GSS-YOLO: Vehicle detection method and embedded deployment in complex traffic road scenarios | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article GSS-YOLO: Vehicle detection method and embedded deployment in complex traffic road scenarios Shengning Lu, Zhihao Ren, Yan Zhi, Xinhua Wang, Xu Yu, Yong Liang This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5357943/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract In real-world vehicle detection scenarios, numerous complex and highly uncertain factors, including variations in lighting, motion blur, occlusion, and weather conditions, can significantly impact performance. Autonomous driving and intelligent traffic systems must be able to respond quickly to various traffic situations. In order to reduce the impact of these uncertainties in actual scenarios and improve the accuracy of vehicle detection in complex backgrounds, we propose a new YOLO detector GSS-YOLO based on YOLOv5s. First, in order to reduce the amount of calculation while improving the performance of model detection and maintaining detection accuracy, we replaced all Conv convolutions in the neck with GSConv convolutions. Secondly, in order to reduce the sequence length and reduce the computational complexity while increasing to improve the receptive field of the model and improve feature extraction capabilities, we embed the Swin-Transformer attention mechanism into the C3 module. Finally, in order to increase the model's ability to handle small objects that are difficult to detect or objects in complex backgrounds, we use the FocalModulation module to replace the original fast spatial pyramid pooling module. Compared with traditional YOLOv5s, our method reduces model parameters by 21.21% and GFLOPS by 20.88%. GSS-YOLO can increase mAP by 4.2% and accuracy by 5.5% on the challenging vehicle detection data set UA-DETRAC. We deployed the GSS-YOLO algorithm on the Atlas200DK A2 embedded system. After testing, it can achieve an FPS of 37.04 when the accuracy is only reduced by 0.3, meeting the requirements of real-time detection. Vehicle detection YOLO Intelligent transportation Embedded devices Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 1 Introduction Accurate detection of vehicles is a key technology for realizing urban smart transportation, and related technical research has been widely used and developed in fields such as autonomous driving, intelligent transportation, and safety monitoring. The rapid advancement of autonomous driving has rendered object detection in traffic scenes a critical area of research. Object detection is recognized as one of the most significant and challenging tasks within computer vision. Due to its robust feature extraction capabilities, deep learning has found widespread application across various fields, including security, military, and medicine.In recent years, it has been extended to the transportation field and has made major breakthroughs [ 1 ]. References in the transportation sector rely on data collected by road monitoring systems. Therefore, many researchers have designed various vehicle detection and classification methods [ 2 ]. Vehicles play a vital role in modern life, but at the same time, they also bring challenges such as traffic jams and accidents. In order to solve these problems, autonomous driving technology has gradually attracted widespread attention. Its core lies in the vehicle detection algorithm. The algorithm is combined with lidar technology to accurately measure and identify vehicle targets [ 3 ] and effectively prevent traffic accidents. Because of this, vehicle detection algorithms have shown huge market potential and application prospects. Vision-based object detection can be mainly divided into traditional object detection and deep learning-based object detection[ 4 – 6 ]. Traditional object detection methods are complex to operate, exhibit a high false positive rate, and face challenges in practical application. In contrast, deep learning-based object detection offers higher accuracy, improved generalization, and robustness. This approach builds on traditional detection techniques, incorporating manual feature extraction and machine learning algorithms for effective object detection.The steps of vehicle detection based on machine learning are divided into feature extraction and classifier training. Haar/Haarlike [ 7 ], Histogram of oriented gradient (HOG) [ 8 ] and Deformable Part Model [ 9 ], they can still extract features stably when the type state of the vehicle changes, and have good effects in vehicle detection. Feature extraction methods such as SIFT and SURF [ 10 – 11 ] are used to generate rich vehicle detection features, which are used to train classifiers to identify vehicle targets. Common classifiers include K-nearest neighbor algorithm (KNN) and support vector machine (SVM), which need to strike a balance between generalization ability and fitting accuracy [ 12 ]. However, traditional methods decompose the detection process into multiple steps, so they lack real-time performance and have limited detection accuracy and generalization ability. The application of deep learning in the field of object detection can be divided into two categories: one is the two-stage detection algorithm (Two-stage detectors) that adopts the region proposal mechanism. During the detection process, they first generate candidate regions and then perform fine classification and positioning; the other is the single-stage detection algorithm (One-stage detectors) that directly performs target prediction and positioning. Both methods have their own advantages and disadvantages. such as Faster R-CNN, which first generates a bounding box, and then needs to classify and regress the bounding box [ 13 ]. The other type is a single-stage detection algorithm, such as YOLO, SSD, etc., which treats the detection task as a regression problem and directly predicts the category and location of the object. [ 14 ]. Although the two-stage detection algorithm has been praised for its excellent detection accuracy, its relatively slow detection speed has become a limiting factor, especially in those situations that require immediate response, which restricts the application of this algorithm. In contrast, the single-stage detection algorithm shows a significant advantage in detection speed and can achieve faster detection, but its detection accuracy is usually slightly inferior to the two-stage algorithm. Currently, many optimized algorithms have been verified and applied in the field of vehicle detection.Li Kang et al. proposed a fuzzy attention mechanism, which introduces fuzzy entropy to reweight the feature map to reduce the uncertainty of the feature map and make the detector focus on the center of the target, thereby effectively improving the accuracy of vehicle detection [ 15 ]. Ren Jinghui et al. designed the ResFusion module to expand the receptive field of the model and capture features of different scales, strengthen the inclusiveness of feature information, and improve detection accuracy [ 16 ]. Dong Xudong et al. proposed an improved vehicle detection method, introduced the C3Ghost module in the neck network to improve the feature expression ability, and introduced the convolutional block attention module (CBAM) in the backbone network to improve the feature extraction ability [ 17 ]. Hamzenejadi et al. introduced the squeeze and excitation attention mechanism and used high-resolution feature maps for detection, which improved the detection accuracy of small objects [ 18 ]. Li Yuhua et al. proposed a vehicle detection algorithm based on coordinate attention mechanism (CA). The vehicle detection algorithm based on the coordinate attention mechanism can reduce the loss of target feature information and improve the detection effect by embedding location information into channel attention during feature extraction[ 19 ]. Although these methods have shown good results in the field of vehicle detection, they still face some challenges. Although the introduction of fuzzy attention mechanism helps to reduce the uncertainty of feature maps, it also inevitably increases the complexity and number of parameters of the model. Although the integration of attention modules such as CBAM and SE can weaken the interference of redundant noise to a certain extent, when dealing with complex traffic scenes, such as changing lighting conditions, motion blur and different weather conditions, the high uncertainty of these scenes themselves still has a significant impact on the accuracy of vehicle detectors, limiting the further improvement of their performance. To solve the above problems, we propose a new vehicle detection method GSS-YOLO, which uses Swin_Transformer attention and GSConv convolution to reduce the computational complexity of the model, and introduces the FocalModulation module to improve the model's ability to detect uncertain objects in complex backgrounds. The main work done in this paper is as follows: We propose a GSS-YOLO algorithm that can better capture global and local feature information while reducing the amount of computation, improving detection accuracy, improving the ability to detect small targets, and reducing false detections and missed detections in vehicle detection. We combine Swin_Transformer attention and GSConv convolution to propose the GSC3 module, which reduces the sequence length and computational complexity while increasing the receptive field of the model and improving feature extraction capabilities. We use the FocalModulation module to replace the original SPPF module to improve the model's ability to handle small objects that are difficult to detect or objects in complex backgrounds. To evaluate the effectiveness of the proposed method, we conducted object detection experiments on the UA-DETRAC dataset using both a PC and the Atlas200DK A2 embedded platform, assessing parameters such as GFLOPS and mAP. The experimental results demonstrate that the proposed method exhibits strong performance in vehicle detection within complex backgrounds. 2 Related Work Visual vehicle detection represents a significant research area within computer vision. Its primary objective is to leverage computer vision technologies for the automatic detection and identification of vehicles in images or videos.This work has wide applications in autonomous driving, traffic monitoring, intelligent transportation systems and other fields. With the advancement of deep learning, vehicle detection algorithms are typically categorized into two main types: one-stage algorithms and two-stage algorithms.The two-stage algorithm first generates candidate boxes and then classifies the candidate boxes. Typical examples of this algorithm include Region-CNN (R-CNN) [ 20 ] and Faster R-CNN [ 21 ]. Huang et al. [ 22 ] proposed an enhanced framework based on Faster R-CNN for rapid vehicle detection. They incorporated the MobileNet architecture to construct the foundational convolutional layer of Faster R-CNN and replaced the original Non-Maximum Suppression (NMS) algorithm following the Region Proposal Network (RPN) with a soft NMS algorithm to address the issue of duplicate detections. While this two-stage detection algorithm achieves high accuracy, the redundant images generated by the RPN may lead to challenges related to real-time processing and computational resource usage. Its real-time detection efficiency is suboptimal. In contrast to the two-stage algorithm, the one-stage algorithm offers a better balance between accuracy and speed [ 23 ], making it more suitable for vehicle detection scenarios that demand real-time performance.The YOLO series of algorithms are typical representatives of the one-stage algorithm and are also the most popular detection algorithms in industrial applications because they have a good balance between accuracy and speed. It is worth noting that in real-time video analysis tools such as traffic monitoring, the YOLO series algorithm is the most commonly used algorithm because it has a fast detection speed while ensuring detection accuracy. YOLO has been continuously developed and based on its original version. [ 24 ], the newer YOLOv3, YOLOv4, YOLOv5, YOLOv6, YOLOv7 and the currently most popular YOLOv8 [ 25 ] algorithms have been further developed to further improve classification accuracy. The open source version of YOLOv5 is currently commonly used for real-time vehicle detection. It effectively extracts vehicle features with high detection accuracy, achieving a balance between speed and accuracy. YOLOv8 has enhanced computational efficiency and accuracy compared to YOLOv5. However, YOLOv5 demonstrates significantly faster training and inference times, with fewer parameters and lower GFLOPS, making it more advantageous for embedded object detection applications. In addition to directly using the R-CNN series and the YOLO series for vehicle detection, there are also some improved methods based on these general detectors to meet the requirements of actual traffic scenarios. YOLOv5 is still one of the most commonly used vehicle detection detectors due to its good versatility and balance between speed and accuracy. In order to address the problem of low accuracy in small target detection, Liu Haiying et al. designed an efficient spatiotemporal interaction module to replace the residual network structure in the original network, and introduced recursive gated convolution in the feature fusion part to enable better interaction of high-order spatial semantic information [ 26 ]. Guo Shangrong et al. improved the YOLOv5 neck structure to the S6 feature fusion structure to improve the recognition ability of multi-scale defects, replaced the neck network with a thin neck, improved the fusion ability of multi-scale defect features, and used the upsampling operator Carafe module to increase the network's receptive field [ 27 ]. To minimize computational load and model size, Hu et al. replaced the backbone feature network of the original YOLOv5 algorithm with the lightweight MobileNetV3. They also introduced a Convolutional Block Attention Module into the neck network, optimizing attention during the feature fusion stage to enhance detection accuracy [ 28 ]. Instead of directly employing the Swin Transformer, we designed the GSC3 module based on the Swin Transformer Block module to achieve a better balance between model parameters and accuracy. 3 Methods In this section, we first introduce the overall structure of GSS-YOLO in Section 3.1 , and then introduce the main components of GSS-YOLO in detail in Sections 3.2 , 3.3 , and 3.4 , respectively, including modules such as FocalModulation, GSConv, and GSC3. 3.1 Overall network structure Based on YOLOv5s as the basic model, this pape proposes the GSC3 module based on GSConv convolution and Swin_Transformer_Block. On this basis, GSS-YOLO is proposed, and its overall structure is shown in Fig. 1 . The GSC3, GSConv, and FocalModulation models will be described in detail below. GSS-YOLO mainly consists of three parts: backbone network, neck structure and head structure. First, the backbone network downsamples the input image, extracts the features of the image, and continuously reduces the feature map. In the last layer of the backbone network, FocalModulation is used to replace SPPF (fast spatial pyramid pooling) to enhance the model's ability to extract features of small objects in the image or objects in complex backgrounds. Then, the GSC3 module designed by us is used in the neck to replace the conventional C3 module, and all Conv convolutions in the neck are replaced with GSConv convolutions to further reduce the amount of calculation. Finally, the head outputs the detection results. 3.2 FocalModulation module The FocalModulation module is a module in the Focused Modulation Networks (FocalNets) proposed by Yang et al. in 2022 [ 29 ] to replace self-attention (SA) for modeling tag interactions in vision. Self-attention requires complex query key interactions and query value aggregation for each query tag, which can be computationally expensive. In contrast, focal modulation simplifies these operations by initially aggregating spatial contexts of different granularities into the modulator. Figure 2 provides an intuitive comparison between traditional self-attention methods and focal modulation methods. FocalModulation uses a multi-level feature fusion mechanism to simultaneously capture and integrate coarse-grained spatial information and fine-grained feature details to enhance and optimize the overall performance of the network. This mechanism ensures that the network can effectively learn at different levels of feature representation, thereby improving the adaptability and accuracy of the model. Compared with traditional SPPF, focal modulation can adjust according to the size of the target, enhance the focus on difficult-to-detect targets, and thus improve detection accuracy. Therefore, we use the FocalModulation module to replace the traditional SPPF module in YOLOv5, thereby improving the model's ability to handle small objects that are difficult to detect or objects in complex backgrounds. Figure 3 (a) illustrates the overall structure of FocalModulation, while Fig. 3 (b) presents the detailed aggregation process. The aggregation process is divided into two key steps: first, hierarchical contextualization, which systematically traverses a wide range from local to global, accurately extracts and integrates contextual information across different granularity levels; followed by gated aggregation, which acts as an intelligent screening mechanism and is responsible for effectively compressing all collected contextual features and injecting them into their corresponding granularity-level modulators to achieve precise regulation and fusion of features. $$\:\begin{array}{c}{y}_{i}={T}_{2}\left({M}_{2}\left(i,X\right),{x}_{i}\right),\#\left(1\right)\end{array}$$ $$\:\begin{array}{c}{y}_{i}=q\left({x}_{i}\right)\odot\:m\left(i,X\right),\#\left(2\right)\end{array}$$ Equation (1) describes FocalModulation, which generates a refined representation \(\:{y}_{i}\) through an early aggregation process. In this process, context features are first aggregated using \(\:{M}_{2}\) at each position \(\:i\) , after which the query interacts with the aggregated features based on \(\:{T}_{2}\) to produce \(\:{y}_{i}\) . Eq. (2) provides a specific instance of FocalModulation, where \(\:q\) represents the query projection function, and \(\:\:m\) denotes element-wise multiplication. The function \(\:m\left(·\right)\) serves as the context aggregation function, and its output is referred to as the modulator. The construction of the regulator consists of two steps: first, the hierarchical semantic processing implemented by Eq. (4) extracts multi-level semantic information from the data; second, Eq. (5) performs a gated aggregation operation, which integrates and compresses the diverse semantic features extracted previously. $$\:\begin{array}{c}{Z}^{l}={f}_{a}^{l}\left({Z}^{l-1}\right)\triangleq\:GeLU\left(DWConv\left({Z}^{l-1}\right)\right)\in\:{R}^{H\times\:W\times\:C},\#\left(3\right)\end{array}$$ $$\:\begin{array}{c}{Z}^{out}=\sum\:_{l=1}^{L+1}{G}^{l}⨀{Z}^{l}\in\:{R}^{H\times\:W\times\:C},\#\left(4\right)\end{array}$$ $$\:\begin{array}{c}{y}_{i}=q\left({x}_{i}\right)⨀h\left(\sum\:_{l=1}^{L+1}{g}_{i}^{l}\times\:{Z}_{i}^{l}\right),\#\left(5\right)\end{array}$$ In Eq. (3), \(\:{f}_{a}^{l}\) is the context function of the previous layer, generated by the depthwise convolution with kernel size \(\:{k}^{l}\) and GeLU activation function. Hierarchical semantics extracts contextual information from the global scope through different levels of granularity. In Eq. (4), \(\:G\in\:{R}^{H\times\:W\times\:l}\) is the L-layer slice of the horizontal G. Specifically, we use a linear layer to obtain the spatial and horizontal perception gating weights: \(\:\text{G}={f}_{g}\left(x\right)\in\:{R}^{H\times\:W\times\:(L+1)}\) . Subsequently, the features are weighted and summed by element-wise multiplication, which generates a feature map \(\:{Z}^{out}\) with the same size as the input X. Combining the interaction of hierarchical contextualization, gated aggregation, and focus modulation described above, the final effect of focus modulation can be accurately expressed by formula (5). 3.3 GSConv module To improve the real-time vehicle detection performance on mobile embedded devices, we replace the standard convolution of YOLOv5s with GSConv convolution. GSConv reduces the model burden while maintaining accuracy. GSConv + Slim-Neck is a lightweight network proposed by Li et al. [ 30 ] for the vehicle-mounted edge autonomous driving computing platform. Slim-Neck reduces the computational complexity through the cross-layer sub-network module GSCSP, thereby improving the detection speed and accuracy. This optimization scheme effectively balances resource consumption and detection performance and is suitable for edge computing environments. Traditionally, although the dependent separable convolution (DSC) model has reduced the computational burden to a certain extent, its core mechanism, the separate processing of channel information, often limits the model's ability to achieve high levels of accuracy. This limitation directly weakens the model's effectiveness in feature extraction and fusion, and becomes a bottleneck that hinders the realization of lightweight and high-precision detection performance. In contrast, the GSConv model stands out with its unique and innovative approach, which cleverly combines standard convolution operations with dependent separable convolution. This design not only retains the comprehensiveness and accuracy of standard convolution in feature extraction, but also cleverly incorporates the advantages of DSC in reducing computational complexity, thereby achieving both lightweight and efficient real-time detection performance on edge devices. Through this combination, the GSConv model effectively overcomes the limitations of the traditional DSC model and promotes the realization of real-time detection tasks in the field of edge computing. This approach begins with conventional convolution for downsampling, followed by DWConv deconvolution to combine the outputs of SCconv and DSCconv, and concludes with a shuffle operation to merge the corresponding channels. The structure of the GSConv module is illustrated in Fig. 4 , where "Conv" encompasses the convolution layer, batch normalization, and activation layer, while "DWCconv" denotes the DSC operation. The cross-layer local area network (GSCSP) is designed to aggregate computations, reducing overall calculation and network complexity while maintaining adequate accuracy. GSConv cleverly compresses the spatial dimension of the feature map while significantly increasing the number of channels. This design retains the important connections between channels. This mechanism ensures that the model can capture and retain rich high-level semantic information while compressing information. Furthermore, the addition of the shuffle operation reduces the computational cost of the convolution operation. It not only simplifies the calculation process, but also improves the operating efficiency and processing speed of the model, making the model more suitable for real-time processing or resource-constrained environments while maintaining high performance. If GSConv is used at all levels, the reasoning time may be inadvertently prolonged due to the increase in network depth, affecting the overall computational efficiency. In view of this, we only use GSConv modules in the neck. This layout not only avoids unnecessary computational overhead, but also ensures that the attention mechanism can focus on key features more efficiently, thereby improving the overall detection accuracy. 3.4 GSC3 Module To reduce sequence length and computational complexity while increasing the model's receptive field and enhancing feature extraction capabilities, we combined GSConv convolution with the Swin Transformer block into the GSC3 module. This new module replaces the original C3 module in YOLOv5s.The overall structure of the GSC3 module is shown in Fig. 5 . Swin Transformer is a new Transformer architecture designed for computer vision tasks. It introduces a self-attention mechanism based on a moving window and adopts a hierarchical feature expression method, which makes the model achieve a balance between computational complexity and performance [ 31 ]. The Swin_Transformer _block utilizes a moving window approach to compute pixel attention, allowing the model to connect with the previous layer's windows. This method reduces the complexity of the original attention calculation and addresses the issue of insufficient global context, thereby enhancing the model's performance. As illustrated in Fig. 6 , the Swin Transformer block comprises a shifted window-based multi-head self-attention (MSA) module followed by a two-layer multi-layer perceptron (MLP) with a GELU nonlinearity in between. A LayerNorm (LN) layer is applied prior to each MSA module and MLP, with a residual connection incorporated after each module. Two consecutive Swin Transformer blocks utilize a window MSA (W-MSA) module and a shifted window MSA (SW-MSA) module, respectively, enabling different windows to exchange information while minimizing computational load. Based on this window partitioning mechanism, consecutive Swin Transformer blocks are computed as follows: $$\:\begin{array}{c}{\widehat{z}}^{i}=W-MSA\left(LN\left({z}^{i-1}\right)\right)+{z}^{i-1},\#\left(6\right)\end{array}$$ $$\:\begin{array}{c}{z}^{i}=MLP\left(LN\left({\widehat{z}}^{i}\right)\right)+{\widehat{z}}^{i},\#\left(7\right)\end{array}$$ $$\:\begin{array}{c}{\widehat{z}}^{i+1}=SW-MSA\left(LN\left({z}^{i}\right)\right)+{z}^{i},\#\left(8\right)\end{array}$$ $$\:\begin{array}{c}{z}^{i+1}=MLP\left(LN\left({\widehat{z}}^{i+1}\right)\right)+{\widehat{z}}^{i+1},\#\left(9\right)\end{array}$$ Where \(\:{\widehat{z}}^{i}\) represents the output of the \(\:W-MSA\) module, and \(\:{z}^{i}\) represents the output of the MLP module of the i-th Block. 4 Experiment After completing all the improvements to YOLOv5s, this paper proposes a real-time traffic vehicle detection algorithm called GSS-YOLO, which is trained and tested on the PC and deployed on the Atlas 200 DK A2 embedded system. This chapter first introduces the dataset, the experimental environment, and then describes the training results. 4.1 Experimental Dataset In this study, we used the challenging UA-DETRAC dataset to evaluate the performance of our proposed GSS-YOLO model on the vehicle detection task. The UA-DETRAC dataset is known for its large size and diversity, covering various types of vehicles such as cars, trucks, and buses, providing a rich test scenario for vehicle detection algorithms. The dataset was collected from real traffic environments in Beijing and Tianjin, China. The training set contains 82,085 high-resolution images from 60 independent video frame sequences, fully demonstrating the vehicle image features under different time, weather, and traffic conditions. The test set comes from 56,127 images from 40 different video frame sequences.The UA-DETRAC dataset considers the impact of weather conditions on the data and collects data under four different weather conditions, such as cloudy, night, sunny, and rainy. The UA-DETRCA dataset is a frame sequence image, which contains a large number of similar images. Therefore, we preprocessed the dataset, took an image every 10 frames, and divided the images into training set, validation set, and test set in a ratio of 8:1:1. Finally, the training set has 8639 images, the validation set has 1165 images, and the test set has 1166 images. The data of vehicles under different weather conditions are shown in Fig. 7 . 4.2 Experimental Environment Training Configuration . The computer is equipped with an AMD Ryzen 5 5600 processor, an NVIDIA GeForce RTX 2080 Ti graphics processor, 11GB of video memory, and runs on a Windows operating system. It uses PyTorch 1.10.0 as a deep learning framework and CUDA 10.2 for graphics acceleration. The deep learning model is trained in the PyCharm integrated development environment in combination with Python 3.8. During the training process, the input image size is set to 640x640 pixels, the training epoch is set to 300 times, each batch is 16, and the learning rate is set to 0.001. To ensure fairness in the experiment, all models do not use pre-trained weights during training. Table 1 Atlas200DK A2 hardware parameters parameter Specification CPU TAISHANV200M AI processor DaVinciV300 AI core Memory 4GB LPDDR4X AI computing power 8TOPS(INT8) Power consumption 21W Wired network Gigabit Ethernet Embedded Devices. At present, the mainstream embedded products on the market include NVIDIA jetson series, Raspberry Pi series products, etc. With the improvement of computing power of embedded devices, they can meet most of the reasoning tasks based on deep learning. The GSS-YOLO vehicle detection algorithm designed in this paper is applied to embedded devices and then loaded into the mobile terminal, so as to promote the popularization of edge vehicle detection. Since the vehicle detection in this paper is a real-time detection of video streams, there are certain requirements for the CPU and computing power of embedded devices. Although the NVIDIA jetson series products integrate NVIDIA CUDA-based GPUs and have very fast computing speeds, the cost is relatively high and the cost performance is low. In comparison, Atlas200DK A2 has higher computing power at a price similar to jetson nano. Therefore, Atlas200DK A2 is used as the embedded device in this experiment. The hardware parameters of the Atlas200DK A2 development board are shown in Table 1 , and the actual development board is shown in Fig. 8 . 4.3 Evaluation Metrics This paper employs accuracy (P), number of parameters (Params), floating point operations per second (FLOPS), and mean average precision (mAP) as evaluation metrics. Accuracy is defined as the ratio of correctly predicted samples to the total number of samples, representing the proportion of correct predictions, as shown in Eq. (10): $$\:\begin{array}{c}P=\frac{TP}{TP+FP}\#\left(10\right)\end{array}$$ TP means that the predicted value is the same as the true value, both are positive samples, and FP means that the predicted result is different from the actual result. The predicted result is judged as a positive sample, while the actual result is a negative sample. The number of parameters refers to the total weights and biases that the model can learn and optimize within the algorithm. This metric not only helps assess the model's complexity but also indicates its training and storage requirements. Floating point operations per second (FLOPS) measures the number of floating point operations required for a single forward propagation in a neural network. This indicator is used to evaluate the computational complexity and efficiency of the model.Generally speaking, the higher the FLOPS value, the more computing resources and time the model needs to consume when processing data, where 1TFLOPS is equal to 1000GFLOPS. mAP is one of the evaluation indicators of the performance of the target detection algorithm. First, the average precision AP of each category is calculated, and then the average AP of all categories is taken to get mAP, as shown in equations (11) and (12). The larger the mAP value, the better the target detection effect. 4.4 Experimental Results Model training results. Figure 9 displays the loss metrics of the improved model on both the training and validation sets. The box_loss represents the error between the predicted bounding box and the ground truth, while the object_loss indicates the algorithm's confidence. The classification_loss assesses whether the anchor_box is correctly classified against the corresponding ground truth. As shown in Fig. 9 , the various loss curves stabilize over time, indicating that the model converges effectively. Comparative experiment. Compared with the traditional YOLOv5s algorithm, the GSS-YOLO algorithm has fewer parameters and computational complexity, and has excellent detection performance for vehicles in complex backgrounds. In order to further verify the detection performance of the proposed detection algorithm, we compared it with several current mainstream detection models, such as YOLOv8s. The network model was trained on the UA-DETRAC dataset using the same training method, and the accuracy (P), number of parameters (Params), floating point number (GFLOPS), and average detection accuracy (mAP) of all samples were used as evaluation indicators for experimental comparison. The comparison results are shown in Table 2 . Among them, the bold text indicates the optimal result of the experiment. Analysis of Table 2 shows that in the same dataset, all parameters of the algorithm are better than YOLOv5s. Although the proposed algorithm is slightly lower in detection accuracy than the more advanced YOLOv8s algorithm, the GFLOPS and Params of GSS-YOLO are only 44.01% and 49.68% of YOLOv8s, respectively. It is proved that it has fewer parameters and better detection effect in the detection task of vehicle targets in complex backgrounds. Figure 10 shows some detection results on the UA-DETRAC dataset. Table 2 Comparative experiments of different algorithms Model Backbone GFLOPS Params Precision (%) Faster-RCNN Resnet-50 201.09 137098724 54.4 YOLOv5s CSP-Darknet53 15.8 7020913 62.7 YOLOv6s EfficientRep 45.17 18507345 66.5 YOLOv7-tiny ELAN 13.01 6014737 65.8 YOLOv8s Darknet-53 28.4 11127132 70.8 Traffic YOLO[ 32 ] CSP-Darknet53 - 45M 65.5 GSS-YOLO CSP-Darknet53 12.5 5528050 68.2 Ablation experiment. Our ablation experiments are based on YOLOv5s. In order to compare the impact of each module used in this study on the proposed algorithm, we conducted multiple sets of experiments to test the performance of each module on four evaluation indicators. We mainly focus on average accuracy and model parameters. The performance after adding different modules is shown in Table 3 , where bold text represents the best results of the experiment. From the analysis of Table 3 , we can see that Model1 uses GSConv convolution to replace Conv convolution in the neck compared to YOLOv5s. When both parameters and GFLOPS are reduced, there is almost no effect on the average accuracy and Map. Comparing Model1 and Model2, although the addition of the FocalModulation module slightly increases the model parameters and GFLOPS, the accuracy and mAP are improved to varying degrees. Comparing Model1 and Model2, after replacing the C3 module with the GSC3 module proposed by us, the model parameters and GFLOPS are greatly reduced, and the accuracy and mAP are significantly improved. Finally, compared with the original YOLOv5s algorithm, the average accuracy of our proposed GSS-YOLO algorithm is increased from 62.7–68.2%, an increase of 5.5%, and the model parameters are reduced from 7020913 to 5528050, a reduction of 21.21% compared to the original version. Therefore, our proposed GSS-YOLO has better detection accuracy and fewer model parameters. Figure 11 is a comparison of the detection results of YOLOv5s and GSS-YOLO under complex backgrounds (the red arrow marks the missed target). Table 3 Ablation experiment Model GSConv FocalModulation GSC3 Params GFLOPS Precision(%) mAP(%) YOLOv5s 7020913 15.8 62.7 61.9 Model1 √ 6579953 15.2 62.8 61.9 Model2 √ √ 6983698 15.7 63.5 62.1 Model3 √ √ 5133041 12.2 66.7 65.8 GSS-YOLO √ √ √ 5528050 12.5 68.2 66.1 4.5 Embedded Deployment As the improvement of the performance of small embedded devices, their computing power is sufficient to support most object detection tasks based on deep learning. Therefore, we deployed the vehicle detection algorithm GSS-YOLO developed in this paper on embedded devices. This study comprehensively considered factors such as cost, power consumption, and computing power, and selected Atlas200DK A2 as the embedded development platform. First, complete the training and conversion of the model on the PC side, and upload the converted model to the Atlas200DK A2 development board through the SSH protocol using the MobaXterm software. Log in to jupyter to run the detection script and output the inference results to the jupyter interface. The detection results are shown in Fig. 12 . In order to verify that the proposed algorithm has better detection effect than the original YOLOv5s in the embedded system, we conducted experiments on Atlas200DK A2, and the experimental results are shown in Fig. 13 . From the two sets of experiments, it can be seen that when the GSS-YOLO algorithm is deployed on Atlas200DK A2, it performs better in complex backgrounds than YOLOv5s, and has stronger detection capabilities for remote vehicles. We use power sockets to test the power consumption of PC and Atlass200DK A2 during inference. The average inference power consumption of PC is 177.3W, while the average inference power consumption of Atlas200DK A2 is only 11.5W. Compared with a PC with a high-performance 2080ti GPU, although the accuracy of the embedded system is slightly reduced, the overall power consumption is only 6.49% of that of the PC, and the inference speed is 37.04FPS, which meets the speed requirements of real-time detection. The comparison data between Atlas200DK A2 and GPU is shown in Table 4 . Table 4 Comparison between Atlas200DK A2 and GPU Device Average Power time Accuracy(%) PC 177.3W 11ms 68.2 Atlas200DK A2 11.5W 26.99ms 67.9 In the real-time inference performance analysis, we use the frame rate during inference as a comparison to compare with the relevant literature. The comparison results are shown in Table 5 . Table 5 Real-time performance comparison Algorithm Platform Resolution Frame rate [ 33 ] NVIDIA Jetson Nano 512×512 12.8 [ 34 ] NVIDIA Jetson Nano 640×640 16.0 [ 35 ] NVIDIA Xavier NX 512×320 26.04 1024×576 14.2 [ 36 ] Raspberry Pi 4B 640×640 25.0 [ 37 ] Nvidia Jetson Tegra X2 384×384 30.9 Ours Atlas200DK A2 640×640 37.04 5 Conclusion and discussion In order to solve the problem of vehicle detection in complex backgrounds in autonomous driving and intelligent transportation systems and meet the necessary requirements of edge detection, this paper proposes an improved algorithm GSS-YOLO based on YOLOv5s. Using GSConv convolution to replace Conv convolution in the neck network can reduce the amount of calculation without reducing the model performance or even slightly improving it. The FocalModulation module is used to replace the original SPPF module. Finally, the GSC3 module we proposed replaces the C3 module in the original neck network. The model was trained using the UA-DETRAC dataset, with inference conducted on both a PC and the Atlas200DK A2. Results indicate that, compared to the original YOLOv5s, average accuracy improved from 62.7–68.2%, a 5.5% increase, while model parameters were reduced by 21.21%. Comparative experiments on the UA-DETRAC dataset demonstrate that the algorithm performs effectively in remote vehicle detection, thereby validating the new model's effectiveness. This paper proposes a vehicle detection algorithm and deploys it on embedded devices, and obtains good detection results. However, in the actual traffic environment, the algorithm proposed in this paper is still a certain distance away from real-time detection. Therefore, based on this article, further optimization of the algorithm, AI acceleration of Atlas200DK A2 and other follow-up research still require a lot of research, and there is still a lot of room for improvement in the detection of complex targets such as high-speed vehicles and blurred vehicles. Considering the complex environmental factors, some image preprocessing methods are introduced to process images with abnormal light and blur. Further experiments will be carried out in the future to achieve a more efficient vehicle target detection algorithm. Declarations Author Contribution Shengning Lu wrote the entire manuscript, Zhihao Ren typeset the paper, Yan Zhi and Xinhua Wang prepared the figures in the paper, Yong Liang organized the data in the paper, and all authors reviewed the manuscript. Acknowledgements This work was financially supported by the Project of the Guilin University of Technology (Nos.GLUTQD2017003). Data availability The dataset (UA-DETRAC) used in this study is an open source dataset and is publicly available on the Internet. References Gholamhosseinian A, Seitz J (2021) Vehicle classification in intelligent transport systems: An overview, methods and software perspective. IEEE Open J Intell Transp Syst 2:173–194. https://doi.org/10.1109/OJITS.2021.3096756 Lin CJ, Jhang JY (2022) Intelligent traffic-monitoring system based on YOLO and convolutional fuzzy neural networks. IEEE Access 10:14120–14133. https://doi.org/10.1109/ACCESS.2022.3147866 Wang Z, Zhan J, Duan C, Guan X, Lu P, Yang K (2022) A review of vehicle detection techniques for intelligent vehicles. IEEE Trans Neural Networks Learn Syst 34(8):3811–3831. https://doi.org/10.1109/TNNLS.2021.3128968 Zhao ZQ, Zheng P, Xu S (2019) Object detection with deep learning: A review. IEEE Trans neural networks Learn Syst 30(11):3212–3232. https://doi.org/10.1109/TNNLS.2018.2876865 Zou Z, Chen K, Shi Z, Guo Y (2023) Object detection in 20 years: A survey. Proceedings of the IEEE 111(3):257–276. https://doi.org/10.1109/JPROC.2023.3238524 Solunke BR, Gengaje SR (2023) A Review on traditional and deep learning based object detection methods. In 2023 International Conference on Emerging Smart Computing and Informatics (ESCI) pp 1–7. https://doi.org/10.1109/ESCI56872.2023.10099639 Lee C, Kim D (2018) Visual homing navigation with Haar-like features in the snapshot. IEEE Access 6:33666–33681. https://doi.org/10.1109/ACCESS.2018.2842679 Min W, Liu R, He D (2022) Traffic sign recognition based on semantic scene understanding and structural traffic sign location. IEEE Trans Intell Transp Syst 23(9):15794–15807. https://doi.org/10.1109/TITS.2022.3145467 Donnelly J, Barnett AJ, Chen C (2022) Deformable protopnet: An interpretable image classifier using deformable prototypes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp 10265–10275. https://doi.org/10.1109/cvpr52688.2022.01002 Sasikala N, Swathipriya V, Ashwini M (2020) Feature extraction of real-time image using Sift algorithm. Eur J Electr Eng Comput Sci 4(3):206–214. https://doi.org/10.24018/ejece.2020.4.3.206 Bansal M, Kumar M, Kumar M (2021) 2D object recognition: a comparative analysis of SIFT, SURF and ORB feature descriptors. Multimedia Tools Appl 80(12):18839–18857. https://doi.org/10.1007/s11042-021-10646-0 Bansal M, Goyal A, Choudhary A (2022) A comparative analysis of K-nearest neighbor, genetic, support vector machine, decision tree, and long short term memory algorithms in machine learning. Decis Analytics J 3:100071. https://doi.org/10.1016/j.dajour.2022.100071 Maity M, Banerjee S, Chaudhuri SS (2021) Faster r-cnn and yolo based vehicle detection: A survey. In 2021 5th international conference on computing methodologies and communication (ICCMC) pp 1442–1447. https://doi.org/10.1016/10.1109/ICCMC51019.2021.9418274 Jiang P, Ergu D, Liu F (2022) A Review of Yolo algorithm developments. Procedia Comput Sci 199:1066–1073. https://doi.org/10.1016/j.procs.2022.01.135 Kang L, Lu Z, Meng L (2024) YOLO-FA: Type-1 fuzzy attention based YOLO detector for vehicle detection. Expert Syst Appl 237:121209. https://doi.org/10.1016/j.eswa.2023.121209 Ren J, Yang J, Zhang W (2024) RBS-YOLO: a vehicle detection algorithm based on multi-scale feature extraction. SIViP 18(4):3421–3430. https://doi.org/10.1007/s11760-024-03007-5 Dong X, Yan S, Duan C (2022) A lightweight vehicles detection network model based onYOLOv5. Eng Appl Artif Intell 113:104914. https://doi.org/10.1016/j.engappai.2022.104914 Hamzenejadi MH, Mohseni H (2023) Fine-tuned YOLOv5 for real-time vehicle detection in UAV imagery: Architectural improvements and performance boost. Expert Syst Appl 231:120845. https://doi.org/10.1016/j.eswa.2023.120845 Li Y, Zhang M, Zhang C (2024) YOLO-CCS: Vehicle detection algorithm based on coordinate attention mechanism. Digit Signal Proc 153:104632. https://doi.org/10.1016/j.dsp.2024.104632 Gidaris S, Komodakis N (2015) Object detection via a multi-region and semantic segmentation-aware cnn model. In Proceedings of the IEEE international conference on computer vision pp 1134–1142. https://doi.org/10.1109/iccv.2015.135 Girshick R (2015) Fast R-CNN. Proceedings of the IEEE international conference on computer vision pp 1440–1448. https://doi.org/10.48550/arXiv.1504.08083 Nguyen H (2019) Improving Faster R-CNN Framework for Fast Vehicle Detection. Math Probl Eng 1:3808064. https://doi.org/10.1155/2019/3808064 Xiao Y, Tian Z, Yu J (2020) A review of object detection based on deep learning. Multimedia Tools Appl 79:23729–23791. https://doi.org/10.1007/s11042-020-08976-6 Redmon J (2016) You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition pp 779–788. https://doi.org/10.1109/cvpr.2016.91 Terven J, Córdova-Esparza DM, Romero-González JA (2023) A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach Learn Knowl Extr 5:1680–1716. https://doi.org/10.3390/make5040083 Liu H, Duan X, Lou H (2023) Improved GBS-YOLOv5 algorithm based on YOLOv5 applied to UAV intelligent traffic. Sci Rep 13:9577. https://doi.org/10.3390/make5040083 Guo S, Li S, Han Z (2024) Efficient detection of multiscale defects on metal surfaces with improved YOLOv5. Multimedia Tools Appl 1–23. https://doi.org/10.1007/s11042-024-19477-1 Hu T, Gong Z, Song J (2024) Research and implementation of an embedded traffic sign detection model using improved YOLOV5. Int J Autom Technol 25:881–892. https://doi.org/10.1007/s12239-024-00082-y Yang J, Li C, Dai X (2022) Focal modulation networks. Adv Neural Inf Process Syst 35:4203–4217. https://doi.org/10.48550/arXiv.2203.11926 Li H, Li J, Wei H (2022) Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv preprint. https://doi.org/10.48550/arXiv.2206.02424 . arXiv:2206.02424 Liu Z, Lin Y, Cao Y (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision pp 10012–10022. https://doi.org/10.1109/iccv48922.2021.00986 Chen X, Zou Y, Ke H (2024) TrafficYOLO: YOLO with Multi-Head Attention Mechanism for Traffic Detection Scenarios. In 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT) pp 2276–2279. https://doi.org/10.1109/AINIT61980.2024.10581465 Koay HV, Chuah JH, Chow CO et al (2021) YOLO-RTUAV: Towards real-time vehicle detection through aerial images with low-cost edge devices. Remote Sens 13(21):4196. https://doi.org/10.3390/rs13214196 Zhang ZD, Tan ML, Lan ZC et al (2022) CDNet: A real-time and robust crosswalk detection network on Jetson nano based on YOLOv5. Neural Comput Appl 34(13):10719–10730. https://doi.org/10.1007/s00521-022-07007-9 Balamuralidhar N, Tilon S, Nex F (2021) MultEYE: Monitoring system for real-time vehicle detection, tracking and speed estimation from UAV imagery on edge-computing platforms. Remote Sens 13(4):573. https://doi.org/10.3390/rs13040573 Wu H, Hua Y, Zou H et al (2022) A lightweight network for vehicle detection based on embedded system. J Supercomputing 78(16):18209–18224. https://doi.org/10.1007/s11227-022-04596-z Chen J, Zhang X, Peng X et al (2023) Shuffle-octave-yolo: a tradeoff object detection method for embedded devices. J Real-Time Image Proc 20(2):25. https://doi.org/10.1007/s11554-023-01284-w Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5357943","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":373400028,"identity":"0bf67580-6669-474f-990e-abbeb7a5facc","order_by":0,"name":"Shengning Lu","email":"","orcid":"","institution":"Guilin University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Shengning","middleName":"","lastName":"Lu","suffix":""},{"id":373400031,"identity":"bb4acaf8-88f3-474b-996a-b2d0026e1836","order_by":1,"name":"Zhihao Ren","email":"","orcid":"","institution":"Guilin University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Zhihao","middleName":"","lastName":"Ren","suffix":""},{"id":373400032,"identity":"96483738-a0c9-43a1-91af-a2381973c643","order_by":2,"name":"Yan Zhi","email":"","orcid":"","institution":"Guilin University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Yan","middleName":"","lastName":"Zhi","suffix":""},{"id":373400033,"identity":"b3ed0c52-098f-4e59-8fa4-1f716b8e709a","order_by":3,"name":"Xinhua Wang","email":"","orcid":"","institution":"Guilin University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Xinhua","middleName":"","lastName":"Wang","suffix":""},{"id":373400034,"identity":"9ac7632d-f61f-4fc9-b883-154ea7001473","order_by":4,"name":"Xu Yu","email":"","orcid":"","institution":"Guilin University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Xu","middleName":"","lastName":"Yu","suffix":""},{"id":373400035,"identity":"e0abe3f3-f4e4-4a61-9db6-06074c62ddc1","order_by":5,"name":"Yong Liang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAsUlEQVRIiWNgGAWjYFACxgYJCQM2OTb29gMkaLEo4DPm4zmTQLw9EhUf5BLnSTgYEKfc4Hhz440bBmbpbRIMCQw/KrYRoeXMwWbLGQZpuW3SjQcYe87cJqxFckZim7SEwbHcNpkDCcyMbcRq+WPwP51NIsGAOC38EoltoEBOIEELz8FmC6AWwzZgIB8kyi/AGHx4Q+IPm7x8e/vBBz8qiNCCAg6QqH4UjIJRMApGAS4AAE/gOMiFTkVUAAAAAElFTkSuQmCC","orcid":"","institution":"Guilin University of Technology","correspondingAuthor":true,"prefix":"","firstName":"Yong","middleName":"","lastName":"Liang","suffix":""}],"badges":[],"createdAt":"2024-10-30 03:23:45","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5357943/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5357943/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":68965724,"identity":"093db021-c3f1-4051-8ead-4a43721de512","added_by":"auto","created_at":"2024-11-14 04:19:24","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":181338,"visible":true,"origin":"","legend":"\u003cp\u003eGSS-YOLO overall structure\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/3e777f6d0be3a4825d4e8e3b.png"},{"id":68965719,"identity":"1af593f6-3e24-4b9e-a6c2-f2a3b7519f16","added_by":"auto","created_at":"2024-11-14 04:19:23","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":4781188,"visible":true,"origin":"","legend":"\u003cp\u003e(a) Self-Attention, (b) Focal Modulation, blue and red arrows represent attention interaction and query-related aggregation respectively.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/9a23a3d155145f3d923f324a.png"},{"id":68965720,"identity":"6d693434-bebc-44da-aa28-04d7fac40275","added_by":"auto","created_at":"2024-11-14 04:19:23","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":93612,"visible":true,"origin":"","legend":"\u003cp\u003eFigure (a) shows the overall architecture of FocalModulation, and Figure (b) details its core aggregation mechanism. The aggregation mechanism is divided into two stages: first, a multi-granularity strategy is adopted to gradually expand from local details to the global field of view to fully capture contextual information; then, through gated aggregation technology, the contextual features obtained at each granularity level are effectively integrated and compressed into the modulator to achieve information refinement and fusion.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/2255a271160d0f66ea551311.png"},{"id":68966429,"identity":"bd5f3871-7ec4-4727-ab6f-63fc764fd3a2","added_by":"auto","created_at":"2024-11-14 04:27:24","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":192437,"visible":true,"origin":"","legend":"\u003cp\u003eGSConv convolution structure\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/8c6a6a6d71d17b8ba3194c55.png"},{"id":68965725,"identity":"ababda7a-64a4-4958-941c-345eae6c7fbb","added_by":"auto","created_at":"2024-11-14 04:19:24","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":35729,"visible":true,"origin":"","legend":"\u003cp\u003eGSC3 module structure\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/5bbc8d286ca97166466c3a83.png"},{"id":68965721,"identity":"cd77d001-6db5-412f-922c-83422197bde0","added_by":"auto","created_at":"2024-11-14 04:19:23","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":41208,"visible":true,"origin":"","legend":"\u003cp\u003eSwin_Transformer_Block module structure\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/29b96851df929b77d2b026b9.png"},{"id":68965727,"identity":"a510a783-b219-4e27-a6b7-2d01599af5ee","added_by":"auto","created_at":"2024-11-14 04:19:24","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":5550276,"visible":true,"origin":"","legend":"\u003cp\u003eSome images of the UA-DETRAC dataset under different weather conditions\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/b116ab98cdb73df4313f1658.png"},{"id":68965723,"identity":"6f8cf0dc-7f12-4153-ab25-ec9d75d00ba0","added_by":"auto","created_at":"2024-11-14 04:19:23","extension":"jpeg","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":390026,"visible":true,"origin":"","legend":"\u003cp\u003eAtlas200DK A2 development board physical picture\u003c/p\u003e","description":"","filename":"8.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/8a61b3af6322bb9bb9eb2d8e.jpeg"},{"id":68966428,"identity":"91e89870-10ec-4332-8be5-de47728fd618","added_by":"auto","created_at":"2024-11-14 04:27:23","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":345946,"visible":true,"origin":"","legend":"\u003cp\u003eLoss results\u003c/p\u003e","description":"","filename":"9.png","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/2ad832e6b94fe965c8e43343.png"},{"id":68965728,"identity":"a32602a4-9562-4d92-86db-ea33834842b2","added_by":"auto","created_at":"2024-11-14 04:19:24","extension":"png","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":6558951,"visible":true,"origin":"","legend":"\u003cp\u003eDetection on UA-DETRAC\u003c/p\u003e","description":"","filename":"10.png","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/7cd1455323572e76a1cceb5b.png"},{"id":68965731,"identity":"6c7433d0-bb87-4e50-91d4-d198bdedf234","added_by":"auto","created_at":"2024-11-14 04:19:29","extension":"png","order_by":11,"title":"Figure 11","display":"","copyAsset":false,"role":"figure","size":1046464,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of detection results on UA-DETRAC. The left side shows the detection results of YOLOv5s, and the right side shows the detection results of GSS-YOLO. The missed detection parts are marked with red arrows.\u003c/p\u003e","description":"","filename":"11.png","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/badb467d243462df4580b566.png"},{"id":68965732,"identity":"02d857ce-7a67-4954-a602-37241ab5ea4a","added_by":"auto","created_at":"2024-11-14 04:19:29","extension":"jpeg","order_by":12,"title":"Figure 12","display":"","copyAsset":false,"role":"figure","size":278031,"visible":true,"origin":"","legend":"\u003cp\u003eUA-DETRAC detection on Atlas200DK A2\u003c/p\u003e","description":"","filename":"12.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/81e3d25ad84e1539374d46db.jpeg"},{"id":68966430,"identity":"461e12ea-d063-462a-b52f-1f1a356ddc27","added_by":"auto","created_at":"2024-11-14 04:27:24","extension":"jpeg","order_by":13,"title":"Figure 13","display":"","copyAsset":false,"role":"figure","size":379166,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of the UA-DETRAC dataset on Atlas200DK A2. The left side shows the detection results using the YOLOv5s model, and the right side shows the GSS-YOLO model. The missed detection parts are marked with red arrows.\u003c/p\u003e","description":"","filename":"13.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/fc022b29bf399ce168500f00.jpeg"},{"id":82435479,"identity":"d02a8a86-00ca-463b-b58d-fe0135f6b91e","added_by":"auto","created_at":"2025-05-10 19:46:34","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":26133152,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5357943/v1/1d0791c4-5395-40e9-b7e7-b246cc9b1ba7.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"GSS-YOLO: Vehicle detection method and embedded deployment in complex traffic road scenarios","fulltext":[{"header":"1 Introduction","content":"\u003cp\u003eAccurate detection of vehicles is a key technology for realizing urban smart transportation, and related technical research has been widely used and developed in fields such as autonomous driving, intelligent transportation, and safety monitoring. The rapid advancement of autonomous driving has rendered object detection in traffic scenes a critical area of research. Object detection is recognized as one of the most significant and challenging tasks within computer vision. Due to its robust feature extraction capabilities, deep learning has found widespread application across various fields, including security, military, and medicine.In recent years, it has been extended to the transportation field and has made major breakthroughs [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. References in the transportation sector rely on data collected by road monitoring systems. Therefore, many researchers have designed various vehicle detection and classification methods [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eVehicles play a vital role in modern life, but at the same time, they also bring challenges such as traffic jams and accidents. In order to solve these problems, autonomous driving technology has gradually attracted widespread attention. Its core lies in the vehicle detection algorithm. The algorithm is combined with lidar technology to accurately measure and identify vehicle targets [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e] and effectively prevent traffic accidents. Because of this, vehicle detection algorithms have shown huge market potential and application prospects.\u003c/p\u003e \u003cp\u003eVision-based object detection can be mainly divided into traditional object detection and deep learning-based object detection[\u003cspan additionalcitationids=\"CR5\" citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e–\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Traditional object detection methods are complex to operate, exhibit a high false positive rate, and face challenges in practical application. In contrast, deep learning-based object detection offers higher accuracy, improved generalization, and robustness. This approach builds on traditional detection techniques, incorporating manual feature extraction and machine learning algorithms for effective object detection.The steps of vehicle detection based on machine learning are divided into feature extraction and classifier training. Haar/Haarlike [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], Histogram of oriented gradient (HOG) [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e] and Deformable Part Model [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], they can still extract features stably when the type state of the vehicle changes, and have good effects in vehicle detection. Feature extraction methods such as SIFT and SURF [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e–\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e] are used to generate rich vehicle detection features, which are used to train classifiers to identify vehicle targets. Common classifiers include K-nearest neighbor algorithm (KNN) and support vector machine (SVM), which need to strike a balance between generalization ability and fitting accuracy [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eHowever, traditional methods decompose the detection process into multiple steps, so they lack real-time performance and have limited detection accuracy and generalization ability. The application of deep learning in the field of object detection can be divided into two categories: one is the two-stage detection algorithm (Two-stage detectors) that adopts the region proposal mechanism. During the detection process, they first generate candidate regions and then perform fine classification and positioning; the other is the single-stage detection algorithm (One-stage detectors) that directly performs target prediction and positioning. Both methods have their own advantages and disadvantages. such as Faster R-CNN, which first generates a bounding box, and then needs to classify and regress the bounding box [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. The other type is a single-stage detection algorithm, such as YOLO, SSD, etc., which treats the detection task as a regression problem and directly predicts the category and location of the object. [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Although the two-stage detection algorithm has been praised for its excellent detection accuracy, its relatively slow detection speed has become a limiting factor, especially in those situations that require immediate response, which restricts the application of this algorithm. In contrast, the single-stage detection algorithm shows a significant advantage in detection speed and can achieve faster detection, but its detection accuracy is usually slightly inferior to the two-stage algorithm.\u003c/p\u003e \u003cp\u003eCurrently, many optimized algorithms have been verified and applied in the field of vehicle detection.Li Kang et al. proposed a fuzzy attention mechanism, which introduces fuzzy entropy to reweight the feature map to reduce the uncertainty of the feature map and make the detector focus on the center of the target, thereby effectively improving the accuracy of vehicle detection [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Ren Jinghui et al. designed the ResFusion module to expand the receptive field of the model and capture features of different scales, strengthen the inclusiveness of feature information, and improve detection accuracy [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Dong Xudong et al. proposed an improved vehicle detection method, introduced the C3Ghost module in the neck network to improve the feature expression ability, and introduced the convolutional block attention module (CBAM) in the backbone network to improve the feature extraction ability [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. Hamzenejadi et al. introduced the squeeze and excitation attention mechanism and used high-resolution feature maps for detection, which improved the detection accuracy of small objects [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. Li Yuhua et al. proposed a vehicle detection algorithm based on coordinate attention mechanism (CA). The vehicle detection algorithm based on the coordinate attention mechanism can reduce the loss of target feature information and improve the detection effect by embedding location information into channel attention during feature extraction[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAlthough these methods have shown good results in the field of vehicle detection, they still face some challenges. Although the introduction of fuzzy attention mechanism helps to reduce the uncertainty of feature maps, it also inevitably increases the complexity and number of parameters of the model. Although the integration of attention modules such as CBAM and SE can weaken the interference of redundant noise to a certain extent, when dealing with complex traffic scenes, such as changing lighting conditions, motion blur and different weather conditions, the high uncertainty of these scenes themselves still has a significant impact on the accuracy of vehicle detectors, limiting the further improvement of their performance. To solve the above problems, we propose a new vehicle detection method GSS-YOLO, which uses Swin_Transformer attention and GSConv convolution to reduce the computational complexity of the model, and introduces the FocalModulation module to improve the model's ability to detect uncertain objects in complex backgrounds. The main work done in this paper is as follows:\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eWe propose a GSS-YOLO algorithm that can better capture global and local feature information while reducing the amount of computation, improving detection accuracy, improving the ability to detect small targets, and reducing false detections and missed detections in vehicle detection.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eWe combine Swin_Transformer attention and GSConv convolution to propose the GSC3 module, which reduces the sequence length and computational complexity while increasing the receptive field of the model and improving feature extraction capabilities.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eWe use the FocalModulation module to replace the original SPPF module to improve the model's ability to handle small objects that are difficult to detect or objects in complex backgrounds.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eTo evaluate the effectiveness of the proposed method, we conducted object detection experiments on the UA-DETRAC dataset using both a PC and the Atlas200DK A2 embedded platform, assessing parameters such as GFLOPS and mAP. The experimental results demonstrate that the proposed method exhibits strong performance in vehicle detection within complex backgrounds.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003cp\u003e\u003c/p\u003e "},{"header":"2 Related Work","content":"\u003cp\u003eVisual vehicle detection represents a significant research area within computer vision. Its primary objective is to leverage computer vision technologies for the automatic detection and identification of vehicles in images or videos.This work has wide applications in autonomous driving, traffic monitoring, intelligent transportation systems and other fields. With the advancement of deep learning, vehicle detection algorithms are typically categorized into two main types: one-stage algorithms and two-stage algorithms.The two-stage algorithm first generates candidate boxes and then classifies the candidate boxes. Typical examples of this algorithm include Region-CNN (R-CNN) [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e] and Faster R-CNN [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. Huang et al. [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e] proposed an enhanced framework based on Faster R-CNN for rapid vehicle detection. They incorporated the MobileNet architecture to construct the foundational convolutional layer of Faster R-CNN and replaced the original Non-Maximum Suppression (NMS) algorithm following the Region Proposal Network (RPN) with a soft NMS algorithm to address the issue of duplicate detections. While this two-stage detection algorithm achieves high accuracy, the redundant images generated by the RPN may lead to challenges related to real-time processing and computational resource usage. Its real-time detection efficiency is suboptimal. In contrast to the two-stage algorithm, the one-stage algorithm offers a better balance between accuracy and speed [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e], making it more suitable for vehicle detection scenarios that demand real-time performance.The YOLO series of algorithms are typical representatives of the one-stage algorithm and are also the most popular detection algorithms in industrial applications because they have a good balance between accuracy and speed. It is worth noting that in real-time video analysis tools such as traffic monitoring, the YOLO series algorithm is the most commonly used algorithm because it has a fast detection speed while ensuring detection accuracy. YOLO has been continuously developed and based on its original version. [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e], the newer YOLOv3, YOLOv4, YOLOv5, YOLOv6, YOLOv7 and the currently most popular YOLOv8 [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e] algorithms have been further developed to further improve classification accuracy. The open source version of YOLOv5 is currently commonly used for real-time vehicle detection. It effectively extracts vehicle features with high detection accuracy, achieving a balance between speed and accuracy. YOLOv8 has enhanced computational efficiency and accuracy compared to YOLOv5. However, YOLOv5 demonstrates significantly faster training and inference times, with fewer parameters and lower GFLOPS, making it more advantageous for embedded object detection applications.\u003c/p\u003e\u003cp\u003eIn addition to directly using the R-CNN series and the YOLO series for vehicle detection, there are also some improved methods based on these general detectors to meet the requirements of actual traffic scenarios. YOLOv5 is still one of the most commonly used vehicle detection detectors due to its good versatility and balance between speed and accuracy. In order to address the problem of low accuracy in small target detection, Liu Haiying et al. designed an efficient spatiotemporal interaction module to replace the residual network structure in the original network, and introduced recursive gated convolution in the feature fusion part to enable better interaction of high-order spatial semantic information [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. Guo Shangrong et al. improved the YOLOv5 neck structure to the S6 feature fusion structure to improve the recognition ability of multi-scale defects, replaced the neck network with a thin neck, improved the fusion ability of multi-scale defect features, and used the upsampling operator Carafe module to increase the network's receptive field [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. To minimize computational load and model size, Hu et al. replaced the backbone feature network of the original YOLOv5 algorithm with the lightweight MobileNetV3. They also introduced a Convolutional Block Attention Module into the neck network, optimizing attention during the feature fusion stage to enhance detection accuracy [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. Instead of directly employing the Swin Transformer, we designed the GSC3 module based on the Swin Transformer Block module to achieve a better balance between model parameters and accuracy.\u003c/p\u003e"},{"header":"3 Methods","content":"\u003cp\u003eIn this section, we first introduce the overall structure of GSS-YOLO in Section \u003cspan refid=\"Sec3\" class=\"InternalRef\"\u003e3.1\u003c/span\u003e, and then introduce the main components of GSS-YOLO in detail in Sections \u003cspan refid=\"Sec4\" class=\"InternalRef\"\u003e3.2\u003c/span\u003e, \u003cspan refid=\"Sec5\" class=\"InternalRef\"\u003e3.3\u003c/span\u003e, and \u003cspan refid=\"Sec6\" class=\"InternalRef\"\u003e3.4\u003c/span\u003e, respectively, including modules such as FocalModulation, GSConv, and GSC3.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Overall network structure\u003c/h2\u003e \u003cp\u003eBased on YOLOv5s as the basic model, this pape proposes the GSC3 module based on GSConv convolution and Swin_Transformer_Block. On this basis, GSS-YOLO is proposed, and its overall structure is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. The GSC3, GSConv, and FocalModulation models will be described in detail below.\u003c/p\u003e \u003cp\u003eGSS-YOLO mainly consists of three parts: backbone network, neck structure and head structure. First, the backbone network downsamples the input image, extracts the features of the image, and continuously reduces the feature map. In the last layer of the backbone network, FocalModulation is used to replace SPPF (fast spatial pyramid pooling) to enhance the model's ability to extract features of small objects in the image or objects in complex backgrounds. Then, the GSC3 module designed by us is used in the neck to replace the conventional C3 module, and all Conv convolutions in the neck are replaced with GSConv convolutions to further reduce the amount of calculation. Finally, the head outputs the detection results.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e3.2 FocalModulation module\u003c/h2\u003e \u003cp\u003eThe FocalModulation module is a module in the Focused Modulation Networks (FocalNets) proposed by Yang et al. in 2022 [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e] to replace self-attention (SA) for modeling tag interactions in vision. Self-attention requires complex query key interactions and query value aggregation for each query tag, which can be computationally expensive. In contrast, focal modulation simplifies these operations by initially aggregating spatial contexts of different granularities into the modulator. Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e provides an intuitive comparison between traditional self-attention methods and focal modulation methods. FocalModulation uses a multi-level feature fusion mechanism to simultaneously capture and integrate coarse-grained spatial information and fine-grained feature details to enhance and optimize the overall performance of the network. This mechanism ensures that the network can effectively learn at different levels of feature representation, thereby improving the adaptability and accuracy of the model. Compared with traditional SPPF, focal modulation can adjust according to the size of the target, enhance the focus on difficult-to-detect targets, and thus improve detection accuracy.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTherefore, we use the FocalModulation module to replace the traditional SPPF module in YOLOv5, thereby improving the model's ability to handle small objects that are difficult to detect or objects in complex backgrounds. Figure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e(a) illustrates the overall structure of FocalModulation, while Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e(b) presents the detailed aggregation process. The aggregation process is divided into two key steps: first, hierarchical contextualization, which systematically traverses a wide range from local to global, accurately extracts and integrates contextual information across different granularity levels; followed by gated aggregation, which acts as an intelligent screening mechanism and is responsible for effectively compressing all collected contextual features and injecting them into their corresponding granularity-level modulators to achieve precise regulation and fusion of features.\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{y}_{i}={T}_{2}\\left({M}_{2}\\left(i,X\\right),{x}_{i}\\right),\\#\\left(1\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{y}_{i}=q\\left({x}_{i}\\right)\\odot\\:m\\left(i,X\\right),\\#\\left(2\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eEquation (1) describes FocalModulation, which generates a refined representation \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{y}_{i}\\)\u003c/span\u003e\u003c/span\u003e through an early aggregation process. In this process, context features are first aggregated using \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{M}_{2}\\)\u003c/span\u003e\u003c/span\u003e at each position \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:i\\)\u003c/span\u003e\u003c/span\u003e, after which the query interacts with the aggregated features based on \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{2}\\)\u003c/span\u003e\u003c/span\u003e to produce \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{y}_{i}\\)\u003c/span\u003e\u003c/span\u003e. Eq.\u0026nbsp;(2) provides a specific instance of FocalModulation, where \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:q\\)\u003c/span\u003e\u003c/span\u003e represents the query projection function, and\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\:m\\)\u003c/span\u003e\u003c/span\u003e denotes element-wise multiplication. The function \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:m\\left(\u0026middot;\\right)\\)\u003c/span\u003e\u003c/span\u003e serves as the context aggregation function, and its output is referred to as the modulator. The construction of the regulator consists of two steps: first, the hierarchical semantic processing implemented by Eq.\u0026nbsp;(4) extracts multi-level semantic information from the data; second, Eq.\u0026nbsp;(5) performs a gated aggregation operation, which integrates and compresses the diverse semantic features extracted previously.\u003cdiv id=\"Equc\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equc\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{Z}^{l}={f}_{a}^{l}\\left({Z}^{l-1}\\right)\\triangleq\\:GeLU\\left(DWConv\\left({Z}^{l-1}\\right)\\right)\\in\\:{R}^{H\\times\\:W\\times\\:C},\\#\\left(3\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equd\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equd\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{Z}^{out}=\\sum\\:_{l=1}^{L+1}{G}^{l}⨀{Z}^{l}\\in\\:{R}^{H\\times\\:W\\times\\:C},\\#\\left(4\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Eque\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Eque\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{y}_{i}=q\\left({x}_{i}\\right)⨀h\\left(\\sum\\:_{l=1}^{L+1}{g}_{i}^{l}\\times\\:{Z}_{i}^{l}\\right),\\#\\left(5\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eIn Eq.\u0026nbsp;(3), \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{f}_{a}^{l}\\)\u003c/span\u003e\u003c/span\u003e is the context function of the previous layer, generated by the depthwise convolution with kernel size \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{k}^{l}\\)\u003c/span\u003e\u003c/span\u003eand GeLU activation function. Hierarchical semantics extracts contextual information from the global scope through different levels of granularity. In Eq.\u0026nbsp;(4), \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:G\\in\\:{R}^{H\\times\\:W\\times\\:l}\\)\u003c/span\u003e\u003c/span\u003eis the L-layer slice of the horizontal G. Specifically, we use a linear layer to obtain the spatial and horizontal perception gating weights: \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{G}={f}_{g}\\left(x\\right)\\in\\:{R}^{H\\times\\:W\\times\\:(L+1)}\\)\u003c/span\u003e\u003c/span\u003e. Subsequently, the features are weighted and summed by element-wise multiplication, which generates a feature map \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{Z}^{out}\\)\u003c/span\u003e\u003c/span\u003e with the same size as the input X. Combining the interaction of hierarchical contextualization, gated aggregation, and focus modulation described above, the final effect of focus modulation can be accurately expressed by formula (5).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e3.3 GSConv module\u003c/h2\u003e \u003cp\u003eTo improve the real-time vehicle detection performance on mobile embedded devices, we replace the standard convolution of YOLOv5s with GSConv convolution. GSConv reduces the model burden while maintaining accuracy. GSConv\u0026thinsp;+\u0026thinsp;Slim-Neck is a lightweight network proposed by Li et al. [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e] for the vehicle-mounted edge autonomous driving computing platform. Slim-Neck reduces the computational complexity through the cross-layer sub-network module GSCSP, thereby improving the detection speed and accuracy. This optimization scheme effectively balances resource consumption and detection performance and is suitable for edge computing environments.\u003c/p\u003e \u003cp\u003eTraditionally, although the dependent separable convolution (DSC) model has reduced the computational burden to a certain extent, its core mechanism, the separate processing of channel information, often limits the model's ability to achieve high levels of accuracy. This limitation directly weakens the model's effectiveness in feature extraction and fusion, and becomes a bottleneck that hinders the realization of lightweight and high-precision detection performance. In contrast, the GSConv model stands out with its unique and innovative approach, which cleverly combines standard convolution operations with dependent separable convolution. This design not only retains the comprehensiveness and accuracy of standard convolution in feature extraction, but also cleverly incorporates the advantages of DSC in reducing computational complexity, thereby achieving both lightweight and efficient real-time detection performance on edge devices. Through this combination, the GSConv model effectively overcomes the limitations of the traditional DSC model and promotes the realization of real-time detection tasks in the field of edge computing.\u003c/p\u003e \u003cp\u003eThis approach begins with conventional convolution for downsampling, followed by DWConv deconvolution to combine the outputs of SCconv and DSCconv, and concludes with a shuffle operation to merge the corresponding channels. The structure of the GSConv module is illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, where \"Conv\" encompasses the convolution layer, batch normalization, and activation layer, while \"DWCconv\" denotes the DSC operation. The cross-layer local area network (GSCSP) is designed to aggregate computations, reducing overall calculation and network complexity while maintaining adequate accuracy.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eGSConv cleverly compresses the spatial dimension of the feature map while significantly increasing the number of channels. This design retains the important connections between channels. This mechanism ensures that the model can capture and retain rich high-level semantic information while compressing information. Furthermore, the addition of the shuffle operation reduces the computational cost of the convolution operation. It not only simplifies the calculation process, but also improves the operating efficiency and processing speed of the model, making the model more suitable for real-time processing or resource-constrained environments while maintaining high performance. If GSConv is used at all levels, the reasoning time may be inadvertently prolonged due to the increase in network depth, affecting the overall computational efficiency. In view of this, we only use GSConv modules in the neck. This layout not only avoids unnecessary computational overhead, but also ensures that the attention mechanism can focus on key features more efficiently, thereby improving the overall detection accuracy.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e3.4 GSC3 Module\u003c/h2\u003e \u003cp\u003eTo reduce sequence length and computational complexity while increasing the model's receptive field and enhancing feature extraction capabilities, we combined GSConv convolution with the Swin Transformer block into the GSC3 module. This new module replaces the original C3 module in YOLOv5s.The overall structure of the GSC3 module is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e. Swin Transformer is a new Transformer architecture designed for computer vision tasks. It introduces a self-attention mechanism based on a moving window and adopts a hierarchical feature expression method, which makes the model achieve a balance between computational complexity and performance [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. The Swin_Transformer _block utilizes a moving window approach to compute pixel attention, allowing the model to connect with the previous layer's windows.\u003c/p\u003e \u003cp\u003eThis method reduces the complexity of the original attention calculation and addresses the issue of insufficient global context, thereby enhancing the model's performance.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eAs illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e, the Swin Transformer block comprises a shifted window-based multi-head self-attention (MSA) module followed by a two-layer multi-layer perceptron (MLP) with a GELU nonlinearity in between. A LayerNorm (LN) layer is applied prior to each MSA module and MLP, with a residual connection incorporated after each module. Two consecutive Swin Transformer blocks utilize a window MSA (W-MSA) module and a shifted window MSA (SW-MSA) module, respectively, enabling different windows to exchange information while minimizing computational load. Based on this window partitioning mechanism, consecutive Swin Transformer blocks are computed as follows:\u003cdiv id=\"Equf\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equf\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{\\widehat{z}}^{i}=W-MSA\\left(LN\\left({z}^{i-1}\\right)\\right)+{z}^{i-1},\\#\\left(6\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equg\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equg\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{z}^{i}=MLP\\left(LN\\left({\\widehat{z}}^{i}\\right)\\right)+{\\widehat{z}}^{i},\\#\\left(7\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equh\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equh\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{\\widehat{z}}^{i+1}=SW-MSA\\left(LN\\left({z}^{i}\\right)\\right)+{z}^{i},\\#\\left(8\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equi\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equi\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}{z}^{i+1}=MLP\\left(LN\\left({\\widehat{z}}^{i+1}\\right)\\right)+{\\widehat{z}}^{i+1},\\#\\left(9\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eWhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\widehat{z}}^{i}\\)\u003c/span\u003e\u003c/span\u003e represents the output of the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:W-MSA\\)\u003c/span\u003e\u003c/span\u003e module, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{z}^{i}\\)\u003c/span\u003e\u003c/span\u003e represents the output of the MLP module of the i-th Block.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"4 Experiment","content":"\u003cp\u003eAfter completing all the improvements to YOLOv5s, this paper proposes a real-time traffic vehicle detection algorithm called GSS-YOLO, which is trained and tested on the PC and deployed on the Atlas 200 DK A2 embedded system. This chapter first introduces the dataset, the experimental environment, and then describes the training results.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Experimental Dataset\u003c/h2\u003e \u003cp\u003eIn this study, we used the challenging UA-DETRAC dataset to evaluate the performance of our proposed GSS-YOLO model on the vehicle detection task. The UA-DETRAC dataset is known for its large size and diversity, covering various types of vehicles such as cars, trucks, and buses, providing a rich test scenario for vehicle detection algorithms. The dataset was collected from real traffic environments in Beijing and Tianjin, China. The training set contains 82,085 high-resolution images from 60 independent video frame sequences, fully demonstrating the vehicle image features under different time, weather, and traffic conditions. The test set comes from 56,127 images from 40 different video frame sequences.The UA-DETRAC dataset considers the impact of weather conditions on the data and collects data under four different weather conditions, such as cloudy, night, sunny, and rainy. The UA-DETRCA dataset is a frame sequence image, which contains a large number of similar images. Therefore, we preprocessed the dataset, took an image every 10 frames, and divided the images into training set, validation set, and test set in a ratio of 8:1:1. Finally, the training set has 8639 images, the validation set has 1165 images, and the test set has 1166 images. The data of vehicles under different weather conditions are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e4.2 Experimental Environment\u003c/h2\u003e \u003cp\u003e \u003cb\u003eTraining Configuration\u003c/b\u003e. The computer is equipped with an AMD Ryzen 5 5600 processor, an NVIDIA GeForce RTX 2080 Ti graphics processor, 11GB of video memory, and runs on a Windows operating system. It uses PyTorch 1.10.0 as a deep learning framework and CUDA 10.2 for graphics acceleration. The deep learning model is trained in the PyCharm integrated development environment in combination with Python 3.8. During the training process, the input image size is set to 640x640 pixels, the training epoch is set to 300 times, each batch is 16, and the learning rate is set to 0.001. To ensure fairness in the experiment, all models do not use pre-trained weights during training.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eAtlas200DK A2 hardware parameters\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eparameter\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSpecification\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCPU\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTAISHANV200M\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAI processor\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDaVinciV300 AI core\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMemory\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e4GB LPDDR4X\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAI computing power\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e8TOPS(INT8)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePower consumption\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e21W\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWired network\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGigabit Ethernet\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eEmbedded Devices.\u003c/b\u003e At present, the mainstream embedded products on the market include NVIDIA jetson series, Raspberry Pi series products, etc. With the improvement of computing power of embedded devices, they can meet most of the reasoning tasks based on deep learning. The GSS-YOLO vehicle detection algorithm designed in this paper is applied to embedded devices and then loaded into the mobile terminal, so as to promote the popularization of edge vehicle detection. Since the vehicle detection in this paper is a real-time detection of video streams, there are certain requirements for the CPU and computing power of embedded devices. Although the NVIDIA jetson series products integrate NVIDIA CUDA-based GPUs and have very fast computing speeds, the cost is relatively high and the cost performance is low. In comparison, Atlas200DK A2 has higher computing power at a price similar to jetson nano. Therefore, Atlas200DK A2 is used as the embedded device in this experiment. The hardware parameters of the Atlas200DK A2 development board are shown in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, and the actual development board is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Evaluation Metrics\u003c/h2\u003e \u003cp\u003eThis paper employs accuracy (P), number of parameters (Params), floating point operations per second (FLOPS), and mean average precision (mAP) as evaluation metrics. Accuracy is defined as the ratio of correctly predicted samples to the total number of samples, representing the proportion of correct predictions, as shown in Eq.\u0026nbsp;(10):\u003cdiv id=\"Equj\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equj\" name=\"EquationSource\"\u003e\n$$\\:\\begin{array}{c}P=\\frac{TP}{TP+FP}\\#\\left(10\\right)\\end{array}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eTP means that the predicted value is the same as the true value, both are positive samples, and FP means that the predicted result is different from the actual result. The predicted result is judged as a positive sample, while the actual result is a negative sample.\u003c/p\u003e \u003cp\u003eThe number of parameters refers to the total weights and biases that the model can learn and optimize within the algorithm. This metric not only helps assess the model's complexity but also indicates its training and storage requirements. Floating point operations per second (FLOPS) measures the number of floating point operations required for a single forward propagation in a neural network. This indicator is used to evaluate the computational complexity and efficiency of the model.Generally speaking, the higher the FLOPS value, the more computing resources and time the model needs to consume when processing data, where 1TFLOPS is equal to 1000GFLOPS.\u003c/p\u003e \u003cp\u003emAP is one of the evaluation indicators of the performance of the target detection algorithm. First, the average precision AP of each category is calculated, and then the average AP of all categories is taken to get mAP, as shown in equations (11) and (12). The larger the mAP value, the better the target detection effect.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e4.4 Experimental Results\u003c/h2\u003e \u003cp\u003e \u003cb\u003eModel training results.\u003c/b\u003e Figure\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e displays the loss metrics of the improved model on both the training and validation sets. The box_loss represents the error between the predicted bounding box and the ground truth, while the object_loss indicates the algorithm's confidence. The classification_loss assesses whether the anchor_box is correctly classified against the corresponding ground truth. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e, the various loss curves stabilize over time, indicating that the model converges effectively.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eComparative experiment.\u003c/b\u003e Compared with the traditional YOLOv5s algorithm, the GSS-YOLO algorithm has fewer parameters and computational complexity, and has excellent detection performance for vehicles in complex backgrounds. In order to further verify the detection performance of the proposed detection algorithm, we compared it with several current mainstream detection models, such as YOLOv8s. The network model was trained on the UA-DETRAC dataset using the same training method, and the accuracy (P), number of parameters (Params), floating point number (GFLOPS), and average detection accuracy (mAP) of all samples were used as evaluation indicators for experimental comparison. The comparison results are shown in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. Among them, the bold text indicates the optimal result of the experiment. Analysis of Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e shows that in the same dataset, all parameters of the algorithm are better than YOLOv5s. Although the proposed algorithm is slightly lower in detection accuracy than the more advanced YOLOv8s algorithm, the GFLOPS and Params of GSS-YOLO are only 44.01% and 49.68% of YOLOv8s, respectively.\u003c/p\u003e \u003cp\u003eIt is proved that it has fewer parameters and better detection effect in the detection task of vehicle targets in complex backgrounds. Figure\u0026nbsp;\u003cspan refid=\"Fig10\" class=\"InternalRef\"\u003e10\u003c/span\u003e shows some detection results on the UA-DETRAC dataset.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparative experiments of different algorithms\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBackbone\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eGFLOPS\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eParams\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePrecision (%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFaster-RCNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eResnet-50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e201.09\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e137098724\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e54.4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eYOLOv5s\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCSP-Darknet53\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e15.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e7020913\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e62.7\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eYOLOv6s\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEfficientRep\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e45.17\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e18507345\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e66.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eYOLOv7-tiny\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eELAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e13.01\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e6014737\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e65.8\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eYOLOv8s\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDarknet-53\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e28.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e11127132\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e70.8\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTraffic YOLO[\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCSP-Darknet53\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e45M\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e65.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGSS-YOLO\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCSP-Darknet53\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e12.5\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e5528050\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e68.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eAblation experiment.\u003c/b\u003e Our ablation experiments are based on YOLOv5s. In order to compare the impact of each module used in this study on the proposed algorithm, we conducted multiple sets of experiments to test the performance of each module on four evaluation indicators. We mainly focus on average accuracy and model parameters. The performance after adding different modules is shown in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, where bold text represents the best results of the experiment. From the analysis of Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, we can see that Model1 uses GSConv convolution to replace Conv convolution in the neck compared to YOLOv5s. When both parameters and GFLOPS are reduced, there is almost no effect on the average accuracy and Map. Comparing Model1 and Model2, although the addition of the FocalModulation module slightly increases the model parameters and GFLOPS, the accuracy and mAP are improved to varying degrees. Comparing Model1 and Model2, after replacing the C3 module with the GSC3 module proposed by us, the model parameters and GFLOPS are greatly reduced, and the accuracy and mAP are significantly improved. Finally, compared with the original YOLOv5s algorithm, the average accuracy of our proposed GSS-YOLO algorithm is increased from 62.7\u0026ndash;68.2%, an increase of 5.5%, and the model parameters are reduced from 7020913 to 5528050, a reduction of 21.21% compared to the original version.\u003c/p\u003e \u003cp\u003eTherefore, our proposed GSS-YOLO has better detection accuracy and fewer model parameters. Figure\u0026nbsp;\u003cspan refid=\"Fig11\" class=\"InternalRef\"\u003e11\u003c/span\u003e is a comparison of the detection results of YOLOv5s and GSS-YOLO under complex backgrounds (the red arrow marks the missed target).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eAblation experiment\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"8\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGSConv\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eFocalModulation\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eGSC3\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eParams\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eGFLOPS\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003ePrecision(%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003emAP(%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eYOLOv5s\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e7020913\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e15.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e62.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e61.9\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u0026radic;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e6579953\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e15.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e62.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e61.9\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u0026radic;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026radic;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e6983698\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e15.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e63.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e62.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u0026radic;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026radic;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e5133041\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e12.2\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e66.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e65.8\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGSS-YOLO\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u0026radic;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026radic;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026radic;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e5528050\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e12.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003e68.2\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e\u003cb\u003e66.1\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e4.5 Embedded Deployment\u003c/h2\u003e \u003cp\u003eAs the improvement of the performance of small embedded devices, their computing power is sufficient to support most object detection tasks based on deep learning. Therefore, we deployed the vehicle detection algorithm GSS-YOLO developed in this paper on embedded devices. This study comprehensively considered factors such as cost, power consumption, and computing power, and selected Atlas200DK A2 as the embedded development platform.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFirst, complete the training and conversion of the model on the PC side, and upload the converted model to the Atlas200DK A2 development board through the SSH protocol using the MobaXterm software. Log in to jupyter to run the detection script and output the inference results to the jupyter interface. The detection results are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig12\" class=\"InternalRef\"\u003e12\u003c/span\u003e. In order to verify that the proposed algorithm has better detection effect than the original YOLOv5s in the embedded system, we conducted experiments on Atlas200DK A2, and the experimental results are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig13\" class=\"InternalRef\"\u003e13\u003c/span\u003e. From the two sets of experiments, it can be seen that when the GSS-YOLO algorithm is deployed on Atlas200DK A2, it performs better in complex backgrounds than YOLOv5s, and has stronger detection capabilities for remote vehicles. We use power sockets to test the power consumption of PC and Atlass200DK A2 during inference. The average inference power consumption of PC is 177.3W, while the average inference power consumption of Atlas200DK A2 is only 11.5W. Compared with a PC with a high-performance 2080ti GPU, although the accuracy of the embedded system is slightly reduced, the overall power consumption is only 6.49% of that of the PC, and the inference speed is 37.04FPS, which meets the speed requirements of real-time detection. The comparison data between Atlas200DK A2 and GPU is shown in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparison between Atlas200DK A2 and GPU\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDevice\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAverage Power\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003etime\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eAccuracy(%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e177.3W\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e11ms\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e68.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAtlas200DK A2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e11.5W\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e26.99ms\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e67.9\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn the real-time inference performance analysis, we use the frame rate during inference as a comparison to compare with the relevant literature. The comparison results are shown in Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eReal-time performance comparison\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026times;\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlgorithm\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePlatform\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eResolution\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eFrame rate\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e[\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNVIDIA Jetson Nano\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c3\"\u003e \u003cp\u003e512\u0026times;512\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e12.8\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e[\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNVIDIA Jetson Nano\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c3\"\u003e \u003cp\u003e640\u0026times;640\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e16.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e[\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eNVIDIA Xavier NX\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c3\"\u003e \u003cp\u003e512\u0026times;320\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e26.04\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c3\"\u003e \u003cp\u003e1024\u0026times;576\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e14.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e[\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRaspberry Pi 4B\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c3\"\u003e \u003cp\u003e640\u0026times;640\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e25.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e[\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNvidia Jetson Tegra X2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c3\"\u003e \u003cp\u003e384\u0026times;384\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e30.9\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOurs\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAtlas200DK A2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026times;\" colname=\"c3\"\u003e \u003cp\u003e640\u0026times;640\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e37.04\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"5 Conclusion and discussion","content":"\u003cp\u003eIn order to solve the problem of vehicle detection in complex backgrounds in autonomous driving and intelligent transportation systems and meet the necessary requirements of edge detection, this paper proposes an improved algorithm GSS-YOLO based on YOLOv5s. Using GSConv convolution to replace Conv convolution in the neck network can reduce the amount of calculation without reducing the model performance or even slightly improving it. The FocalModulation module is used to replace the original SPPF module. Finally, the GSC3 module we proposed replaces the C3 module in the original neck network. The model was trained using the UA-DETRAC dataset, with inference conducted on both a PC and the Atlas200DK A2. Results indicate that, compared to the original YOLOv5s, average accuracy improved from 62.7\u0026ndash;68.2%, a 5.5% increase, while model parameters were reduced by 21.21%. Comparative experiments on the UA-DETRAC dataset demonstrate that the algorithm performs effectively in remote vehicle detection, thereby validating the new model's effectiveness.\u003c/p\u003e \u003cp\u003eThis paper proposes a vehicle detection algorithm and deploys it on embedded devices, and obtains good detection results. However, in the actual traffic environment, the algorithm proposed in this paper is still a certain distance away from real-time detection. Therefore, based on this article, further optimization of the algorithm, AI acceleration of Atlas200DK A2 and other follow-up research still require a lot of research, and there is still a lot of room for improvement in the detection of complex targets such as high-speed vehicles and blurred vehicles. Considering the complex environmental factors, some image preprocessing methods are introduced to process images with abnormal light and blur. Further experiments will be carried out in the future to achieve a more efficient vehicle target detection algorithm.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eShengning Lu wrote the entire manuscript, Zhihao Ren typeset the paper, Yan Zhi and Xinhua Wang prepared the figures in the paper, Yong Liang organized the data in the paper, and all authors reviewed the manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgements\u003c/h2\u003e \u003cp\u003eThis work was financially supported by the Project of the Guilin University of Technology (Nos.GLUTQD2017003).\u003c/p\u003e\u003ch2\u003eData availability\u003c/h2\u003e \u003cp\u003eThe dataset (UA-DETRAC) used in this study is an open source dataset and is publicly available on the Internet.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eGholamhosseinian A, Seitz J (2021) Vehicle classification in intelligent transport systems: An overview, methods and software perspective. IEEE Open J Intell Transp Syst 2:173\u0026ndash;194. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/OJITS.2021.3096756\u003c/span\u003e\u003cspan address=\"10.1109/OJITS.2021.3096756\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLin CJ, Jhang JY (2022) Intelligent traffic-monitoring system based on YOLO and convolutional fuzzy neural networks. IEEE Access 10:14120\u0026ndash;14133. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ACCESS.2022.3147866\u003c/span\u003e\u003cspan address=\"10.1109/ACCESS.2022.3147866\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang Z, Zhan J, Duan C, Guan X, Lu P, Yang K (2022) A review of vehicle detection techniques for intelligent vehicles. IEEE Trans Neural Networks Learn Syst 34(8):3811\u0026ndash;3831. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/TNNLS.2021.3128968\u003c/span\u003e\u003cspan address=\"10.1109/TNNLS.2021.3128968\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhao ZQ, Zheng P, Xu S (2019) Object detection with deep learning: A review. IEEE Trans neural networks Learn Syst 30(11):3212\u0026ndash;3232. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/TNNLS.2018.2876865\u003c/span\u003e\u003cspan address=\"10.1109/TNNLS.2018.2876865\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZou Z, Chen K, Shi Z, Guo Y (2023) Object detection in 20 years: A survey. Proceedings of the IEEE 111(3):257\u0026ndash;276. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/JPROC.2023.3238524\u003c/span\u003e\u003cspan address=\"10.1109/JPROC.2023.3238524\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSolunke BR, Gengaje SR (2023) A Review on traditional and deep learning based object detection methods. In 2023 International Conference on Emerging Smart Computing and Informatics (ESCI) pp 1\u0026ndash;7. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ESCI56872.2023.10099639\u003c/span\u003e\u003cspan address=\"10.1109/ESCI56872.2023.10099639\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee C, Kim D (2018) Visual homing navigation with Haar-like features in the snapshot. IEEE Access 6:33666\u0026ndash;33681. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ACCESS.2018.2842679\u003c/span\u003e\u003cspan address=\"10.1109/ACCESS.2018.2842679\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMin W, Liu R, He D (2022) Traffic sign recognition based on semantic scene understanding and structural traffic sign location. IEEE Trans Intell Transp Syst 23(9):15794\u0026ndash;15807. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/TITS.2022.3145467\u003c/span\u003e\u003cspan address=\"10.1109/TITS.2022.3145467\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDonnelly J, Barnett AJ, Chen C (2022) Deformable protopnet: An interpretable image classifier using deformable prototypes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp 10265\u0026ndash;10275. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/cvpr52688.2022.01002\u003c/span\u003e\u003cspan address=\"10.1109/cvpr52688.2022.01002\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSasikala N, Swathipriya V, Ashwini M (2020) Feature extraction of real-time image using Sift algorithm. Eur J Electr Eng Comput Sci 4(3):206\u0026ndash;214. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.24018/ejece.2020.4.3.206\u003c/span\u003e\u003cspan address=\"10.24018/ejece.2020.4.3.206\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBansal M, Kumar M, Kumar M (2021) 2D object recognition: a comparative analysis of SIFT, SURF and ORB feature descriptors. Multimedia Tools Appl 80(12):18839\u0026ndash;18857. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s11042-021-10646-0\u003c/span\u003e\u003cspan address=\"10.1007/s11042-021-10646-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBansal M, Goyal A, Choudhary A (2022) A comparative analysis of K-nearest neighbor, genetic, support vector machine, decision tree, and long short term memory algorithms in machine learning. Decis Analytics J 3:100071. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.dajour.2022.100071\u003c/span\u003e\u003cspan address=\"10.1016/j.dajour.2022.100071\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMaity M, Banerjee S, Chaudhuri SS (2021) Faster r-cnn and yolo based vehicle detection: A survey. In 2021 5th international conference on computing methodologies and communication (ICCMC) pp 1442\u0026ndash;1447. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/10.1109/ICCMC51019.2021.9418274\u003c/span\u003e\u003cspan address=\"10.1016/10.1109/ICCMC51019.2021.9418274\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJiang P, Ergu D, Liu F (2022) A Review of Yolo algorithm developments. Procedia Comput Sci 199:1066\u0026ndash;1073. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.procs.2022.01.135\u003c/span\u003e\u003cspan address=\"10.1016/j.procs.2022.01.135\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKang L, Lu Z, Meng L (2024) YOLO-FA: Type-1 fuzzy attention based YOLO detector for vehicle detection. Expert Syst Appl 237:121209. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.eswa.2023.121209\u003c/span\u003e\u003cspan address=\"10.1016/j.eswa.2023.121209\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRen J, Yang J, Zhang W (2024) RBS-YOLO: a vehicle detection algorithm based on multi-scale feature extraction. SIViP 18(4):3421\u0026ndash;3430. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s11760-024-03007-5\u003c/span\u003e\u003cspan address=\"10.1007/s11760-024-03007-5\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDong X, Yan S, Duan C (2022) A lightweight vehicles detection network model based onYOLOv5. Eng Appl Artif Intell 113:104914. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.engappai.2022.104914\u003c/span\u003e\u003cspan address=\"10.1016/j.engappai.2022.104914\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHamzenejadi MH, Mohseni H (2023) Fine-tuned YOLOv5 for real-time vehicle detection in UAV imagery: Architectural improvements and performance boost. Expert Syst Appl 231:120845. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.eswa.2023.120845\u003c/span\u003e\u003cspan address=\"10.1016/j.eswa.2023.120845\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi Y, Zhang M, Zhang C (2024) YOLO-CCS: Vehicle detection algorithm based on coordinate attention mechanism. Digit Signal Proc 153:104632. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.dsp.2024.104632\u003c/span\u003e\u003cspan address=\"10.1016/j.dsp.2024.104632\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGidaris S, Komodakis N (2015) Object detection via a multi-region and semantic segmentation-aware cnn model. In Proceedings of the IEEE international conference on computer vision pp 1134\u0026ndash;1142. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/iccv.2015.135\u003c/span\u003e\u003cspan address=\"10.1109/iccv.2015.135\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGirshick R (2015) Fast R-CNN. Proceedings of the IEEE international conference on computer vision pp 1440\u0026ndash;1448. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.1504.08083\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.1504.08083\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNguyen H (2019) Improving Faster R-CNN Framework for Fast Vehicle Detection. Math Probl Eng 1:3808064. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1155/2019/3808064\u003c/span\u003e\u003cspan address=\"10.1155/2019/3808064\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXiao Y, Tian Z, Yu J (2020) A review of object detection based on deep learning. Multimedia Tools Appl 79:23729\u0026ndash;23791. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s11042-020-08976-6\u003c/span\u003e\u003cspan address=\"10.1007/s11042-020-08976-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRedmon J (2016) You only look once: Unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition pp 779\u0026ndash;788. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/cvpr.2016.91\u003c/span\u003e\u003cspan address=\"10.1109/cvpr.2016.91\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTerven J, C\u0026oacute;rdova-Esparza DM, Romero-Gonz\u0026aacute;lez JA (2023) A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach Learn Knowl Extr 5:1680\u0026ndash;1716. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/make5040083\u003c/span\u003e\u003cspan address=\"10.3390/make5040083\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu H, Duan X, Lou H (2023) Improved GBS-YOLOv5 algorithm based on YOLOv5 applied to UAV intelligent traffic. Sci Rep 13:9577. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/make5040083\u003c/span\u003e\u003cspan address=\"10.3390/make5040083\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuo S, Li S, Han Z (2024) Efficient detection of multiscale defects on metal surfaces with improved YOLOv5. Multimedia Tools Appl 1\u0026ndash;23. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s11042-024-19477-1\u003c/span\u003e\u003cspan address=\"10.1007/s11042-024-19477-1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHu T, Gong Z, Song J (2024) Research and implementation of an embedded traffic sign detection model using improved YOLOV5. Int J Autom Technol 25:881\u0026ndash;892. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s12239-024-00082-y\u003c/span\u003e\u003cspan address=\"10.1007/s12239-024-00082-y\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang J, Li C, Dai X (2022) Focal modulation networks. Adv Neural Inf Process Syst 35:4203\u0026ndash;4217. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2203.11926\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2203.11926\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi H, Li J, Wei H (2022) Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv preprint. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.2206.02424\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2206.02424\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. arXiv:2206.02424\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu Z, Lin Y, Cao Y (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision pp 10012\u0026ndash;10022. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/iccv48922.2021.00986\u003c/span\u003e\u003cspan address=\"10.1109/iccv48922.2021.00986\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen X, Zou Y, Ke H (2024) TrafficYOLO: YOLO with Multi-Head Attention Mechanism for Traffic Detection Scenarios. In 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT) pp 2276\u0026ndash;2279. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/AINIT61980.2024.10581465\u003c/span\u003e\u003cspan address=\"10.1109/AINIT61980.2024.10581465\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKoay HV, Chuah JH, Chow CO et al (2021) YOLO-RTUAV: Towards real-time vehicle detection through aerial images with low-cost edge devices. Remote Sens 13(21):4196. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/rs13214196\u003c/span\u003e\u003cspan address=\"10.3390/rs13214196\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang ZD, Tan ML, Lan ZC et al (2022) CDNet: A real-time and robust crosswalk detection network on Jetson nano based on YOLOv5. Neural Comput Appl 34(13):10719\u0026ndash;10730. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s00521-022-07007-9\u003c/span\u003e\u003cspan address=\"10.1007/s00521-022-07007-9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBalamuralidhar N, Tilon S, Nex F (2021) MultEYE: Monitoring system for real-time vehicle detection, tracking and speed estimation from UAV imagery on edge-computing platforms. Remote Sens 13(4):573. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/rs13040573\u003c/span\u003e\u003cspan address=\"10.3390/rs13040573\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWu H, Hua Y, Zou H et al (2022) A lightweight network for vehicle detection based on embedded system. J Supercomputing 78(16):18209\u0026ndash;18224. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s11227-022-04596-z\u003c/span\u003e\u003cspan address=\"10.1007/s11227-022-04596-z\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen J, Zhang X, Peng X et al (2023) Shuffle-octave-yolo: a tradeoff object detection method for embedded devices. J Real-Time Image Proc 20(2):25. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s11554-023-01284-w\u003c/span\u003e\u003cspan address=\"10.1007/s11554-023-01284-w\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Vehicle detection, YOLO, Intelligent transportation, Embedded devices","lastPublishedDoi":"10.21203/rs.3.rs-5357943/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5357943/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eIn real-world vehicle detection scenarios, numerous complex and highly uncertain factors, including variations in lighting, motion blur, occlusion, and weather conditions, can significantly impact performance. Autonomous driving and intelligent traffic systems must be able to respond quickly to various traffic situations. In order to reduce the impact of these uncertainties in actual scenarios and improve the accuracy of vehicle detection in complex backgrounds, we propose a new YOLO detector GSS-YOLO based on YOLOv5s. First, in order to reduce the amount of calculation while improving the performance of model detection and maintaining detection accuracy, we replaced all Conv convolutions in the neck with GSConv convolutions. Secondly, in order to reduce the sequence length and reduce the computational complexity while increasing to improve the receptive field of the model and improve feature extraction capabilities, we embed the Swin-Transformer attention mechanism into the C3 module. Finally, in order to increase the model's ability to handle small objects that are difficult to detect or objects in complex backgrounds, we use the FocalModulation module to replace the original fast spatial pyramid pooling module. Compared with traditional YOLOv5s, our method reduces model parameters by 21.21% and GFLOPS by 20.88%. GSS-YOLO can increase mAP by 4.2% and accuracy by 5.5% on the challenging vehicle detection data set UA-DETRAC. We deployed the GSS-YOLO algorithm on the Atlas200DK A2 embedded system. After testing, it can achieve an FPS of 37.04 when the accuracy is only reduced by 0.3, meeting the requirements of real-time detection.\u003c/p\u003e","manuscriptTitle":"GSS-YOLO: Vehicle detection method and embedded deployment in complex traffic road scenarios","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-11-14 04:19:16","doi":"10.21203/rs.3.rs-5357943/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"8fb24c80-c4c5-4062-af06-13983f7769df","owner":[],"postedDate":"November 14th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-05-10T19:38:15+00:00","versionOfRecord":[],"versionCreatedAt":"2024-11-14 04:19:16","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-5357943","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5357943","identity":"rs-5357943","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.