Identifying Rice Field Weeds from Unmanned Aerial Vehicle Remote Sensing Imagery Using Deep Learning

preprint OA: closed
Full text JSON View at publisher
Full text 167,704 characters · extracted from preprint-html · click to expand
Identifying Rice Field Weeds from Unmanned Aerial Vehicle Remote Sensing Imagery Using Deep Learning | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Identifying Rice Field Weeds from Unmanned Aerial Vehicle Remote Sensing Imagery Using Deep Learning Zhonghui Guo, Dongdong Cai, Yunyi Zhou, Tongyu Xu, Fenghua Yu This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4008720/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 16 Jul, 2024 Read the published version in Plant Methods → Version 1 posted 12 You are reading this latest preprint version Abstract Background Rice field weed object detection can provide key information on weed species and locations for precise spraying, which is of great significance in actual agricultural production. However, facing the complex and changing real farm environments, traditional object detection methods still have difficulties in identifying small-sized, occluded and densely distributed weed instances. To address these problems, this paper proposes a multi-scale feature enhanced DETR network, named MS-DETR. By adding multi-scale feature extraction branches on top of DETR, this model fully utilizes the information from different semantic feature layers to improve recognition capability for rice field weeds in real-world scenarios. Methods Introducing multi-scale feature layers on the basis of the DETR model, we conduct a differentiated design for different semantic feature layers. The high-level semantic feature layer adopts Transformer structure to extract contextual information between barnyard grass and rice plants. The low-level semantic feature layer uses CNN structure to extract local detail features of barnyard grass. Introducing multi-scale feature layers inevitably leads to increased model computation, thus lowering model inference speed. Therefore, we employ a new type of Pconv (Partial convolution) to replace traditional standard convolutions in the model, so as to reduce memory access time and computational redundancy. Results On our constructed rice field weed dataset, compared with the original DETR model, our proposed MS-DETR model improved average recognition accuracy of rice field weeds by 2.8%, reaching 0.792. The MS-DETR model size is 40.8M with inference time of 0.0081 seconds. Compared with three classical DETR models (Deformable DETR, Anchor DETR and DAB-DETR), the MS-DETR model respectively improved average precision by 2.1%, 4.9% and 2.4%. Discussion This model has advantages such as high recognition accuracy and fast recognition speed. It is capable of accurately identifying rice field weeds in complex real-world scenarios, thus providing key technical support for precision spraying and management of variable-rate spraying systems. Rice field weeds Target detection Transformer DETR UAV Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 1. Introduction During the growth period of rice, competition for soil nutrients and water between rice and weeds can lead to the loss of water and fertilizer resources. Additionally, the proliferation of weeds can contribute to the emergence and spread of diseases and pests. Weeds in rice fields have become a critical biological threat limiting rice yield and quality [ 1 ]. Therefore, effective weed control is a necessary step to achieve high and stable rice production. The characteristics of rice field environments, including soft and wet soil, low-lying terrain, and narrow spaces, impose certain limitations on traditional mechanical weed control methods. In this context, unmanned aerial vehicles (UAVs) have demonstrated unique applicability due to their flexible maneuverability. In recent years, with the improvement of payload capacity and performance of agricultural UAVs, aerial spraying has become the mainstream method for weed control in rice fields [ 2 ]. Currently, there is a problem of indiscriminate spraying in weed control using agricultural UAVs in rice fields. The widespread spraying may not accurately target weed locations, leading to low pesticide utilization rates and potential negative environmental impacts [ 3 – 4 ]. Utilizing high-resolution rice field remote sensing images captured by UAVs for precise weed identification and generating variable-rate prescription maps can enable targeted pesticide application based on weed locations and quantities, addressing this issue effectively [ 5 ]. Barnyard grass is one of the most common weeds in rice fields, belonging to the same Poaceae family as rice. Both share a high degree of similarity in appearance and growth habits [ 6 ]. In rice field images obtained by UAVs, barnyard grass weeds often occupy only a few dozen pixels, representing typical small targets. The recognition process for such small targets is prone to false positives or negatives due to lighting conditions and mutual occlusion. The high similarity between barnyard grass and rice, coupled with the small size of the targets and the complex and dynamic background, poses a significant challenge for accurate identification of barnyard grass in rice fields based on UAV remote sensing images [ 7 ]. In recent years, deep learning approaches have demonstrated significant potential in weed identification tasks [ 8 – 9 ]. Deep learning, with advantages such as end-to-end learning, high-level feature learning, and large-scale data-driven capabilities, has rapidly emerged as the mainstream method in the field of object detection [ 10 ]. Deep learning can directly learn end-to-end from large-scale annotated weed image data, automatically extracting visual features required for weed classification without the need for manual feature design and selection. Furthermore, with the expansion of datasets, deep models show continuous improvement in performance and adaptability to different agricultural environments [ 11 ]. There are two typical architectures for modern object detectors: CNN-based and Transformer-based. Over the past few years, extensive research has been conducted on CNN-based object detectors. The evolution of these detectors has transitioned from the initial two-stage structures, such as the R-CNN series, gradually evolving into one-stage structures, with models like the YOLO series representing this trend[ 15 ]. Zhang et al[ 12 ] embedded the CBAM attention mechanism after the pooling layers in the latter part of VGG19, forming the VGG19-CBAM structure as the optimal backbone feature extraction network for the Faster R-CNN model. They utilized this model for weed detection in soybean fields, achieving an average recognition accuracy of 99.16%, with an average recognition speed of 336 ms per image. Gallo et al[ 13 ] collected over 3000 weed remote sensing data using drones in a chicory plantation, creating a weed dataset in chicory plant production. They trained a YOLOv7 model on this dataset for weed target detection, achieving an average recognition accuracy of 56.6%. Both two-stage and one-stage models involve many manually crafted components, such as anchor generation, rule-based training target assignment, and post-processing non-maximum suppression (NMS), which are not fully end-to-end. Since the introduction of the DETR (DEtection TRansformer) as a Transformer-based object detector, it has garnered widespread attention in the academic community due to its elimination of various manually crafted components, such as non-maximum suppression (NMS). This architecture significantly simplifies the object detection pipeline, achieving end-to-end object detection[ 15 ]. In recent years, Transformer-based detectors have made significant progress in performance, thanks to researchers' relentless efforts in accelerating training convergence and reducing optimization challenges [ 16 ].Zhu et al[ 14 ]pointed out that when Transformer components are initialized, attention modules apply almost identical attention weights to all pixels in the feature map, leading to a longer training time to converge. To address this issue, they proposed a deformable attention module, combining the advantages of deformable convolution sparse spatial sampling and the relationship modeling capability of transformers, to overcome the slow convergence problem in DETR models. Li et al[ 17 ]attributed the slow convergence of DETR models to the instability of bipartite graph matching, resulting in inconsistent optimization objectives in the early training stages. To resolve this issue, they introduced noisy ground truth bounding boxes into the Transformer decoder, effectively reducing the difficulty of bipartite graph matching and accelerating convergence. However, despite achieving a certain degree of improvement in convergence speed and overall performance, the model exhibits poor performance in detecting small targets. Current research has demonstrated that integrating multi-scale feature layers into the model can effectively enhance its detection performance for small targets [ 18 ]. In this paper, we propose an MS-DETR model, which enhances the DETR model's ability to detect small targets by introducing multi-scale feature layers into the DETR framework. Existing research indicates that low-level semantic information typically contains more fine-grained and local features, which may be more distinctive and sensitive for small targets[ 19 – 21 ]. Therefore, we differentially design the various feature layers of multi-scale features. For high-level semantic information, we apply Transformer structures to extract features, fully integrating context information from different perceptual domains. For low-level semantic information, we use a more computationally efficient CNN structure for feature extraction and encoding. Subsequently, effective fusion of the two types of features is achieved through cross-scale feature fusion, leveraging their respective advantages and forming an information-rich feature space. In traditional Transformer structures, due to all heads sharing the same input features and relying on isolated learning with non-shared parameters, there is often a highly homogeneous and redundant representation across different heads[ 22 – 23 ]. To reduce computational redundancy, we adopt a novel structure called Cascaded Group Attention module to replace the traditional Transformer structure. This module provides different channel subsets of features as input for each head, allowing each head to learn more unique features, thereby enhancing the model's learning ability and reducing computational redundancy. The introduction of multi-scale feature layers inevitably increases the model's computational complexity and slows down the inference speed. In this study, we use an efficient and parallelizable Partial convolution (PConv) in the MS-DETR model to replace conventional convolution, aiming to maximize the model's inference speed. 2. Experimental Design 2.1 Data Collection: In rice production management, the "two seals and one kill" weed control strategy is commonly employed. This strategy involves two soil herbicide applications before rice transplanting and after the tillering stage, along with one herbicide application during the panicle initiation stage. Therefore, between May and June 2022, at the experimental field of Shenyang Agricultural University in Haicheng City, Liaoning Province, unmanned aerial vehicles were utilized to collect remote sensing data for both rice and barnyard grass during the tillering and panicle initiation stages. The experimental area measured 165 meters in length, 97 meters in width, with a total area of 16,005 square meters, as illustrated in Fig. 1 . The DJI M300 drone served as the flight platform, flying at an altitude of 30 meters and equipped with the Zenmuse P1 lens with an effective pixel count of 45 million. To ensure image registration accuracy, the drone followed a predetermined flight path with 80% forward overlap and 80% side overlap. Images were captured in a vertical perspective to cover the entire experimental field. The collected image resolution was 8192×5460 pixels. DJI Terra was used for image registration and fusion of the acquired rice field remote sensing images. To prevent disturbances to both rice plants and weeds caused by strong winds and ensure the accuracy of subsequent image registration and fusion, unmanned aerial vehicle (UAV) remote sensing data collection was conducted under weather conditions with wind speeds below level 4. A total of 171 UAV remote sensing images of rice fields were collected. 2.2 Dataset Generation: To avoid inconsistencies in annotation when the same target appears in different images, this study initially used DJI Terra to register and fuse the collected unmanned aerial vehicle remote sensing data of weeds. Subsequently, the registered and fused images were segmented into non-overlapping sub-images of 600×600 pixels each. After image segmentation, a total of 3,094 rice field weed remote sensing images were obtained. Weed distribution in the field is uneven, with dense areas showing continuous weed growth, while sparse areas exhibit individual weed plants. Therefore, during the manual annotation process, barnyard grass was classified into two types: continuous patches of barnyard grass and single barnyard grass plants. The Labelme software (v4.5.6) was employed for manual annotation. The schematic diagram of the dataset is shown in Fig. 2 , and the sample quantities after annotation are presented in Table 1 . The dataset was partitioned into training, validation, and testing sets at a ratio of 7:2:1, with no duplicate data between sets. The yellow-highlighted boxes in the image indicate the "Field ridge" labels, the green-highlighted boxes represent the "Single barnyard grass plant" labels, and the red-highlighted boxes correspond to the "Continuous patches of barnyard grass" labels. Figure 2 ༎Example of a remote sensing image dataset of weeds in rice fields Table 1 Number of samples in the dataset. Label Category Numbers of original images Total number of images after augmentation Field ridge 438 1752 Continuous patches of barnyard grass 218 872 Single barnyard grass plant 2876 11504 2.3 Data Augmentation: To enhance the model's robustness to the aforementioned samples, we augmented the training dataset by fourfold through techniques such as random cropping, color jittering, noise addition, and random rotation. This resulted in a dataset containing a total of 14,128 images, with the sample distribution outlined in Table 1 . 3. Materials and Methods 3.1: Methodology Employed in this Study Model Overview: To address the limitations of the DETR model in small object detection [ 24 ] we introduced a multi-scale feature layer into the DETR model, creating the MS-DETR model. The purpose of this modification is to better adapt to small targets, particularly the detection of single barnyard grass plants in rice fields. The model framework, illustrated in Fig. 3 , consists of three core components: Backbone, Encoder, and Decoder.We designed a hybrid encoder, which comprises a feature extraction module and a cross-scale feature fusion module. In the feature extraction module, we differentiated the design for different feature layers. The high-level semantic feature layers employ a Transformer structure to emphasize the extraction of contextual information related to grass and rice, while the low-level semantic feature layers efficiently extract detailed grass features using a CNN structure. The cross-scale feature fusion module effectively combines the features extracted by the Transformer and CNN structures across different scales. This unique design of the hybrid encoder organically combines features from different levels, creating favorable conditions for the overall performance improvement of the model. To better extract grass features in the high-level semantic feature layers, we introduced the Cascaded Group Attention module, replacing the traditional multi-head attention mechanism to enhance learning ability while reducing computational burden. Last but not least, to further improve the inference speed, MS-DETR adopts the efficient and parallelizable PConv, successfully replacing conventional convolution operations. This series of innovative designs results in outstanding performance of our model in small object detection tasks, especially in the detection of grass. We selected a rice field weed image with dimensions of 600x600. Firstly, we preprocessed the image by resizing it to 640x640x3 using bilinear interpolation. The resized image was then fed into the Backbone module for initial feature extraction. In the middle layers of the Backbone, we concurrently inserted multiple convolutional layers to perform multi-scale convolution operations on the feature map, generating fixed-dimensional multi-scale feature representations. This resulted in three multi-scale feature layers labeled as s3, s4, and s5, with dimensions of 80×80×128, 40×40×256, and 20×20×512, respectively. The high-level semantic feature layer s5 was input into the transformer structure for feature encoding, producing the encoded feature layer y1 with dimensions of 20×20×512. Simultaneously, the low-level semantic feature layer s3 was input into a CNN network for feature extraction, yielding the feature layer y2 with dimensions of 80×80×256.Finally, we input y1, y2, and s4 into the feature fusion module for cross-scale feature fusion. The output is the fused feature layer y3 with dimensions of 80×80×512. This y3 serves as our final feature representation, which is then input into the subsequent decoding layers to accomplish weed recognition predictions. Feature Extraction Based on Cascaded Group Attention: In traditional Transformer structures, due to shared input features, different heads may learn redundant information. Simultaneously, because parameters are not shared, each head independently learns weights and feature representations, potentially leading to overly similar features across different heads, lacking diversity. To overcome this issue, this study introduces the Cascaded Group Attention module. This module provides each head with different channel subsets of features as input, enabling each head to learn more unique features. Additionally, it cascades output features among heads, thereby enhancing the model's learning ability and reducing computational redundancy [ 25 ]. As illustrated in Fig. 3 , the structure of the Cascaded Group Attention module can be described as follows: $${\stackrel{\sim}{X}}_{ij}=Attn({X}_{ij}{W}_{ij}^{Q},{X}_{ij}{W}_{ij}^{K},{X}_{ij}{W}_{ij}^{V})$$ 1 $${\stackrel{\sim}{X}}_{i+1}=Concat{\left[{\stackrel{\sim}{X}}_{ij}\right]}_{j=1:h}{W}_{i}^{P}$$ 2 where the \(j\) -th head computes the self-attention over \({X}_{ij}\) , which is the \(j\) -th split of the input feature \({X}_{i}\) , i.e., \({X}_{i}=\) \(\left[{X}_{i1},{X}_{i2},\dots ,{X}_{ih}\right]\) and \(1\le j\le h\) . \(h\) is the total number of heads, \({W}_{ij}^{\text{Q}},{W}_{ij}^{\text{K}}\) , and \({W}_{ij}^{\text{V}}\) are projection layers mapping the input feature split into different subspaces, and \({W}_{i}^{\text{P}}\) is a linear layer that projects the concatenated output features back to the dimension consistent with the input. The Cascaded Group Attention calculates the attention maps for each head in a cascading manner, adding the output of each head to the subsequent ones. This design encourages the Q, K, V layers to learn feature projections with richer information, progressively improving the capacity of feature representation. Through the cascading structure, this process allows the model to continuously accumulate and propagate richer information in each attention head, contributing to the enhancement of the model's learning ability and further optimizing feature representation: $${X}_{ij}^{{\prime }}={X}_{ij}+{\stackrel{\sim}{X}}_{i(j-1)}, 1<j\le h$$ 3 where \({X}_{ij}^{{\prime }}\) is the addition of the \(j\) -th input split \({X}_{ij}\) and the \((j-1)\) -th head output \({\stackrel{\sim}{X}}_{i(j-1)}\) calculated by Eq. ( 2 ). It replaces \({X}_{ij}\) to serve as the new input feature for the \(j\) -th head when calculating the self-attention. Besides, another token The Cascaded Group Attention can save h × FLOPs and parameters since the input and output channels of the QKV layers are reduced by h ×. Secondly, cascading attention heads can increase the network depth, thereby further enhancing the model capacity without introducing any additional parameters. Feature Extraction Based on CNN Structure: Existing research results indicate that low-level semantic feature layers contain more fine-grained local information, which is crucial and sensitive for the detection of small targets [ 26 ]. CNN's convolution and pooling operations aid in extracting local information such as textures and shapes in images, making it easier to capture local features and details in the images. This makes CNN more suitable for extracting and encoding detailed features from low-level semantic feature layers [ 27 – 28 ].Therefore, in the design of the hybrid encoder for the MS-DETR model, we utilized a CNN structure in the feature extraction module to extract detailed information about weeds from the low-level semantic feature layer. When using the CNN network to extract low-level details, appropriately expanding the receptive field of the CNN network enables it to capture richer features of the target and surrounding background areas, thereby improving the quality of small target detection [ 29 – 30 ]. Dilation convolution, compared to regular convolution, can enlarge the receptive field, obtaining broader and richer features, which is crucial for detecting small targets of different scales [ 31 – 32 ]. Therefore, we employed dilated convolution for feature extraction on the low-level semantic feature layer, as illustrated in Fig. 5 . To capture multi-scale features within different receptive fields, we employed dilated convolutions with dilation factors of 6, 12, and 18 to extract low-level semantic information from the multi-scale feature layers. Here, the kernel size of the dilated convolution is 3×3, and dilated convolutions with different dilation factors, along with Batch Normalization and ReLU activation functions, form separate branches. To alleviate potential issues of gradient vanishing or exploding during training, we introduced a residual structure in each branch, including a 1×1 convolution layer. The outputs of each branch are obtained by summing them, and these features are then concatenated together. By applying a 1×1 convolution operation, we reduced the channel number from 240 to 80, obtaining a globally fused feature representation that incorporates multi-scale contextual information. This helps in capturing subtle features of barnyard grass in the rice field scene. Efficient and Parallelizable PConv as an Alternative to Conventional Convolution: The introduction of multiscale feature layers is bound to increase the computational load of the model, slowing down its inference speed. Current research indicates that frequent memory access by operators is the primary cause of low FLOPS. To enhance the inference speed of the model as much as possible, we have employed a PConv that simultaneously reduces memory access time and computational redundancy, replacing conventional convolutions in the model. The working principle of PConv involves utilizing the first or last consecutive channel for continuous or regular memory access as a representative for the entire feature map, while the remaining channels remain unchanged [ 33 ]. As a result, the FLOPs of PConv are only: $$h\times w\times {k}^{2}\times {c}_{p}^{2}.$$ 4 With a typical partial ratio \(r=\frac{{c}_{p}}{c}=\frac{1}{4}\) , the FLOPs of a PConv is only \(\frac{1}{16}\) of a regular Conv. Besides, PConv has a smaller amount of memory access, i.e., $$h\times w\times 2{c}_{p}+{k}^{2}\times {c}_{p}^{2}\approx h\times w\times 2{c}_{p}$$ 5 which is only \(\frac{1}{4}\) of a regular Conv for \(r=\frac{1}{4}\) . 3.2: Model Training and Evaluation Metrics Parameter Configuration: To ensure the fairness of the experiments, identical initial training parameters are set for each group. Taking into account physical memory constraints and learning efficiency, the number of training images per batch is set to 4, and the maximum iteration count is set to 500. During training, the model employs the Stochastic Gradient Descent (SGD) optimizer, and the learning rate ( \(lr\) ) decay strategy can be described as follows: $$lr=base\_lr\bullet {(1-\frac{iter\_num}{\text{m}\text{a}\text{x}\_iterations})}^{p}$$ 6 Here, \(base\_lr\) represents the base learning rate, max_iterations is the maximum iteration count, \(iter\_num\) is the iteration index, and p is the polynomial decay exponent. In this study, the base learning rate is set to 0.001, momentum is set to 0.9, weight decay is set to 1e-4, and the lower limit for learning rate updates is 0. These settings are consistently applied across all model training sessions. This study employs the Cross-Entropy Loss function to quantify the distance between the predicted probability distribution of pixel categories and the true label category probability distribution during the training process. The specific calculation method is as follows: $$Loss=\frac{1}{M}\sum _{i=1}^{M}\sum _{C=1}^{N}h\left({b}_{i}\right)\text{l}\text{o}\text{g}\left({p}_{ic}\right)$$ 7 In the formula, \(M\) represents the number of pixels; \(N\) represents the number of categories; \(i\) represents the current pixel; \(C\) represents the current category; \({b}_{i}\) is the true label category for pixel \(i\) ; \(h\) is the probability distribution function in the range of 0 ~ 1, where it is 1 if \({b}_{i}=c\) and 0 otherwise; \({p}_{ic}\) is the predicted probability of pixel i belonging to category \(c\) , obtained through the Sigmoid function applied to the calculation of predicted category scores. Through the computation of the loss function during the iteration process, the model's training performance is evaluated. The weights are adjusted through backpropagation to gradually reduce the error represented by the loss value, aiming to achieve the training objectives. Evaluation Metrics: To quantitatively analyze the model's performance, this study employs Average Precision (AP), precision, and recall to assess the effectiveness of the proposed MS-DETR model. For precision and recall, there are four states after the test sample is predicted: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The definitions are as follows: $$precision=\frac{TP}{TP+FP}$$ 8 $$recall=\frac{TP}{TP+FN}$$ 9 The recall rate and precision rate are based on the threshold value of 0.5. Experimental Platform Configuration: Table 2 Experimental Environment Operating system Hardware environment Software environment CPU Hard drive capacity GPU Python cuDNN CUDA Windows 10 Intel(R) Core(TM) i7-9700 @3.0GHz 64G NVIDIA GeForce RTX 5000 3.7 8.5.0 11.7 4. Experimental Results In this section, we conducted multiple experiments to validate the performance and reliability of the proposed MS-DETR model in rice field weed detection. Comprehensive analysis and discussion of the experimental results were performed.4.1 Visualization To visually demonstrate the effectiveness of the proposed approach in improving the recognition performance of rice field weed images, this study introduced three successive improvements based on the original DETR model. The first improvement replaced the multi-head attention mechanism in DETR with Cascaded Group Attention, resulting in DETR-CGA. The second improvement added multiscale feature layers to DETR-CGA and used CGA and CNN to extract high and low-level semantic features separately, yielding DETR-CGA + CNN. Finally, the MS-DETR model was obtained by replacing conventional Conv with PConv on the basis of DETR-CGA + CNN. Subsequently, the attention maps for weed feature extraction were compared using Grad-CAM visualization technique between the original DETR, MS-DETR, and the two intermediate variants. All attention maps are from the last encoding layer of the model's encoder. The results are shown in Fig. 6 . Figure 6 : Visualization of Target Heatmaps under Different Improvement Methods Observing Fig. 6 c and 6 b, it is evident that the DETR-CGA model, incorporating the Cascaded Group Attention module, enhances attention to key feature regions when recognizing single barnyard grass plants and field ridges compared to the original DETR model. Although it expanded the attention scope on the features of contiguous weeds, the DETR-CGA model compensates for the missed detection issues present in the original DETR model, as illustrated by the red boxes in the figure. Observing Fig. 6 e and 6 d, it is evident that the MS-DETR model, utilizing PConv, exhibits a pronounced focus in the attention distribution on the main feature regions of all target categories compared to the DETR-CGA + CNN model with conventional convolutions. The innovation of the MS-DETR model lies in the effective fusion of global and local features. As depicted in Fig. 6 e, when detecting single barnyard grass plants and continuous patches of barnyard grass, the MS-DETR model primarily focuses on their growth positions between field ridges. The growth position of barnyard grass between field ridges is a typical local feature distinguishing barnyard grass from rice. When identifying field ridge categories, the MS-DETR model emphasizes both the boundary parts of the field ridge and the presence of weeds on the ridge, ensuring comprehensive attention to both global and local features. This indicates that the MS-DETR model, through the effective fusion of global and local features in the image, enhances the recognition ability of typical features in targets, thereby improving the detection performance of rice field weed by the model. 4.2 Sensitivity Analysis To verify the contribution of the proposed improvement method to the model's performance, this study conducted ablation experiments based on a self-constructed rice field weed dataset. Starting with the framework of the original DETR base model, various improvement modules were progressively incorporated to create multiple model variants. The performance of each variant was then evaluated using the mAP50 metric. Through ablation experiments, a quantitative analysis was conducted to assess the impact of each improvement method on the performance enhancement of the model in rice field weed detection tasks. The results are presented in Table 3 . Table 3 Recognition Results of Different Improvement Methods for Rice Field Weeds NO. Model All Single barnyard grass plant Continuous patches of barnyard grass Field ridge 1 DETR 0.764 0.647 0.77 0.875 2 DETR-CGA 0.772 0.687 0.81 0.818 3 DETR-CGA + CNN 0.784 0.73 0.782 0.839 4 MS-DETR 0.792 0.686 0.816 0.873 1 → 2: In terms of overall accuracy, the DETR-CGA model has slightly improved in mAP50 metrics compared to the original DETR model, from 0.764 to 0.772. From various categories, compared to the DETR model, the DETR-CGA model has improved recognition accuracy by 4% in both single plant barnyard grass and Continuous patches of barnyard grass. This indicates that the CGA module enhances the model's ability to extract complex features, effectively improving the recognition accuracy of complex targets such as barnyard grass. However, we also observed a 6.5% decrease in model recognition accuracy when facing relatively regular and simple field ridge targets. The reason might be that the attention heads of the CGA module are overly concentrated on capturing crucial complex semantic information, leading to insufficient representation of simple low-level visual features and failing to provide effective support for simple targets. 2 → 3: The DETR-CGA + CNN model is built on the DETR-CGA model by introducing a multi-scale feature extraction module and effectively fusing the semantic information extracted from both Transformer and CNN structures. Its mAP50 overall score is improved from 0.772 to 0.784. This demonstrates that the effective fusion of global and local features is beneficial for enhancing target detection. For the single barnyard grass plant and field ridge categories, the recognition accuracy of the DETR-CGA + CNN model has been improved to varying degrees, especially the recognition accuracy for the single barnyard grass plant category, which has increased significantly. This shows that adding the multi-scale feature extraction module can improve the model's recognition accuracy for small target categories to some extent. 3 → 4: The MS-DETR model, built on the DETR-CGA + CNN model, replaces the conventional convolutions with PConvs. This improvement effectively enhances the model's recognition capability, improving mAP50 overall score from 0.784 to 0.792. For large-area targets like continuous patches of barnyard grass and field ridges, the recognition accuracy of the MS-DETR model has increased by 3.4% for both. However, for small-area targets like a single barnyard grass plant, the recognition accuracy decreased by 4.4%. This suggests that the PConv structure may be more suitable for extracting features of large-area targets, while having limitations in extracting features of small-area targets. 4.3 Analysis of Other Metrics To provide a more comprehensive analysis of the impact of the proposed improvement methods on model performance, various metrics were analyzed. Figure 7 illustrates the trend of loss values during the training process. As can be seen from Fig. 7 , MS-DETR and DETR-CGA + CNN demonstrated the lowest loss value of 0.13 during the training process. However, it is noteworthy that DETR-CGA + CNN converged relatively slowly and fluctuated greatly during training. In contrast, the MS-DETR model achieved the best performance in both loss value and convergence speed. Figure 8 depicts the precision-recall curves for models utilizing different improvement methods. ROC curves and AUC are commonly used metrics to evaluate the performance of classification models. The ROC curve illustrates the model's ability to correctly classify judgments under different threshold conditions. The AUC value represents the probability expectation that the model correctly distinguishes positive and negative instances across all classification thresholds. The larger these two metrics, the better the stability and robustness of the classification model. As shown in Fig. 8 , the ROC curve of the MS-DETR model is closest to the top right corner, meaning the MS-DETR model has the best classification performance with an AUC of 0.79, outperforming other contrastive models. This indicates that the MS-DETR model has higher average recognition accuracy and better overall performance. 4.4 Validation of Enhanced Small Object Recognition Capability The detailed experiments in this section are to verify the enhanced effects of our proposed model on small target detection tasks. The model performance is evaluated by the mean Average Precision (AP) and mean Average Recall (AR) in different size ranges, where higher AP and AR values indicate better effects of the model in detecting targets within the corresponding size ranges. The AP and AR in Table 4 are obtained at IoU = 0.50:0.95. The subscripts are defined as follows: S represents small targets (area ≤ 322), M represents medium targets (322 962), and area represents number of pixels [ 34 ]. Table 4 Recognition results of models with different improvement methods for different sizes of rice weeds. MS-DETR DETR-CGA + CNN DETR-CGA DETR AP-s 0.111 0.073 0.103 0.058 AP-m 0.153 0.139 0.136 0.118 AP-l 0.635 0.624 0.617 0.617 AR-s 0.266 0.443 0.364 0.191 AR-m 0.397 0.318 0.396 0.402 AR-l 0.807 0.798 0.814 0.788 The experimental results show that compared with the original DETR model, our proposed MS-DETR model significantly improves the detection performance on large, medium and small targets. Among them, the gain on small target detection is the most significant, with AP and AR greatly improved by 91% and 39% respectively, outperforming all contrastive methods. The recognition of medium and large targets also has some improvement, with the AP of medium targets increased by 29%, and the AP and AR of large targets improved by 2.9% and 2.4% respectively. It should be noted that the AR of medium targets dropped slightly by 1.2%. The reason may be that the model optimization for small target detection resulted in less attention on medium targets. Since small target weeds are more densely distributed in weed scenes, the model optimization pays more attention to improving small target detection, which sacrifices the detection recall rate of medium-sized weed targets to some extent, leading to a slight 1.2% decline. Considering the small number and relatively easy detection of medium-sized weeds, such loss can be acceptable. 4.5 Feasibility Analysis of Agricultural Production To assess the computational complexity of our proposed method, we conducted testing experiments on the collected rice weed dataset. To eliminate other influencing factors, we performed comparisons under the same experimental environment, where model parameters and GFLOPs were computed on a single NVIDIA RTX5000 GPU for input sizes of 640×640 pixels. Inference time was calculated as the average over 100 runs on test samples of 640×640 pixel images. The experimental results are presented in Table 5 . Table 5 Performance parameters of models with different improvement methods. Model DETR DETR-CGA DETR-CGA + CNN MS-DETR Parameters 38.6MB 38.3MB 42.5MB 40.8MB Latency 0.00750s ± 0.00145s 0.00705s ± 0.00095s 0.00829s ± 0.01191s 0.00818s ± 0.00105s FPS 133.3 141.8 120.6 122.2 mAP50 0.764 0.772 0.784 0.792 Compared with the original DETR model, the DETR-CGA model with the efficient Cascaded Group Attention module reduced the model size by 0.3MB. While reducing the number of parameters and model size, its accuracy was improved by 0.08, indicating that the Cascaded Group Attention module provides different channel subsets of features as input to each head, which reduces model parameters while allowing each head to learn more unique features, thereby improving the model's recognition accuracy for rice field weeds. The DETR-CGA + CNN model introduced multi-scale feature layers later, with its number of parameters significantly increased by 9.7%, due to the additional parameters brought by the multi-scale feature layers. However, the model accuracy also increased by 0.012. On this basis, the efficient PConv was adopted to replace conventional convolutions in the MS-DETR model. With no change in model structure, the number of model parameters decreased by 4%, FPS increased by 1.6, and model accuracy also improved by 0.08. Overall, compared with the original DETR model, our model has no significant advantages in terms of number of parameters and inference time. It completes weed recognition at a speed of 0.00818 seconds per image. Although not the fastest in inference, it achieved the best performance in recognition results. Our model strikes a good balance between recognition performance and computational efficiency, making it suitable for deployment on intelligent devices with limited computing power. 4.6 Comparison with Other Classic Algorithms In order to comprehensively evaluate the performance of the model on the rice weed detection task, we conducted comparative experiments on the rice weed dataset, comparing the MS-DETR model with other classic DETR variants, including Deformable DETR [ 14 ], Anchor DETR [ 35 ], and DAB-DETR [ 36 ]. The experimental results are presented in Table 6 and Fig. 9 . Table 6 Detection performance of different DETR variant models. Model MS-DETR Deformable DETR Anchor DETR DAB-DETR mAP50 0.792 0.775 0.755 0.773 Parameters 40.8M 41M 36.8M 44M GFLOPs 187G 86G 151G 94G (a)Single barnyard grass plant (b)Field ridge (c)Continuous patches of barnyard grass Figure 9 : Recognition results of different models on rice weeds. Since Deformable DETR first introduced multi-scale features on the DETR basis, effectively improving the detection performance, and Anchor DETR and DAB-DETR are improved on the Deformable DETR model, we chose the above models for comparison. As shown in Table 6 , among multiple DETR variants, MS-DETR achieved the highest mAP50 value of 0.792, displaying the optimal recognition performance. In terms of the number of parameters, MS-DETR used 40.8M parameters, only higher than the smallest Anchor DETR (44M). Considering the highest recognition accuracy of MS-DETR, this means its parameter utilization efficiency is high. However, the computational complexity (GFLOPs) of MS-DETR reached 187G, the largest among all contrastive models. Taking into account both recognition accuracy and parameter utilization efficiency, MS-DETR achieved the best balance between the two, obtaining the highest recognition performance metrics, while keeping the number of parameters and computational complexity within a reasonable range. As shown in Fig. 9 , our proposed MS-DETR model performs the best in recognizing smaller single barnyard grass plant targets, accurately identifying all weed instances and field ridges in the image, while the Deformable DETR and DAB-DETR models failed to detect the field ridge in the bottom right corner (as shown in the blue box in Fig. 9 a), and missed detecting some weeds (as shown in the yellow box in Fig. 9 a). The reasons may be: first, Deformable DETR does not distinguish between feature layers at different scales. The independent Deformable Attention modules on low semantic feature layers cannot effectively capture detailed features like CNNs. They do not fully exploit the key localization information that low semantic layers provide for small targets; Second, the multi-scale feature extraction and fusion process of simple “stacking-summing” is too singular to model the rich interactions between features, which limits the effectiveness of multi-scale information representation and integration of the model. Although the Anchor DETR model detected the field ridges, it also missed some weed targets (as shown in the yellow box in Fig. 9 a). For larger field ridge targets, all models can identify them relatively well. However, the Anchor DETR model incorrectly identified the barnyard grass on the field ridges, which should not have been annotated during the data annotation process. Therefore, there was no dataset with barnyard grass on field ridges in the training data, resulting in a kind of false positive detection. For recognizing continuous patches of barnyard grass, Anchor DETR failed to detect the continuous patches of barnyard grass in the bottom left corner (as shown in the black box in Fig. 9 c), while other models basically detected the area of continuous weed patches, but with some differences in the positioning of detection boxes. The MS-DETR model left a small unlabeled area in recognizing continuous patches of barnyard grass, while the detection boxes of Deformable DETR and DAB-DETR models have some overlap, especially the two boxes in the Deformable DETR model with the largest overlap area. The possible reasons for Anchor DETR missing a patch of weed target (as shown in the black box in Fig. 9 c) are: (1) The concept of "dense weeds" itself is relatively subjective, and different people have different understandings and criteria regarding weed density. Even for the same person, the understanding of "dense" may change when annotating data at different times, resulting in inconsistent labels in the training data. (2) The current training data volume is relatively small, and the samples of various weed density scenarios are not comprehensive enough. This limits the model's ability to learn the concept of “dense weeds”. 5. Discussion Due to the high similarity in morphology between barnyard grass and rice plants, and the fact that barnyard grass are small objects in UAV remote sensing imagery, accurate identification of barnyard grass in rice fields based on UAV remote sensing is challenging. To address this problem, this study proposes targeted improvement measures and develops a rice field barnyard grass object detection model that balances detection performance and efficiency to handle barnyard grass detection tasks in complex real-world scenarios. In order to improve the recognition accuracy of barnyard grass in remote sensing imagery, we proposed the MS-DETR model, which introduces multi-scale feature layers on the basis of DETR. We designed the different feature layers differently. The high-level semantic feature layer adopts Transformer structure to emphasize the extraction of context relationship information between barnyard grass and rice plants. The low-level semantic feature layer uses CNN structure to extract barnyard grass detail features. This is because high-level semantic feature layers usually contain more abstract and semantic information. The self-attention mechanism in Transformers allows each input position to associate with all other positions, unlike CNN networks which are limited by fixed window sizes. This fully-connected mechanism enables the model to build relationships between any two pixels in the image, thereby better extracting global feature information. Low-level semantic feature layers usually contain more detailed information. The process of convolving the convolution kernels with the feature layer element-by-element in CNN networks is essentially weighted aggregation of features, which can effectively capture local features in the feature layer. When using Transformer structure to extract context information of rice field weeds, we introduced the Cascaded Group Attention module to replace the traditional multi-head attention mechanism in Transformer structure. Since the Cascaded Group Attention module splits the input features into multiple channel subsets and takes these channel subsets as the inputs to different self-attention heads separately, it avoids repetitive encoding of the same information by different heads and reduces computational redundancy. Meanwhile, different heads extracting features from their own channel subsets help the model learn more diverse representations of the input features. Experimental results show that this improvement increased the detection accuracy (mAP50) by 1%, reduced the model size from 38.6M to 38.3M, and shortened the inference time from 0.0075 seconds to 0.00705 seconds. When using CNN to extract barnyard grass detail features, we apply atrous convolutions with different dilation rates on the same semantic feature layer to achieve multi-scale observation of the feature layer, thereby enabling the model to capture small barnyard grass features. Experimental results show that this improvement increased the barnyard grass recognition accuracy by 1.6%. This is mainly attributed to the enlarged receptive field of convolution kernels by introducing dilation rates in atrous convolution, which can capture richer features of barnyard grass objects and surrounding background regions. However, the introduction of this multi-branch structure leads to increased computational burden and slower inference speed. The model size increased from 38.3M to 42.5M, and the detection time increased from 0.00705 seconds to 0.00829 seconds. In order to maximize the model's inference speed, we extensively adopted the efficient parallelizable PConv in the model to replace conventional convolutions. PConv treats the first or last consecutive channel subset of the feature map as the representative of the entire feature map, performs spatial feature extraction on it using Conv, while keeping the remaining channels unchanged. This strategy of focusing only on key channels significantly improves computational efficiency and reduces channel redundancy. Experimental results show that the use of PConv modules not only reduced model parameters from 42.5M to 40.8M, but also improved average inference time by 1.3%. More importantly, the barnyard grass detection accuracy also increased from 0.784 to 0.792. Although the MS-DETR model demonstrates good performance on our self-built rice field weed dataset, there are still many factors not evaluated in this study. First, our training set was collected from a single experimental field, without considering the effects of different farm management measures on dominant weed species. Second, changes in lighting conditions may affect image features, while the current dataset does not cover variations under different weather conditions. These two limitations may affect the model's generalization ability in other environments. To mitigate the above effects, in future research, we will collect rice field weed datasets across more regions and time spans, to include samples under varying lighting conditions and with different weed species, so as to expand the applicability of the MS-DETR model. 6. Conclusion The main conclusions of this study, which proposes a rice field weed detection method for UAV remote sensing, are summarized as follows: (1)By introducing multi-scale feature layers in the DETR model and differentiating their designs, the detection performance of the DETR model can be effectively improved, especially for detecting small targets. Compared with the original DETR model, the overall detection accuracy of our proposed MS-DETR model is improved by 3.6%, and the detection accuracy for small targets is increased substantially by 91%. (2)Incorporating the Cascaded Group Attention module into the DETR model to replace the traditional multi-head attention mechanism can effectively reduce model computation while improving detection accuracy. The model size is reduced by 0.3M and the overall detection accuracy is improved by 1%. (3)Extensively using PConv in the model can effectively decrease model computation and improve model inference speed. The model inference speed is increased by 1.3% and the model size is reduced by 1.7M. Declarations Funding Liaoning Province Applied Basic Research Program Project (2023JH2/101300120) , Liaoning Province's "Xingliao Talent Plan" project, with project number XLYC2203005.and Open Project of the South China Tropical Smart Agriculture Technology Key Laboratory of the Ministry of Agriculture and Rural Affairs (HNZHNY-KFKT-202208) Authors and Affiliations School of Information and Electrical Engineering, Shenyang Agricultural University、National Digital Agriculture Regional Innovation Center (Northeast)、Key Laboratory of Smart Agriculture Technology in Liaoning Province Zhonghui Guo School of Information and Electrical Engineering, Shenyang Agricultural University、National Digital Agriculture Regional Innovation Center (Northeast)、Key Laboratory of Smart Agriculture Technology in Liaoning Province Dongdong Cai School of Information and Electrical Engineering, Shenyang Agricultural University、National Digital Agriculture Regional Innovation Center (Northeast)、Key Laboratory of Smart Agriculture Technology in Liaoning Province Yunyi Zhou School of Information and Electrical Engineering, Shenyang Agricultural University、National Digital Agriculture Regional Innovation Center (Northeast)、Key Laboratory of Smart Agriculture Technology in Liaoning Province Tongyu Xu School of Information and Electrical Engineering, Shenyang Agricultural University、National Digital Agriculture Regional Innovation Center (Northeast)、Key Laboratory of Smart Agriculture Technology in Liaoning Province、Key Laboratory of Smart Agriculture in the South China Tropical Region, Ministry of Agriculture and Rural Affairs Fenghua Yu Corresponding author Correspondence to Fenghua Yu , E-mail: [email protected] Conflict of interest None References Ghosh, D.; Brahmachari, K.; Skalicky, M.; Roy, D.; Das, A.; Sarkar, S.; Moulick, D.; Brestič, M.; Hejnak, V.; Vachova, P.; et al. The combination of organic and inorganic fertilizers influence the weed growth, productivity and soil fertility of monsoon rice. PloS one 2022 , 17 , e0262586. Rosle, R.; Che’Ya, N.N.; Ang, Y.; Rahmat, F.; Wayayok, A.; Berahim, Z.; Fazlil Ilahi, W.F.; Ismail, M.R.; Omar, M.H. Weed detection in rice fields using remote sensing technique: A review. Applied sciences 2021 , 11 , 10701. Meshram, A.T.; Vanalkar, A.V.; Kalambe, K.B.; Badar, A.M. Pesticide spraying robot for precision agriculture: A categorical literature review and future trends. Journal of Field Robotics 2022 , 39 , 153–171. Talaviya, T.; Shah, D.; Patel, N.; Yagnik, H.; Shah, M. Implementation of artificial intelligence in agriculture for optimisation of irrigation and application of pesticides and herbicides. Artificial Intelligence in Agriculture 2020 , 4 , 58–73. Roslim, M.H.M.; Juraimi, A.S.; Che’Ya, N.N.; Sulaiman, N.; Manaf, M.N.H.A.; Ramli, Z.; Motmainna, M. Using remote sensing and an unmanned aerial system for weed management in agricultural crops: A review. Agronomy 2021 , 11 , 1809. Rahaman, F.; Juraimi, A.S.; Rafii, M.Y.; Uddin, M.K.; Hassan, L.; Chowdhury, A.K.; Bashar, H.M.K. Allelopathic effect of selected rice (Oryza sativa) varieties against barnyard grass (Echinochloa cruss-gulli). Plants 2021 , 10 , 2017. Singh, V.; Rana, A.; Bishop, M.; Filippi, A.M.; Cope, D.; Rajan, N.; Bagavathiannan, M. Unmanned aircraft systems for precision weed detection and management: Prospects and challenges. Advances in Agronomy 2020 , 159 , 93–134. Zhang, Y.; Wang, M.; Zhao, D.; Liu, C.; Liu, Z. Early weed identification based on deep learning: A review. Smart Agricultural Technology 2023 , 3 , 100123. Al-Badri, A.H.; Ismail, N.A.; Al-Dulaimi, K.; Salman, G.A.; Khan, A.R.; Al-Sabaawi, A.; Salam, M.S.H. Classification of weed using machine learning techniques: a review—challenges, current and future potential techniques. Journal of Plant Diseases and Protection 2022 , 129 , 745–768. Wang, W.; Lai, Q.; Fu, H.; Shen, J.; Ling, H.; Yang, R. Salient object detection in the deep learning era: An in-depth survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 2021 , 44 , 3239–3259. Huang, H.; Lan, Y.; Yang, A.; Zhang, Y.; Wen, S.; Deng, J. Deep learning versus Object-based Image Analysis (OBIA) in weed mapping of UAV imagery. International Journal of Remote Sensing 2020 , 41 , 3446–3479. Zhang, X.; Cui, J.; Liu, H.; Han, Y.; Ai, H.; Dong, C.; Zhang, J.; Chu, Y. Weed Identification in Soybean Seedling Stage Based on Optimized Faster R-CNN Algorithm. Agriculture 2023 , 13 , 175. Gallo, I.; Rehman, A.U.; Dehkordi, R.H.; Landro, N.; La Grassa, R.; Boschetti, M. Deep object detection of crop weeds: Performance of YOLOv7 on a real case dataset from UAV images. Remote Sensing 2023 , 15 , 539. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 2020 . Lv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. Detrs beat yolos on real-time object detection. arXiv preprint arXiv:2304.08069 2023 . Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021; pp. 3651–3660. Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 13619–13627. Ning, X.; Tian, W.; Yu, L.; Li, W. Brain-inspired CIRA-DETR full inference model for small and occluded object detection. CHINESE JOURNAL OF COMPUTERS 2022 , 045. Ke, X.; Cai, Y.; Chen, B.; Liu, H.; Guo, W. Granularity-aware distillation and structure modeling region proposal network for fine-grained image classification. Pattern Recognition 2023 , 137 , 109305. Meng, H.; Tian, Y.; Ling, Y.; Li, T. Fine-grained ship recognition for complex background based on global to local and progressive learning. IEEE Geoscience and Remote Sensing Letters 2022 , 19 , 1–5. Wang, Y.; Tian, Y.; Liu, J.; Xu, Y. Multi-Stage Multi-Scale Local Feature Fusion for Infrared Small Target Detection. Remote Sensing 2023 , 15 , 4506. Yin, A.; Ren, C.; Yan, Z.; Xue, X.; Zhou, Y.; Liu, Y.; Lu, J.; Ding, C. C2S-RoadNet: road extraction model with depth-wise separable convolution and self-attention. Remote Sensing 2023 , 15 , 4531. Ye, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-Time Object Detection Network in UAV-Vision Based on CNN and Transformer. IEEE Transactions on Instrumentation and Measurement 2023 , 72 , 1–13. Rekavandi, A.M.; Rashidi, S.; Boussaid, F.; Hoefs, S.; Akbas, E.; others Transformers in small object detection: A benchmark and survey of state-of-the-art. arXiv preprint arXiv:2309.04902 2023 . Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 14420–14430. Lei, T.; Xue, D.; Ning, H.; Yang, S.; Lv, Z.; Nandi, A.K. Local and global feature learning with kernel scale-adaptive attention network for VHR remote sensing change detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2022 , 15 , 7308–7322. Mumuni, A.; Mumuni, F. CNN architectures for geometric transformation-invariant feature representation in computer vision: a review. SN Computer Science 2021 , 2 , 1–23. Wang, H.; Chen, X.; Zhang, T.; Xu, Z.; Li, J. CCTNet: Coupled CNN and transformer network for crop segmentation of remote sensing images. Remote Sensing 2022 , 14 , 1956. Li, S.; Li, B.; Li, J.; Liu, B.; Li, X. Semantic Segmentation Algorithm of Rice Small Target Based on Deep Learning. Agriculture 2022 , 12 , 1232. Qi, M.; Liu, L.; Zhuang, S.; Liu, Y.; Li, K.; Yang, Y.; Li, X. FTC-net: fusion of transformer and CNN features for infrared small target detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2022 , 15 , 8613–8623. Hou, J.; Zhou, H.; Yu, H.; Hu, H. HPAC: a forest tree species recognition network based on multi-scale spatial enhancement in remote sensing images. International Journal of Remote Sensing 2023 , 44 , 5960–5975. Wang, X.; Lv, R.; Zhao, Y.; Yang, T.; Ruan, Q. Multi-scale context aggregation network with attention-guided for crowd counting. In Proceedings of the 2020 15th IEEE International Conference on Signal Processing (ICSP); IEEE, 2020; Vol. 1, pp. 240–245. Chen, J.; Kao, S.-hong; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 12021–12031. Rostianingsih, S.; Setiawan, A.; Halim, C.I. COCO (creating common object in context) dataset for chemistry apparatus. Procedia Computer Science 2020 , 171 , 2445–2452. Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor DETR: Query design for transformer-based object detection. arXiv preprint arXiv:2109.07107 2021 , 3 . Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329 2022 . Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 16 Jul, 2024 Read the published version in Plant Methods → Version 1 posted Editorial decision: Revision requested 05 May, 2024 Reviews received at journal 22 Apr, 2024 Reviews received at journal 17 Apr, 2024 Reviews received at journal 09 Apr, 2024 Reviewers agreed at journal 25 Mar, 2024 Reviewers agreed at journal 25 Mar, 2024 Reviewers agreed at journal 22 Mar, 2024 Reviewers agreed at journal 21 Mar, 2024 Reviewers invited by journal 21 Mar, 2024 Editor assigned by journal 06 Mar, 2024 Submission checks completed at journal 06 Mar, 2024 First submitted to journal 03 Mar, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4008720","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":276849507,"identity":"bff9c05d-ef5e-42c5-85be-b78f4a9def60","order_by":0,"name":"Zhonghui Guo","email":"","orcid":"","institution":"Shenyang Agricultural University","correspondingAuthor":false,"prefix":"","firstName":"Zhonghui","middleName":"","lastName":"Guo","suffix":""},{"id":276849508,"identity":"9449d881-6c6d-4c1e-a2e7-f1ecc9082160","order_by":1,"name":"Dongdong Cai","email":"","orcid":"","institution":"Shenyang Agricultural University","correspondingAuthor":false,"prefix":"","firstName":"Dongdong","middleName":"","lastName":"Cai","suffix":""},{"id":276849509,"identity":"b89ad7fe-90af-415f-a9c3-f4ca66679037","order_by":2,"name":"Yunyi Zhou","email":"","orcid":"","institution":"Shenyang Agricultural University","correspondingAuthor":false,"prefix":"","firstName":"Yunyi","middleName":"","lastName":"Zhou","suffix":""},{"id":276849510,"identity":"b196e9bf-24fa-44d2-8097-a0a98caeb17e","order_by":3,"name":"Tongyu Xu","email":"","orcid":"","institution":"Shenyang Agricultural University","correspondingAuthor":false,"prefix":"","firstName":"Tongyu","middleName":"","lastName":"Xu","suffix":""},{"id":276849511,"identity":"59a997a1-4865-4da2-be01-f6be3b7800c5","order_by":4,"name":"Fenghua Yu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAv0lEQVRIiWNgGAWjYFCCBAaGD1CmBNFaGGeQrIWZhyQt/O3Jzx7b/Dksb3CA+eBtHga7PIJaJM48MzfObTtsuOEAW7I1D0NyMWFrbiSYSec2HGbccIDHTJqH4UBiAyEd8jfSv0lb/Dlsv+EA/zfitBjcyDGTZmA7nAi0hY04LYZn3pRJ9ralJ888zGZsOccgmbAWuePp2yR+/LG27Tve/PDGmwo7wlqgoJmBgRnsTiLVA0Ed8UpHwSgYBaNg5AEApqU8Bazf8kEAAAAASUVORK5CYII=","orcid":"","institution":"Shenyang Agricultural University","correspondingAuthor":true,"prefix":"","firstName":"Fenghua","middleName":"","lastName":"Yu","suffix":""}],"badges":[],"createdAt":"2024-03-03 13:30:33","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4008720/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4008720/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s13007-024-01232-0","type":"published","date":"2024-07-16T16:13:38+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":52236659,"identity":"efb06eeb-6af4-4f39-bcf7-fae237f8f841","added_by":"auto","created_at":"2024-03-08 07:06:31","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":913170,"visible":true,"origin":"","legend":"\u003cp\u003eSchematic diagram of the experimental area\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-4008720/v1/bc3f0de47d619f9b5044eff2.png"},{"id":52236868,"identity":"b8a33a8e-46f2-4a36-9ab7-fc55f89b011b","added_by":"auto","created_at":"2024-03-08 07:14:32","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":2023285,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-4008720/v1/d6838c65c97b8725ef15b865.png"},{"id":52236870,"identity":"027a461b-7f88-4767-9848-9bf087fd9543","added_by":"auto","created_at":"2024-03-08 07:14:32","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":845304,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-4008720/v1/6cecedde9c79eaa814280590.png"},{"id":52238053,"identity":"42186fa3-a0db-405e-912a-3023c081b5a8","added_by":"auto","created_at":"2024-03-08 07:22:32","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":745406,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-4008720/v1/a55c6cf21ce8a27bf45a1af7.png"},{"id":52236866,"identity":"15d3af07-89ea-4440-9c97-6f5a8d0d4fa5","added_by":"auto","created_at":"2024-03-08 07:14:31","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":632148,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-4008720/v1/1c78dfeebde27fac5feaeffc.png"},{"id":52236667,"identity":"12c8fefb-b9fb-4fa1-8e9d-f4b4ac77c6bd","added_by":"auto","created_at":"2024-03-08 07:06:32","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":3623649,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-4008720/v1/443041b45b495ff792ff0eef.png"},{"id":52236660,"identity":"b9c74069-2a39-4938-b8dd-6c0a469f3289","added_by":"auto","created_at":"2024-03-08 07:06:31","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":229746,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-4008720/v1/d8e32401b4907f275f3434ca.png"},{"id":52236665,"identity":"e4f073ad-415e-4b77-ab44-8a9ad26c03c5","added_by":"auto","created_at":"2024-03-08 07:06:31","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":195011,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend\u003c/p\u003e","description":"","filename":"8.png","url":"https://assets-eu.researchsquare.com/files/rs-4008720/v1/8a609fd43a6ec264b99d7fda.png"},{"id":52238052,"identity":"d8dd8cd0-ae55-43dd-a025-01becaee2449","added_by":"auto","created_at":"2024-03-08 07:22:31","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":3447693,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend\u003c/p\u003e","description":"","filename":"9.png","url":"https://assets-eu.researchsquare.com/files/rs-4008720/v1/90f632f344ebba6c428b133a.png"},{"id":61597522,"identity":"cbde62c9-29be-4639-921d-69d12344966c","added_by":"auto","created_at":"2024-08-01 17:33:20","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":16044731,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4008720/v1/9232f052-444b-4863-b3b6-2b3727dd49b5.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Identifying Rice Field Weeds from Unmanned Aerial Vehicle Remote Sensing Imagery Using Deep Learning","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eDuring the growth period of rice, competition for soil nutrients and water between rice and weeds can lead to the loss of water and fertilizer resources. Additionally, the proliferation of weeds can contribute to the emergence and spread of diseases and pests. Weeds in rice fields have become a critical biological threat limiting rice yield and quality [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. Therefore, effective weed control is a necessary step to achieve high and stable rice production.\u003c/p\u003e \u003cp\u003eThe characteristics of rice field environments, including soft and wet soil, low-lying terrain, and narrow spaces, impose certain limitations on traditional mechanical weed control methods. In this context, unmanned aerial vehicles (UAVs) have demonstrated unique applicability due to their flexible maneuverability. In recent years, with the improvement of payload capacity and performance of agricultural UAVs, aerial spraying has become the mainstream method for weed control in rice fields [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eCurrently, there is a problem of indiscriminate spraying in weed control using agricultural UAVs in rice fields. The widespread spraying may not accurately target weed locations, leading to low pesticide utilization rates and potential negative environmental impacts [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Utilizing high-resolution rice field remote sensing images captured by UAVs for precise weed identification and generating variable-rate prescription maps can enable targeted pesticide application based on weed locations and quantities, addressing this issue effectively [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eBarnyard grass is one of the most common weeds in rice fields, belonging to the same Poaceae family as rice. Both share a high degree of similarity in appearance and growth habits [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. In rice field images obtained by UAVs, barnyard grass weeds often occupy only a few dozen pixels, representing typical small targets. The recognition process for such small targets is prone to false positives or negatives due to lighting conditions and mutual occlusion. The high similarity between barnyard grass and rice, coupled with the small size of the targets and the complex and dynamic background, poses a significant challenge for accurate identification of barnyard grass in rice fields based on UAV remote sensing images [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn recent years, deep learning approaches have demonstrated significant potential in weed identification tasks [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. Deep learning, with advantages such as end-to-end learning, high-level feature learning, and large-scale data-driven capabilities, has rapidly emerged as the mainstream method in the field of object detection [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. Deep learning can directly learn end-to-end from large-scale annotated weed image data, automatically extracting visual features required for weed classification without the need for manual feature design and selection. Furthermore, with the expansion of datasets, deep models show continuous improvement in performance and adaptability to different agricultural environments [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThere are two typical architectures for modern object detectors: CNN-based and Transformer-based. Over the past few years, extensive research has been conducted on CNN-based object detectors. The evolution of these detectors has transitioned from the initial two-stage structures, such as the R-CNN series, gradually evolving into one-stage structures, with models like the YOLO series representing this trend[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Zhang et al[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] embedded the CBAM attention mechanism after the pooling layers in the latter part of VGG19, forming the VGG19-CBAM structure as the optimal backbone feature extraction network for the Faster R-CNN model. They utilized this model for weed detection in soybean fields, achieving an average recognition accuracy of 99.16%, with an average recognition speed of 336 ms per image. Gallo et al[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e] collected over 3000 weed remote sensing data using drones in a chicory plantation, creating a weed dataset in chicory plant production. They trained a YOLOv7 model on this dataset for weed target detection, achieving an average recognition accuracy of 56.6%. Both two-stage and one-stage models involve many manually crafted components, such as anchor generation, rule-based training target assignment, and post-processing non-maximum suppression (NMS), which are not fully end-to-end.\u003c/p\u003e \u003cp\u003eSince the introduction of the DETR (DEtection TRansformer) as a Transformer-based object detector, it has garnered widespread attention in the academic community due to its elimination of various manually crafted components, such as non-maximum suppression (NMS). This architecture significantly simplifies the object detection pipeline, achieving end-to-end object detection[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. In recent years, Transformer-based detectors have made significant progress in performance, thanks to researchers' relentless efforts in accelerating training convergence and reducing optimization challenges [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e].Zhu et al[\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]pointed out that when Transformer components are initialized, attention modules apply almost identical attention weights to all pixels in the feature map, leading to a longer training time to converge. To address this issue, they proposed a deformable attention module, combining the advantages of deformable convolution sparse spatial sampling and the relationship modeling capability of transformers, to overcome the slow convergence problem in DETR models. Li et al[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]attributed the slow convergence of DETR models to the instability of bipartite graph matching, resulting in inconsistent optimization objectives in the early training stages. To resolve this issue, they introduced noisy ground truth bounding boxes into the Transformer decoder, effectively reducing the difficulty of bipartite graph matching and accelerating convergence. However, despite achieving a certain degree of improvement in convergence speed and overall performance, the model exhibits poor performance in detecting small targets. Current research has demonstrated that integrating multi-scale feature layers into the model can effectively enhance its detection performance for small targets [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn this paper, we propose an MS-DETR model, which enhances the DETR model's ability to detect small targets by introducing multi-scale feature layers into the DETR framework. Existing research indicates that low-level semantic information typically contains more fine-grained and local features, which may be more distinctive and sensitive for small targets[\u003cspan additionalcitationids=\"CR20\" citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. Therefore, we differentially design the various feature layers of multi-scale features. For high-level semantic information, we apply Transformer structures to extract features, fully integrating context information from different perceptual domains. For low-level semantic information, we use a more computationally efficient CNN structure for feature extraction and encoding. Subsequently, effective fusion of the two types of features is achieved through cross-scale feature fusion, leveraging their respective advantages and forming an information-rich feature space. In traditional Transformer structures, due to all heads sharing the same input features and relying on isolated learning with non-shared parameters, there is often a highly homogeneous and redundant representation across different heads[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. To reduce computational redundancy, we adopt a novel structure called Cascaded Group Attention module to replace the traditional Transformer structure. This module provides different channel subsets of features as input for each head, allowing each head to learn more unique features, thereby enhancing the model's learning ability and reducing computational redundancy. The introduction of multi-scale feature layers inevitably increases the model's computational complexity and slows down the inference speed. In this study, we use an efficient and parallelizable Partial convolution (PConv) in the MS-DETR model to replace conventional convolution, aiming to maximize the model's inference speed.\u003c/p\u003e"},{"header":"2. Experimental Design","content":"\u003cp\u003e2.1 Data Collection: In rice production management, the \u0026quot;two seals and one kill\u0026quot; weed control strategy is commonly employed. This strategy involves two soil herbicide applications before rice transplanting and after the tillering stage, along with one herbicide application during the panicle initiation stage. Therefore, between May and June 2022, at the experimental field of Shenyang Agricultural University in Haicheng City, Liaoning Province, unmanned aerial vehicles were utilized to collect remote sensing data for both rice and barnyard grass during the tillering and panicle initiation stages. The experimental area measured 165 meters in length, 97 meters in width, with a total area of 16,005 square meters, as illustrated in Fig. \u003cspan\u003e1\u003c/span\u003e. The DJI M300 drone served as the flight platform, flying at an altitude of 30 meters and equipped with the Zenmuse P1 lens with an effective pixel count of 45\u0026nbsp;million. To ensure image registration accuracy, the drone followed a predetermined flight path with 80% forward overlap and 80% side overlap. Images were captured in a vertical perspective to cover the entire experimental field. The collected image resolution was 8192\u0026times;5460 pixels. DJI Terra was used for image registration and fusion of the acquired rice field remote sensing images. To prevent disturbances to both rice plants and weeds caused by strong winds and ensure the accuracy of subsequent image registration and fusion, unmanned aerial vehicle (UAV) remote sensing data collection was conducted under weather conditions with wind speeds below level 4. A total of 171 UAV remote sensing images of rice fields were collected.\u003c/p\u003e\n\u003cp\u003e2.2 Dataset Generation: To avoid inconsistencies in annotation when the same target appears in different images, this study initially used DJI Terra to register and fuse the collected unmanned aerial vehicle remote sensing data of weeds. Subsequently, the registered and fused images were segmented into non-overlapping sub-images of 600\u0026times;600 pixels each. After image segmentation, a total of 3,094 rice field weed remote sensing images were obtained. Weed distribution in the field is uneven, with dense areas showing continuous weed growth, while sparse areas exhibit individual weed plants. Therefore, during the manual annotation process, barnyard grass was classified into two types: continuous patches of barnyard grass and single barnyard grass plants. The Labelme software (v4.5.6) was employed for manual annotation. The schematic diagram of the dataset is shown in Fig. \u003cspan\u003e2\u003c/span\u003e, and the sample quantities after annotation are presented in Table \u003cspan\u003e1\u003c/span\u003e. The dataset was partitioned into training, validation, and testing sets at a ratio of 7:2:1, with no duplicate data between sets.\u003c/p\u003e\n\u003cp\u003eThe yellow-highlighted boxes in the image indicate the \u0026quot;Field ridge\u0026quot; labels, the green-highlighted boxes represent the \u0026quot;Single barnyard grass plant\u0026quot; labels, and the red-highlighted boxes correspond to the \u0026quot;Continuous patches of barnyard grass\u0026quot; labels.\u003c/p\u003e\n\u003cp\u003eFigure \u003cspan\u003e2\u003c/span\u003e༎Example of a remote sensing image dataset of weeds in rice fields\u003c/p\u003e\n\u003cdiv\u003e\n \u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv\u003eTable 1\u003c/div\u003e\n \u003cdiv\u003e\n \u003cp\u003eNumber of samples in the dataset.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ccolgroup cols=\"4\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003eLabel Category\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eNumbers of original images\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eTotal number of images after augmentation\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eField ridge\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e438\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e1752\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eContinuous patches of barnyard grass\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e218\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e872\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSingle barnyard grass plant\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e2876\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e11504\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003e2.3 Data Augmentation: To enhance the model\u0026apos;s robustness to the aforementioned samples, we augmented the training dataset by fourfold through techniques such as random cropping, color jittering, noise addition, and random rotation. This resulted in a dataset containing a total of 14,128 images, with the sample distribution outlined in Table \u003cspan\u003e1\u003c/span\u003e.\u003c/p\u003e"},{"header":"3. Materials and Methods","content":"\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e3.1: Methodology Employed in this Study\u003c/h2\u003e \u003cp\u003eModel Overview: To address the limitations of the DETR model in small object detection [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e] we introduced a multi-scale feature layer into the DETR model, creating the MS-DETR model. The purpose of this modification is to better adapt to small targets, particularly the detection of single barnyard grass plants in rice fields. The model framework, illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, consists of three core components: Backbone, Encoder, and Decoder.We designed a hybrid encoder, which comprises a feature extraction module and a cross-scale feature fusion module. In the feature extraction module, we differentiated the design for different feature layers. The high-level semantic feature layers employ a Transformer structure to emphasize the extraction of contextual information related to grass and rice, while the low-level semantic feature layers efficiently extract detailed grass features using a CNN structure. The cross-scale feature fusion module effectively combines the features extracted by the Transformer and CNN structures across different scales. This unique design of the hybrid encoder organically combines features from different levels, creating favorable conditions for the overall performance improvement of the model. To better extract grass features in the high-level semantic feature layers, we introduced the Cascaded Group Attention module, replacing the traditional multi-head attention mechanism to enhance learning ability while reducing computational burden. Last but not least, to further improve the inference speed, MS-DETR adopts the efficient and parallelizable PConv, successfully replacing conventional convolution operations. This series of innovative designs results in outstanding performance of our model in small object detection tasks, especially in the detection of grass.\u003c/p\u003e \u003cp\u003eWe selected a rice field weed image with dimensions of 600x600. Firstly, we preprocessed the image by resizing it to 640x640x3 using bilinear interpolation. The resized image was then fed into the Backbone module for initial feature extraction. In the middle layers of the Backbone, we concurrently inserted multiple convolutional layers to perform multi-scale convolution operations on the feature map, generating fixed-dimensional multi-scale feature representations. This resulted in three multi-scale feature layers labeled as s3, s4, and s5, with dimensions of 80\u0026times;80\u0026times;128, 40\u0026times;40\u0026times;256, and 20\u0026times;20\u0026times;512, respectively. The high-level semantic feature layer s5 was input into the transformer structure for feature encoding, producing the encoded feature layer y1 with dimensions of 20\u0026times;20\u0026times;512. Simultaneously, the low-level semantic feature layer s3 was input into a CNN network for feature extraction, yielding the feature layer y2 with dimensions of 80\u0026times;80\u0026times;256.Finally, we input y1, y2, and s4 into the feature fusion module for cross-scale feature fusion. The output is the fused feature layer y3 with dimensions of 80\u0026times;80\u0026times;512. This y3 serves as our final feature representation, which is then input into the subsequent decoding layers to accomplish weed recognition predictions.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFeature Extraction Based on Cascaded Group Attention: In traditional Transformer structures, due to shared input features, different heads may learn redundant information. Simultaneously, because parameters are not shared, each head independently learns weights and feature representations, potentially leading to overly similar features across different heads, lacking diversity. To overcome this issue, this study introduces the Cascaded Group Attention module. This module provides each head with different channel subsets of features as input, enabling each head to learn more unique features. Additionally, it cascades output features among heads, thereby enhancing the model's learning ability and reducing computational redundancy [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. As illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, the structure of the Cascaded Group Attention module can be described as follows:\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$${\\stackrel{\\sim}{X}}_{ij}=Attn({X}_{ij}{W}_{ij}^{Q},{X}_{ij}{W}_{ij}^{K},{X}_{ij}{W}_{ij}^{V})$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equ2\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e\n$${\\stackrel{\\sim}{X}}_{i+1}=Concat{\\left[{\\stackrel{\\sim}{X}}_{ij}\\right]}_{j=1:h}{W}_{i}^{P}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(j\\)\u003c/span\u003e\u003c/span\u003e-th head computes the self-attention over \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({X}_{ij}\\)\u003c/span\u003e\u003c/span\u003e, which is the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(j\\)\u003c/span\u003e\u003c/span\u003e-th split of the input feature \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({X}_{i}\\)\u003c/span\u003e\u003c/span\u003e, i.e., \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({X}_{i}=\\)\u003c/span\u003e\u003c/span\u003e \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\left[{X}_{i1},{X}_{i2},\\dots ,{X}_{ih}\\right]\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(1\\le j\\le h\\)\u003c/span\u003e\u003c/span\u003e. \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(h\\)\u003c/span\u003e\u003c/span\u003e is the total number of heads, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({W}_{ij}^{\\text{Q}},{W}_{ij}^{\\text{K}}\\)\u003c/span\u003e\u003c/span\u003e, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({W}_{ij}^{\\text{V}}\\)\u003c/span\u003e\u003c/span\u003e are projection layers mapping the input feature split into different subspaces, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({W}_{i}^{\\text{P}}\\)\u003c/span\u003e\u003c/span\u003e is a linear layer that projects the concatenated output features back to the dimension consistent with the input.\u003c/p\u003e \u003cp\u003eThe Cascaded Group Attention calculates the attention maps for each head in a cascading manner, adding the output of each head to the subsequent ones. This design encourages the Q, K, V layers to learn feature projections with richer information, progressively improving the capacity of feature representation. Through the cascading structure, this process allows the model to continuously accumulate and propagate richer information in each attention head, contributing to the enhancement of the model's learning ability and further optimizing feature representation:\u003cdiv id=\"Equ3\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ3\" name=\"EquationSource\"\u003e\n$${X}_{ij}^{{\\prime }}={X}_{ij}+{\\stackrel{\\sim}{X}}_{i(j-1)}, 1\u0026lt;j\\le h$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e3\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({X}_{ij}^{{\\prime }}\\)\u003c/span\u003e\u003c/span\u003e is the addition of the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(j\\)\u003c/span\u003e\u003c/span\u003e-th input split \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({X}_{ij}\\)\u003c/span\u003e\u003c/span\u003e and the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\((j-1)\\)\u003c/span\u003e\u003c/span\u003e-th head output \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({\\stackrel{\\sim}{X}}_{i(j-1)}\\)\u003c/span\u003e\u003c/span\u003e calculated by Eq.\u0026nbsp;(\u003cspan refid=\"Equ2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). It replaces \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({X}_{ij}\\)\u003c/span\u003e\u003c/span\u003e to serve as the new input feature for the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(j\\)\u003c/span\u003e\u003c/span\u003e-th head when calculating the self-attention. Besides, another token\u003c/p\u003e \u003cp\u003eThe Cascaded Group Attention can save \u003cem\u003eh\u003c/em\u003e\u0026times; FLOPs and parameters since the input and output channels of the QKV layers are reduced by \u003cem\u003eh\u003c/em\u003e\u0026times;. Secondly, cascading attention heads can increase the network depth, thereby further enhancing the model capacity without introducing any additional parameters.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFeature Extraction Based on CNN Structure: Existing research results indicate that low-level semantic feature layers contain more fine-grained local information, which is crucial and sensitive for the detection of small targets [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. CNN's convolution and pooling operations aid in extracting local information such as textures and shapes in images, making it easier to capture local features and details in the images. This makes CNN more suitable for extracting and encoding detailed features from low-level semantic feature layers [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e].Therefore, in the design of the hybrid encoder for the MS-DETR model, we utilized a CNN structure in the feature extraction module to extract detailed information about weeds from the low-level semantic feature layer. When using the CNN network to extract low-level details, appropriately expanding the receptive field of the CNN network enables it to capture richer features of the target and surrounding background areas, thereby improving the quality of small target detection [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]. Dilation convolution, compared to regular convolution, can enlarge the receptive field, obtaining broader and richer features, which is crucial for detecting small targets of different scales [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. Therefore, we employed dilated convolution for feature extraction on the low-level semantic feature layer, as illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTo capture multi-scale features within different receptive fields, we employed dilated convolutions with dilation factors of 6, 12, and 18 to extract low-level semantic information from the multi-scale feature layers. Here, the kernel size of the dilated convolution is 3\u0026times;3, and dilated convolutions with different dilation factors, along with Batch Normalization and ReLU activation functions, form separate branches. To alleviate potential issues of gradient vanishing or exploding during training, we introduced a residual structure in each branch, including a 1\u0026times;1 convolution layer. The outputs of each branch are obtained by summing them, and these features are then concatenated together. By applying a 1\u0026times;1 convolution operation, we reduced the channel number from 240 to 80, obtaining a globally fused feature representation that incorporates multi-scale contextual information. This helps in capturing subtle features of barnyard grass in the rice field scene.\u003c/p\u003e \u003cp\u003eEfficient and Parallelizable PConv as an Alternative to Conventional Convolution: The introduction of multiscale feature layers is bound to increase the computational load of the model, slowing down its inference speed. Current research indicates that frequent memory access by operators is the primary cause of low FLOPS. To enhance the inference speed of the model as much as possible, we have employed a PConv that simultaneously reduces memory access time and computational redundancy, replacing conventional convolutions in the model. The working principle of PConv involves utilizing the first or last consecutive channel for continuous or regular memory access as a representative for the entire feature map, while the remaining channels remain unchanged [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]. As a result, the FLOPs of PConv are only:\u003cdiv id=\"Equ4\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ4\" name=\"EquationSource\"\u003e\n$$h\\times w\\times {k}^{2}\\times {c}_{p}^{2}.$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e4\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eWith a typical partial ratio \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(r=\\frac{{c}_{p}}{c}=\\frac{1}{4}\\)\u003c/span\u003e\u003c/span\u003e, the FLOPs of a PConv is only \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\frac{1}{16}\\)\u003c/span\u003e\u003c/span\u003e of a regular Conv. Besides, PConv has a smaller amount of memory access, i.e.,\u003cdiv id=\"Equ5\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ5\" name=\"EquationSource\"\u003e\n$$h\\times w\\times 2{c}_{p}+{k}^{2}\\times {c}_{p}^{2}\\approx h\\times w\\times 2{c}_{p}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e5\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhich is only \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\frac{1}{4}\\)\u003c/span\u003e\u003c/span\u003e of a regular Conv for \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(r=\\frac{1}{4}\\)\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e3.2: Model Training and Evaluation Metrics\u003c/h2\u003e \u003cp\u003eParameter Configuration: To ensure the fairness of the experiments, identical initial training parameters are set for each group. Taking into account physical memory constraints and learning efficiency, the number of training images per batch is set to 4, and the maximum iteration count is set to 500. During training, the model employs the Stochastic Gradient Descent (SGD) optimizer, and the learning rate (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(lr\\)\u003c/span\u003e\u003c/span\u003e) decay strategy can be described as follows:\u003cdiv id=\"Equ6\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ6\" name=\"EquationSource\"\u003e\n$$lr=base\\_lr\\bullet {(1-\\frac{iter\\_num}{\\text{m}\\text{a}\\text{x}\\_iterations})}^{p}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e6\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eHere, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(base\\_lr\\)\u003c/span\u003e\u003c/span\u003e represents the base learning rate, max_iterations is the maximum iteration count, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(iter\\_num\\)\u003c/span\u003e\u003c/span\u003e is the iteration index, and p is the polynomial decay exponent. In this study, the base learning rate is set to 0.001, momentum is set to 0.9, weight decay is set to 1e-4, and the lower limit for learning rate updates is 0. These settings are consistently applied across all model training sessions.\u003c/p\u003e \u003cp\u003eThis study employs the Cross-Entropy Loss function to quantify the distance between the predicted probability distribution of pixel categories and the true label category probability distribution during the training process. The specific calculation method is as follows:\u003cdiv id=\"Equ7\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ7\" name=\"EquationSource\"\u003e\n$$Loss=\\frac{1}{M}\\sum _{i=1}^{M}\\sum _{C=1}^{N}h\\left({b}_{i}\\right)\\text{l}\\text{o}\\text{g}\\left({p}_{ic}\\right)$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e7\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eIn the formula, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(M\\)\u003c/span\u003e\u003c/span\u003erepresents the number of pixels; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(N\\)\u003c/span\u003e\u003c/span\u003erepresents the number of categories; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(i\\)\u003c/span\u003e\u003c/span\u003e represents the current pixel; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(C\\)\u003c/span\u003e\u003c/span\u003e represents the current category; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({b}_{i}\\)\u003c/span\u003e\u003c/span\u003eis the true label category for pixel \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(i\\)\u003c/span\u003e\u003c/span\u003e; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(h\\)\u003c/span\u003e\u003c/span\u003e is the probability distribution function in the range of 0\u0026thinsp;~\u0026thinsp;1, where it is 1 if \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({b}_{i}=c\\)\u003c/span\u003e\u003c/span\u003e and 0 otherwise; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({p}_{ic}\\)\u003c/span\u003e\u003c/span\u003e is the predicted probability of pixel i belonging to category \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(c\\)\u003c/span\u003e\u003c/span\u003e, obtained through the Sigmoid function applied to the calculation of predicted category scores. Through the computation of the loss function during the iteration process, the model's training performance is evaluated. The weights are adjusted through backpropagation to gradually reduce the error represented by the loss value, aiming to achieve the training objectives.\u003c/p\u003e \u003cp\u003eEvaluation Metrics: To quantitatively analyze the model's performance, this study employs Average Precision (AP), precision, and recall to assess the effectiveness of the proposed MS-DETR model. For precision and recall, there are four states after the test sample is predicted: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The definitions are as follows:\u003cdiv id=\"Equ8\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ8\" name=\"EquationSource\"\u003e\n$$precision=\\frac{TP}{TP+FP}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e8\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equ9\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ9\" name=\"EquationSource\"\u003e\n$$recall=\\frac{TP}{TP+FN}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e9\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eThe recall rate and precision rate are based on the threshold value of 0.5.\u003c/p\u003e \u003cp\u003eExperimental Platform Configuration:\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eExperimental Environment\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOperating\u003c/p\u003e \u003cp\u003esystem\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c4\" namest=\"c2\"\u003e \u003cp\u003eHardware environment\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c7\" namest=\"c5\"\u003e \u003cp\u003eSoftware environment\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCPU\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eHard drive\u003c/p\u003e \u003cp\u003ecapacity\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eGPU\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePython\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003ecuDNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eCUDA\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWindows 10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eIntel(R)\u003c/p\u003e \u003cp\u003eCore(TM)\u003c/p\u003e \u003cp\u003ei7-9700\u003c/p\u003e \u003cp\[email protected]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e64G\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eNVIDIA\u003c/p\u003e \u003cp\u003eGeForce RTX 5000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e3.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e8.5.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e11.7\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"4. Experimental Results","content":"\u003cp\u003eIn this section, we conducted multiple experiments to validate the performance and reliability of the proposed MS-DETR model in rice field weed detection. Comprehensive analysis and discussion of the experimental results were performed.4.1 Visualization\u003c/p\u003e\n\u003cp\u003eTo visually demonstrate the effectiveness of the proposed approach in improving the recognition performance of rice field weed images, this study introduced three successive improvements based on the original DETR model. The first improvement replaced the multi-head attention mechanism in DETR with Cascaded Group Attention, resulting in DETR-CGA. The second improvement added multiscale feature layers to DETR-CGA and used CGA and CNN to extract high and low-level semantic features separately, yielding DETR-CGA\u0026thinsp;+\u0026thinsp;CNN. Finally, the MS-DETR model was obtained by replacing conventional Conv with PConv on the basis of DETR-CGA\u0026thinsp;+\u0026thinsp;CNN. Subsequently, the attention maps for weed feature extraction were compared using Grad-CAM visualization technique between the original DETR, MS-DETR, and the two intermediate variants. All attention maps are from the last encoding layer of the model\u0026apos;s encoder. The results are shown in Fig. \u003cspan\u003e6\u003c/span\u003e.\u003c/p\u003e\n\u003cp\u003eFigure \u003cspan\u003e6\u003c/span\u003e: Visualization of Target Heatmaps under Different Improvement Methods\u003c/p\u003e\n\u003cp\u003eObserving Fig. \u003cspan\u003e6\u003c/span\u003ec and \u003cspan\u003e6\u003c/span\u003eb, it is evident that the DETR-CGA model, incorporating the Cascaded Group Attention module, enhances attention to key feature regions when recognizing single barnyard grass plants and field ridges compared to the original DETR model. Although it expanded the attention scope on the features of contiguous weeds, the DETR-CGA model compensates for the missed detection issues present in the original DETR model, as illustrated by the red boxes in the figure. Observing Fig. \u003cspan\u003e6\u003c/span\u003ee and \u003cspan\u003e6\u003c/span\u003ed, it is evident that the MS-DETR model, utilizing PConv, exhibits a pronounced focus in the attention distribution on the main feature regions of all target categories compared to the DETR-CGA\u0026thinsp;+\u0026thinsp;CNN model with conventional convolutions. The innovation of the MS-DETR model lies in the effective fusion of global and local features. As depicted in Fig. \u003cspan\u003e6\u003c/span\u003ee, when detecting single barnyard grass plants and continuous patches of barnyard grass, the MS-DETR model primarily focuses on their growth positions between field ridges. The growth position of barnyard grass between field ridges is a typical local feature distinguishing barnyard grass from rice. When identifying field ridge categories, the MS-DETR model emphasizes both the boundary parts of the field ridge and the presence of weeds on the ridge, ensuring comprehensive attention to both global and local features. This indicates that the MS-DETR model, through the effective fusion of global and local features in the image, enhances the recognition ability of typical features in targets, thereby improving the detection performance of rice field weed by the model.\u003c/p\u003e\n\u003cdiv id=\"Sec7\"\u003e\n \u003ch2\u003e4.2 Sensitivity Analysis\u003c/h2\u003e\n \u003cp\u003eTo verify the contribution of the proposed improvement method to the model\u0026apos;s performance, this study conducted ablation experiments based on a self-constructed rice field weed dataset. Starting with the framework of the original DETR base model, various improvement modules were progressively incorporated to create multiple model variants. The performance of each variant was then evaluated using the mAP50 metric. Through ablation experiments, a quantitative analysis was conducted to assess the impact of each improvement method on the performance enhancement of the model in rice field weed detection tasks. The results are presented in Table \u003cspan\u003e3\u003c/span\u003e.\u003c/p\u003e\n \u003cdiv\u003e\n \u003ctable id=\"Tab3\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv\u003eTable 3\u003c/div\u003e\n \u003cdiv\u003e\n \u003cp\u003eRecognition Results of Different Improvement Methods for Rice Field Weeds\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ccolgroup cols=\"6\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eNO.\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eModel\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAll\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eSingle barnyard grass plant\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eContinuous patches of barnyard grass\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eField ridge\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDETR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.764\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.647\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.77\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.875\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDETR-CGA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.772\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.687\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.81\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.818\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDETR-CGA\u0026thinsp;+\u0026thinsp;CNN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.784\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.73\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.782\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.839\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMS-DETR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.792\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.686\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.816\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.873\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cdiv\u003e\n \u003c/div\u003e\n \u003cp\u003e1 \u0026rarr; 2: In terms of overall accuracy, the DETR-CGA model has slightly improved in mAP50 metrics compared to the original DETR model, from 0.764 to 0.772. From various categories, compared to the DETR model, the DETR-CGA model has improved recognition accuracy by 4% in both single plant barnyard grass and Continuous patches of barnyard grass. This indicates that the CGA module enhances the model\u0026apos;s ability to extract complex features, effectively improving the recognition accuracy of complex targets such as barnyard grass. However, we also observed a 6.5% decrease in model recognition accuracy when facing relatively regular and simple field ridge targets. The reason might be that the attention heads of the CGA module are overly concentrated on capturing crucial complex semantic information, leading to insufficient representation of simple low-level visual features and failing to provide effective support for simple targets.\u003c/p\u003e\n \u003cp\u003e2 \u0026rarr; 3: The DETR-CGA\u0026thinsp;+\u0026thinsp;CNN model is built on the DETR-CGA model by introducing a multi-scale feature extraction module and effectively fusing the semantic information extracted from both Transformer and CNN structures. Its mAP50 overall score is improved from 0.772 to 0.784. This demonstrates that the effective fusion of global and local features is beneficial for enhancing target detection. For the single barnyard grass plant and field ridge categories, the recognition accuracy of the DETR-CGA\u0026thinsp;+\u0026thinsp;CNN model has been improved to varying degrees, especially the recognition accuracy for the single barnyard grass plant category, which has increased significantly. This shows that adding the multi-scale feature extraction module can improve the model\u0026apos;s recognition accuracy for small target categories to some extent.\u003c/p\u003e\n \u003cp\u003e3 \u0026rarr; 4: The MS-DETR model, built on the DETR-CGA\u0026thinsp;+\u0026thinsp;CNN model, replaces the conventional convolutions with PConvs. This improvement effectively enhances the model\u0026apos;s recognition capability, improving mAP50 overall score from 0.784 to 0.792. For large-area targets like continuous patches of barnyard grass and field ridges, the recognition accuracy of the MS-DETR model has increased by 3.4% for both. However, for small-area targets like a single barnyard grass plant, the recognition accuracy decreased by 4.4%. This suggests that the PConv structure may be more suitable for extracting features of large-area targets, while having limitations in extracting features of small-area targets.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec8\"\u003e\n \u003ch2\u003e4.3 Analysis of Other Metrics\u003c/h2\u003e\n \u003cp\u003eTo provide a more comprehensive analysis of the impact of the proposed improvement methods on model performance, various metrics were analyzed. Figure \u003cspan\u003e7\u003c/span\u003e illustrates the trend of loss values during the training process.\u003c/p\u003e\n \u003cp\u003eAs can be seen from Fig. \u003cspan\u003e7\u003c/span\u003e, MS-DETR and DETR-CGA\u0026thinsp;+\u0026thinsp;CNN demonstrated the lowest loss value of 0.13 during the training process. However, it is noteworthy that DETR-CGA\u0026thinsp;+\u0026thinsp;CNN converged relatively slowly and fluctuated greatly during training. In contrast, the MS-DETR model achieved the best performance in both loss value and convergence speed.\u003c/p\u003e\n \u003cp\u003eFigure \u003cspan\u003e8\u003c/span\u003e depicts the precision-recall curves for models utilizing different improvement methods. ROC curves and AUC are commonly used metrics to evaluate the performance of classification models. The ROC curve illustrates the model\u0026apos;s ability to correctly classify judgments under different threshold conditions. The AUC value represents the probability expectation that the model correctly distinguishes positive and negative instances across all classification thresholds. The larger these two metrics, the better the stability and robustness of the classification model.\u003c/p\u003e\n \u003cp\u003eAs shown in Fig. \u003cspan\u003e8\u003c/span\u003e, the ROC curve of the MS-DETR model is closest to the top right corner, meaning the MS-DETR model has the best classification performance with an AUC of 0.79, outperforming other contrastive models. This indicates that the MS-DETR model has higher average recognition accuracy and better overall performance.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec9\"\u003e\n \u003ch2\u003e4.4 Validation of Enhanced Small Object Recognition Capability\u003c/h2\u003e\n \u003cp\u003eThe detailed experiments in this section are to verify the enhanced effects of our proposed model on small target detection tasks. The model performance is evaluated by the mean Average Precision (AP) and mean Average Recall (AR) in different size ranges, where higher AP and AR values indicate better effects of the model in detecting targets within the corresponding size ranges. The AP and AR in Table \u003cspan\u003e4\u003c/span\u003e are obtained at IoU\u0026thinsp;=\u0026thinsp;0.50:0.95. The subscripts are defined as follows: S represents small targets (area\u0026thinsp;\u0026le;\u0026thinsp;322), M represents medium targets (322\u0026thinsp;\u0026lt;\u0026thinsp;area\u0026thinsp;\u0026le;\u0026thinsp;962), L represents large targets (area\u0026thinsp;\u0026gt;\u0026thinsp;962), and area represents number of pixels [\u003cspan\u003e34\u003c/span\u003e].\u003c/p\u003e\n \u003cdiv\u003e\n \u003ctable id=\"Tab5\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv\u003eTable 4\u003c/div\u003e\n \u003cdiv\u003e\n \u003cp\u003eRecognition results of models with different improvement methods for different sizes of rice weeds.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ccolgroup cols=\"5\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMS-DETR\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eDETR-CGA\u0026thinsp;+\u0026thinsp;CNN\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eDETR-CGA\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eDETR\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAP-s\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.111\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.073\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.103\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.058\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAP-m\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.153\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.139\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.136\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.118\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAP-l\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.635\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.624\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.617\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.617\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAR-s\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.266\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.443\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.364\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.191\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAR-m\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.397\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.318\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.396\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.402\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAR-l\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.807\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.798\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.814\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.788\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cdiv\u003e\n \u003c/div\u003e\n \u003cp\u003eThe experimental results show that compared with the original DETR model, our proposed MS-DETR model significantly improves the detection performance on large, medium and small targets. Among them, the gain on small target detection is the most significant, with AP and AR greatly improved by 91% and 39% respectively, outperforming all contrastive methods. The recognition of medium and large targets also has some improvement, with the AP of medium targets increased by 29%, and the AP and AR of large targets improved by 2.9% and 2.4% respectively. It should be noted that the AR of medium targets dropped slightly by 1.2%. The reason may be that the model optimization for small target detection resulted in less attention on medium targets. Since small target weeds are more densely distributed in weed scenes, the model optimization pays more attention to improving small target detection, which sacrifices the detection recall rate of medium-sized weed targets to some extent, leading to a slight 1.2% decline. Considering the small number and relatively easy detection of medium-sized weeds, such loss can be acceptable.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec10\"\u003e\n \u003ch2\u003e4.5 Feasibility Analysis of Agricultural Production\u003c/h2\u003e\n \u003cp\u003eTo assess the computational complexity of our proposed method, we conducted testing experiments on the collected rice weed dataset. To eliminate other influencing factors, we performed comparisons under the same experimental environment, where model parameters and GFLOPs were computed on a single NVIDIA RTX5000 GPU for input sizes of 640\u0026times;640 pixels. Inference time was calculated as the average over 100 runs on test samples of 640\u0026times;640 pixel images. The experimental results are presented in Table \u003cspan\u003e5\u003c/span\u003e.\u003c/p\u003e\n \u003cdiv\u003e\n \u003ctable id=\"Tab7\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv\u003eTable 5\u003c/div\u003e\n \u003cdiv\u003e\n \u003cp\u003ePerformance parameters of models with different improvement methods.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ccolgroup cols=\"5\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eModel\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eDETR\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eDETR-CGA\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eDETR-CGA\u0026thinsp;+\u0026thinsp;CNN\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMS-DETR\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eParameters\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e38.6MB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e38.3MB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e42.5MB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e40.8MB\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLatency\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.00750s\u0026thinsp;\u0026plusmn;\u0026thinsp;0.00145s\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.00705s\u003c/p\u003e\n \u003cp\u003e\u0026plusmn;\u0026thinsp;0.00095s\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.00829s\u0026thinsp;\u0026plusmn;\u0026thinsp;0.01191s\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.00818s\u0026thinsp;\u0026plusmn;\u0026thinsp;0.00105s\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFPS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e133.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e141.8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e120.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e122.2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003emAP50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.764\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.772\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.784\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.792\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eCompared with the original DETR model, the DETR-CGA model with the efficient Cascaded Group Attention module reduced the model size by 0.3MB. While reducing the number of parameters and model size, its accuracy was improved by 0.08, indicating that the Cascaded Group Attention module provides different channel subsets of features as input to each head, which reduces model parameters while allowing each head to learn more unique features, thereby improving the model\u0026apos;s recognition accuracy for rice field weeds. The DETR-CGA\u0026thinsp;+\u0026thinsp;CNN model introduced multi-scale feature layers later, with its number of parameters significantly increased by 9.7%, due to the additional parameters brought by the multi-scale feature layers. However, the model accuracy also increased by 0.012. On this basis, the efficient PConv was adopted to replace conventional convolutions in the MS-DETR model. With no change in model structure, the number of model parameters decreased by 4%, FPS increased by 1.6, and model accuracy also improved by 0.08. Overall, compared with the original DETR model, our model has no significant advantages in terms of number of parameters and inference time. It completes weed recognition at a speed of 0.00818 seconds per image. Although not the fastest in inference, it achieved the best performance in recognition results. Our model strikes a good balance between recognition performance and computational efficiency, making it suitable for deployment on intelligent devices with limited computing power.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec11\"\u003e\n \u003ch2\u003e4.6 Comparison with Other Classic Algorithms\u003c/h2\u003e\n \u003cp\u003eIn order to comprehensively evaluate the performance of the model on the rice weed detection task, we conducted comparative experiments on the rice weed dataset, comparing the MS-DETR model with other classic DETR variants, including Deformable DETR [\u003cspan\u003e14\u003c/span\u003e], Anchor DETR [\u003cspan\u003e35\u003c/span\u003e], and DAB-DETR [\u003cspan\u003e36\u003c/span\u003e]. The experimental results are presented in Table \u003cspan\u003e6\u003c/span\u003e and Fig. \u003cspan\u003e9\u003c/span\u003e.\u003c/p\u003e\n \u003cdiv\u003e\n \u003ctable id=\"Tab8\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv\u003eTable 6\u003c/div\u003e\n \u003cdiv\u003e\n \u003cp\u003eDetection performance of different DETR variant models.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ccolgroup cols=\"5\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eModel\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMS-DETR\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eDeformable DETR\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAnchor DETR\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eDAB-DETR\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003emAP50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.792\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.775\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.755\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.773\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eParameters\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e40.8M\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e41M\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e36.8M\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e44M\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGFLOPs\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e187G\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e86G\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e151G\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e94G\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003e(a)Single barnyard grass plant (b)Field ridge (c)Continuous patches of barnyard grass\u003c/p\u003e\n \u003cp\u003eFigure \u003cspan\u003e9\u003c/span\u003e: Recognition results of different models on rice weeds.\u003c/p\u003e\n \u003cp\u003eSince Deformable DETR first introduced multi-scale features on the DETR basis, effectively improving the detection performance, and Anchor DETR and DAB-DETR are improved on the Deformable DETR model, we chose the above models for comparison. As shown in Table \u003cspan\u003e6\u003c/span\u003e, among multiple DETR variants, MS-DETR achieved the highest mAP50 value of 0.792, displaying the optimal recognition performance. In terms of the number of parameters, MS-DETR used 40.8M parameters, only higher than the smallest Anchor DETR (44M). Considering the highest recognition accuracy of MS-DETR, this means its parameter utilization efficiency is high. However, the computational complexity (GFLOPs) of MS-DETR reached 187G, the largest among all contrastive models. Taking into account both recognition accuracy and parameter utilization efficiency, MS-DETR achieved the best balance between the two, obtaining the highest recognition performance metrics, while keeping the number of parameters and computational complexity within a reasonable range.\u003c/p\u003e\n \u003cp\u003eAs shown in Fig. \u003cspan\u003e9\u003c/span\u003e, our proposed MS-DETR model performs the best in recognizing smaller single barnyard grass plant targets, accurately identifying all weed instances and field ridges in the image, while the Deformable DETR and DAB-DETR models failed to detect the field ridge in the bottom right corner (as shown in the blue box in Fig. \u003cspan\u003e9\u003c/span\u003ea), and missed detecting some weeds (as shown in the yellow box in Fig. \u003cspan\u003e9\u003c/span\u003ea). The reasons may be: first, Deformable DETR does not distinguish between feature layers at different scales. The independent Deformable Attention modules on low semantic feature layers cannot effectively capture detailed features like CNNs. They do not fully exploit the key localization information that low semantic layers provide for small targets; Second, the multi-scale feature extraction and fusion process of simple \u0026ldquo;stacking-summing\u0026rdquo; is too singular to model the rich interactions between features, which limits the effectiveness of multi-scale information representation and integration of the model. Although the Anchor DETR model detected the field ridges, it also missed some weed targets (as shown in the yellow box in Fig. \u003cspan\u003e9\u003c/span\u003ea).\u003c/p\u003e\n \u003cp\u003eFor larger field ridge targets, all models can identify them relatively well. However, the Anchor DETR model incorrectly identified the barnyard grass on the field ridges, which should not have been annotated during the data annotation process. Therefore, there was no dataset with barnyard grass on field ridges in the training data, resulting in a kind of false positive detection. For recognizing continuous patches of barnyard grass, Anchor DETR failed to detect the continuous patches of barnyard grass in the bottom left corner (as shown in the black box in Fig. \u003cspan\u003e9\u003c/span\u003ec), while other models basically detected the area of continuous weed patches, but with some differences in the positioning of detection boxes. The MS-DETR model left a small unlabeled area in recognizing continuous patches of barnyard grass, while the detection boxes of Deformable DETR and DAB-DETR models have some overlap, especially the two boxes in the Deformable DETR model with the largest overlap area. The possible reasons for Anchor DETR missing a patch of weed target (as shown in the black box in Fig. \u003cspan\u003e9\u003c/span\u003ec) are: (1) The concept of \u0026quot;dense weeds\u0026quot; itself is relatively subjective, and different people have different understandings and criteria regarding weed density. Even for the same person, the understanding of \u0026quot;dense\u0026quot; may change when annotating data at different times, resulting in inconsistent labels in the training data. (2) The current training data volume is relatively small, and the samples of various weed density scenarios are not comprehensive enough. This limits the model\u0026apos;s ability to learn the concept of \u0026ldquo;dense weeds\u0026rdquo;.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"5. Discussion","content":"\u003cp\u003eDue to the high similarity in morphology between barnyard grass and rice plants, and the fact that barnyard grass are small objects in UAV remote sensing imagery, accurate identification of barnyard grass in rice fields based on UAV remote sensing is challenging. To address this problem, this study proposes targeted improvement measures and develops a rice field barnyard grass object detection model that balances detection performance and efficiency to handle barnyard grass detection tasks in complex real-world scenarios.\u003c/p\u003e \u003cp\u003eIn order to improve the recognition accuracy of barnyard grass in remote sensing imagery, we proposed the MS-DETR model, which introduces multi-scale feature layers on the basis of DETR. We designed the different feature layers differently. The high-level semantic feature layer adopts Transformer structure to emphasize the extraction of context relationship information between barnyard grass and rice plants. The low-level semantic feature layer uses CNN structure to extract barnyard grass detail features. This is because high-level semantic feature layers usually contain more abstract and semantic information. The self-attention mechanism in Transformers allows each input position to associate with all other positions, unlike CNN networks which are limited by fixed window sizes. This fully-connected mechanism enables the model to build relationships between any two pixels in the image, thereby better extracting global feature information. Low-level semantic feature layers usually contain more detailed information. The process of convolving the convolution kernels with the feature layer element-by-element in CNN networks is essentially weighted aggregation of features, which can effectively capture local features in the feature layer.\u003c/p\u003e \u003cp\u003eWhen using Transformer structure to extract context information of rice field weeds, we introduced the Cascaded Group Attention module to replace the traditional multi-head attention mechanism in Transformer structure. Since the Cascaded Group Attention module splits the input features into multiple channel subsets and takes these channel subsets as the inputs to different self-attention heads separately, it avoids repetitive encoding of the same information by different heads and reduces computational redundancy. Meanwhile, different heads extracting features from their own channel subsets help the model learn more diverse representations of the input features. Experimental results show that this improvement increased the detection accuracy (mAP50) by 1%, reduced the model size from 38.6M to 38.3M, and shortened the inference time from 0.0075 seconds to 0.00705 seconds.\u003c/p\u003e \u003cp\u003eWhen using CNN to extract barnyard grass detail features, we apply atrous convolutions with different dilation rates on the same semantic feature layer to achieve multi-scale observation of the feature layer, thereby enabling the model to capture small barnyard grass features. Experimental results show that this improvement increased the barnyard grass recognition accuracy by 1.6%. This is mainly attributed to the enlarged receptive field of convolution kernels by introducing dilation rates in atrous convolution, which can capture richer features of barnyard grass objects and surrounding background regions. However, the introduction of this multi-branch structure leads to increased computational burden and slower inference speed. The model size increased from 38.3M to 42.5M, and the detection time increased from 0.00705 seconds to 0.00829 seconds.\u003c/p\u003e \u003cp\u003eIn order to maximize the model's inference speed, we extensively adopted the efficient parallelizable PConv in the model to replace conventional convolutions. PConv treats the first or last consecutive channel subset of the feature map as the representative of the entire feature map, performs spatial feature extraction on it using Conv, while keeping the remaining channels unchanged. This strategy of focusing only on key channels significantly improves computational efficiency and reduces channel redundancy. Experimental results show that the use of PConv modules not only reduced model parameters from 42.5M to 40.8M, but also improved average inference time by 1.3%. More importantly, the barnyard grass detection accuracy also increased from 0.784 to 0.792.\u003c/p\u003e \u003cp\u003eAlthough the MS-DETR model demonstrates good performance on our self-built rice field weed dataset, there are still many factors not evaluated in this study. First, our training set was collected from a single experimental field, without considering the effects of different farm management measures on dominant weed species. Second, changes in lighting conditions may affect image features, while the current dataset does not cover variations under different weather conditions. These two limitations may affect the model's generalization ability in other environments. To mitigate the above effects, in future research, we will collect rice field weed datasets across more regions and time spans, to include samples under varying lighting conditions and with different weed species, so as to expand the applicability of the MS-DETR model.\u003c/p\u003e"},{"header":"6. Conclusion","content":"\u003cp\u003eThe main conclusions of this study, which proposes a rice field weed detection method for UAV remote sensing, are summarized as follows:\u003c/p\u003e \u003cp\u003e(1)By introducing multi-scale feature layers in the DETR model and differentiating their designs, the detection performance of the DETR model can be effectively improved, especially for detecting small targets. Compared with the original DETR model, the overall detection accuracy of our proposed MS-DETR model is improved by 3.6%, and the detection accuracy for small targets is increased substantially by 91%.\u003c/p\u003e \u003cp\u003e(2)Incorporating the Cascaded Group Attention module into the DETR model to replace the traditional multi-head attention mechanism can effectively reduce model computation while improving detection accuracy. The model size is reduced by 0.3M and the overall detection accuracy is improved by 1%.\u003c/p\u003e \u003cp\u003e(3)Extensively using PConv in the model can effectively decrease model computation and improve model inference speed. The model inference speed is increased by 1.3% and the model size is reduced by 1.7M.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003cbr\u003e\u0026nbsp;Liaoning Province Applied Basic Research Program Project (2023JH2/101300120) , Liaoning Province\u0026apos;s \u0026quot;Xingliao Talent Plan\u0026quot; project, with project number XLYC2203005.and Open Project of the South China Tropical Smart Agriculture Technology Key Laboratory of the Ministry of Agriculture and Rural Affairs (HNZHNY-KFKT-202208)\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors and Affiliations\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSchool of Information and Electrical Engineering, Shenyang Agricultural University、National Digital Agriculture Regional Innovation Center (Northeast)、Key Laboratory of Smart Agriculture Technology in Liaoning Province\u003c/p\u003e\n\u003cp\u003eZhonghui Guo\u003c/p\u003e\n\u003cp\u003eSchool of Information and Electrical Engineering, Shenyang Agricultural University、National Digital Agriculture Regional Innovation Center (Northeast)、Key Laboratory of Smart Agriculture Technology in Liaoning Province\u003c/p\u003e\n\u003cp\u003eDongdong Cai\u003c/p\u003e\n\u003cp\u003eSchool of Information and Electrical Engineering, Shenyang Agricultural University、National Digital Agriculture Regional Innovation Center (Northeast)、Key Laboratory of Smart Agriculture Technology in Liaoning Province\u003c/p\u003e\n\u003cp\u003eYunyi Zhou\u003c/p\u003e\n\u003cp\u003eSchool of Information and Electrical Engineering, Shenyang Agricultural University、National Digital Agriculture Regional Innovation Center (Northeast)、Key Laboratory of Smart Agriculture Technology in Liaoning Province\u003c/p\u003e\n\u003cp\u003eTongyu Xu\u003c/p\u003e\n\u003cp\u003eSchool of Information and Electrical Engineering, Shenyang Agricultural University、National Digital Agriculture Regional Innovation Center (Northeast)、Key Laboratory of Smart Agriculture Technology in Liaoning Province、Key Laboratory of Smart Agriculture in the South China Tropical Region, Ministry of Agriculture and Rural Affairs\u003c/p\u003e\n\u003cp\u003eFenghua Yu\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCorresponding author\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCorrespondence to Fenghua Yu , E-mail: [email protected]\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflict of interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNone\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eGhosh, D.; Brahmachari, K.; Skalicky, M.; Roy, D.; Das, A.; Sarkar, S.; Moulick, D.; Brestič, M.; Hejnak, V.; Vachova, P.; et al. The combination of organic and inorganic fertilizers influence the weed growth, productivity and soil fertility of monsoon rice. \u003cem\u003ePloS one\u003c/em\u003e \u003cstrong\u003e2022\u003c/strong\u003e, \u003cem\u003e17\u003c/em\u003e, e0262586.\u003c/li\u003e\n\u003cli\u003eRosle, R.; Che\u0026rsquo;Ya, N.N.; Ang, Y.; Rahmat, F.; Wayayok, A.; Berahim, Z.; Fazlil Ilahi, W.F.; Ismail, M.R.; Omar, M.H. Weed detection in rice fields using remote sensing technique: A review. \u003cem\u003eApplied sciences\u003c/em\u003e \u003cstrong\u003e2021\u003c/strong\u003e, \u003cem\u003e11\u003c/em\u003e, 10701.\u003c/li\u003e\n\u003cli\u003eMeshram, A.T.; Vanalkar, A.V.; Kalambe, K.B.; Badar, A.M. Pesticide spraying robot for precision agriculture: A categorical literature review and future trends. \u003cem\u003eJournal of Field Robotics\u003c/em\u003e \u003cstrong\u003e2022\u003c/strong\u003e, \u003cem\u003e39\u003c/em\u003e, 153\u0026ndash;171.\u003c/li\u003e\n\u003cli\u003eTalaviya, T.; Shah, D.; Patel, N.; Yagnik, H.; Shah, M. Implementation of artificial intelligence in agriculture for optimisation of irrigation and application of pesticides and herbicides. \u003cem\u003eArtificial Intelligence in Agriculture\u003c/em\u003e \u003cstrong\u003e2020\u003c/strong\u003e, \u003cem\u003e4\u003c/em\u003e, 58\u0026ndash;73.\u003c/li\u003e\n\u003cli\u003eRoslim, M.H.M.; Juraimi, A.S.; Che\u0026rsquo;Ya, N.N.; Sulaiman, N.; Manaf, M.N.H.A.; Ramli, Z.; Motmainna, M. Using remote sensing and an unmanned aerial system for weed management in agricultural crops: A review. \u003cem\u003eAgronomy\u003c/em\u003e \u003cstrong\u003e2021\u003c/strong\u003e, \u003cem\u003e11\u003c/em\u003e, 1809.\u003c/li\u003e\n\u003cli\u003eRahaman, F.; Juraimi, A.S.; Rafii, M.Y.; Uddin, M.K.; Hassan, L.; Chowdhury, A.K.; Bashar, H.M.K. Allelopathic effect of selected rice (Oryza sativa) varieties against barnyard grass (Echinochloa cruss-gulli). \u003cem\u003ePlants\u003c/em\u003e \u003cstrong\u003e2021\u003c/strong\u003e, \u003cem\u003e10\u003c/em\u003e, 2017.\u003c/li\u003e\n\u003cli\u003eSingh, V.; Rana, A.; Bishop, M.; Filippi, A.M.; Cope, D.; Rajan, N.; Bagavathiannan, M. Unmanned aircraft systems for precision weed detection and management: Prospects and challenges. \u003cem\u003eAdvances in Agronomy\u003c/em\u003e \u003cstrong\u003e2020\u003c/strong\u003e, \u003cem\u003e159\u003c/em\u003e, 93\u0026ndash;134.\u003c/li\u003e\n\u003cli\u003eZhang, Y.; Wang, M.; Zhao, D.; Liu, C.; Liu, Z. Early weed identification based on deep learning: A review. \u003cem\u003eSmart Agricultural Technology\u003c/em\u003e \u003cstrong\u003e2023\u003c/strong\u003e, \u003cem\u003e3\u003c/em\u003e, 100123.\u003c/li\u003e\n\u003cli\u003eAl-Badri, A.H.; Ismail, N.A.; Al-Dulaimi, K.; Salman, G.A.; Khan, A.R.; Al-Sabaawi, A.; Salam, M.S.H. Classification of weed using machine learning techniques: a review\u0026mdash;challenges, current and future potential techniques. \u003cem\u003eJournal of Plant Diseases and Protection\u003c/em\u003e \u003cstrong\u003e2022\u003c/strong\u003e, \u003cem\u003e129\u003c/em\u003e, 745\u0026ndash;768.\u003c/li\u003e\n\u003cli\u003eWang, W.; Lai, Q.; Fu, H.; Shen, J.; Ling, H.; Yang, R. Salient object detection in the deep learning era: An in-depth survey. \u003cem\u003eIEEE Transactions on Pattern Analysis and Machine Intelligence\u003c/em\u003e \u003cstrong\u003e2021\u003c/strong\u003e, \u003cem\u003e44\u003c/em\u003e, 3239\u0026ndash;3259.\u003c/li\u003e\n\u003cli\u003eHuang, H.; Lan, Y.; Yang, A.; Zhang, Y.; Wen, S.; Deng, J. Deep learning versus Object-based Image Analysis (OBIA) in weed mapping of UAV imagery. \u003cem\u003eInternational Journal of Remote Sensing\u003c/em\u003e \u003cstrong\u003e2020\u003c/strong\u003e, \u003cem\u003e41\u003c/em\u003e, 3446\u0026ndash;3479.\u003c/li\u003e\n\u003cli\u003eZhang, X.; Cui, J.; Liu, H.; Han, Y.; Ai, H.; Dong, C.; Zhang, J.; Chu, Y. Weed Identification in Soybean Seedling Stage Based on Optimized Faster R-CNN Algorithm. \u003cem\u003eAgriculture\u003c/em\u003e \u003cstrong\u003e2023\u003c/strong\u003e, \u003cem\u003e13\u003c/em\u003e, 175.\u003c/li\u003e\n\u003cli\u003eGallo, I.; Rehman, A.U.; Dehkordi, R.H.; Landro, N.; La Grassa, R.; Boschetti, M. Deep object detection of crop weeds: Performance of YOLOv7 on a real case dataset from UAV images. \u003cem\u003eRemote Sensing\u003c/em\u003e \u003cstrong\u003e2023\u003c/strong\u003e, \u003cem\u003e15\u003c/em\u003e, 539.\u003c/li\u003e\n\u003cli\u003eZhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. \u003cem\u003earXiv preprint arXiv:2010.04159\u003c/em\u003e \u003cstrong\u003e2020\u003c/strong\u003e.\u003c/li\u003e\n\u003cli\u003eLv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. Detrs beat yolos on real-time object detection. \u003cem\u003earXiv preprint arXiv:2304.08069\u003c/em\u003e \u003cstrong\u003e2023\u003c/strong\u003e.\u003c/li\u003e\n\u003cli\u003eMeng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021; pp. 3651\u0026ndash;3660.\u003c/li\u003e\n\u003cli\u003eLi, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 13619\u0026ndash;13627.\u003c/li\u003e\n\u003cli\u003eNing, X.; Tian, W.; Yu, L.; Li, W. Brain-inspired CIRA-DETR full inference model for small and occluded object detection. \u003cem\u003eCHINESE JOURNAL OF COMPUTERS\u003c/em\u003e \u003cstrong\u003e2022\u003c/strong\u003e, 045.\u003c/li\u003e\n\u003cli\u003eKe, X.; Cai, Y.; Chen, B.; Liu, H.; Guo, W. Granularity-aware distillation and structure modeling region proposal network for fine-grained image classification. \u003cem\u003ePattern Recognition\u003c/em\u003e \u003cstrong\u003e2023\u003c/strong\u003e, \u003cem\u003e137\u003c/em\u003e, 109305.\u003c/li\u003e\n\u003cli\u003eMeng, H.; Tian, Y.; Ling, Y.; Li, T. Fine-grained ship recognition for complex background based on global to local and progressive learning. \u003cem\u003eIEEE Geoscience and Remote Sensing Letters\u003c/em\u003e \u003cstrong\u003e2022\u003c/strong\u003e, \u003cem\u003e19\u003c/em\u003e, 1\u0026ndash;5.\u003c/li\u003e\n\u003cli\u003eWang, Y.; Tian, Y.; Liu, J.; Xu, Y. Multi-Stage Multi-Scale Local Feature Fusion for Infrared Small Target Detection. \u003cem\u003eRemote Sensing\u003c/em\u003e \u003cstrong\u003e2023\u003c/strong\u003e, \u003cem\u003e15\u003c/em\u003e, 4506.\u003c/li\u003e\n\u003cli\u003eYin, A.; Ren, C.; Yan, Z.; Xue, X.; Zhou, Y.; Liu, Y.; Lu, J.; Ding, C. C2S-RoadNet: road extraction model with depth-wise separable convolution and self-attention. \u003cem\u003eRemote Sensing\u003c/em\u003e \u003cstrong\u003e2023\u003c/strong\u003e, \u003cem\u003e15\u003c/em\u003e, 4531.\u003c/li\u003e\n\u003cli\u003eYe, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-Time Object Detection Network in UAV-Vision Based on CNN and Transformer. \u003cem\u003eIEEE Transactions on Instrumentation and Measurement\u003c/em\u003e \u003cstrong\u003e2023\u003c/strong\u003e, \u003cem\u003e72\u003c/em\u003e, 1\u0026ndash;13.\u003c/li\u003e\n\u003cli\u003eRekavandi, A.M.; Rashidi, S.; Boussaid, F.; Hoefs, S.; Akbas, E.; others Transformers in small object detection: A benchmark and survey of state-of-the-art. \u003cem\u003earXiv preprint arXiv:2309.04902\u003c/em\u003e \u003cstrong\u003e2023\u003c/strong\u003e.\u003c/li\u003e\n\u003cli\u003eLiu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 14420\u0026ndash;14430.\u003c/li\u003e\n\u003cli\u003eLei, T.; Xue, D.; Ning, H.; Yang, S.; Lv, Z.; Nandi, A.K. Local and global feature learning with kernel scale-adaptive attention network for VHR remote sensing change detection. \u003cem\u003eIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing\u003c/em\u003e \u003cstrong\u003e2022\u003c/strong\u003e, \u003cem\u003e15\u003c/em\u003e, 7308\u0026ndash;7322.\u003c/li\u003e\n\u003cli\u003eMumuni, A.; Mumuni, F. CNN architectures for geometric transformation-invariant feature representation in computer vision: a review. \u003cem\u003eSN Computer Science\u003c/em\u003e \u003cstrong\u003e2021\u003c/strong\u003e, \u003cem\u003e2\u003c/em\u003e, 1\u0026ndash;23.\u003c/li\u003e\n\u003cli\u003eWang, H.; Chen, X.; Zhang, T.; Xu, Z.; Li, J. CCTNet: Coupled CNN and transformer network for crop segmentation of remote sensing images. \u003cem\u003eRemote Sensing\u003c/em\u003e \u003cstrong\u003e2022\u003c/strong\u003e, \u003cem\u003e14\u003c/em\u003e, 1956.\u003c/li\u003e\n\u003cli\u003eLi, S.; Li, B.; Li, J.; Liu, B.; Li, X. Semantic Segmentation Algorithm of Rice Small Target Based on Deep Learning. \u003cem\u003eAgriculture\u003c/em\u003e \u003cstrong\u003e2022\u003c/strong\u003e, \u003cem\u003e12\u003c/em\u003e, 1232.\u003c/li\u003e\n\u003cli\u003eQi, M.; Liu, L.; Zhuang, S.; Liu, Y.; Li, K.; Yang, Y.; Li, X. FTC-net: fusion of transformer and CNN features for infrared small target detection. \u003cem\u003eIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing\u003c/em\u003e \u003cstrong\u003e2022\u003c/strong\u003e, \u003cem\u003e15\u003c/em\u003e, 8613\u0026ndash;8623.\u003c/li\u003e\n\u003cli\u003eHou, J.; Zhou, H.; Yu, H.; Hu, H. HPAC: a forest tree species recognition network based on multi-scale spatial enhancement in remote sensing images. \u003cem\u003eInternational Journal of Remote Sensing\u003c/em\u003e \u003cstrong\u003e2023\u003c/strong\u003e, \u003cem\u003e44\u003c/em\u003e, 5960\u0026ndash;5975.\u003c/li\u003e\n\u003cli\u003eWang, X.; Lv, R.; Zhao, Y.; Yang, T.; Ruan, Q. Multi-scale context aggregation network with attention-guided for crowd counting. In Proceedings of the 2020 15th IEEE International Conference on Signal Processing (ICSP); IEEE, 2020; Vol. 1, pp. 240\u0026ndash;245.\u003c/li\u003e\n\u003cli\u003eChen, J.; Kao, S.-hong; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don\u0026rsquo;t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 12021\u0026ndash;12031.\u003c/li\u003e\n\u003cli\u003eRostianingsih, S.; Setiawan, A.; Halim, C.I. COCO (creating common object in context) dataset for chemistry apparatus. \u003cem\u003eProcedia Computer Science\u003c/em\u003e \u003cstrong\u003e2020\u003c/strong\u003e, \u003cem\u003e171\u003c/em\u003e, 2445\u0026ndash;2452.\u003c/li\u003e\n\u003cli\u003eWang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor DETR: Query design for transformer-based object detection. \u003cem\u003earXiv preprint arXiv:2109.07107\u003c/em\u003e \u003cstrong\u003e2021\u003c/strong\u003e, \u003cem\u003e3\u003c/em\u003e.\u003c/li\u003e\n\u003cli\u003eLiu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. \u003cem\u003earXiv preprint arXiv:2201.12329\u003c/em\u003e \u003cstrong\u003e2022\u003c/strong\u003e.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"plant-methods","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"plme","sideBox":"Learn more about [Plant Methods](http://plantmethods.biomedcentral.com/)","snPcode":"13007","submissionUrl":"https://submission.nature.com/new-submission/13007/3","title":"Plant Methods","twitterHandle":"@PlantMethods","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Rice field weeds, Target detection, Transformer, DETR, UAV","lastPublishedDoi":"10.21203/rs.3.rs-4008720/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4008720/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eRice field weed object detection can provide key information on weed species and locations for precise spraying, which is of great significance in actual agricultural production. However, facing the complex and changing real farm environments, traditional object detection methods still have difficulties in identifying small-sized, occluded and densely distributed weed instances. To address these problems, this paper proposes a multi-scale feature enhanced DETR network, named MS-DETR. By adding multi-scale feature extraction branches on top of DETR, this model fully utilizes the information from different semantic feature layers to improve recognition capability for rice field weeds in real-world scenarios.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eIntroducing multi-scale feature layers on the basis of the DETR model, we conduct a differentiated design for different semantic feature layers. The high-level semantic feature layer adopts Transformer structure to extract contextual information between barnyard grass and rice plants. The low-level semantic feature layer uses CNN structure to extract local detail features of barnyard grass. Introducing multi-scale feature layers inevitably leads to increased model computation, thus lowering model inference speed. Therefore, we employ a new type of Pconv (Partial convolution) to replace traditional standard convolutions in the model, so as to reduce memory access time and computational redundancy.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eOn our constructed rice field weed dataset, compared with the original DETR model, our proposed MS-DETR model improved average recognition accuracy of rice field weeds by 2.8%, reaching 0.792. The MS-DETR model size is 40.8M with inference time of 0.0081 seconds. Compared with three classical DETR models (Deformable DETR, Anchor DETR and DAB-DETR), the MS-DETR model respectively improved average precision by 2.1%, 4.9% and 2.4%.\u003c/p\u003e\u003ch2\u003eDiscussion\u003c/h2\u003e \u003cp\u003eThis model has advantages such as high recognition accuracy and fast recognition speed. It is capable of accurately identifying rice field weeds in complex real-world scenarios, thus providing key technical support for precision spraying and management of variable-rate spraying systems.\u003c/p\u003e","manuscriptTitle":"Identifying Rice Field Weeds from Unmanned Aerial Vehicle Remote Sensing Imagery Using Deep Learning","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-03-08 07:06:26","doi":"10.21203/rs.3.rs-4008720/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-05-05T22:04:30+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-04-22T13:09:16+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-04-17T06:17:21+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-04-09T11:31:06+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"a90752ff-ce9c-4501-b366-7bfab043fdc7","date":"2024-03-26T03:28:46+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"f083425f-69ac-4ff9-8da2-d2ca0165fb41_SNPRID","date":"2024-03-25T07:00:36+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"3a16ac6e-6c1a-4ca0-91a7-aa6d4eb5e05a","date":"2024-03-22T05:16:14+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"bef38636-f93c-40eb-9291-cdbb8b15dea5","date":"2024-03-22T00:21:16+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-03-21T10:38:42+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-03-06T09:59:44+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-03-06T09:59:44+00:00","index":"","fulltext":""},{"type":"submitted","content":"Plant Methods","date":"2024-03-03T12:37:48+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"plant-methods","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"plme","sideBox":"Learn more about [Plant Methods](http://plantmethods.biomedcentral.com/)","snPcode":"13007","submissionUrl":"https://submission.nature.com/new-submission/13007/3","title":"Plant Methods","twitterHandle":"@PlantMethods","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"80df3a9f-7d09-4f4e-9dc6-387db1cbfc85","owner":[],"postedDate":"March 8th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2024-08-01T17:23:05+00:00","versionOfRecord":{"articleIdentity":"rs-4008720","link":"https://doi.org/10.1186/s13007-024-01232-0","journal":{"identity":"plant-methods","isVorOnly":false,"title":"Plant Methods"},"publishedOn":"2024-07-16 16:13:38","publishedOnDateReadable":"July 16th, 2024"},"versionCreatedAt":"2024-03-08 07:06:26","video":"","vorDoi":"10.1186/s13007-024-01232-0","vorDoiUrl":"https://doi.org/10.1186/s13007-024-01232-0","workflowStages":[]},"version":"v1","identity":"rs-4008720","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4008720","identity":"rs-4008720","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00