Improvement of YOLOv8 algorithm through integration of Pyramid Vision Transformer architecture

doi:10.21203/rs.3.rs-4987159/v1

Improvement of YOLOv8 algorithm through integration of Pyramid Vision Transformer architecture

2024 · doi:10.21203/rs.3.rs-4987159/v1

preprint OA: closed

Full text JSON View at publisher

Full text 75,309 characters · extracted from preprint-html · click to expand

Improvement of YOLOv8 algorithm through integration of Pyramid Vision Transformer architecture | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Improvement of YOLOv8 algorithm through integration of Pyramid Vision Transformer architecture Zhiqiang Dong, Shu Yang, Yang Xiao This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4987159/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Addressing the issue of poor target detection accuracy in complex backgrounds with the YOLOv8s model, this chapter proposes an improved YOLOv8s model that incorporates the Pyramid Vision Transformer (PVT). Specifically, to enhance the feature extraction capabilities of the base module, this paper proposes using PVT in the Backbone stage of YOLOv8s to replace the previous basic convolutional feature extraction blocks. This structure allows the model to process images at different resolution levels, thereby more effectively capturing details and contextual information. Physical sciences/Mathematics and computing/Computer science Physical sciences/Mathematics and computing/Information technology Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Introduction With the rapid development of deep learning technology, the field of computer vision has undergone unprecedented changes. As one of the core tasks of computer vision, the improvement of object detection performance is directly related to the level of intelligence in many application scenarios. The YOLO (You Only Look Once) series of algorithms, with their efficient and accurate characteristics [1] , have received widespread attention and applications since their inception. However, as application requirements become increasingly complex, enhancing the accuracy and efficiency of YOLO [2] algorithms has become a research hotspot. YOLOv8, as the latest member of the YOLO series, inherits many advantages of YOLOv5 [3] [4] and achieves significant performance improvements. Nevertheless, when faced with more complex and diverse scenarios and multi-scale object detection tasks, existing Convolutional Neural Network (CNN) structures still have certain limitations. To overcome this challenge, this paper proposes an improved YOLOv8 algorithm that incorporates the Pyramid Vision Transformer (PVT) architecture [5] (Figure 1). Pyramid Vision Transformer is a deep learning model that combines the Transformer architecture with a pyramid structure [6] . Compared to the original Vision Transformer (ViT) [7] [8] [9] , PVT utilizes feature maps of different scales across multiple stages, thereby forming a pyramid structure. This design enables PVT to capture feature information at different scales, enhancing the model's ability to process objects of varying sizes within images. By introducing PVT into the backbone network of YOLOv8, it aims to leverage its robust feature extraction capabilities and multi-scale feature processing abilities to further improve the object detection accuracy and efficiency of YOLOv8 [10] . Currently, both single-stage and two-stage object detection algorithms based on deep learning largely adopt feature pyramid structures to enhance intermediate feature map information. In Vision Transformer (ViT) structure-based methods [11] , the input image is first converted into a series of patches, which are then fed into the Transformer Encoder to extract object features and obtain feature maps. However, since the feature map size remains the same [12] at each stage, this approach is challenging to apply to downstream vision tasks. To bridge the gap between ViT and feature pyramid techniques, Wang et al. [22] proposed PVT, which can be trained on high-resolution images without significantly increasing the model's computational complexity. PVT employs a progressive pyramid paradigm with diminishing sizes to produce multi-scale output feature maps. Additionally, it introduces Spatial Reduction Attention (SRA) [13] [14] to reduce resource consumption and time complexity during attention computations. Compared to CNNs and ViT, PVT not only inherits the global receptive field of ViT but also incorporates the pyramid structure of CNNs, facilitating the acquisition of multi-scale feature maps and seamless migration to advanced computer vision tasks such as object detection and instance segmentation. Figure 2 illustrates the overall architecture of PVT, which comprises four stage modules (Stage1, Stage2, Stage3, and Stage4). Each stage module contains a Patch Embedding and n Transformer Encoder Layers [5] , outputting feature maps of different sizes (four-fold downsampled, eight-fold downsampled, sixteen-fold downsampled, and thirty-two-fold downsampled) [6] . In the first stage module, given an input image of size H×W×3, it is first divided into patches of size 4×4×3, totaling H×W/16 patches. These embedded vectors, along with positional embeddings, are then fed into the Transformer Encoder Layer, and the output feature map is reshaped to obtain an H/4×W/4×C1 feature map f1. The subsequent three stages follow a similar process, yielding H/8×W/8×C2 feature map f2, H/16×W/16×C3 feature map f3, and H/32×W/32×C4 feature map f4, respectively. To reduce the computational load of the model, PVT proposes SRA to replace the Multi-Head Attention (MHA) module in Transformers [5] . Similar to MHA, SRA takes a query, key, and value as inputs and outputs a modified feature vector [15] . Unlike MHA, SRA reduces the spatial dimensions (i.e., width and height) of the key and value before performing self-attention calculations, significantly decreasing the attention computation burden and system memory usage [14] [16] . The structures of MHA and SRA modules are depicted in Figure 3. PVT encompasses four variant models: PVT-Tiny, PVT-Small, PVT-Medium, and PVT-Large. Considering computational cost and complexity, this paper adopts the YOLOv8s model as the baseline and utilizes PVT-Small as the feature extraction network, replacing the stacked convolutional block structure in YOLOv8s [8] [17] . Addressing the issue of poor object detection accuracy in complex backgrounds for the YOLOv8s model [18] [19] , this paper aims to enhance the feature extraction capability of the YOLOv8s model's feature extraction network. We propose substituting the stacked convolutional block structure in YOLOv8s with PVT. This approach leverages PVT's pre-trained weights to better initialize model parameters and benefits from the Transformer model's robust feature modeling capabilities, enabling multi-dimensional feature extraction and fusion, thereby strengthening the model's image feature extraction ability. The overall structure of the proposed model is shown in Figure 3 below. We replace the original convolutional block structure with PVT in the Backbone stage, while maintaining consistency with YOLOv8s in the Head and Prediction stages. Experimental dataset MS COCO (Microsoft Common Objects in Context) is a large-scale, publicly available benchmark dataset for computer vision tasks, including object detection, instance segmentation, image captioning, and more. It was released by Microsoft Research to advance research and development in computer vision tasks. MS COCO has three versions released in 2014, 2015, and 2017, with the 2017 version being a significant update that further expanded its scale and annotation quality. The MS COCO 2017 dataset consists of hundreds of thousands of high-resolution images and covers 80 common object categories, including people, animals, furniture, electronic devices, vehicles, and more. The dataset is divided into training, validation, and test sets, with 118,287 images in the training set, 5,000 images in the validation set, and 5,000 images in the test set, respectively used for model training, performance evaluation, and image testing. The self-built dataset for fall detection of elderly individuals in this paper, which includes images captured in the Smart Elderly Care Laboratory of Guiyang Health Vocational University and other locations, totals 6490 images. Specifically, the training set contains 4401 images, the validation set contains 1089 images, and the test set contains 1000 images. We used the LabelImg tool for annotation, and some annotated images from the training set are shown in the Figure 4 below. Experimental equipment and environment This paper uses Ubuntu 20.04 as the experimental platform, with an Intel Xeon(R) Gold 6330 CPU and an NVIDIA GeForce RTX 3090 GPU, and a total system memory of 64GB. In terms of software configuration, the GPU is equipped with driver version 525.60.11, and the versions of CUDA (Compute Unified Device Architecture) and CUDNN (NVIDIA CUDA® Deep Neural Network library) are 11.8 and 8.7.0, respectively. The Python interpreter version is 3.8, and the version of the deep learning framework PyTorch is 2.0.1. This paper uses the Ultralytics algorithm library to reproduce and improve the YOLOv8s model, with a specific version number of 8.1.18. Additionally, this paper also utilizes some extra dependent libraries, such as OpenCV version 4.7.0.72, Numpy version 1.21.6, Pillow version 9.5.0, Matplotlib version 3.8.2, and tqdm version 4.65.0. Experimental hyperparameter setting In the experiments of this paper, we conducted detailed settings for the experimental parameter configuration. Firstly, the batch size for all experiments was uniformly set to 64, and all experiments were trained for 100 epochs. The optimization algorithm chosen was Stochastic Gradient Descent (SGD), with an initial learning rate set to 0.01 and a final learning rate gradually decaying to 0.0001. The total decay factor for the learning rate was set to 0.937, and the weight decay factor for the optimizer was set to 0.0005. In terms of data augmentation during the training phase, the probabilities for color hue, saturation, and brightness adjustments were set to 0.015, 0.7, and 0.4, respectively. Additionally, the probabilities for left-right flipping and scale variation of images were both set to 0.5 to increase the diversity of the training data. We also employed the mosaic method to augment the dataset. For the loss function, the Complete Intersection over Union (CIoU) Loss was used for the object bounding box loss, with a weight coefficient set to 7.5. The weight coefficient for the Distribution Focal Loss (DFL Loss) was set to 1.5, and the weight coefficient for the object classification loss was set to 0.5. Results Figure 5 shows the metric graph obtained by reproducing the YOLOv8s model on the MS COCO dataset during the experimental process of this paper. It can be seen from the figure that the loss during the training process of the algorithm consistently decreases and converges to a small value. In object detection tasks, we mainly focus on the metrics of Precision, Recall, mAP50, and mAP50-95. The mAP50 metric refers to the average precision across all categories with an Intersection over Union (IoU) threshold of 0.5 for the object bounding boxes. The mAP50-95 metric refers to the average precision across all categories with IoU thresholds at 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, and 0.95 for the object bounding boxes. Higher mAP50 and mAP50-95 metrics indicate stronger detection performance. YOLOv8s achieves 67.3%, 54.8%, 59.9%, and 43.9% respectively in Precision, Recall, mAP50, and mAP50-95 metrics. Compared to other methods such as YOLOv5, YOLOv6, and YOLOv7, YOLOv8 demonstrates significant performance improvements with only 100 epochs of training, while maintaining a relatively fast detection speed, compared to these methods with the same number of parameters. Figure 6 shows the metric results of our method on the MS COCO 2017 dataset. It can be seen from the figure that our method achieves 69.4%, 56.0%, 61.7%, and 45.2% respectively in Precision, Recall, mAP50, and mAP50-95 metrics, which are 2.1%, 1.2%, 1.8%, and 1.3% higher compared to the YOLOv8s model. The main reason for this performance improvement is that our method uses a stronger PVT model in the feature extraction stage, which can more effectively learn and represent object features. Figure 7 shows the result metric graph of the YOLOv8s model on the self-built elderly fall detection dataset in this paper. It can be seen from the figure that the object bounding box losses, CIoU Loss and DFL Loss, of the YOLOv8s model converge to 1.33 and 1.206 after 100 iterations on the validation set, and the object classification loss converges to 0.488. In specific metrics, YOLOv8s achieves 94.2%, 94.6%, 94.8%, and 64.9% respectively in Precision, Recall, mAP50, and mAP50-95 metrics. Figure 8 shows the experimental result metric graph of our method on the self-built fall detection dataset. It can be seen from the figure that after 100 iterations of optimization in the training phase, the object bounding box loss CIoU Loss, DFL Loss, and object classification loss exhibit a stable and rapid downward trend, ultimately converging to 0.76, 0.989, and 0.356, respectively. In the validation phase, these three losses decrease to 1.214, 1.138, and 0.429, respectively. In specific metrics, our method achieves 97.1%, 96.9%, 97.3%, and 66.8% respectively in Precision, Recall, mAP50, and mAP50-95 metrics. Compared to the YOLOv8s model, there are improvements of 2.9%, 2.3%, 2.5%, and 1.9%, respectively. Figure 9 shows the detection results of our method on the self-built elderly fall detection test set. It can be seen from the figure that our method can accurately predict fall targets in different scenarios while achieving a very high detection confidence score. Discussion The Pyramid Vision Transformer (PVT), as a deep learning model that combines the Transformer architecture with a pyramid structure, excels in feature extraction and multi-scale processing. This paper explores strategies for incorporating the PVT architecture into the YOLOv8 algorithm and analyzes its potential advantages and challenges [ 20 ][ 21 ] . The original backbone network of YOLOv8 typically employs Convolutional Neural Networks (CNNs), such as CSPDarknet [ 22 ] . To introduce the multi-scale feature extraction capabilities of PVT, we can replace the backbone network of YOLOv8 with the PVT architecture. Through its pyramid structure design, PVT can utilize feature maps of different scales at multiple stages, thereby capturing feature information of targets of different sizes in the image. This design gives PVT a significant advantage in handling multi-scale targets. After adopting PVT as the backbone network, we need to optimize the feature fusion network to ensure effective transmission and integration of multi-scale features. YOLOv8 commonly uses structures such as Feature Pyramid Network (FPN) [ 6 ] or Path Aggregation Network (PAN) [ 23 ] for feature fusion. We can adjust the parameters and configurations of these structures based on the characteristics of the feature maps output by PVT to achieve more efficient feature fusion [ 24 ] . The design of YOLOv8's detection head is usually closely related to the backbone network [ 3 ] . After introducing PVT as the backbone network, we need to adapt the detection head accordingly [ 25 ] . This includes adjusting the input channel number of the detection head, anchor box settings, and loss functions to ensure that the detection head can fully utilize the multi-scale features extracted by PVT and achieve more accurate object detection. The pyramid structure design of PVT enables it to capture feature information at different scales in images, which is crucial for handling multi-scale object detection tasks. Incorporating PVT into YOLOv8 can significantly enhance the algorithm's ability to detect multi-scale objects. The Transformer architecture excels in feature extraction, but its computational complexity is usually high. PVT effectively reduces computational costs through designs such as progressive feature pyramids and spatial-reduction attention (SRA) layers [ 26 ][ 27 ][ 28 ] . Using PVT as the backbone network of YOLOv8 can maintain high feature extraction efficiency while reducing the model's computational complexity [ 29 ] . Both PVT and YOLOv8 are deep learning-based models that can learn rich feature representations during training. Combining the two can fully leverage their respective strengths to form stronger feature extraction and detection capabilities. This combination helps improve the model's generalization ability, allowing it to perform well in different scenarios [ 30 ] . Experimental results on benchmark datasets and self-built datasets verify the effectiveness of incorporating the PVT architecture. Compared to the original YOLOv8, the improved algorithm achieves significant improvements in mean average precision (mAP) [ 31 ], especially in the detection performance of small objects and complex backgrounds [ 32 ] . This result indicates that the introduction of the PVT architecture indeed enhances the feature extraction and expression capabilities of YOLOv8 [ 33 ] . Although the PVT structure is relatively complex, we have successfully maintained the real-time performance of the improved algorithm [ 34 ] by optimizing model parameters and inference strategies. This characteristic is crucial for real-time object detection systems in practical applications. Despite its impressive performance in multiple aspects, YOLOv8 with the PVT architecture also has some limitations [ 35 ] . Firstly, due to the complexity of the PVT structure, model training requires more computational resources and time. Secondly, there is still room for improvement in detecting extremely small objects and objects under heavy occlusion [ 36 ] . Future research can further explore how to optimize the PVT structure to enhance performance in these aspects [ 37 ] . To address the current limitations of the PVT structure, future work can explore more efficient Transformer variants or optimization strategies to further reduce computational complexity [ 38 ] and improve feature extraction capabilities [ 39 ] . With the rise of multimodal learning, future attempts can be made to fuse PVT with other types of sensor data (such as LiDAR, radar, etc.) [ 40 ] to further enhance the robustness and accuracy [ 41 ] of object detection systems. For resource-constrained application scenarios such as mobile and embedded devices, future research can investigate how to design a lightweight version of YOLOv8 with PVT to meet the real-time and accuracy requirements of these scenarios [ 42 ] . Conclusion The proposed improved YOLOv8 algorithm fused with PVT effectively enhances the accuracy and robustness of object detection, especially when dealing with small and dense objects. This improvement provides a new research direction in the field of real-time object detection and offers powerful technical support for complex scene detection in practical applications. Declarations Data availability The datasets used or analysed during the current study are available from the corresponding author on reason able request. Authors’ contributions ZD and SY contributed equally to this work. ZD and SYcontributed to the conceptualization, Methodology, software, data curation, writing - original draft preparation, writing- reviewing and editing. SY contributed to the writing - reviewing and editing, supervision, project administration, funding acquisition. YX contributed to the investigation, data curation, resources. Funding This research was supported by the doctoral research launch project of Guiyang Healthcare Vocational University Competing interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Ethics declarations Ethics approval and consent to participate All procedures performed in the study involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. All individuals and/or their parents provided informed consent to participate in this study and approval was provided by Research Ethics Committee of Guiyang Healthcare Vocational University. Statement All subjects and/or their legal guardian(s) agree to publish of identifying information/images in an online open-access publication. References Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 779–788). (2016). Redmon, J. & Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7263–7271). (2017). Bochkovskiy, A., Wang, C. Y. & Liao, H. Y. M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv preprint arXiv:2004.10934. (2020). Jocher, G., Chaurasia, A., Qiu, J. & Stoken, A. (2020). YOLOv5. GitHub repository. Wang, W. et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In International Conference on Computer Vision (ICCV). (2021). Lin, T. Y. et al. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2117–2125). (2017). Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR). (2021). Liu, Z. et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022). (2021). Carion, N. et al. End-to-End Object Detection with Transformers. In European Conference on Computer Vision (ECCV) (pp. 213–229). (2020). Ge, Z., Liu, S., Wang, F., Li, Z. & Sun, J. YOLOX: Exceeding YOLO Series in 2021. (2021). arXiv preprint arXiv:2107.08430. Huang, Z., Wang, X. & Li, L. J. CrossViT: Cross-Attention Vision Transformer for Image Classification. arXiv preprint arXiv:2007.00666. (2020). He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). (2016). Chu, X., Wu, Y. & Liu, X. TokenLearner: What Can 8.4 Billion Tokens Do for Visual Recognition? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 5117–5127). (2021). Lin, J., Li, J., Wang, Z., Xu, M. & Zhang, Z. Simplified Self-Attention Mechanisms in Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5698–5708). (2022). Vaswani, A. et al. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (pp. 5998–6008). (2017). Cheng, B. & Liu, X. Adaptive Attention: A New Mechanism for Transformer Models. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 05, pp. 8620–8627). (2020). Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. Switchable Atrous Convolution for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6878–6887). (2020). Zhang, Z., Li, M. & Qi, X. Replacing Convolutional Neural Networks with Transformer Networks for Image Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1450–1459). (2020). Chen, J., Yu, K., Xie, L. & Zhang, X. Efficient and Robust Object Detection with Attention Mechanisms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1245–1255). (2021). Redmon, J. & Farhadi, A. YOLOv3: An Incremental Improvement. arXiv preprint arXiv:1804.02767. (2018). Zhu, X., Lu, L., Li, B., Dai, J. & Wang, X. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations (ICLR). (2021). Wang, C. & Xu, Z. CSPDarknet: A New Backbone Network for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5294–5303). (2020). Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 8759–8768). (2020). Zhou, X., Wang, D. & Zhu, J. Objects as Points. arXiv preprint arXiv:2006.05987. (2020). Zhang, Y., Li, M. & Qi, X. A Survey on Backbone Networks for Object Detection. J. Comput. Vis. Res. 36 (2), 109–125 (2021). Xie, E. et al. Multiscale Vision Transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2860–2869). (2021). Cao, Y. & Yang, J. A Survey on Vision Transformers. arXiv preprint arXiv:2108.10654. (2021). Chen, L. & Wu, J. Efficient Attention Mechanism in Transformers for Vision Tasks. IEEE Trans. Neural Networks Learn. Syst. 32 (5), 1804–1816 (2021). Li, X., Xie, E., Wang, C., Zhang, Z. & Fan, D. Vision Transformers: A Survey of Methods and Applications. arXiv preprint arXiv:2205.12476. (2022). Zhang, H., Li, H. & Lin, H. Enhancing YOLO with Transformers for Improved Object Detection Performance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1479–1488). (2022). Wu, T. & Dong, Y. YOLO-SE: Improved YOLOv8 for remote sensing object detection and recognition. Appl. Sci. 13 (24), 12977 (2023). Liu, Y., Sun, P., Wergeles, N. & Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 172 , 114602 (2021). Ma, N., Su, Y., Yang, L., Li, Z. & Yan, H. Wheat Seed Detection and Counting Method Based on Improved YOLOv8 Model. Sensors . 24 (5), 1654 (2024). Yao, J. et al. A real-time detection algorithm for Kiwifruit defects based on YOLOv5. Electronics . 10 (14), 1711 (2021). Swathi, Y. & Challa, M. YOLOv8: Advancements and Innovations in Object Detection. In International Conference on Smart Computing and Communication (pp. 1–13). Singapore: Springer Nature Singapore. (2024), January. Lin, Y., Zhang, J. & Huang, J. Centralised visual processing center for remote sensing target detection. Sci. Rep. 14 (1), 17021 (2024). Maghrabie, H. M. et al. Building-integrated photovoltaic/thermal (BIPVT) systems: Applications and challenges. Sustain. Energy Technol. Assess. 45 , 101151 (2021). So, D. et al. Searching for efficient transformers for language modeling. Adv. Neural. Inf. Process. Syst. 34 , 6010–6022 (2021). Zebari, R., Abdulazeez, A., Zeebaree, D., Zebari, D. & Saeed, J. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appl. Sci. Technol. Trends . 1 (1), 56–70 (2020). Hasan, M. et al. LiDAR-based detection, tracking, and property estimation: A contemporary review. Neurocomputing . 506 , 393–405 (2022). Zhang, Y., Hou, J. & Yuan, Y. A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks. Int. J. Comput. Vision . 132 (5), 1592–1624 (2024). Zamri, F. N. M. et al. (2024). Enhanced Small Drone Detection using Optimized YOLOv8 with Attention Mechanisms. IEEE Access. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4987159","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":363818485,"identity":"9964dae5-636c-4203-93a2-041f7f97b776","order_by":0,"name":"Zhiqiang Dong","email":"","orcid":"","institution":"Guiyang Healthcare Vocational University","correspondingAuthor":false,"prefix":"","firstName":"Zhiqiang","middleName":"","lastName":"Dong","suffix":""},{"id":363818486,"identity":"201c28d1-163a-476f-8468-5f956a67120b","order_by":1,"name":"Shu Yang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAu0lEQVRIiWNgGAWjYFACHgaGhAobOTb29gMkaHlwJs2Yj+dMAvFaGB+2HU6cJ+FgQJwGg/Nnj0kknDmc3ibBkMDwo2IbEVpu5KVJJFSk57ZJNx5g7DlzmxgtPGZAW6xz22QOJDAzthGj5fwZM4nENuZ0NokEAyK1HMgBaXFOIF6L5I0cY4uEM2mGbcBAPkiUX/jOnzG8+aPCRl6+vf3ggx8VRGhROIDEOYBDESqQbyBK2SgYBaNgFIxoAABKVD72VcLN1wAAAABJRU5ErkJggg==","orcid":"","institution":"Guiyang Healthcare Vocational University","correspondingAuthor":true,"prefix":"","firstName":"Shu","middleName":"","lastName":"Yang","suffix":""},{"id":363818487,"identity":"057d4355-8bdb-49a7-bbf1-32b419c8d8b0","order_by":2,"name":"Yang Xiao","email":"","orcid":"","institution":"Guiyang Healthcare Vocational University","correspondingAuthor":false,"prefix":"","firstName":"Yang","middleName":"","lastName":"Xiao","suffix":""}],"badges":[],"createdAt":"2024-08-28 00:24:53","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4987159/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4987159/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":67224603,"identity":"85dadafe-1fc8-47e3-9b23-170f85ef90ea","added_by":"auto","created_at":"2024-10-22 14:55:12","extension":"jpeg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":468900,"visible":true,"origin":"","legend":"\u003cp\u003eOverall structure diagram of PVT model\u003c/p\u003e","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-4987159/v1/11b54c67a60ca9a0ba3410b3.jpeg"},{"id":67224601,"identity":"9d9916ea-1892-4634-ae08-f55a486e06f7","added_by":"auto","created_at":"2024-10-22 14:55:12","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":26908,"visible":true,"origin":"","legend":"\u003cp\u003eStructural comparison diagram of MHA and SRA modules\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-4987159/v1/e13531e33bbfcbf4cb6e75a8.png"},{"id":67224763,"identity":"f8bf2318-a7aa-4868-9d8b-40badc3853f9","added_by":"auto","created_at":"2024-10-22 15:03:12","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":63912,"visible":true,"origin":"","legend":"\u003cp\u003eStructural diagram of an improved YOLOv8s model incorporating Pyramid Vision Transformer\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-4987159/v1/050f2e4db6956c48bb9a262f.png"},{"id":67224608,"identity":"bd881b3c-e140-408c-bf67-a9e2b52e6a6e","added_by":"auto","created_at":"2024-10-22 14:55:12","extension":"jpeg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":824744,"visible":true,"origin":"","legend":"\u003cp\u003ePartial annotated samples on the self built fall detection dataset in this article\u003c/p\u003e","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-4987159/v1/557467ed1bf5389dd5a18ee7.jpeg"},{"id":67224606,"identity":"7a333995-63b7-4ab6-be64-2e670514fd7b","added_by":"auto","created_at":"2024-10-22 14:55:12","extension":"jpeg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":316475,"visible":true,"origin":"","legend":"\u003cp\u003eIndicator Results of YOLOv8s on the MS COCO 2017 Dataset\u003c/p\u003e","description":"","filename":"floatimage5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-4987159/v1/d1814c5c00bb0dc104a76945.jpeg"},{"id":67224764,"identity":"e991572c-2cba-4263-a5df-27b88367695b","added_by":"auto","created_at":"2024-10-22 15:03:12","extension":"jpeg","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":312429,"visible":true,"origin":"","legend":"\u003cp\u003eThe indicator result chart of the algorithm in this article on the MS COCO 2017 dataset\u003c/p\u003e","description":"","filename":"floatimage6.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-4987159/v1/3943e6cbde5e63e6680e4a23.jpeg"},{"id":67224604,"identity":"79c9c924-ca94-4c89-a307-41e05709c882","added_by":"auto","created_at":"2024-10-22 14:55:12","extension":"jpeg","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":328866,"visible":true,"origin":"","legend":"\u003cp\u003eIndicator Results of YOLOv8s on a Self Built Fall Dataset\u003c/p\u003e","description":"","filename":"floatimage7.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-4987159/v1/48d9489a27ff7eea64ad9333.jpeg"},{"id":67224607,"identity":"0f269ea5-1c90-4ca7-8945-05284d77535f","added_by":"auto","created_at":"2024-10-22 14:55:12","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":174153,"visible":true,"origin":"","legend":"\u003cp\u003eThe indicator result chart of the algorithm in this article on a self built fall dataset\u003c/p\u003e","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-4987159/v1/6eb20bfc0232bd5a4aa26c49.png"},{"id":67224610,"identity":"e1adb7b3-d691-4385-9d48-56fcf0ba4b55","added_by":"auto","created_at":"2024-10-22 14:55:13","extension":"jpeg","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":788864,"visible":true,"origin":"","legend":"\u003cp\u003eThe detection result graph of the algorithm in this article on the test set of falling target\u003c/p\u003e","description":"","filename":"floatimage9.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-4987159/v1/1e5bfa78ca3c0dbc0c15aab1.jpeg"},{"id":76720437,"identity":"ce56e274-9020-4125-87f6-e855cf29e871","added_by":"auto","created_at":"2025-02-20 04:32:03","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3703666,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4987159/v1/695e54e6-60ab-41af-a207-458f38ab7e9a.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Improvement of YOLOv8 algorithm through integration of Pyramid Vision Transformer architecture","fulltext":[{"header":"Introduction","content":"\u003cp\u003eWith the rapid development of deep learning technology, the field of computer vision has undergone unprecedented changes. As one of the core tasks of computer vision, the improvement of object detection performance is directly related to the level of intelligence in many application scenarios. The YOLO (You Only Look Once) series of algorithms, with their efficient and accurate characteristics\u003csup\u003e[1]\u003c/sup\u003e\u003csup\u003e,\u003c/sup\u003e have received widespread attention and applications since their inception. However, as application requirements become increasingly complex, enhancing the accuracy and efficiency of YOLO\u003csup\u003e[2]\u003c/sup\u003e algorithms has become a research hotspot. YOLOv8, as the latest member of the YOLO series, inherits many advantages of YOLOv5\u003csup\u003e[3]\u003c/sup\u003e\u003csup\u003e[4]\u003c/sup\u003eand achieves significant performance improvements. Nevertheless, when faced with more complex and diverse scenarios and multi-scale object detection tasks, existing Convolutional Neural Network (CNN) structures still have certain limitations. To overcome this challenge, this paper proposes an improved YOLOv8 algorithm that incorporates the Pyramid Vision Transformer (PVT) architecture\u003csup\u003e[5]\u003c/sup\u003e(Figure 1).\u003c/p\u003e\n\u003cp\u003ePyramid Vision Transformer is a deep learning model that combines the Transformer architecture with a pyramid structure\u003csup\u003e[6]\u003c/sup\u003e. Compared to the original Vision Transformer (ViT)\u003csup\u003e[7]\u003c/sup\u003e\u003csup\u003e[8]\u003c/sup\u003e\u003csup\u003e[9]\u003c/sup\u003e, PVT utilizes feature maps of different scales across multiple stages, thereby forming a pyramid structure. This design enables PVT to capture feature information at different scales, enhancing the model\u0026apos;s ability to process objects of varying sizes within images. By introducing PVT into the backbone network of YOLOv8, it aims to leverage its robust feature extraction capabilities and multi-scale feature processing abilities to further improve the object detection accuracy and efficiency of YOLOv8\u003csup\u003e[10]\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eCurrently, both single-stage and two-stage object detection algorithms based on deep learning largely adopt feature pyramid structures to enhance intermediate feature map information. In Vision Transformer (ViT) structure-based methods\u003csup\u003e[11]\u003c/sup\u003e\u003csup\u003e,\u003c/sup\u003e the input image is first converted into a series of patches, which are then fed into the Transformer Encoder to extract object features and obtain feature maps. However, since the feature map size remains the same\u003csup\u003e[12]\u003c/sup\u003eat each stage, this approach is challenging to apply to downstream vision tasks. To bridge the gap between ViT and feature pyramid techniques, Wang et al.\u003csup\u003e[22]\u003c/sup\u003e proposed PVT, which can be trained on high-resolution images without significantly increasing the model\u0026apos;s computational complexity. PVT employs a progressive pyramid paradigm with diminishing sizes to produce multi-scale output feature maps. Additionally, it introduces Spatial Reduction Attention (SRA)\u003csup\u003e[13]\u003c/sup\u003e\u003csup\u003e[14]\u003c/sup\u003e to reduce resource consumption and time complexity during attention computations. Compared to CNNs and ViT, PVT not only inherits the global receptive field of ViT but also incorporates the pyramid structure of CNNs, facilitating the acquisition of multi-scale feature maps and seamless migration to advanced computer vision tasks such as object detection and instance segmentation.\u003c/p\u003e\n\u003cp\u003eFigure 2 illustrates the overall architecture of PVT, which comprises four stage modules (Stage1, Stage2, Stage3, and Stage4). Each stage module contains a Patch Embedding and n Transformer Encoder Layers\u003csup\u003e[5]\u003c/sup\u003e, outputting feature maps of different sizes (four-fold downsampled, eight-fold downsampled, sixteen-fold downsampled, and thirty-two-fold downsampled)\u003csup\u003e[6]\u003c/sup\u003e. In the first stage module, given an input image of size H\u0026times;W\u0026times;3, it is first divided into patches of size 4\u0026times;4\u0026times;3, totaling H\u0026times;W/16 patches. These embedded vectors, along with positional embeddings, are then fed into the Transformer Encoder Layer, and the output feature map is reshaped to obtain an H/4\u0026times;W/4\u0026times;C1 feature map f1. The subsequent three stages follow a similar process, yielding H/8\u0026times;W/8\u0026times;C2 feature map f2, H/16\u0026times;W/16\u0026times;C3 feature map f3, and H/32\u0026times;W/32\u0026times;C4 feature map f4, respectively.\u003c/p\u003e\n\u003cp\u003eTo reduce the computational load of the model, PVT proposes SRA to replace the Multi-Head Attention (MHA) module in Transformers\u003csup\u003e[5]\u003c/sup\u003e. Similar to MHA, SRA takes a query, key, and value as inputs and outputs a modified feature vector\u003csup\u003e[15]\u003c/sup\u003e. Unlike MHA, SRA reduces the spatial dimensions (i.e., width and height) of the key and value before performing self-attention calculations, significantly decreasing the attention computation burden and system memory usage\u003csup\u003e[14]\u003c/sup\u003e\u003csup\u003e[16]\u003c/sup\u003e. The structures of MHA and SRA modules are depicted in Figure 3.\u003c/p\u003e\n\u003cp\u003ePVT encompasses four variant models: PVT-Tiny, PVT-Small, PVT-Medium, and PVT-Large. Considering computational cost and complexity, this paper adopts the YOLOv8s model as the baseline and utilizes PVT-Small as the feature extraction network, replacing the stacked convolutional block structure in YOLOv8s\u003csup\u003e[8]\u003c/sup\u003e\u003csup\u003e[17]\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eAddressing the issue of poor object detection accuracy in complex backgrounds for the YOLOv8s model\u003csup\u003e[18]\u003c/sup\u003e\u003csup\u003e[19]\u003c/sup\u003e, this paper aims to enhance the feature extraction capability of the YOLOv8s model\u0026apos;s feature extraction network. We propose substituting the stacked convolutional block structure in YOLOv8s with PVT. This approach leverages PVT\u0026apos;s pre-trained weights to better initialize model parameters and benefits from the Transformer model\u0026apos;s robust feature modeling capabilities, enabling multi-dimensional feature extraction and fusion, thereby strengthening the model\u0026apos;s image feature extraction ability. The overall structure of the proposed model is shown in Figure 3 below. We replace the original convolutional block structure with PVT in the Backbone stage, while maintaining consistency with YOLOv8s in the Head and Prediction stages.\u003c/p\u003e"},{"header":"Experimental dataset","content":"\u003cp\u003eMS COCO (Microsoft Common Objects in Context) is a large-scale, publicly available benchmark dataset for computer vision tasks, including object detection, instance segmentation, image captioning, and more. It was released by Microsoft Research to advance research and development in computer vision tasks. MS COCO has three versions released in 2014, 2015, and 2017, with the 2017 version being a significant update that further expanded its scale and annotation quality. The MS COCO 2017 dataset consists of hundreds of thousands of high-resolution images and covers 80 common object categories, including people, animals, furniture, electronic devices, vehicles, and more. The dataset is divided into training, validation, and test sets, with 118,287 images in the training set, 5,000 images in the validation set, and 5,000 images in the test set, respectively used for model training, performance evaluation, and image testing.\u003c/p\u003e\n\u003cp\u003eThe self-built dataset for fall detection of elderly individuals in this paper, which includes images captured in the Smart Elderly Care Laboratory of Guiyang Health Vocational University and other locations, totals 6490 images. Specifically, the training set contains 4401 images, the validation set contains 1089 images, and the test set contains 1000 images. We used the LabelImg tool for annotation, and some annotated images from the training set are shown in the Figure 4 below.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eExperimental equipment and environment\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis paper uses Ubuntu 20.04 as the experimental platform, with an Intel Xeon(R) Gold 6330 CPU and an NVIDIA GeForce RTX 3090 GPU, and a total system memory of 64GB. In terms of software configuration, the GPU is equipped with driver version 525.60.11, and the versions of CUDA (Compute Unified Device Architecture) and CUDNN (NVIDIA CUDA\u0026reg; Deep Neural Network library) are 11.8 and 8.7.0, respectively. The Python interpreter version is 3.8, and the version of the deep learning framework PyTorch is 2.0.1. This paper uses the Ultralytics algorithm library to reproduce and improve the YOLOv8s model, with a specific version number of 8.1.18. Additionally, this paper also utilizes some extra dependent libraries, such as OpenCV version 4.7.0.72, Numpy version 1.21.6, Pillow version 9.5.0, Matplotlib version 3.8.2, and tqdm version 4.65.0.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eExperimental hyperparameter setting\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn the experiments of this paper, we conducted detailed settings for the experimental parameter configuration. Firstly, the batch size for all experiments was uniformly set to 64, and all experiments were trained for 100 epochs. The optimization algorithm chosen was Stochastic Gradient Descent (SGD), with an initial learning rate set to 0.01 and a final learning rate gradually decaying to 0.0001. The total decay factor for the learning rate was set to 0.937, and the weight decay factor for the optimizer was set to 0.0005. In terms of data augmentation during the training phase, the probabilities for color hue, saturation, and brightness adjustments were set to 0.015, 0.7, and 0.4, respectively. Additionally, the probabilities for left-right flipping and scale variation of images were both set to 0.5 to increase the diversity of the training data. We also employed the mosaic method to augment the dataset. For the loss function, the Complete Intersection over Union (CIoU) Loss was used for the object bounding box loss, with a weight coefficient set to 7.5. The weight coefficient for the Distribution Focal Loss (DFL Loss) was set to 1.5, and the weight coefficient for the object classification loss was set to 0.5.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eFigure \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e shows the metric graph obtained by reproducing the YOLOv8s model on the MS COCO dataset during the experimental process of this paper. It can be seen from the figure that the loss during the training process of the algorithm consistently decreases and converges to a small value. In object detection tasks, we mainly focus on the metrics of Precision, Recall, mAP50, and mAP50-95. The mAP50 metric refers to the average precision across all categories with an Intersection over Union (IoU) threshold of 0.5 for the object bounding boxes. The mAP50-95 metric refers to the average precision across all categories with IoU thresholds at 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, and 0.95 for the object bounding boxes. Higher mAP50 and mAP50-95 metrics indicate stronger detection performance. YOLOv8s achieves 67.3%, 54.8%, 59.9%, and 43.9% respectively in Precision, Recall, mAP50, and mAP50-95 metrics. Compared to other methods such as YOLOv5, YOLOv6, and YOLOv7, YOLOv8 demonstrates significant performance improvements with only 100 epochs of training, while maintaining a relatively fast detection speed, compared to these methods with the same number of parameters. Figure\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e shows the metric results of our method on the MS COCO 2017 dataset. It can be seen from the figure that our method achieves 69.4%, 56.0%, 61.7%, and 45.2% respectively in Precision, Recall, mAP50, and mAP50-95 metrics, which are 2.1%, 1.2%, 1.8%, and 1.3% higher compared to the YOLOv8s model. The main reason for this performance improvement is that our method uses a stronger PVT model in the feature extraction stage, which can more effectively learn and represent object features.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e shows the result metric graph of the YOLOv8s model on the self-built elderly fall detection dataset in this paper. It can be seen from the figure that the object bounding box losses, CIoU Loss and DFL Loss, of the YOLOv8s model converge to 1.33 and 1.206 after 100 iterations on the validation set, and the object classification loss converges to 0.488. In specific metrics, YOLOv8s achieves 94.2%, 94.6%, 94.8%, and 64.9% respectively in Precision, Recall, mAP50, and mAP50-95 metrics. Figure\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e shows the experimental result metric graph of our method on the self-built fall detection dataset. It can be seen from the figure that after 100 iterations of optimization in the training phase, the object bounding box loss CIoU Loss, DFL Loss, and object classification loss exhibit a stable and rapid downward trend, ultimately converging to 0.76, 0.989, and 0.356, respectively. In the validation phase, these three losses decrease to 1.214, 1.138, and 0.429, respectively. In specific metrics, our method achieves 97.1%, 96.9%, 97.3%, and 66.8% respectively in Precision, Recall, mAP50, and mAP50-95 metrics. Compared to the YOLOv8s model, there are improvements of 2.9%, 2.3%, 2.5%, and 1.9%, respectively.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e shows the detection results of our method on the self-built elderly fall detection test set. It can be seen from the figure that our method can accurately predict fall targets in different scenarios while achieving a very high detection confidence score.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe Pyramid Vision Transformer (PVT), as a deep learning model that combines the Transformer architecture with a pyramid structure, excels in feature extraction and multi-scale processing. This paper explores strategies for incorporating the PVT architecture into the YOLOv8 algorithm and analyzes its potential advantages and challenges\u003csup\u003e[\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e][\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]\u003c/sup\u003e. The original backbone network of YOLOv8 typically employs Convolutional Neural Networks (CNNs), such as CSPDarknet\u003csup\u003e[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]\u003c/sup\u003e. To introduce the multi-scale feature extraction capabilities of PVT, we can replace the backbone network of YOLOv8 with the PVT architecture. Through its pyramid structure design, PVT can utilize feature maps of different scales at multiple stages, thereby capturing feature information of targets of different sizes in the image. This design gives PVT a significant advantage in handling multi-scale targets. After adopting PVT as the backbone network, we need to optimize the feature fusion network to ensure effective transmission and integration of multi-scale features. YOLOv8 commonly uses structures such as Feature Pyramid Network (FPN)\u003csup\u003e[\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]\u003c/sup\u003e or Path Aggregation Network (PAN) \u003csup\u003e[\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]\u003c/sup\u003efor feature fusion. We can adjust the parameters and configurations of these structures based on the characteristics of the feature maps output by PVT to achieve more efficient feature fusion\u003csup\u003e[\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/sup\u003e. The design of YOLOv8's detection head is usually closely related to the backbone network\u003csup\u003e[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]\u003c/sup\u003e. After introducing PVT as the backbone network, we need to adapt the detection head accordingly\u003csup\u003e[\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]\u003c/sup\u003e. This includes adjusting the input channel number of the detection head, anchor box settings, and loss functions to ensure that the detection head can fully utilize the multi-scale features extracted by PVT and achieve more accurate object detection.\u003c/p\u003e \u003cp\u003eThe pyramid structure design of PVT enables it to capture feature information at different scales in images, which is crucial for handling multi-scale object detection tasks. Incorporating PVT into YOLOv8 can significantly enhance the algorithm's ability to detect multi-scale objects. The Transformer architecture excels in feature extraction, but its computational complexity is usually high. PVT effectively reduces computational costs through designs such as progressive feature pyramids and spatial-reduction attention (SRA) layers\u003csup\u003e[\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e][\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e][\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]\u003c/sup\u003e. Using PVT as the backbone network of YOLOv8 can maintain high feature extraction efficiency while reducing the model's computational complexity\u003csup\u003e[\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]\u003c/sup\u003e. Both PVT and YOLOv8 are deep learning-based models that can learn rich feature representations during training. Combining the two can fully leverage their respective strengths to form stronger feature extraction and detection capabilities. This combination helps improve the model's generalization ability, allowing it to perform well in different scenarios\u003csup\u003e[\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eExperimental results on benchmark datasets and self-built datasets verify the effectiveness of incorporating the PVT architecture. Compared to the original YOLOv8, the improved algorithm achieves significant improvements in mean average precision (mAP)\u003csup\u003e[\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e],\u003c/sup\u003e especially in the detection performance of small objects and complex backgrounds\u003csup\u003e[\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]\u003c/sup\u003e. This result indicates that the introduction of the PVT architecture indeed enhances the feature extraction and expression capabilities of YOLOv8\u003csup\u003e[\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]\u003c/sup\u003e. Although the PVT structure is relatively complex, we have successfully maintained the real-time performance of the improved algorithm\u003csup\u003e[\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]\u003c/sup\u003e by optimizing model parameters and inference strategies. This characteristic is crucial for real-time object detection systems in practical applications.\u003c/p\u003e \u003cp\u003eDespite its impressive performance in multiple aspects, YOLOv8 with the PVT architecture also has some limitations\u003csup\u003e[\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]\u003c/sup\u003e. Firstly, due to the complexity of the PVT structure, model training requires more computational resources and time. Secondly, there is still room for improvement in detecting extremely small objects and objects under heavy occlusion\u003csup\u003e[\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e]\u003c/sup\u003e. Future research can further explore how to optimize the PVT structure to enhance performance in these aspects\u003csup\u003e[\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e]\u003c/sup\u003e. To address the current limitations of the PVT structure, future work can explore more efficient Transformer variants or optimization strategies to further reduce computational complexity\u003csup\u003e[\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]\u003c/sup\u003e and improve feature extraction capabilities\u003csup\u003e[\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e]\u003c/sup\u003e. With the rise of multimodal learning, future attempts can be made to fuse PVT with other types of sensor data (such as LiDAR, radar, etc.)\u003csup\u003e[\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e]\u003c/sup\u003e to further enhance the robustness and accuracy\u003csup\u003e[\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e]\u003c/sup\u003e of object detection systems. For resource-constrained application scenarios such as mobile and embedded devices, future research can investigate how to design a lightweight version of YOLOv8 with PVT to meet the real-time and accuracy requirements of these scenarios\u003csup\u003e[\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e]\u003c/sup\u003e.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThe proposed improved YOLOv8 algorithm fused with PVT effectively enhances the accuracy and robustness of object detection, especially when dealing with small and dense objects. This improvement provides a new research direction in the field of real-time object detection and offers powerful technical support for complex scene detection in practical applications.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets used or analysed during the current study are available from the corresponding author on reason\u003c/p\u003e\n\u003cp\u003eable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026rsquo; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eZD and SY contributed equally to this work. ZD and SYcontributed to the\u0026nbsp;conceptualization, Methodology, software, data curation, writing - original draft preparation, writing- reviewing and editing. SY\u0026nbsp;contributed to the writing - reviewing and editing, supervision, project administration, funding acquisition. YX contributed to the\u0026nbsp;investigation, data curation, resources.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis research was supported by the doctoral research launch project of Guiyang Healthcare Vocational University\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics declarations\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eEthics approval and consent to participate\u003c/p\u003e\n\u003cp\u003eAll procedures performed in the study involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. All individuals and/or their parents provided informed consent to participate in this study and approval was provided by Research Ethics Committee of\u0026nbsp;Guiyang Healthcare Vocational University.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eStatement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll subjects and/or their legal guardian(s) agree to publish of identifying information/images in an online open-access publication.\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eRedmon, J., Divvala, S., Girshick, R. \u0026amp; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 779\u0026ndash;788). (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRedmon, J. \u0026amp; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7263\u0026ndash;7271). (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBochkovskiy, A., Wang, C. Y. \u0026amp; Liao, H. Y. M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv preprint arXiv:2004.10934. (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJocher, G., Chaurasia, A., Qiu, J. \u0026amp; Stoken, A. (2020). YOLOv5. GitHub repository.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang, W. et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In International Conference on Computer Vision (ICCV). (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLin, T. Y. et al. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2117\u0026ndash;2125). (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR). (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu, Z. et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012\u0026ndash;10022). (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCarion, N. et al. End-to-End Object Detection with Transformers. In European Conference on Computer Vision (ECCV) (pp. 213\u0026ndash;229). (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGe, Z., Liu, S., Wang, F., Li, Z. \u0026amp; Sun, J. YOLOX: Exceeding YOLO Series in 2021. (2021). arXiv preprint arXiv:2107.08430.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuang, Z., Wang, X. \u0026amp; Li, L. J. CrossViT: Cross-Attention Vision Transformer for Image Classification. arXiv preprint arXiv:2007.00666. (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHe, K., Zhang, X., Ren, S. \u0026amp; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770\u0026ndash;778). (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChu, X., Wu, Y. \u0026amp; Liu, X. TokenLearner: What Can 8.4 Billion Tokens Do for Visual Recognition? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 5117\u0026ndash;5127). (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLin, J., Li, J., Wang, Z., Xu, M. \u0026amp; Zhang, Z. Simplified Self-Attention Mechanisms in Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5698\u0026ndash;5708). (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVaswani, A. et al. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (pp. 5998\u0026ndash;6008). (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCheng, B. \u0026amp; Liu, X. Adaptive Attention: A New Mechanism for Transformer Models. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 05, pp. 8620\u0026ndash;8627). (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu, S., Qi, L., Qin, H., Shi, J. \u0026amp; Jia, J. Switchable Atrous Convolution for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6878\u0026ndash;6887). (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang, Z., Li, M. \u0026amp; Qi, X. Replacing Convolutional Neural Networks with Transformer Networks for Image Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1450\u0026ndash;1459). (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen, J., Yu, K., Xie, L. \u0026amp; Zhang, X. Efficient and Robust Object Detection with Attention Mechanisms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1245\u0026ndash;1255). (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRedmon, J. \u0026amp; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv preprint arXiv:1804.02767. (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhu, X., Lu, L., Li, B., Dai, J. \u0026amp; Wang, X. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations (ICLR). (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang, C. \u0026amp; Xu, Z. CSPDarknet: A New Backbone Network for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5294\u0026ndash;5303). (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu, S., Qi, L., Qin, H., Shi, J. \u0026amp; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 8759\u0026ndash;8768). (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou, X., Wang, D. \u0026amp; Zhu, J. Objects as Points. arXiv preprint arXiv:2006.05987. (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang, Y., Li, M. \u0026amp; Qi, X. A Survey on Backbone Networks for Object Detection. \u003cem\u003eJ. Comput. Vis. Res.\u003c/em\u003e \u003cb\u003e36\u003c/b\u003e (2), 109\u0026ndash;125 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXie, E. et al. Multiscale Vision Transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2860\u0026ndash;2869). (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCao, Y. \u0026amp; Yang, J. A Survey on Vision Transformers. arXiv preprint arXiv:2108.10654. (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen, L. \u0026amp; Wu, J. Efficient Attention Mechanism in Transformers for Vision Tasks. \u003cem\u003eIEEE Trans. Neural Networks Learn. Syst.\u003c/em\u003e \u003cb\u003e32\u003c/b\u003e (5), 1804\u0026ndash;1816 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, X., Xie, E., Wang, C., Zhang, Z. \u0026amp; Fan, D. Vision Transformers: A Survey of Methods and Applications. arXiv preprint arXiv:2205.12476. (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang, H., Li, H. \u0026amp; Lin, H. Enhancing YOLO with Transformers for Improved Object Detection Performance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1479\u0026ndash;1488). (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWu, T. \u0026amp; Dong, Y. YOLO-SE: Improved YOLOv8 for remote sensing object detection and recognition. \u003cem\u003eAppl. Sci.\u003c/em\u003e \u003cb\u003e13\u003c/b\u003e (24), 12977 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu, Y., Sun, P., Wergeles, N. \u0026amp; Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. \u003cem\u003eExpert Syst. Appl.\u003c/em\u003e \u003cb\u003e172\u003c/b\u003e, 114602 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMa, N., Su, Y., Yang, L., Li, Z. \u0026amp; Yan, H. Wheat Seed Detection and Counting Method Based on Improved YOLOv8 Model. \u003cem\u003eSensors\u003c/em\u003e. \u003cb\u003e24\u003c/b\u003e (5), 1654 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYao, J. et al. A real-time detection algorithm for Kiwifruit defects based on YOLOv5. \u003cem\u003eElectronics\u003c/em\u003e. \u003cb\u003e10\u003c/b\u003e (14), 1711 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSwathi, Y. \u0026amp; Challa, M. YOLOv8: Advancements and Innovations in Object Detection. In International Conference on Smart Computing and Communication (pp. 1\u0026ndash;13). Singapore: Springer Nature Singapore. (2024), January.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLin, Y., Zhang, J. \u0026amp; Huang, J. Centralised visual processing center for remote sensing target detection. \u003cem\u003eSci. Rep.\u003c/em\u003e \u003cb\u003e14\u003c/b\u003e (1), 17021 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMaghrabie, H. M. et al. Building-integrated photovoltaic/thermal (BIPVT) systems: Applications and challenges. \u003cem\u003eSustain. Energy Technol. Assess.\u003c/em\u003e \u003cb\u003e45\u003c/b\u003e, 101151 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSo, D. et al. Searching for efficient transformers for language modeling. \u003cem\u003eAdv. Neural. Inf. Process. Syst.\u003c/em\u003e \u003cb\u003e34\u003c/b\u003e, 6010\u0026ndash;6022 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZebari, R., Abdulazeez, A., Zeebaree, D., Zebari, D. \u0026amp; Saeed, J. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. \u003cem\u003eJ. Appl. Sci. Technol. Trends\u003c/em\u003e. \u003cb\u003e1\u003c/b\u003e (1), 56\u0026ndash;70 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHasan, M. et al. LiDAR-based detection, tracking, and property estimation: A contemporary review. \u003cem\u003eNeurocomputing\u003c/em\u003e. \u003cb\u003e506\u003c/b\u003e, 393\u0026ndash;405 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang, Y., Hou, J. \u0026amp; Yuan, Y. A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks. \u003cem\u003eInt. J. Comput. Vision\u003c/em\u003e. \u003cb\u003e132\u003c/b\u003e (5), 1592\u0026ndash;1624 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZamri, F. N. M. et al. (2024). Enhanced Small Drone Detection using Optimized YOLOv8 with Attention Mechanisms. IEEE Access.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-4987159/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4987159/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Addressing the issue of poor target detection accuracy in complex backgrounds with the YOLOv8s model, this chapter proposes an improved YOLOv8s model that incorporates the Pyramid Vision Transformer (PVT). Specifically, to enhance the feature extraction capabilities of the base module, this paper proposes using PVT in the Backbone stage of YOLOv8s to replace the previous basic convolutional feature extraction blocks. This structure allows the model to process images at different resolution levels, thereby more effectively capturing details and contextual information.","manuscriptTitle":"Improvement of YOLOv8 algorithm through integration of Pyramid Vision Transformer architecture","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-10-22 14:55:06","doi":"10.21203/rs.3.rs-4987159/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"abecbb94-0cdb-40fa-893f-b63fa765254d","owner":[],"postedDate":"October 22nd, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":38705434,"name":"Physical sciences/Mathematics and computing/Computer science"},{"id":38705435,"name":"Physical sciences/Mathematics and computing/Information technology"}],"tags":[],"updatedAt":"2025-02-20T04:23:43+00:00","versionOfRecord":[],"versionCreatedAt":"2024-10-22 14:55:06","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4987159","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4987159","identity":"rs-4987159","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00