A Cloud-Edge Collaborative Model Training Framework for Assisted Classification of Middle Ear Diseases Based on Ultra-High-Resolution Temporal Bone CT Images | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article A Cloud-Edge Collaborative Model Training Framework for Assisted Classification of Middle Ear Diseases Based on Ultra-High-Resolution Temporal Bone CT Images Ting Wu, Yu Tang, Zigang Che, Jiangjiang Zhao, Jue Wang, Yanfeng Wu, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5414065/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Objective Cholesteatoma and otitis media are two of the most common middle ear diseases, of which the treatment principles are different, making the differentiation between them of significant importance. Both chronic suppurative otitis media (CSOM) and middle ear cholesteatoma (MEC) can appear on CT images as low-density soft tissue-like masses partially filling the middle ear and mastoid cavities. However, typical CT imaging of MEC may show progressive destruction of auditory structures and adjacent cranial bones. Compared to high-resolution CT (HRCT), ultra-high-resolution CT (U-HRCT) offers inherent continuity and a more detailed display of the fine structures of the middle ear. This study proposes a "cloud-edge" collaborative training framework for middle ear disease classification that exploits temporal bone U-HRCT imaging data. By integrating the YOLO recognition algorithm, this framework aims to achieve auxiliary classification of MEC and CSOM based on U-HRCT images. Design: In the cloud-edge collaborative framework, the edge devices acquire U-HRCT imaging data and perform auxiliary classification of middle ear diseases using image recognition and inference techniques. The imaging data collected by the edge devices are transmitted to the cloud, where a unified model training process is executed, and the model containers are then deployed to the edge devices for future auxiliary diagnosis. The framework employed Mixup and Mosaic methods for data augmentation to enhance model robustness and improve generalization performance. The object detection models of the You Only Look Once (YOLO) family was used, and the final model selection was made based on their performance. Results This study found that this cloud-edge collaborative framework can effectively classify temporal bone U-HRCT imaging data for MEC and CSOM. In the test set, the framework successfully collected real CT image data, performed data processing and conducted model training as designed. Eventually, multiple models were trained, with different levels of detection ability assessed by selected metrics, allowing for trade-offs in model selection considering computation time and accuracy. The selected model was then deployed to the edge, where they performed auxiliary classification tasks at the edge device. Conclusions This study discussed the significance of temporal bone U-HRCT imaging in the diagnosis of CSOM and MEC and proposed a cloud-edge collaborative model training framework for auxiliary classification from U-HRCT imaging data. This approach maximizes the utility of the data, fully leverages the diversity of image recognition algorithms, and ensures a high level of accuracy in classification. Health sciences/Medical research Physical sciences/Mathematics and computing Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 1. Introduction More than three years have passed since the outbreak of the COVID-19 pandemic, and its impact is still influencing the global healthcare industry, which is in urgent need of cost reduction and improvement in the medical service accessibility, while facing challenges such as uneven distribution of local medical resources and shortage of skilled medical personnel. These challenges have prompted healthcare systems to adopt emerging technologies to fill these gaps. By promoting the application of advanced technologies such as cloud computing, the Internet of Things (IoT), and artificial intelligence (AI), healthcare services are evolving towards an information-enabled intelligent outlook. The construction of “Internet + Medical System” can effectively alleviate the growing demand for medical services and the shortage of medical resources. Hearing impairment has long been a significant ear condition affecting individuals across all age groups. Based on the location of the lesion, hearing loss can be classified into three main types: sensorineural hearing loss, conductive hearing loss, and mixed hearing loss. Out of all three types, the conductive hearing loss comprises a major proportion. The cause of conductive hearing loss is the lesions in the outer ear, middle ear, and inner ear, among which the middle ear lesions are a primary focus of research. The middle ear is in the petrous part of the temporal bone and consists of four sections: the tympanic cavity, eustachian tube, tympanic antrum, and mastoid process. The main function of the middle ear is to transmit sound energy from the air in the external auditory canal to the lymphatic fluid in the cochlea. This energy conversion, from gas to liquid, is achieved through the vibration of the tympanic membrane and the ossicular chain (Luers & Hüttenbrink, 2016). This specific anatomical structure renders middle ear diseases accounting for the highest proportion of ear disorders, which not only directly affect hearing ability but may also lead to serious intracranial and extracranial complications such as peripheral facial palsy and brain abscesses (Cacco, et al., 2022). Chronic suppurative otitis media (CSOM) and middle ear cholesteatoma (MEC) are the most common middle ear diseases. CSOM is a chronic suppurative inflammation of the mucosa, periosteum, or even the bone tissue of the middle ear. The disease is often not confined to the tympanic cavity but invades the tympanic antrum, mastoid process, and eustachian tube. It is clinically characterized by long-term intermittent or persistent purulent discharge, tympanic membrane perforation, with or without hearing loss. It is globally one of the leading causes of preventable hearing loss in both children and adults (Bhutta, Leach, & Brennan-Jones, 2024 ). MEC is a cystic structure in the middle ear, rather than a true tumor, caused by the inward growth of squamous epithelium from the tympanic membrane towards the middle ear. The cyst contains desquamated squamous epithelium and keratinous material, which can erode the surrounding bone. Apart from hearing loss, it can cause serious intracranial and extracranial complications (Gilberto, et al., 2020). Both CSOM and MEC are featured with recurrent ear discharge, hearing loss and tinnitus at varying levels, making them similar in presentation. However, the treatment principles for the two differ. CSOM treatment aims at containing infection, ensuring drainage, removing lesions, restoring hearing ability, and eliminating the underlying cause (Silverstein, 1972 ). For MEC, the first choice is to remove the lesioned tissue, prevent complications via surgery and to reconstruct the sound transmission structures while keeping the middle ear dry (Kuo, et al., 2015). Therefore, differential recognition between the two is crucial for clinical diagnosis and prognosis. High-resolution computed tomography (HRCT) of the temporal bone is currently a relatively accurate imaging method for diagnosing middle ear diseases, as it can clearly reveal the internal anatomical structures and lesions of the middle ear (Baba, et al., 2022). However, the typical slice thickness of HRCT is 0.625 mm, and the reconstructed slice thickness is often 1.0 mm, restraining its ability to detect small or early-stage lesions, which may lead to missed or misdiagnosed middle ear diseases (Xu, et al., 2023). At present, ultra-high-resolution computed tomography (U-HRCT) specifically designed for otology demonstrated significant advantages in displaying the fine bone anatomy of the temporal bone. 10-µm level U-HRCT can reach a resolution of up to 0.05 mm, providing clear visualization of the fine structures in the middle ear, surrounding bones and even the state of ligaments. Compared to HRCT, its image quality is significantly improved. U-HRCT has shown great advancement in displaying the detailed bone anatomy of the temporal bone. Currently, computer-aided decision support models involving various machine learning algorithms (Zeng, et al., 2022; Sundgaard, et al., 2021; Zeng, et al., 2021) and convolutional neural networks (CNN) have been applied for middle ear disease detection using tympanic membrane images. Yan-Mei Wang, et al. proposed a deep learning framework to extract regions from temporal bone CT slices for CSOM and MEC diagnosis (Wang, et al., 2020). In the meantime, a middle ear disease detection model based on 3D CNN using temporal bone CT images is published (Su, et al., 2022). Both CSOM and MEC appear on CT images as soft tissue-like low-density masses partially filling the middle ear and mastoid cavities. However, typical CT images of MEC may also show progressive destruction of auditory structures and adjacent cranial bones, such as erosion of superior shield plate of the tympanic cavity or enlargement of the tympanic antrum opening. Regular HRCT may miss these features, leading to misdiagnosis and inappropriate treatment. U-HRCT shows great potential for the early detection of fine bone anatomy and small lesions in the temporal bone, but at present there is still no research on combining U-HRCT with artificial intelligence to differentiate between the most common middle ear diseases, CSOM and MEC. To address this, this study analyzes the U-HRCT imaging characteristics of CSOM and MEC. By leveraging the innate continuity of U-HRCT image data and its ability to display fine structures in the middle ear, this study proposes a "cloud-edge" collaborative training framework for middle ear diseases that integrates the Yolo recognition algorithm to realize the classification of U-HRCT images for MEC and CSOM. The research aims to alleviate the heavy workload of image interpretation, which is constrained by expert experience, using computer vision recognition technology, establishing a new system to promote the standardization of high-quality diagnostic technologies, reduce the workload and costs associated with middle ear disease diagnosis and treatment, and facilitate the accumulation of high-quality medical resources and the sharing of intelligent image recognition technology. 2. Materials and Methods 2.1 Data acquirement This study collected the medical records and temporal bone U-HRCT imaging data of patients who underwent middle ear surgery at the Department of Otolaryngology of Nanjing Tongren Hospital. The screening process for the medical records was based on pathology, medical history, ear examinations, audiograms, and imaging results of the operated ear. Patients with congenital middle ear malformations, those undergoing repeated surgeries, and those with acute or chronic secretory otitis media were excluded. Eventually, the study included 400 ears of 205 patients in this experimental research. The medical records of all patients were independently reviewed by two otolaryngologists, each with more than 15 years of experience and the title of Associate Chief Physician, and a unanimous diagnosis was achieved. The study was approved by the Medical Ethics Committee of Nanjing Tongren Hospital (Approval No. 2024-03-006-k001). all methods were performed in accordance with the relevant guidelines and regulations. Given the retrospective nature of the study, the informed consent process was waived. The imaging experiment was conducted using the Ultra3D U-HRCT equipment from Beijing LargeV Instrument Co., Ltd. The imaging was performed in small-field-of-view and high-definition mode, with the scanning parameters listed in Table 1 . Table 1 Imaging equipment parameters Number Parameter Value/Range 1 Voltage 100 ~ 110 kV 2 Current 140 ~ 180mAs 3 Image matrix dimension 650×650 4 Scanning range 7 cm×4 cm 5 Reconstruction range 65mm×65mm 6 Slice thickness 0.1mm 7 Slice interval 0.1mm The scanning range of the ear imaging covers from the superior part of the semicircular canal to the mastoid antrum, extending outwards to the tympanic part of the temporal bone and inward to the petrous apex of the temporal bone. A total of 4000 high-resolution axial U-HRCT images of the temporal bone were obtained. This dataset involves 70 patients with MEC (38 male patients and 32 female patients), 135 patients with CSOM (53 male patients and 82 female patients), and 20 control subjects in normal condition (10 males and 10 females). The ages of the collected patients ranged from 10 to 70 years. The summarization of the patient information is given in Table 2 . Table 2 Patient information summarization Male Female MEC 38 32 CSOM 53 82 Normal 10 10 Total 91 114 For each patient, we selected approximately 10 to 20 CT images that clearly displayed well-determined lesions. Eventually we used 2295 CT images of CSOM, 1305 CT images of MEC, and 400 normal CT images for analysis. 2.2 Cloud-edge collaborative training framework-based disease classification In the proposed "cloud-edge" collaborative training framework for middle ear diseases, the edge devices are physically distributed and used as auxiliary classification tools for otolaryngologists or radiologists. These edge devices access imaging data from actual patients and utilize image recognition inference models to assist in classifying middle ear diseases. The cloud consists of high-performance processing devices to which the imaging data from the edge devices is transmitted for a unified training process. The trained models are then deployed to the edge devices for diagnosis and treatment support. Within this framework, cloud-edge collaboration enables data labeling and model optimization. The overall architecture is illustrated in Fig. 1 . Deep learning has demonstrated remarkable performance in image recognition (Wu, Liu, & Liu, 2019 ). The mainstream deep learning image recognition methods can be divided into two categories: two-stage methods based on region extraction and single-stage detection methods. The two-stage method represented by Regions with Convolutional Neural Networks (RCNN) first uses a region extractor to generate candidate object regions, and then employs deep neural networks for feature extraction and classification in each region (Girshick, Donahue, Darrell, & Malik, 2013 ). The single-stage detection method represented by You Only Look Once (YOLO) algorithms directly extract location and class information from the image, allowing for faster detection (Kim, Sung, & Park, 2020 ). Given the real-time requirements, functionality, and model generalizability of the cloud-edge collaborative training framework, this study adopted the YOLO algorithms as the model for identifying middle ear diseases. The YOLO algorithms are among the most commonly used deep learning algorithms for object detection (Kim, Sung, & Park, 2020 ). The first version was accomplished by Joseph Redmon in 2016, and after numerous optimizations and innovations, the most often used versions today are YOLOv8 (Jocher, Chaurasia, & Qiu, 2023 ), YOLOv9 (Wang, Yeh, & Liao, YOLOv9, 2024), and YOLOv10 (Wang, et al., 2024 ). The principal idea of YOLO is to divide the image into grid cells, and then practice simultaneously the prediction of the bounding boxes and the class probabilities for each grid cell using a multi-layer convolutional neural network. As shown in Fig. 2 , the architecture of YOLOv8 consists of a backbone, neck, and head (Jocher, Chaurasia, & Qiu, 2023 ). The backbone of YOLOv8 utilizes the Cross Stage Partial (CSPNet) architecture (Wang, et al., CSPNet, 2019), which connects different layers in the deep learning network. It divides the input feature map of a layer into two parts: one proceeds to the subsequent neural network layer, while the other part bypasses that segment of the network and merges directly with the output from the previous layer that has passed through the neural network. CSPNet guarantees gradient propagation during the model training process, enhances the stability of the training, reducing computation workload and memory usage. The combination with ELAN (Wang, Liao, & Yeh, Designing Network Design Strategies Through Gradient Path Analysis, 2022) further optimizes computational resource utilization and object detection performance. Additionally, YOLOv8 employs a better optimized version of spatial pyramid pooling— Spatial Pyramid Pooling - Fast (SPPF), along with the SiLU as activation function, improved loss functions, and training strategies to enhance efficiency and accuracy. Building on YOLOv8, YOLOv9 addresses the problem of information loss during the feed-forward process, which affects the convergence in deep learning process. It introduces Programmable Gradient Information (PGI) and Generalized ELAN (GELAN), and improves the architecture of the image detection network (Wang, Liao, & Yeh, Designing Network Design Strategies Through Gradient Path Analysis, 2022). YOLOv10 employs a new holistic efficiency-accuracy driven model design strategy, enhancing the architectural design and utilizing an NMS-free (Non-Maximum Suppression-free) data post-processing method (Wang, et al., 2024 ). Both YOLOv9 and YOLOv10 show performance improvement over YOLOv8 when tested on the MS COCO dataset. Each YOLO version offers different sizes of sub-versions. YOLOv8 includes five versions: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x; YOLOv9 has five versions: YOLOv9t, YOLOv9s, YOLOv9m, YOLOv9c, and YOLOv9e; YOLOv10 offers six versions: YOLOv10n, YOLOv10s, YOLOv10m, YOLOv10b, YOLOv10l, and YOLOv10x. YOLOv8m, YOLOv9m, and YOLOv10m are suitable for general-purpose object detection. This study tested the performance of YOLOv8m, YOLOv9m, and YOLOv10m on the dataset with the cloud-edge collaborative training framework, providing a reference for the model selection phase when constructing the cloud-edge collaborative training framework. Table 3 displays the features of these models. Table 3 Number of parameters for each model and the testing result on MS COCO (Wang, et al., 2024 ) Model mAP50-95 PARAMS FLOPs YOLOv8m 50.6 25.9 78.9 YOLOv9m 51.1 20.0 76.3 YOLOv10m 51.1 16.46 63.4 The framework proposed in this paper receives U-HRCT images from edge devices, and then recollects the data for training into the cloud. The system leverages the reliable data acquisition capabilities of U-HRCT devices, precise inference analysis at the edge, and efficient collaborative capabilities in the cloud to address the challenges of fragmentation of multiple-level information in the healthcare system and the limitations of interdepartmental diagnostic capabilities for middle ear diseases, which are often constrained by expert experience. This method filters axial data containing the middle ear structural features from the patient's temporal bone U-HRCT scans and utilizes the YOLO algorithm to build an auxiliary diagnostic model, facilitating computation-intensive low-latency diagnoses. 2.3 Experiment design 2.3.1 Data augmentation The experiment employs the Mixup and Mosaic data augmentation algorithms to enhance the robustness of the model, thereby improving its generalization ability. Mixup was collaboratively proposed by MIT and Facebook in 2018. It is performed during the dataset loading phase and convexly combines the samples with their labels to create new training samples. Unlike the conventional data augmentation methods, Mixup alters both the samples and the labels simultaneously, operating within a batch by mixing one batch of data with a randomly selected image from that batch. The formula is as follows: $$\:x=\lambda\:{x}_{i}+(1-\lambda\:){x}_{j}$$ $$\:y=\lambda\:{y}_{i}+(1-\lambda\:){y}_{j}$$ Here, \(\:({x}_{i},\:{y}_{i})\) and \(\:({x}_{j},\:{y}_{j})\) are two randomly selected feature-target vectors from the training dataset, while \(\:\lambda\:\) is a random number drawn from a given Beta distribution. Thus, Mixup provides continuous data samples between different data categories, directly expanding the distribution of the given training set and making the network perform better during the testing phase. Mosaic augmentation involves stitching images together through random scaling, random cropping, and random arrangement. This technique enriches the detection dataset by increasing the presence of small objects, thereby improving the network's robustness. The steps are as follows: 1. Create a new mosaic canvas and randomly generate a point on the canvas. 2. Select four images around the random point and incorporate parts of these images into the canvas. The augmentation process is shown in Fig. 3 . The four colors represent the four sample images. The parts that are out of the canvas are discarded. The stitching process for the bottom left and bottom right images follows the same mechanism as that for the top left and top right images. 2.3.2 Data labeling and model training Considering the performance of the cloud-edge collaborative architecture, this study selected three models for experimental comparison: YOLOv8m, YOLOv9m, and YOLOv10m. Data annotation was in Pascal VOC format, with the dataset divided into training set, validation set, and testing set in a 50%, 25%, 25% ratio. The data labels included three categories: normal, CSOM, and MEC. Subsequently, transfer learning and training were practiced using the aforementioned models. The initial weights were those of the models pre-trained on the COCO dataset provided by the Ultralytics library (Lin, et al., 2014), with a training cycle of 100 epochs, a learning rate of 0.01, an optimization method of Stochastic Gradient Descent (SGD), and a batch size of 16. The overall process adopted the automatic mixed precision approach. The study evaluated the inference performance of the models on the test set from the perspectives of localization performance and classification performance. Localization performance is described by Intersection over Union (IoU), which is associated with the overlapping area between the predicted bounding box and the ground truth bounding box, as defined below: $$\:\text{I}\text{o}\text{U}=\frac{\text{g}\text{r}\text{o}\text{u}\text{n}\text{d}\:\text{t}\text{r}\text{u}\text{t}\text{h}\:\text{b}\text{o}\text{u}\text{n}\text{d}\text{i}\text{n}\text{g}\:\text{b}\text{o}\text{x}\:\cap\:\:\text{p}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{e}\text{d}\:\text{b}\text{o}\text{u}\text{n}\text{d}\text{i}\text{n}\text{g}\:\text{b}\text{o}\text{x}}{\text{g}\text{r}\text{o}\text{u}\text{n}\text{d}\:\text{t}\text{r}\text{u}\text{t}\text{h}\:\text{b}\text{o}\text{u}\text{n}\text{d}\text{i}\text{n}\text{g}\:\text{b}\text{o}\text{x}\:\cup\:\:\text{p}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{e}\text{d}\:\text{b}\text{o}\text{u}\text{n}\text{d}\text{i}\text{n}\text{g}\:\text{b}\text{o}\text{x}}$$ The higher the IoU, the closer the localization result aligns with the ground truth. Classification performance is represented by precision, recall, and the F1 score. The calculation methods for precision and recall are defined as follows: $$\:\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\:=\:\frac{{\text{T}}_{\text{p}}}{{\text{T}}_{\text{p}}+{\text{F}}_{\text{P}}}$$ $$\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}\:=\:\frac{{\text{T}}_{\text{P}}}{{\text{T}}_{\text{P}}+{\text{F}}_{\text{N}}}$$ In this study, when the IoU of the predicted result is greater than the \(\:\text{I}\text{o}{\text{U}}^{\text{t}\text{h}\text{r}\text{e}\text{s}\text{h}\text{o}\text{l}\text{d}}\) , the prediction is classified as a true positive \(\:{T}_{P}\) ; otherwise, it is considered a false positive \(\:{F}_{P}\) . Precision describes the proportion of correctly predicted positive samples among all predicted positive results, indicating that higher precision means more reliable predictions for positive samples. Recall refers to the proportion of correctly identified positive samples among all actual positive samples, with a higher recall indicating a better recognition rate for positive samples. The F-score combines precision and recall and is a commonly used metric to assess the predictive performance of the model (Goutte & Gaussier, 2005 ): $$\:\text{F}-\text{s}\text{c}\text{o}\text{r}\text{e}\:=\frac{(1+{{\beta\:}}^{2})\times\:\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\times\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}{{{\beta\:}}^{2}\times\:\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\:\times\:\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}$$ Let \(\:{\beta\:}=1\) , then we have F1-score: $$\:\text{F}1=\frac{2\times\:\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\times\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}{\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\:+\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}$$ Recall and precision vary with changes in the threshold. Based on different thresholds, a P-R curve is plotted with recall and precision as coordinates (Boyd, Eng, & Page, 2013 ). The area enclosed by this curve and the axes is referred to as Average Precision (AP), which is an important metric for evaluating the overall performance of object detection algorithms. The mean Average Precision (mAP) is defined as the average of AP across all categories at a specific IoU threshold. In this study, the average precision (mAP50) was calculated at an IoU threshold of 50%, along with the average IoU values within the specified range \(\:[50\%,95\%]\) . The experiments were conducted using Google Colab services, with the following machine configuration: Intel(R) Xeon(R) CPU @ 2.20GHz (8 cores), NVidia Tesla T4 (15102MB of graphic memory), operating system Linux 6.1.85, CUDA version 12.2, Python version 3.10.12, Pytorch version 2.3.1 + cu121, and Ultralytics version 8.2.82. 3. Results This research collected and compared the inference results of YOLO algorithm family. We adopted processsing time, recall, precision, F1-score, mAP50 and mAP50-95 as performance indicators for result analysis. Table 4 gives a comparison of all three models, and Fig. 4 provides the P-R diagrams on the testing set of them. Table 5 , Table 6 and Table 7 demonstrate the predictions results of each model for normal, CSOM, MEC and total cases overall. Table 4 YOLO model family prediction result comparison Model YOLOv8m YOLOv9m YOLOv10m Pre-processing time (ms) 3.1 3.2 3.1 Inference time (ms) 29.9 29.5 33.8 Post-processing time (ms) 7.5 12.6 4.2 Precision 0.957 0.917 0.952 Recall 0.967 0.95 0.872 F1-score 0.962 0.933 0.911 mAP50 0.976 0.958 0.926 mAP50-95 0.956 0.949 0.915 Table 5 Prediction Results of YOLOv8m Model Precision Recall F1-score mAP50 mAP50-95 Normal 1 0.900 0.947 0.950 0.928 CSOM 0.871 1 0.931 0.984 0.984 MEC 1 1 1 0.995 0.957 All 0.957 0.967 0.962 0.976 0.956 Table 6 Prediction Results of YOLOv9m Model Precision Recall F1-score mAP50 mAP50-95 Normal 1 0.85 0.933 0.925 0.913 CSOM 0.75 1 0.857 0.955 0.955 MEC 1 1 1 0.995 0.980 All 0.917 0.95 0.933 0.958 0.949 Table 7 Prediction Results of YOLOv10m Model Precision Recall F1-score mAP50 mAP50-95 Normal 1 0.95 0.974 0.975 0.953 CSOM 0.857 0.667 0.75 0.809 0.809 MEC 1 1 1 0.995 0.983 All 0.952 0.872 0.911 0.926 0.915 Prediction results on CSOM, MEC and normal cases are shown correspondingly in Fig. 5 , Fig. 6 and Fig. 7 . 4. Discussion Among the three models, as shown in Table 4 , YOLOv9m exhibits a preprocessing time that is 0.1 ms longer than both YOLOv8m and YOLOv10m. Yet YOLOv9 has the shortest inference time, while YOLOv8 is 0.4 ms slower than YOLOv9. YOLOv10's inference time is longer than the other two models. In the postprocessing phase, thanks to the optimization techniques, YOLOv10m shows significant dropping of postprocessing time to an average of 4.2 ms which compensates for the lag in preprocessing and inference procedure. In terms of overall predictive performance, YOLOv8m outperforms both YOLOv9m and YOLOv10m in average precision, recall, mAP50, and mAP50-95 across the three recognition tasks. Both average precision and recall for YOLOv8m exceed 95%. YOLOv9m has an average recall of 95% but a relatively lower precision. YOLOv10m's average precision is close to that of YOLOv8m, but with a smaller recall. Regarding classification prediction results, YOLOv10m excels in predicting normal and MEC cases, achieving the highest precision, recall, mAP50, and mAP50-95 among the tested models. However, its performance in identifying CSOM is significantly lower than that of YOLOv8m and YOLOv9m, which adversely affects the overall evaluation of YOLOv10m. As demonstrated in Fig. 5 , Fig. 6 and Fig. 7 , all three models can effectively locate targets but exhibit misclassifications at different levels. For instance, YOLOv10m failed to recognize CSOM in Fig. 5 c. Figure 6 indicate that all three models perform well in identifying MEC. Misclassifications in identifying normal cases occur across all models as shown in Fig. 7 . Despite YOLOv8 having a higher proportion of false positives and YOLOv10 showing superior predictions for normal and MEC cases, the architecture proposed in this study serves as an auxiliary classification tool. False positives can be further filtered by professionals to prevent delays in diagnosing false negatives. In summary, YOLOv8m demonstrates superior accuracy across comprehensive recognition metrics and computation resource requirements compared to YOLOv9m and YOLOv10m. Given the cloud-edge collaborative architecture, training for all three models occurs in the cloud, allowing them to be deployed and simultaneously trained and tested. The models can be quantified and packaged as containers for inference on incoming data at the edge. As the dataset expands, the cloud can iteratively train these models and update the edge models as needed. 5. Conclusions This study explores the significance of temporal bone U-HRCT imaging in the diagnosis of CSOM and MEC, proposing a framework for assisting classification using a "cloud-edge" collaborative training architecture. In this framework, multiple image recognition models were trained and tested in the cloud, and their results were compared. Experimental results show that model performance can be assessed using indicators such as precision, recall, and mAP. By leveraging the "cloud-edge" collaborative training architecture, image recognition models can be trained simultaneously on the cloud, and the best-performing models can be deployed to the edge devices. This approach maximizes data utilization and fully explores the diversity of image recognition algorithms, ensuring high target recognition accuracy on edge devices. Currently, large vision models like Vision Transformers (ViT) offer better recognition capabilities; however, due to the relatively short time that U-HRCT equipment has been on the market, the sample size is limited, making it difficult to train these large models effectively. In the future, efforts will focus on continuously expanding the dataset, harnessing the detailed and precise features of large models to train and test them. Comparative studies on the performance difference between large models and the existing models will be conducted to explore the application of large models in U-HRCT image processing. Abbreviations CSOM = chronic suppurative otitis media MEC = middle ear cholesteatoma HRCT = high-resolution computed tomography U-HRCT = ultra-high-resolution computed tomography YOLO = You Only Look Once IoU = Intersection over Union AP = average precision mAP = mean average precision Declarations Author Contribution Ting Wu and Yu Tang contributed equally to this work and share the first authorship.All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by T.W.,Y.T and Z.G.C The first draft of the manuscript was written by T.W., Y.T. and W.M. J.J.Z.and S.B.H. designed and performed experiment and provided critical comments. Y.W and J.W.had drawn the figures. All authors commented on the previous versions of the manuscript. All authors read and approved the final manuscript. Data Availability The data that support the findings of this study are available from Nanjing Tongren Hospital, School of Medicine, Southeast University, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of [Nanjing Tongren Hospital, School of Medicine, Southeast University]. the data from this study can be acquired by contacting ting wu. ( [email protected] ). References Baba, A., Kurokawa, R., Kurokawa, M., Ota, Y., Matsushima, S., Fukuda, T., . . . Ojiri, H. (2022, June). Preoperative prediction for mastoid extension of middle ear cholesteatoma using temporal subtraction serial HRCT studies. European Radiology, 32 , 3631–3638. doi:10.1007/s00330-021-08453-0 Bhutta, M. F., Leach, A. J., & Brennan-Jones, C. G. (2024, May 25). Chronic suppurative otitis media. Lancet (London, England), 403 , 2339–2348. doi:10.1016/S0140-6736(24)00259-9 Boyd, K., Eng, K. H., & Page, C. D. (2013). Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals. In C. Salinesi, M. C. Norrie, & Ó. Pastor (Eds.), Advanced Information Systems Engineering (Vol. 7908, pp. 451–466). Berlin, Heidelberg: Springer Berlin Heidelberg. doi:10.1007/978-3-642-40994-3_29 Cacco, T., Africano, S., Gaglio, G., Carmisciano, L., Piccirillo, E., Castello, E., & Peretti, G. (2022, February). Correlation between peri-operative complication in middle ear cholesteatoma surgery using STAMCO, ChOLE, and SAMEO-ATO classifications. European archives of oto-rhino-laryngology: official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS): affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery, 279 , 619–626. doi:10.1007/s00405-021-06679-8 Gilberto, N., Custódio, S., Colaço, T., Santos, R., Sousa, P., & Escada, P. (2020, April). Middle ear congenital cholesteatoma: systematic review, meta-analysis and insights on its pathogenesis. European archives of oto-rhino-laryngology: official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS): affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery, 277 , 987–998. doi:10.1007/s00405-020-05792-4 Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation. Rich feature hierarchies for accurate object detection and semantic segmentation . arXiv. doi:10.48550/ARXIV.1311.2524 Goutte, C., & Gaussier, E. (2005). A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. In D. E. Losada, & J. M. Fernández-Luna (Eds.), Advances in Information Retrieval (Vol. 3408, pp. 345–359). Berlin, Heidelberg: Springer Berlin Heidelberg. doi:10.1007/978-3-540-31865-1_25 Jocher, G., Chaurasia, A., & Qiu, J. (2023). Ultralytics YOLOv8. Ultralytics YOLOv8 . Retrieved from https://github.com/ultralytics/ultralytics Kim, J.-a., Sung, J.-Y., & Park, S.-h. (2020, November). Comparison of Faster-RCNN, YOLO, and SSD for Real-Time Vehicle Type Recognition. 2020 IEEE International Conference on Consumer Electronics - Asia (ICCE-Asia) , (pp. 1–4). doi:10.1109/ICCE-Asia49877.2020.9277040 Kuo, C.-L., Shiao, A.-S., Yung, M., Sakagami, M., Sudhoff, H., Wang, C.-H., . . . Lien, C.-F. (2015). Updates and knowledge gaps in cholesteatoma research. BioMed Research International, 2015 , 854024. doi:10.1155/2015/854024 Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., . . . Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision – ECCV 2014 (Vol. 8693, pp. 740–755). Cham: Springer International Publishing. doi:10.1007/978-3-319-10602-1_48 Luers, J. C., & Hüttenbrink, K.-B. (2016, February). Surgical anatomy and pathology of the middle ear. Journal of Anatomy, 228 , 338–353. doi:10.1111/joa.12389 Silverstein, H. (1972, August 10). Surgery for chronic suppurative otitis media. The New England Journal of Medicine, 287 , 287–290. doi:10.1056/NEJM197208102870607 Su, R., Song, J., Wang, Z., Mao, S., Mao, Y., Wu, X., & Hou, M. (2022, August 28). Application of high resolution computed tomography image assisted classification model of middle ear diseases based on 3D-convolutional neural network. Zhong Nan Da Xue Xue Bao. Yi Xue Ban = Journal of Central South University. Medical Sciences, 47 , 1037–1048. doi:10.11817/j.issn.1672-7347.2022.210704 Sundgaard, J. V., Harte, J., Bray, P., Laugesen, S., Kamide, Y., Tanaka, C., . . . Christensen, A. N. (2021, July). Deep metric learning for otitis media classification. Medical Image Analysis, 71 , 102034. doi:10.1016/j.media.2021.102034 Wang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., & Ding, G. (2024). YOLOv10: Real-Time End-to-End Object Detection. YOLOv10: Real-Time End-to-End Object Detection . arXiv. doi:10.48550/ARXIV.2405.14458 Wang, C.-Y., Liao, H.-Y. M., & Yeh, I.-H. (2022). Designing Network Design Strategies Through Gradient Path Analysis. Designing Network Design Strategies Through Gradient Path Analysis . arXiv. doi:10.48550/ARXIV.2211.04800 Wang, C.-Y., Liao, H.-Y. M., Yeh, I.-H., Wu, Y.-H., Chen, P.-Y., & Hsieh, J.-W. (2019). CSPNet: A New Backbone that can Enhance Learning Capability of CNN. CSPNet: A New Backbone that can Enhance Learning Capability of CNN . arXiv. doi:10.48550/ARXIV.1911.11929 Wang, C.-Y., Yeh, I.-H., & Liao, H.-Y. M. (2024). YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information . arXiv. doi:10.48550/ARXIV.2402.13616 Wang, Y.-M., Li, Y., Cheng, Y.-S., He, Z.-Y., Yang, J.-M., Xu, J.-H., . . . Ren, D.-D. (2020). Deep Learning in Automated Region Proposal and Diagnosis of Chronic Otitis Media Based on Computed Tomography. Ear and Hearing, 41 , 669–677. doi:10.1097/AUD.0000000000000794 Wu, H., Liu, Q., & Liu, X. (2019). A Review on Deep Learning Approaches to Image Classification and Object Segmentation. Computers, Materials & Continua, 60 , 575–597. doi:10.32604/cmc.2019.03595 Xu, N., Ding, H., Tang, R., Li, X., Zhang, Z., Lv, H., . . . Zhao, P. (2023, November 28). Comparative study of the sensitivity of ultra-high-resolution CT and high-resolution CT in the diagnosis of isolated fenestral otosclerosis. Insights into Imaging, 14 , 211. doi:10.1186/s13244-023-01562-y Zeng, J., Kang, W., Chen, S., Lin, Y., Deng, W., Wang, Y., . . . Cai, Y. (2022, July 1). A Deep Learning Approach to Predict Conductive Hearing Loss in Patients With Otitis Media With Effusion Using Otoscopic Images. JAMA otolaryngology– head & neck surgery, 148 , 612–620. doi:10.1001/jamaoto.2022.0900 Zeng, X., Jiang, Z., Luo, W., Li, H., Li, H., Li, G., . . . Li, Z. (2021, May 25). Efficient and accurate identification of ear diseases using an ensemble deep learning model. Scientific Reports, 11 , 10839. doi:10.1038/s41598-021-90345-w Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5414065","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":383019123,"identity":"a8310746-3c3f-4285-b12d-69ca15250f64","order_by":0,"name":"Ting Wu","email":"","orcid":"","institution":"Southeast University","correspondingAuthor":false,"prefix":"","firstName":"Ting","middleName":"","lastName":"Wu","suffix":""},{"id":383019124,"identity":"2c17372f-7588-4b21-89ea-3aa1b00819b7","order_by":1,"name":"Yu Tang","email":"","orcid":"","institution":"Southeast University","correspondingAuthor":false,"prefix":"","firstName":"Yu","middleName":"","lastName":"Tang","suffix":""},{"id":383019125,"identity":"ea1ee43f-ab37-466f-8c15-7fd9f6dc1713","order_by":2,"name":"Zigang Che","email":"","orcid":"","institution":"Southeast University","correspondingAuthor":false,"prefix":"","firstName":"Zigang","middleName":"","lastName":"Che","suffix":""},{"id":383019126,"identity":"81379efc-4e40-437f-a5ff-10d34c917201","order_by":3,"name":"Jiangjiang Zhao","email":"","orcid":"","institution":"Southeast University","correspondingAuthor":false,"prefix":"","firstName":"Jiangjiang","middleName":"","lastName":"Zhao","suffix":""},{"id":383019127,"identity":"d84a124d-3162-4186-823d-480329e5d214","order_by":4,"name":"Jue Wang","email":"","orcid":"","institution":"Politecnico di Milano","correspondingAuthor":false,"prefix":"","firstName":"Jue","middleName":"","lastName":"Wang","suffix":""},{"id":383019128,"identity":"cb755fee-d6eb-45b1-9031-5f7166a143e1","order_by":5,"name":"Yanfeng Wu","email":"","orcid":"","institution":"University of Science and Technology of China","correspondingAuthor":false,"prefix":"","firstName":"Yanfeng","middleName":"","lastName":"Wu","suffix":""},{"id":383019129,"identity":"be4f7a8e-d3d7-49e1-a40e-86bf96161f96","order_by":6,"name":"Wei Meng","email":"","orcid":"","institution":"Southeast University","correspondingAuthor":false,"prefix":"","firstName":"Wei","middleName":"","lastName":"Meng","suffix":""},{"id":383019130,"identity":"07e08a49-1298-407a-bbc8-916a23e5cfd7","order_by":7,"name":"Shuangba He","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAsklEQVRIiWNgGAWjYBACAwYGNoYPQMwgQYoWxhkMbBKkaWHmAdlBtBZz9vZnj2138NWZz25gfFzxiwgtlj1nzI1zz7BJyNw5wGx4to8Yh93IYZPObWOTkJBIYJNs7CFGy/3nz6QtSdNyg8FMmhGmpeEHMVrO5Jgb9raxSc6QSGw2bGwgRsvx488e/Gw7xi8hkXzwYcMfIrRAwTEgZmxgYGwjXksNlCbBllEwCkbBKBg5AACqAzG9+hF53wAAAABJRU5ErkJggg==","orcid":"","institution":"Southeast University","correspondingAuthor":true,"prefix":"","firstName":"Shuangba","middleName":"","lastName":"He","suffix":""}],"badges":[],"createdAt":"2024-11-08 06:23:36","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5414065/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5414065/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":71925861,"identity":"4ce91422-dd9c-4bde-914f-6e415670e2a8","added_by":"auto","created_at":"2024-12-19 18:45:27","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":328449,"visible":true,"origin":"","legend":"\u003cp\u003eCloud-edge collaborative training framework\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-5414065/v1/b62b14c410f8847629bfe27e.png"},{"id":71925864,"identity":"f02cd1eb-3f51-4479-98d8-27b7d8700ef6","added_by":"auto","created_at":"2024-12-19 18:45:27","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":553678,"visible":true,"origin":"","legend":"\u003cp\u003e\u0026nbsp;YOLOv8 architecture\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-5414065/v1/5bee0291c19b0975a23a30fb.png"},{"id":71925859,"identity":"3cd2eac5-73f1-4f81-a808-5eb4c00581cd","added_by":"auto","created_at":"2024-12-19 18:45:27","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":182755,"visible":true,"origin":"","legend":"\u003cp\u003eMosaic augmentation procedures\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-5414065/v1/a67efdd5211bd091a075a70a.png"},{"id":71926377,"identity":"0e772dae-ad56-492e-83f8-938c008e7ffa","added_by":"auto","created_at":"2024-12-19 18:53:27","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":76482,"visible":true,"origin":"","legend":"\u003cp\u003eYOLO model family P-R diagram on testing set\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-5414065/v1/349b7dc582e6ded2c4d70bf0.png"},{"id":71926378,"identity":"4a85fdd8-b9b7-4d6c-b2d4-12ffaea4d814","added_by":"auto","created_at":"2024-12-19 18:53:27","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":738381,"visible":true,"origin":"","legend":"\u003cp\u003ePrediction results on CSOM\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-5414065/v1/275e3852f3bb9e729e32e8f8.png"},{"id":71925863,"identity":"aaec751c-b680-4905-8dc3-421f996510e5","added_by":"auto","created_at":"2024-12-19 18:45:27","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":822281,"visible":true,"origin":"","legend":"\u003cp\u003ePrediction results on MEC\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-5414065/v1/8e10a10a5b03437a317a927b.png"},{"id":71925865,"identity":"57705159-cdfa-4d47-9cb5-53290e6f5a68","added_by":"auto","created_at":"2024-12-19 18:45:27","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":707504,"visible":true,"origin":"","legend":"\u003cp\u003ePrediction results on normal cases\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-5414065/v1/7dd5d0f85fbfbd60b5c08c11.png"},{"id":72182943,"identity":"9c9bda12-3259-45be-903e-f61aed863058","added_by":"auto","created_at":"2024-12-23 12:54:11","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":4077894,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5414065/v1/74ca999c-496c-4f57-8132-8dcae993f81c.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"A Cloud-Edge Collaborative Model Training Framework for Assisted Classification of Middle Ear Diseases Based on Ultra-High-Resolution Temporal Bone CT Images","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eMore than three years have passed since the outbreak of the COVID-19 pandemic, and its impact is still influencing the global healthcare industry, which is in urgent need of cost reduction and improvement in the medical service accessibility, while facing challenges such as uneven distribution of local medical resources and shortage of skilled medical personnel. These challenges have prompted healthcare systems to adopt emerging technologies to fill these gaps. By promoting the application of advanced technologies such as cloud computing, the Internet of Things (IoT), and artificial intelligence (AI), healthcare services are evolving towards an information-enabled intelligent outlook. The construction of \u0026ldquo;Internet\u0026thinsp;+\u0026thinsp;Medical System\u0026rdquo; can effectively alleviate the growing demand for medical services and the shortage of medical resources.\u003c/p\u003e \u003cp\u003eHearing impairment has long been a significant ear condition affecting individuals across all age groups. Based on the location of the lesion, hearing loss can be classified into three main types: sensorineural hearing loss, conductive hearing loss, and mixed hearing loss. Out of all three types, the conductive hearing loss comprises a major proportion. The cause of conductive hearing loss is the lesions in the outer ear, middle ear, and inner ear, among which the middle ear lesions are a primary focus of research. The middle ear is in the petrous part of the temporal bone and consists of four sections: the tympanic cavity, eustachian tube, tympanic antrum, and mastoid process. The main function of the middle ear is to transmit sound energy from the air in the external auditory canal to the lymphatic fluid in the cochlea. This energy conversion, from gas to liquid, is achieved through the vibration of the tympanic membrane and the ossicular chain (Luers \u0026amp; H\u0026uuml;ttenbrink, 2016). This specific anatomical structure renders middle ear diseases accounting for the highest proportion of ear disorders, which not only directly affect hearing ability but may also lead to serious intracranial and extracranial complications such as peripheral facial palsy and brain abscesses (Cacco, et al., 2022). Chronic suppurative otitis media (CSOM) and middle ear cholesteatoma (MEC) are the most common middle ear diseases. CSOM is a chronic suppurative inflammation of the mucosa, periosteum, or even the bone tissue of the middle ear. The disease is often not confined to the tympanic cavity but invades the tympanic antrum, mastoid process, and eustachian tube. It is clinically characterized by long-term intermittent or persistent purulent discharge, tympanic membrane perforation, with or without hearing loss. It is globally one of the leading causes of preventable hearing loss in both children and adults (Bhutta, Leach, \u0026amp; Brennan-Jones, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). MEC is a cystic structure in the middle ear, rather than a true tumor, caused by the inward growth of squamous epithelium from the tympanic membrane towards the middle ear. The cyst contains desquamated squamous epithelium and keratinous material, which can erode the surrounding bone. Apart from hearing loss, it can cause serious intracranial and extracranial complications (Gilberto, et al., 2020). Both CSOM and MEC are featured with recurrent ear discharge, hearing loss and tinnitus at varying levels, making them similar in presentation. However, the treatment principles for the two differ. CSOM treatment aims at containing infection, ensuring drainage, removing lesions, restoring hearing ability, and eliminating the underlying cause (Silverstein, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e1972\u003c/span\u003e). For MEC, the first choice is to remove the lesioned tissue, prevent complications via surgery and to reconstruct the sound transmission structures while keeping the middle ear dry (Kuo, et al., 2015). Therefore, differential recognition between the two is crucial for clinical diagnosis and prognosis.\u003c/p\u003e \u003cp\u003eHigh-resolution computed tomography (HRCT) of the temporal bone is currently a relatively accurate imaging method for diagnosing middle ear diseases, as it can clearly reveal the internal anatomical structures and lesions of the middle ear (Baba, et al., 2022). However, the typical slice thickness of HRCT is 0.625 mm, and the reconstructed slice thickness is often 1.0 mm, restraining its ability to detect small or early-stage lesions, which may lead to missed or misdiagnosed middle ear diseases (Xu, et al., 2023). At present, ultra-high-resolution computed tomography (U-HRCT) specifically designed for otology demonstrated significant advantages in displaying the fine bone anatomy of the temporal bone. 10-\u0026micro;m level U-HRCT can reach a resolution of up to 0.05 mm, providing clear visualization of the fine structures in the middle ear, surrounding bones and even the state of ligaments. Compared to HRCT, its image quality is significantly improved. U-HRCT has shown great advancement in displaying the detailed bone anatomy of the temporal bone.\u003c/p\u003e \u003cp\u003eCurrently, computer-aided decision support models involving various machine learning algorithms (Zeng, et al., 2022; Sundgaard, et al., 2021; Zeng, et al., 2021) and convolutional neural networks (CNN) have been applied for middle ear disease detection using tympanic membrane images. Yan-Mei Wang, et al. proposed a deep learning framework to extract regions from temporal bone CT slices for CSOM and MEC diagnosis (Wang, et al., 2020). In the meantime, a middle ear disease detection model based on 3D CNN using temporal bone CT images is published (Su, et al., 2022). Both CSOM and MEC appear on CT images as soft tissue-like low-density masses partially filling the middle ear and mastoid cavities. However, typical CT images of MEC may also show progressive destruction of auditory structures and adjacent cranial bones, such as erosion of superior shield plate of the tympanic cavity or enlargement of the tympanic antrum opening. Regular HRCT may miss these features, leading to misdiagnosis and inappropriate treatment. U-HRCT shows great potential for the early detection of fine bone anatomy and small lesions in the temporal bone, but at present there is still no research on combining U-HRCT with artificial intelligence to differentiate between the most common middle ear diseases, CSOM and MEC.\u003c/p\u003e \u003cp\u003eTo address this, this study analyzes the U-HRCT imaging characteristics of CSOM and MEC. By leveraging the innate continuity of U-HRCT image data and its ability to display fine structures in the middle ear, this study proposes a \"cloud-edge\" collaborative training framework for middle ear diseases that integrates the Yolo recognition algorithm to realize the classification of U-HRCT images for MEC and CSOM. The research aims to alleviate the heavy workload of image interpretation, which is constrained by expert experience, using computer vision recognition technology, establishing a new system to promote the standardization of high-quality diagnostic technologies, reduce the workload and costs associated with middle ear disease diagnosis and treatment, and facilitate the accumulation of high-quality medical resources and the sharing of intelligent image recognition technology.\u003c/p\u003e"},{"header":"2. Materials and Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\n \u003ch2\u003e2.1 Data acquirement\u003c/h2\u003e\n \u003cp\u003eThis study collected the medical records and temporal bone U-HRCT imaging data of patients who underwent middle ear surgery at the Department of Otolaryngology of Nanjing Tongren Hospital. The screening process for the medical records was based on pathology, medical history, ear examinations, audiograms, and imaging results of the operated ear. Patients with congenital middle ear malformations, those undergoing repeated surgeries, and those with acute or chronic secretory otitis media were excluded. Eventually, the study included 400 ears of 205 patients in this experimental research. The medical records of all patients were independently reviewed by two otolaryngologists, each with more than 15 years of experience and the title of Associate Chief Physician, and a unanimous diagnosis was achieved. The study was approved by the Medical Ethics Committee of Nanjing Tongren Hospital (Approval No. 2024-03-006-k001). all methods were performed in accordance with the relevant guidelines and regulations. Given the retrospective nature of the study, the informed consent process was waived.\u003c/p\u003e\n \u003cp\u003eThe imaging experiment was conducted using the Ultra3D U-HRCT equipment from Beijing LargeV Instrument Co., Ltd. The imaging was performed in small-field-of-view and high-definition mode, with the scanning parameters listed in Table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\n \u003cp\u003e\u003c/p\u003e\u0026nbsp;\u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eImaging equipment parameters\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eNumber\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eParameter\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eValue/Range\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eVoltage\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e100\u0026thinsp;~\u0026thinsp;110 kV\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCurrent\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e140\u0026thinsp;~\u0026thinsp;180mAs\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eImage matrix dimension\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e650\u0026times;650\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eScanning range\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7 cm\u0026times;4 cm\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eReconstruction range\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e65mm\u0026times;65mm\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSlice thickness\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.1mm\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSlice interval\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.1mm\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003cp\u003eThe scanning range of the ear imaging covers from the superior part of the semicircular canal to the mastoid antrum, extending outwards to the tympanic part of the temporal bone and inward to the petrous apex of the temporal bone. A total of 4000 high-resolution axial U-HRCT images of the temporal bone were obtained. This dataset involves 70 patients with MEC (38 male patients and 32 female patients), 135 patients with CSOM (53 male patients and 82 female patients), and 20 control subjects in normal condition (10 males and 10 females). The ages of the collected patients ranged from 10 to 70 years. The summarization of the patient information is given in Table \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e\n \u003cp\u003e\u003c/p\u003e\u0026nbsp;\u003ctable id=\"Tab2\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003ePatient information summarization\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMale\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eFemale\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMEC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e38\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e32\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCSOM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e53\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNormal\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTotal\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e91\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e114\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003cp\u003eFor each patient, we selected approximately 10 to 20 CT images that clearly displayed well-determined lesions. Eventually we used 2295 CT images of CSOM, 1305 CT images of MEC, and 400 normal CT images for analysis.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\n \u003ch2\u003e2.2 Cloud-edge collaborative training framework-based disease classification\u003c/h2\u003e\n \u003cp\u003eIn the proposed \u0026quot;cloud-edge\u0026quot; collaborative training framework for middle ear diseases, the edge devices are physically distributed and used as auxiliary classification tools for otolaryngologists or radiologists. These edge devices access imaging data from actual patients and utilize image recognition inference models to assist in classifying middle ear diseases. The cloud consists of high-performance processing devices to which the imaging data from the edge devices is transmitted for a unified training process. The trained models are then deployed to the edge devices for diagnosis and treatment support. Within this framework, cloud-edge collaboration enables data labeling and model optimization. The overall architecture is illustrated in Fig. \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\n \u003cp\u003eDeep learning has demonstrated remarkable performance in image recognition (Wu, Liu, \u0026amp; Liu, \u003cspan class=\"CitationRef\"\u003e2019\u003c/span\u003e). The mainstream deep learning image recognition methods can be divided into two categories: two-stage methods based on region extraction and single-stage detection methods. The two-stage method represented by Regions with Convolutional Neural Networks (RCNN) first uses a region extractor to generate candidate object regions, and then employs deep neural networks for feature extraction and classification in each region (Girshick, Donahue, Darrell, \u0026amp; Malik, \u003cspan class=\"CitationRef\"\u003e2013\u003c/span\u003e). The single-stage detection method represented by You Only Look Once (YOLO) algorithms directly extract location and class information from the image, allowing for faster detection (Kim, Sung, \u0026amp; Park, \u003cspan class=\"CitationRef\"\u003e2020\u003c/span\u003e). Given the real-time requirements, functionality, and model generalizability of the cloud-edge collaborative training framework, this study adopted the YOLO algorithms as the model for identifying middle ear diseases.\u003c/p\u003e\n \u003cp\u003eThe YOLO algorithms are among the most commonly used deep learning algorithms for object detection (Kim, Sung, \u0026amp; Park, \u003cspan class=\"CitationRef\"\u003e2020\u003c/span\u003e). The first version was accomplished by Joseph Redmon in 2016, and after numerous optimizations and innovations, the most often used versions today are YOLOv8 (Jocher, Chaurasia, \u0026amp; Qiu, \u003cspan class=\"CitationRef\"\u003e2023\u003c/span\u003e), YOLOv9 (Wang, Yeh, \u0026amp; Liao, YOLOv9, 2024), and YOLOv10 (Wang, et al., \u003cspan class=\"CitationRef\"\u003e2024\u003c/span\u003e). The principal idea of YOLO is to divide the image into grid cells, and then practice simultaneously the prediction of the bounding boxes and the class probabilities for each grid cell using a multi-layer convolutional neural network. As shown in Fig. \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e, the architecture of YOLOv8 consists of a backbone, neck, and head (Jocher, Chaurasia, \u0026amp; Qiu, \u003cspan class=\"CitationRef\"\u003e2023\u003c/span\u003e). The backbone of YOLOv8 utilizes the Cross Stage Partial (CSPNet) architecture (Wang, et al., CSPNet, 2019), which connects different layers in the deep learning network. It divides the input feature map of a layer into two parts: one proceeds to the subsequent neural network layer, while the other part bypasses that segment of the network and merges directly with the output from the previous layer that has passed through the neural network. CSPNet guarantees gradient propagation during the model training process, enhances the stability of the training, reducing computation workload and memory usage. The combination with ELAN (Wang, Liao, \u0026amp; Yeh, Designing Network Design Strategies Through Gradient Path Analysis, 2022) further optimizes computational resource utilization and object detection performance. Additionally, YOLOv8 employs a better optimized version of spatial pyramid pooling\u0026mdash; Spatial Pyramid Pooling - Fast (SPPF), along with the SiLU as activation function, improved loss functions, and training strategies to enhance efficiency and accuracy.\u003c/p\u003e\n \u003cp\u003eBuilding on YOLOv8, YOLOv9 addresses the problem of information loss during the feed-forward process, which affects the convergence in deep learning process. It introduces Programmable Gradient Information (PGI) and Generalized ELAN (GELAN), and improves the architecture of the image detection network (Wang, Liao, \u0026amp; Yeh, Designing Network Design Strategies Through Gradient Path Analysis, 2022). YOLOv10 employs a new holistic efficiency-accuracy driven model design strategy, enhancing the architectural design and utilizing an NMS-free (Non-Maximum Suppression-free) data post-processing method (Wang, et al., \u003cspan class=\"CitationRef\"\u003e2024\u003c/span\u003e). Both YOLOv9 and YOLOv10 show performance improvement over YOLOv8 when tested on the MS COCO dataset.\u003c/p\u003e\n \u003cp\u003eEach YOLO version offers different sizes of sub-versions. YOLOv8 includes five versions: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x; YOLOv9 has five versions: YOLOv9t, YOLOv9s, YOLOv9m, YOLOv9c, and YOLOv9e; YOLOv10 offers six versions: YOLOv10n, YOLOv10s, YOLOv10m, YOLOv10b, YOLOv10l, and YOLOv10x. YOLOv8m, YOLOv9m, and YOLOv10m are suitable for general-purpose object detection. This study tested the performance of YOLOv8m, YOLOv9m, and YOLOv10m on the dataset with the cloud-edge collaborative training framework, providing a reference for the model selection phase when constructing the cloud-edge collaborative training framework. Table \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e displays the features of these models.\u003c/p\u003e\n \u003cp\u003e\u003c/p\u003e\u0026nbsp;\u003ctable id=\"Tab3\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eNumber of parameters for each model and the testing result on MS COCO (Wang, et al., \u003cspan class=\"CitationRef\"\u003e2024\u003c/span\u003e)\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eModel\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003emAP50-95\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003ePARAMS\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eFLOPs\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eYOLOv8m\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e50.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e25.9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e78.9\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eYOLOv9m\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e51.1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e20.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e76.3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eYOLOv10m\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e51.1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e16.46\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e63.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003cp\u003eThe framework proposed in this paper receives U-HRCT images from edge devices, and then recollects the data for training into the cloud. The system leverages the reliable data acquisition capabilities of U-HRCT devices, precise inference analysis at the edge, and efficient collaborative capabilities in the cloud to address the challenges of fragmentation of multiple-level information in the healthcare system and the limitations of interdepartmental diagnostic capabilities for middle ear diseases, which are often constrained by expert experience. This method filters axial data containing the middle ear structural features from the patient\u0026apos;s temporal bone U-HRCT scans and utilizes the YOLO algorithm to build an auxiliary diagnostic model, facilitating computation-intensive low-latency diagnoses.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\n \u003ch2\u003e2.3 Experiment design\u003c/h2\u003e\n \u003cdiv id=\"Sec6\" class=\"Section3\"\u003e\n \u003ch2\u003e2.3.1 Data augmentation\u003c/h2\u003e\n \u003cp\u003eThe experiment employs the Mixup and Mosaic data augmentation algorithms to enhance the robustness of the model, thereby improving its generalization ability.\u003c/p\u003e\n \u003cp\u003eMixup was collaboratively proposed by MIT and Facebook in 2018. It is performed during the dataset loading phase and convexly combines the samples with their labels to create new training samples. Unlike the conventional data augmentation methods, Mixup alters both the samples and the labels simultaneously, operating within a batch by mixing one batch of data with a randomly selected image from that batch. The formula is as follows:\u003c/p\u003e\n \u003cdiv id=\"Equa\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e$$\\:x=\\lambda\\:{x}_{i}+(1-\\lambda\\:){x}_{j}$$\u003c/div\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Equb\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e$$\\:y=\\lambda\\:{y}_{i}+(1-\\lambda\\:){y}_{j}$$\u003c/div\u003e\n \u003c/div\u003e\n \u003cp\u003eHere, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:({x}_{i},\\:{y}_{i})\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:({x}_{j},\\:{y}_{j})\\)\u003c/span\u003e\u003c/span\u003e are two randomly selected feature-target vectors from the training dataset, while \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\lambda\\:\\)\u003c/span\u003e\u003c/span\u003e is a random number drawn from a given Beta distribution. Thus, Mixup provides continuous data samples between different data categories, directly expanding the distribution of the given training set and making the network perform better during the testing phase.\u003c/p\u003e\n \u003cp\u003eMosaic augmentation involves stitching images together through random scaling, random cropping, and random arrangement. This technique enriches the detection dataset by increasing the presence of small objects, thereby improving the network\u0026apos;s robustness. The steps are as follows:\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cp\u003e1. Create a new mosaic canvas and randomly generate a point on the canvas.\u003c/p\u003e\n\u003cp\u003e\u003cspan\u003e\u003c/span\u003e\u003c/p\u003e\n\u003cp\u003e2. Select four images around the random point and incorporate parts of these images into the canvas.\u003c/p\u003e\n\u003cp\u003e\u003c/p\u003e\n\u003cp\u003eThe augmentation process is shown in Fig. \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e\n\u003cp\u003eThe four colors represent the four sample images. The parts that are out of the canvas are discarded. The stitching process for the bottom left and bottom right images follows the same mechanism as that for the top left and top right images.\u003c/p\u003e\n\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\n \u003ch2\u003e2.3.2 Data labeling and model training\u003c/h2\u003e\n \u003cp\u003eConsidering the performance of the cloud-edge collaborative architecture, this study selected three models for experimental comparison: YOLOv8m, YOLOv9m, and YOLOv10m. Data annotation was in Pascal VOC format, with the dataset divided into training set, validation set, and testing set in a 50%, 25%, 25% ratio. The data labels included three categories: normal, CSOM, and MEC. Subsequently, transfer learning and training were practiced using the aforementioned models. The initial weights were those of the models pre-trained on the COCO dataset provided by the Ultralytics library (Lin, et al., 2014), with a training cycle of 100 epochs, a learning rate of 0.01, an optimization method of Stochastic Gradient Descent (SGD), and a batch size of 16. The overall process adopted the automatic mixed precision approach.\u003c/p\u003e\n \u003cp\u003eThe study evaluated the inference performance of the models on the test set from the perspectives of localization performance and classification performance. Localization performance is described by Intersection over Union (IoU), which is associated with the overlapping area between the predicted bounding box and the ground truth bounding box, as defined below:\u003c/p\u003e\n \u003cdiv id=\"Equc\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equc\" name=\"EquationSource\"\u003e$$\\:\\text{I}\\text{o}\\text{U}=\\frac{\\text{g}\\text{r}\\text{o}\\text{u}\\text{n}\\text{d}\\:\\text{t}\\text{r}\\text{u}\\text{t}\\text{h}\\:\\text{b}\\text{o}\\text{u}\\text{n}\\text{d}\\text{i}\\text{n}\\text{g}\\:\\text{b}\\text{o}\\text{x}\\:\\cap\\:\\:\\text{p}\\text{r}\\text{e}\\text{d}\\text{i}\\text{c}\\text{t}\\text{e}\\text{d}\\:\\text{b}\\text{o}\\text{u}\\text{n}\\text{d}\\text{i}\\text{n}\\text{g}\\:\\text{b}\\text{o}\\text{x}}{\\text{g}\\text{r}\\text{o}\\text{u}\\text{n}\\text{d}\\:\\text{t}\\text{r}\\text{u}\\text{t}\\text{h}\\:\\text{b}\\text{o}\\text{u}\\text{n}\\text{d}\\text{i}\\text{n}\\text{g}\\:\\text{b}\\text{o}\\text{x}\\:\\cup\\:\\:\\text{p}\\text{r}\\text{e}\\text{d}\\text{i}\\text{c}\\text{t}\\text{e}\\text{d}\\:\\text{b}\\text{o}\\text{u}\\text{n}\\text{d}\\text{i}\\text{n}\\text{g}\\:\\text{b}\\text{o}\\text{x}}$$\u003c/div\u003e\n \u003c/div\u003e\n \u003cp\u003eThe higher the IoU, the closer the localization result aligns with the ground truth.\u003c/p\u003e\n \u003cp\u003eClassification performance is represented by precision, recall, and the F1 score. The calculation methods for precision and recall are defined as follows:\u003c/p\u003e\n \u003cdiv id=\"Equd\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equd\" name=\"EquationSource\"\u003e$$\\:\\text{P}\\text{r}\\text{e}\\text{c}\\text{i}\\text{s}\\text{i}\\text{o}\\text{n}\\:=\\:\\frac{{\\text{T}}_{\\text{p}}}{{\\text{T}}_{\\text{p}}+{\\text{F}}_{\\text{P}}}$$\u003c/div\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Eque\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Eque\" name=\"EquationSource\"\u003e$$\\:\\text{R}\\text{e}\\text{c}\\text{a}\\text{l}\\text{l}\\:=\\:\\frac{{\\text{T}}_{\\text{P}}}{{\\text{T}}_{\\text{P}}+{\\text{F}}_{\\text{N}}}$$\u003c/div\u003e\n \u003c/div\u003e\n \u003cp\u003eIn this study, when the IoU of the predicted result is greater than the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{I}\\text{o}{\\text{U}}^{\\text{t}\\text{h}\\text{r}\\text{e}\\text{s}\\text{h}\\text{o}\\text{l}\\text{d}}\\)\u003c/span\u003e\u003c/span\u003e, the prediction is classified as a true positive \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{T}_{P}\\)\u003c/span\u003e\u003c/span\u003e; otherwise, it is considered a false positive \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{F}_{P}\\)\u003c/span\u003e\u003c/span\u003e. Precision describes the proportion of correctly predicted positive samples among all predicted positive results, indicating that higher precision means more reliable predictions for positive samples. Recall refers to the proportion of correctly identified positive samples among all actual positive samples, with a higher recall indicating a better recognition rate for positive samples.\u003c/p\u003e\n \u003cp\u003eThe F-score combines precision and recall and is a commonly used metric to assess the predictive performance of the model (Goutte \u0026amp; Gaussier, \u003cspan class=\"CitationRef\"\u003e2005\u003c/span\u003e):\u003c/p\u003e\n \u003cdiv id=\"Equf\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equf\" name=\"EquationSource\"\u003e$$\\:\\text{F}-\\text{s}\\text{c}\\text{o}\\text{r}\\text{e}\\:=\\frac{(1+{{\\beta\\:}}^{2})\\times\\:\\text{P}\\text{r}\\text{e}\\text{c}\\text{i}\\text{s}\\text{i}\\text{o}\\text{n}\\times\\:\\text{R}\\text{e}\\text{c}\\text{a}\\text{l}\\text{l}}{{{\\beta\\:}}^{2}\\times\\:\\text{P}\\text{r}\\text{e}\\text{c}\\text{i}\\text{s}\\text{i}\\text{o}\\text{n}\\:\\times\\:\\:\\text{R}\\text{e}\\text{c}\\text{a}\\text{l}\\text{l}}$$\u003c/div\u003e\n \u003c/div\u003e\n \u003cp\u003eLet \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\beta\\:}=1\\)\u003c/span\u003e\u003c/span\u003e, then we have F1-score:\u003c/p\u003e\n \u003cdiv id=\"Equg\" class=\"Equation\"\u003e\n \u003cdiv class=\"mathdisplay\" id=\"FileID_Equg\" name=\"EquationSource\"\u003e$$\\:\\text{F}1=\\frac{2\\times\\:\\text{P}\\text{r}\\text{e}\\text{c}\\text{i}\\text{s}\\text{i}\\text{o}\\text{n}\\times\\:\\text{R}\\text{e}\\text{c}\\text{a}\\text{l}\\text{l}}{\\text{P}\\text{r}\\text{e}\\text{c}\\text{i}\\text{s}\\text{i}\\text{o}\\text{n}\\:+\\:\\text{R}\\text{e}\\text{c}\\text{a}\\text{l}\\text{l}}$$\u003c/div\u003e\n \u003c/div\u003e\n \u003cp\u003eRecall and precision vary with changes in the threshold. Based on different thresholds, a P-R curve is plotted with recall and precision as coordinates (Boyd, Eng, \u0026amp; Page, \u003cspan class=\"CitationRef\"\u003e2013\u003c/span\u003e). The area enclosed by this curve and the axes is referred to as Average Precision (AP), which is an important metric for evaluating the overall performance of object detection algorithms. The mean Average Precision (mAP) is defined as the average of AP across all categories at a specific IoU threshold. In this study, the average precision (mAP50) was calculated at an IoU threshold of 50%, along with the average IoU values within the specified range \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:[50\\%,95\\%]\\)\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\n \u003cp\u003eThe experiments were conducted using Google Colab services, with the following machine configuration: Intel(R) Xeon(R) CPU @ 2.20GHz (8 cores), NVidia Tesla T4 (15102MB of graphic memory), operating system Linux 6.1.85, CUDA version 12.2, Python version 3.10.12, Pytorch version 2.3.1\u0026thinsp;+\u0026thinsp;cu121, and Ultralytics version 8.2.82.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"3. Results","content":"\u003cp\u003eThis research collected and compared the inference results of YOLO algorithm family. We adopted processsing time, recall, precision, F1-score, mAP50 and mAP50-95 as performance indicators for result analysis. Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e gives a comparison of all three models, and Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e provides the P-R diagrams on the testing set of them. Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e and Table\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e7\u003c/span\u003e demonstrate the predictions results of each model for normal, CSOM, MEC and total cases overall.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eYOLO model family prediction result comparison\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eYOLOv8m\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eYOLOv9m\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eYOLOv10m\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePre-processing time (ms)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e3.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eInference time (ms)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e29.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e29.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e33.8\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePost-processing time (ms)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e7.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e12.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.957\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.917\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.952\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRecall\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.967\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.872\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eF1-score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.962\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.933\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.911\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emAP50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.976\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.958\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.926\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emAP50-95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.956\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.949\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.915\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePrediction Results of YOLOv8m\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRecall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eF1-score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003emAP50\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003emAP50-95\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNormal\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.900\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.947\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.950\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.928\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCSOM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.871\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.931\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.984\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.984\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMEC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.995\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.957\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAll\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.957\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.967\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.962\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.976\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.956\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePrediction Results of YOLOv9m\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRecall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eF1-score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003emAP50\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003emAP50-95\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNormal\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.933\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.925\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.913\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCSOM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.75\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.857\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.955\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.955\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMEC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.995\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.980\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAll\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.917\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.933\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.958\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.949\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab7\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 7\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePrediction Results of YOLOv10m\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRecall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eF1-score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003emAP50\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003emAP50-95\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNormal\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.974\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.975\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.953\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCSOM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.857\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.667\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.75\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.809\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.809\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMEC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.995\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.983\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAll\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.952\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.872\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.911\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.926\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.915\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003ePrediction results on CSOM, MEC and normal cases are shown correspondingly in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e and Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e.\u003c/p\u003e "},{"header":"4. Discussion","content":"\u003cp\u003eAmong the three models, as shown in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, YOLOv9m exhibits a preprocessing time that is 0.1 ms longer than both YOLOv8m and YOLOv10m. Yet YOLOv9 has the shortest inference time, while YOLOv8 is 0.4 ms slower than YOLOv9. YOLOv10's inference time is longer than the other two models. In the postprocessing phase, thanks to the optimization techniques, YOLOv10m shows significant dropping of postprocessing time to an average of 4.2 ms which compensates for the lag in preprocessing and inference procedure.\u003c/p\u003e \u003cp\u003eIn terms of overall predictive performance, YOLOv8m outperforms both YOLOv9m and YOLOv10m in average precision, recall, mAP50, and mAP50-95 across the three recognition tasks. Both average precision and recall for YOLOv8m exceed 95%. YOLOv9m has an average recall of 95% but a relatively lower precision. YOLOv10m's average precision is close to that of YOLOv8m, but with a smaller recall.\u003c/p\u003e \u003cp\u003eRegarding classification prediction results, YOLOv10m excels in predicting normal and MEC cases, achieving the highest precision, recall, mAP50, and mAP50-95 among the tested models. However, its performance in identifying CSOM is significantly lower than that of YOLOv8m and YOLOv9m, which adversely affects the overall evaluation of YOLOv10m.\u003c/p\u003e \u003cp\u003eAs demonstrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e and Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e, all three models can effectively locate targets but exhibit misclassifications at different levels. For instance, YOLOv10m failed to recognize CSOM in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ec. Figure\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e indicate that all three models perform well in identifying MEC. Misclassifications in identifying normal cases occur across all models as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e.\u003c/p\u003e \u003cp\u003eDespite YOLOv8 having a higher proportion of false positives and YOLOv10 showing superior predictions for normal and MEC cases, the architecture proposed in this study serves as an auxiliary classification tool. False positives can be further filtered by professionals to prevent delays in diagnosing false negatives.\u003c/p\u003e \u003cp\u003eIn summary, YOLOv8m demonstrates superior accuracy across comprehensive recognition metrics and computation resource requirements compared to YOLOv9m and YOLOv10m. Given the cloud-edge collaborative architecture, training for all three models occurs in the cloud, allowing them to be deployed and simultaneously trained and tested. The models can be quantified and packaged as containers for inference on incoming data at the edge. As the dataset expands, the cloud can iteratively train these models and update the edge models as needed.\u003c/p\u003e"},{"header":"5. Conclusions","content":"\u003cp\u003eThis study explores the significance of temporal bone U-HRCT imaging in the diagnosis of CSOM and MEC, proposing a framework for assisting classification using a \"cloud-edge\" collaborative training architecture. In this framework, multiple image recognition models were trained and tested in the cloud, and their results were compared. Experimental results show that model performance can be assessed using indicators such as precision, recall, and mAP. By leveraging the \"cloud-edge\" collaborative training architecture, image recognition models can be trained simultaneously on the cloud, and the best-performing models can be deployed to the edge devices. This approach maximizes data utilization and fully explores the diversity of image recognition algorithms, ensuring high target recognition accuracy on edge devices.\u003c/p\u003e \u003cp\u003eCurrently, large vision models like Vision Transformers (ViT) offer better recognition capabilities; however, due to the relatively short time that U-HRCT equipment has been on the market, the sample size is limited, making it difficult to train these large models effectively. In the future, efforts will focus on continuously expanding the dataset, harnessing the detailed and precise features of large models to train and test them. Comparative studies on the performance difference between large models and the existing models will be conducted to explore the application of large models in U-HRCT image processing.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eCSOM = chronic suppurative otitis media\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eMEC = middle ear cholesteatoma\u003c/p\u003e\n\u003cp\u003eHRCT = high-resolution computed tomography\u003c/p\u003e\n\u003cp\u003eU-HRCT = ultra-high-resolution computed tomography\u003c/p\u003e\n\u003cp\u003eYOLO = You Only Look Once\u003c/p\u003e\n\u003cp\u003eIoU = Intersection over Union\u003c/p\u003e\n\u003cp\u003eAP = average precision\u003c/p\u003e\n\u003cp\u003emAP = mean average precision\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eTing Wu and Yu Tang contributed equally to this work and share the first authorship.All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by T.W.,Y.T and Z.G.C The first draft of the manuscript was written by T.W., Y.T. and W.M. J.J.Z.and S.B.H. designed and performed experiment and provided critical comments. Y.W and J.W.had drawn the figures. All authors commented on the previous versions of the manuscript. All authors read and approved the final manuscript.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe data that support the findings of this study are available from Nanjing Tongren Hospital, School of Medicine, Southeast University, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of [Nanjing Tongren Hospital, School of Medicine, Southeast University]. the data from this study can be acquired by contacting ting wu. (
[email protected]).\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eBaba, A., Kurokawa, R., Kurokawa, M., Ota, Y., Matsushima, S., Fukuda, T., . . . Ojiri, H. (2022, June). Preoperative prediction for mastoid extension of middle ear cholesteatoma using temporal subtraction serial HRCT studies. \u003cem\u003eEuropean Radiology, 32\u003c/em\u003e, 3631\u0026ndash;3638. doi:10.1007/s00330-021-08453-0\u003c/li\u003e\n \u003cli\u003eBhutta, M. F., Leach, A. J., \u0026amp; Brennan-Jones, C. G. (2024, May 25). Chronic suppurative otitis media. \u003cem\u003eLancet (London, England), 403\u003c/em\u003e, 2339\u0026ndash;2348. doi:10.1016/S0140-6736(24)00259-9\u003c/li\u003e\n \u003cli\u003eBoyd, K., Eng, K. H., \u0026amp; Page, C. D. (2013). Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals. In C. Salinesi, M. C. Norrie, \u0026amp; \u0026Oacute;. Pastor (Eds.), \u003cem\u003eAdvanced Information Systems Engineering\u003c/em\u003e (Vol. 7908, pp. 451\u0026ndash;466). Berlin, Heidelberg: Springer Berlin Heidelberg. doi:10.1007/978-3-642-40994-3_29\u003c/li\u003e\n \u003cli\u003eCacco, T., Africano, S., Gaglio, G., Carmisciano, L., Piccirillo, E., Castello, E., \u0026amp; Peretti, G. (2022, February). Correlation between peri-operative complication in middle ear cholesteatoma surgery using STAMCO, ChOLE, and SAMEO-ATO classifications. \u003cem\u003eEuropean archives of oto-rhino-laryngology: official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS): affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery, 279\u003c/em\u003e, 619\u0026ndash;626. doi:10.1007/s00405-021-06679-8\u003c/li\u003e\n \u003cli\u003eGilberto, N., Cust\u0026oacute;dio, S., Cola\u0026ccedil;o, T., Santos, R., Sousa, P., \u0026amp; Escada, P. (2020, April). Middle ear congenital cholesteatoma: systematic review, meta-analysis and insights on its pathogenesis. \u003cem\u003eEuropean archives of oto-rhino-laryngology: official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS): affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery, 277\u003c/em\u003e, 987\u0026ndash;998. doi:10.1007/s00405-020-05792-4\u003c/li\u003e\n \u003cli\u003eGirshick, R., Donahue, J., Darrell, T., \u0026amp; Malik, J. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation. \u003cem\u003eRich feature hierarchies for accurate object detection and semantic segmentation\u003c/em\u003e. arXiv. doi:10.48550/ARXIV.1311.2524\u003c/li\u003e\n \u003cli\u003eGoutte, C., \u0026amp; Gaussier, E. (2005). A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. In D. E. Losada, \u0026amp; J. M. Fern\u0026aacute;ndez-Luna (Eds.), \u003cem\u003eAdvances in Information Retrieval\u003c/em\u003e (Vol. 3408, pp. 345\u0026ndash;359). Berlin, Heidelberg: Springer Berlin Heidelberg. doi:10.1007/978-3-540-31865-1_25\u003c/li\u003e\n \u003cli\u003eJocher, G., Chaurasia, A., \u0026amp; Qiu, J. (2023). Ultralytics YOLOv8. \u003cem\u003eUltralytics YOLOv8\u003c/em\u003e. Retrieved from https://github.com/ultralytics/ultralytics\u003c/li\u003e\n \u003cli\u003eKim, J.-a., Sung, J.-Y., \u0026amp; Park, S.-h. (2020, November). Comparison of Faster-RCNN, YOLO, and SSD for Real-Time Vehicle Type Recognition. \u003cem\u003e2020 IEEE International Conference on Consumer Electronics - Asia (ICCE-Asia)\u003c/em\u003e, (pp. 1\u0026ndash;4). doi:10.1109/ICCE-Asia49877.2020.9277040\u003c/li\u003e\n \u003cli\u003eKuo, C.-L., Shiao, A.-S., Yung, M., Sakagami, M., Sudhoff, H., Wang, C.-H., . . . Lien, C.-F. (2015). Updates and knowledge gaps in cholesteatoma research. \u003cem\u003eBioMed Research International, 2015\u003c/em\u003e, 854024. doi:10.1155/2015/854024\u003c/li\u003e\n \u003cli\u003eLin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., . . . Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. In D. Fleet, T. Pajdla, B. Schiele, \u0026amp; T. Tuytelaars (Eds.), \u003cem\u003eComputer Vision \u0026ndash; ECCV 2014\u003c/em\u003e (Vol. 8693, pp. 740\u0026ndash;755). Cham: Springer International Publishing. doi:10.1007/978-3-319-10602-1_48\u003c/li\u003e\n \u003cli\u003eLuers, J. C., \u0026amp; H\u0026uuml;ttenbrink, K.-B. (2016, February). Surgical anatomy and pathology of the middle ear. \u003cem\u003eJournal of Anatomy, 228\u003c/em\u003e, 338\u0026ndash;353. doi:10.1111/joa.12389\u003c/li\u003e\n \u003cli\u003eSilverstein, H. (1972, August 10). Surgery for chronic suppurative otitis media. \u003cem\u003eThe New England Journal of Medicine, 287\u003c/em\u003e, 287\u0026ndash;290. doi:10.1056/NEJM197208102870607\u003c/li\u003e\n \u003cli\u003eSu, R., Song, J., Wang, Z., Mao, S., Mao, Y., Wu, X., \u0026amp; Hou, M. (2022, August 28). Application of high resolution computed tomography image assisted classification model of middle ear diseases based on 3D-convolutional neural network. \u003cem\u003eZhong Nan Da Xue Xue Bao. Yi Xue Ban = Journal of Central South University. Medical Sciences, 47\u003c/em\u003e, 1037\u0026ndash;1048. doi:10.11817/j.issn.1672-7347.2022.210704\u003c/li\u003e\n \u003cli\u003eSundgaard, J. V., Harte, J., Bray, P., Laugesen, S., Kamide, Y., Tanaka, C., . . . Christensen, A. N. (2021, July). Deep metric learning for otitis media classification. \u003cem\u003eMedical Image Analysis, 71\u003c/em\u003e, 102034. doi:10.1016/j.media.2021.102034\u003c/li\u003e\n \u003cli\u003eWang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., \u0026amp; Ding, G. (2024). YOLOv10: Real-Time End-to-End Object Detection. \u003cem\u003eYOLOv10: Real-Time End-to-End Object Detection\u003c/em\u003e. arXiv. doi:10.48550/ARXIV.2405.14458\u003c/li\u003e\n \u003cli\u003eWang, C.-Y., Liao, H.-Y. M., \u0026amp; Yeh, I.-H. (2022). Designing Network Design Strategies Through Gradient Path Analysis. \u003cem\u003eDesigning Network Design Strategies Through Gradient Path Analysis\u003c/em\u003e. arXiv. doi:10.48550/ARXIV.2211.04800\u003c/li\u003e\n \u003cli\u003eWang, C.-Y., Liao, H.-Y. M., Yeh, I.-H., Wu, Y.-H., Chen, P.-Y., \u0026amp; Hsieh, J.-W. (2019). CSPNet: A New Backbone that can Enhance Learning Capability of CNN. \u003cem\u003eCSPNet: A New Backbone that can Enhance Learning Capability of CNN\u003c/em\u003e. arXiv. doi:10.48550/ARXIV.1911.11929\u003c/li\u003e\n \u003cli\u003eWang, C.-Y., Yeh, I.-H., \u0026amp; Liao, H.-Y. M. (2024). YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. \u003cem\u003eYOLOv9: Learning What You Want to Learn Using Programmable Gradient Information\u003c/em\u003e. arXiv. doi:10.48550/ARXIV.2402.13616\u003c/li\u003e\n \u003cli\u003eWang, Y.-M., Li, Y., Cheng, Y.-S., He, Z.-Y., Yang, J.-M., Xu, J.-H., . . . Ren, D.-D. (2020). Deep Learning in Automated Region Proposal and Diagnosis of Chronic Otitis Media Based on Computed Tomography. \u003cem\u003eEar and Hearing, 41\u003c/em\u003e, 669\u0026ndash;677. doi:10.1097/AUD.0000000000000794\u003c/li\u003e\n \u003cli\u003eWu, H., Liu, Q., \u0026amp; Liu, X. (2019). A Review on Deep Learning Approaches to Image Classification and Object Segmentation. \u003cem\u003eComputers, Materials \u0026amp; Continua, 60\u003c/em\u003e, 575\u0026ndash;597. doi:10.32604/cmc.2019.03595\u003c/li\u003e\n \u003cli\u003eXu, N., Ding, H., Tang, R., Li, X., Zhang, Z., Lv, H., . . . Zhao, P. (2023, November 28). Comparative study of the sensitivity of ultra-high-resolution CT and high-resolution CT in the diagnosis of isolated fenestral otosclerosis. \u003cem\u003eInsights into Imaging, 14\u003c/em\u003e, 211. doi:10.1186/s13244-023-01562-y\u003c/li\u003e\n \u003cli\u003eZeng, J., Kang, W., Chen, S., Lin, Y., Deng, W., Wang, Y., . . . Cai, Y. (2022, July 1). A Deep Learning Approach to Predict Conductive Hearing Loss in Patients With Otitis Media With Effusion Using Otoscopic Images. \u003cem\u003eJAMA otolaryngology\u0026ndash; head \u0026amp; neck surgery, 148\u003c/em\u003e, 612\u0026ndash;620. doi:10.1001/jamaoto.2022.0900\u003c/li\u003e\n \u003cli\u003eZeng, X., Jiang, Z., Luo, W., Li, H., Li, H., Li, G., . . . Li, Z. (2021, May 25). Efficient and accurate identification of ear diseases using an ensemble deep learning model. \u003cem\u003eScientific Reports, 11\u003c/em\u003e, 10839. doi:10.1038/s41598-021-90345-w\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-5414065/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5414065/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eObjective\u003c/h2\u003e \u003cp\u003eCholesteatoma and otitis media are two of the most common middle ear diseases, of which the treatment principles are different, making the differentiation between them of significant importance. Both chronic suppurative otitis media (CSOM) and middle ear cholesteatoma (MEC) can appear on CT images as low-density soft tissue-like masses partially filling the middle ear and mastoid cavities. However, typical CT imaging of MEC may show progressive destruction of auditory structures and adjacent cranial bones. Compared to high-resolution CT (HRCT), ultra-high-resolution CT (U-HRCT) offers inherent continuity and a more detailed display of the fine structures of the middle ear. This study proposes a \"cloud-edge\" collaborative training framework for middle ear disease classification that exploits temporal bone U-HRCT imaging data. By integrating the YOLO recognition algorithm, this framework aims to achieve auxiliary classification of MEC and CSOM based on U-HRCT images.\u003c/p\u003e\u003ch2\u003eDesign:\u003c/h2\u003e \u003cp\u003eIn the cloud-edge collaborative framework, the edge devices acquire U-HRCT imaging data and perform auxiliary classification of middle ear diseases using image recognition and inference techniques. The imaging data collected by the edge devices are transmitted to the cloud, where a unified model training process is executed, and the model containers are then deployed to the edge devices for future auxiliary diagnosis. The framework employed Mixup and Mosaic methods for data augmentation to enhance model robustness and improve generalization performance. The object detection models of the You Only Look Once (YOLO) family was used, and the final model selection was made based on their performance.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eThis study found that this cloud-edge collaborative framework can effectively classify temporal bone U-HRCT imaging data for MEC and CSOM. In the test set, the framework successfully collected real CT image data, performed data processing and conducted model training as designed. Eventually, multiple models were trained, with different levels of detection ability assessed by selected metrics, allowing for trade-offs in model selection considering computation time and accuracy. The selected model was then deployed to the edge, where they performed auxiliary classification tasks at the edge device.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eThis study discussed the significance of temporal bone U-HRCT imaging in the diagnosis of CSOM and MEC and proposed a cloud-edge collaborative model training framework for auxiliary classification from U-HRCT imaging data. This approach maximizes the utility of the data, fully leverages the diversity of image recognition algorithms, and ensures a high level of accuracy in classification.\u003c/p\u003e","manuscriptTitle":"A Cloud-Edge Collaborative Model Training Framework for Assisted Classification of Middle Ear Diseases Based on Ultra-High-Resolution Temporal Bone CT Images","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-12-19 18:45:22","doi":"10.21203/rs.3.rs-5414065/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"d229cda1-2c9d-4d4a-b2be-b4a7041f1d40","owner":[],"postedDate":"December 19th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":40796625,"name":"Health sciences/Medical research"},{"id":40796626,"name":"Physical sciences/Mathematics and computing"}],"tags":[],"updatedAt":"2024-12-23T12:53:55+00:00","versionOfRecord":[],"versionCreatedAt":"2024-12-19 18:45:22","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-5414065","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5414065","identity":"rs-5414065","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.