Tooth-to-white spot lesion YOLO: a novel model for white spot lesion detection | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Tooth-to-white spot lesion YOLO: a novel model for white spot lesion detection Hau Man Chung, Jingjing Ke, Mengdan Zhang, Lixian Kong, Junming Zheng, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7058696/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 09 Oct, 2025 Read the published version in BMC Oral Health → Version 1 posted 10 You are reading this latest preprint version Abstract Background: To develop a new deep learning model for detecting white spot lesions (WSLs), which are commonly observed in patients undergoing orthodontic treatment, and assess its accuracy. Methods : A total of 653 intra-oral photographs of WSLs were collected and annotated. Our novel model, tooth-to-WSL You Only Look Once (TW-YOLO), and the original YOLOv5 model were fine-tuned and evaluated, with 457 photographs used for training; 130, for validation; and 66, for external testing. Cohen's kappa coefficient between model prediction and orthodontist annotation was used as the primary evaluation metric, and mean average precision ( [email protected] :0.95), average precision ( [email protected] ), F1 score, and accuracy were also evaluated. The score-CAM technique was used for explainability analysis. Results : Cohen's kappa coefficient values were 0.76 and 0.62 for TW-YOLO and YOLOv5, respectively. The [email protected] :0.95 was 0.51 for TW-YOLO and 0.45 for YOLOv5. Explainability analysis suggested that the TW-YOLO model could implicitly learn the distribution pattern of WSLs by shifting more attention toward these regions. Conclusion : The novel TW-YOLO model demonstrated not only improved accuracy but also the potential to be applied in other related dentistry studies. White spot lesions object detection deep learning explainability analysis Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction The stage before cavitation in the development of dental caries is called white spot lesion (WSL). It is characterized by subsurface demineralization areas formed under an intact enamel surface. WSLs manifest as alterations of the translucent feature of the enamel, and the color of these areas appears opaque white. The reported incidence of WSLs is widely variable, but on average, such decalcifications are found in 30–70% of patients during orthodontic treatment[ 1 ]. The high incidence of WSLs necessitates attention from patients and practitioners. Early detection of WSLs during orthodontic treatment would allow the implementation of preventive measures and control of the demineralization process before it further progresses[ 2 ]. Various techniques are used to detect WSLs, including visual inspection, photography, light-induced fluorescence, and quantitative laser analysis. The enamel decalcification index (EDI), a visual inspection tool designed to categorize and assess the presence and severity of enamel defects[ 3 ], not only has an accuracy comparable to that of light-induced fluorescence[ 4 ] but also is easier to implement and can sometimes be used for patient self-monitoring, which is helpful for early detection[ 5 ]. However, using the EDI requires professional training, which limits its applicability in patient self-monitoring. This method is also time-consuming, which makes it impractical for large-scale screening and dynamic monitoring during orthodontic processes. Recent advances in the field of deep learning have shown promising potential for streamlining the routine work of dental caregivers and empowering patient self-monitoring[ 6 ]. Many studies have evaluated the classification ability (the ability to differentiate between images of teeth with and without lesions) of deep learning models. Askar et al. demonstrated that SqueezeNet, a convolutional neural network, performs this task with great accuracy[ 7 ]. Determining the size and location of these lesions, which usually involves object detection and semantic segmentation, is the next step. This is especially relevant in the prevention and management of WSLs because location and size influence esthetic outcomes. Studies indicate that deep learning models can localize and identify caries in bitewing X-rays with a recall of 0.727 and an F1 score of 0.687[ 8 , 9 ]. Furthermore, Casalegno et al. used a U-Net–like network architecture for caries segmentation in near-infrared transillumination (TI) images, achieving a mean intersection over union (mIOU) of 72.7% and an area under the ROC curve (AUC) of 85.6%[ 10 ]. These findings collectively demonstrate that deep learning models exhibit not only high processing speed but also substantial diagnostic accuracy in caries detection. Compared with the images analyzed in these studies, in which the lesions comprise the majority of the image, intra-oral photographs are usually much larger, and the WSLs comprise a much smaller proportion. As demonstrated in the study by Ozsunkar et al.[ 11 ], applying YOLOv5 to detect WSLs in intra-oral photographs in an end-to-end approach by directly down-sizing images before detection yields poor accuracy. Therefore, a novel deep learning model is needed for this specific task. Inspired by the task partitioning paradigm[ 12 ] and the sliding windows strategy[ 13 , 14 ], we developed a tooth-to-WSL You Only Look Once (TW-YOLO) model and compared its accuracy metrics with those of YOLO. To the best of our knowledge, our study is the first to implement explainability analysis on object detection models within the domain of dental medicine to gain a better understanding of the mechanisms behind model decision-making. Materials and Methods Data collection Anonymized intra-oral photographs of orthodontic patients with WSLs were acquired from image archives in the Orthodontics Department of Foshan Stomatological Hospital, Foshan University. The protocol of the current study was approved by the Ethics Committee of the Foshan Stomatological Hospital, Foshan University (2024-FSKQ-LW-002). All patients signed informed consent at the beginning of their treatment sessions to have their anonymized image data to be used for medical research purpose. In total, 653 anonymized intra-oral pictures were collected, a number greater than those in previous studies[ 7 , 15 , 16 ]. All intra-oral photographs were taken from patients receiving fixed appliances orthodontic treatment, either before the treatment commenced or after appliance removal, with a digital reflex camera (Canon EOS 60D, Canon Corp., Tokyo, Japan). The image resolution was approximately 4000 by 3000 pixels. Tooth surfaces were cleaned and dried prior to intra-oral photography. Image annotation and data augmentation All image data were first annotated manually by two senior orthodontic specialists with LabelImg[ 17 ] by drawing bounding boxes around WSLs. The specialists first annotated 10 cases together, reaching a consensus, and then each specialist continued the annotation work independently. Finally, only regions selected by both orthodontic specialists were kept as annotations. Bounding boxes around individual teeth were drawn by a junior researcher. The 653 intra-oral photographs were split into three groups (Fig. 1 ): 457 images were used for training; 130, for validation; and 66, for external testing. With a previous study[ 18 ] used as a guide, all images and corresponding annotation data were randomly split into train-validation (90% of all images) and external testing (10% of all images) datasets. The images in the train-validation dataset were used for model training and validation, and the images in the external testing dataset were used for performance evaluation. The images in the external testing dataset were never seen by the deep learning models, to avoid affecting memory[ 19 ]. Data augmentation techniques, including random cropping and scaling, image rotation, and affine transformations, were employed to expand the effective dataset scale and enhance the spatial robustness of the deep learning model. The minimum sample size was estimated based on the approach outlined in a previous study[ 20 ]. Based on a pilot study, we estimated that Cohen's kappa coefficient values were 0.6 and 0.77 for YOLOv5 and TW-YOLO, respectively. On average, WSLs covered 15% of the tooth enamel surfaces in the photographs. Given these parameters, at least 505 teeth with WSLs were needed. The external testing dataset included 548 teeth with WSLs, satisfying the minimum sample size requirement. TW-YOLO model architecture To mitigate the issue of excessive downsizing in conventional image preprocessing of high-resolution intra-oral photographs, we developed the novel TW-YOLO model (Fig. 2 ). First, intra-oral photographs (approximately 4000 × 3000 pixels) undergo standard proportional resizing to fit within a 640 × 640 pixel square. The resized images are then processed by a YOLOv5s network to localize teeth. Non-maximum suppression (NMS) is subsequently applied to retain only non-overlapping predicted bounding boxes. Based on the bounding boxes for the teeth, the image region encompassing all detected teeth is calculated. This region is then cropped out of the original, full-resolution image. Within this cropped input, tiled image extraction is performed using sliding 640 × 640 pixel windows. Adjacent windows overlap by 50 pixels to prevent lesion oversight. Next, all extracted image tiles along with the entire cropped input are processed individually by the YOLOv5l network to detect WSLs. This facilitates concurrent localized detection within the tiles and holistic detection on the cropped input. During the training phase, each extracted tile is treated as an independent training sample. For the prediction phase, the detection results from all processed tiles and the cropped input are mapped back onto the coordinate space of the original, uncropped image. A final NMS step is applied to eliminate significantly overlapping bounding boxes, retaining the post-processed detections. Fine-tuning of YOLO and our TW-YOLO model YOLO models possess parameter counts in the millions (YOLOv5s: 7.2 million; YOLOv5l: 46.5 million), so training such models from scratch is impractical. Therefore, we initialized both models using pretrained weights (trained on the common objects in context [COCO] dataset) as starting points for fine-tuning. This transfer learning strategy leveraged the generic feature extraction capabilities acquired from broad image data while incorporating domain-specific knowledge pertinent to orthodontic WSL detection. Fine-tuning of the YOLO network was performed by training models for 500 epochs with adaptive optimization hyperparameters. The loss function, comprising object detection loss (computed via intersection over union [IOU]) and classification loss (computed via Cross-Entropy), was evaluated in each epoch and backpropagated to update model weights. Model performance evaluation After training, model performance was evaluated using the testing dataset. The primary evaluation metric was the pixel-wise Cohen’s kappa coefficient. We adopted this metric to evaluate the agreement between our orthodontists and the models regarding the boundaries of the WSLs. Other accuracy metrics included overall mean average precision ( [email protected] :0.95), which was first introduced by the COCO detection challenge and has since become the most common evaluation metric for object detection accuracy. Average precision at the 0.5 IOU threshold ( [email protected] ) and F1 score were chosen as secondary accuracy metrics. The evaluation metrics, including IOU, precision (P), recall (R), [email protected] :0.95, [email protected] , and F1 score, were calculated using methods outlined in prior research[ 7 , 21 ]. The prediction time for each image was also tracked and compared. We would like to point out that for object detection task, a true negative means no detection is overlapped with background, and it is not applicable. Thus, both ROC curve and the area under curve (AUC) can’t be calculated. On the other hand, average precision, whether the [email protected] :0.95 or [email protected] , is the area under precision recall curve by definition. A more detailed description of the calculation algorithms for metrics in our study is included in the Supplementary Information. Explainability analysis with ablation-CAM Gradient-weighted class activation mapping (grad-CAM) uses the gradients of any target concept flowing into the final convolutional layer to produce a coarse localization map that highlights important regions in the image for predicting the concept[ 22 ]. It is widely used in explainability analysis of classification problems[ 7 , 23 , 24 ]. However, for object detection problems, both class discrimination and target localization are equally important. While the prediction score can be used for classification explainability, to better understand how deep learning models draw bounding boxes, we investigated the IOU ratio between the predictions from the models and the annotations from the orthodontic specialists. To calculate gradients for this purpose, however, is a tricky task. To solve this problem, Wang[ 25 ] proposed score-based class activation mapping (score-CAM), which is gradient-free localization mapping suitable for our analysis purposes. We adopted Gildenblat's[ 26 ] CAM method library for score-CAM analysis. For the target layer, we chose the backbone module from RetinaNet and FasterRCNN and the C2f module at the back of the Detection Model module from YOLO. Statistical analysis For continuous variables in our study, non-parametric hypothesis tests were chosen due to non-normal distribution. For categorical variables, the \(\:\text{}{\text{χ}}^{\text{2}}\text{}\) test was used. Results Dataset features In total, 653 intra-oral photographs with 12,216 teeth and 8392 WSL bounding box annotations were included in our dataset. For the external testing subset, 1252 teeth (548 teeth presented with WSLs) and 842 WSLs were included. WSLs covered 14.8% of the area of the affected tooth crown (95% CI: 2.1–31.2%) in all photographs. With TW-YOLO, slices of intra-oral photographs were analyzed at their original resolution, and the median area for WSL bounding boxes was 2898 px (IQR: 1895–5080). The YOLOv5l model received downsized images, and the median area for WSLs decreased to 1135 px (IQR: 364–2423), significantly smaller than that in TW-YOLO. Judging by the standard from the COCO dataset[ 27 ], only four WSL bounding boxes had a small size (< 1024 pixels) during the prediction phase of the TW-YOLO model, whereas 395 had a small size when YOLOv5l was evaluated. It is obvious that the standard resize treatment for YOLOv5l not only decreased the size of all WSLs (p < 0.001) but also significantly increased the proportion of the small size bounding boxes ( \(\:{\text{χ}}^{\text{2}}\) = 795, p < 0.001). Heatmaps of WSLs (Fig. 3Aa) and tooth bounding boxes (Fig. 3Ab) demonstrated their distribution in the central part of the photographs. For WSLs, the upper part was lighter than the lower part, suggesting that WSLs were primarily found in the upper jaw. The heatmap for the relative position of the WSLs to that of the corresponding tooth (Fig. 3 B) showed that most WSLs were present on the peripheral area, especially on the upper and lower parts, which correspond to the peri-gingival area. Comparison of model performance The YOLOv5s model shares a similar architecture to the YOLOv5l model but is structurally simpler and contains fewer parameters. For images resized to a 640 × 640 resolution, YOLOv5s performs rapid detection (8 ns per image). Despite its efficiency, it maintains high accuracy, achieving an mAP of 0.95 at an IOU threshold of 0.5 ( [email protected] ) and an mAP of 0.73 across the full range of 0.5–0.95 IOU thresholds ( [email protected] :0.95). The YOLOv5l model demonstrates superior accuracy compared with YOLOv5s, with an [email protected] of 0.98 and an [email protected] :0.95 of 0.8, though it requires a longer inference time, at 12 ns per image. For enamel WSLs, the precision-recall curves (Fig. 4 A) and F1-IOU curves (Fig. 4 B)were obtained using YOLOv5l and TW-YOLO. The pixel-wise Cohen's kappa coefficient was 0.76 for TW-YOLO and 0.62 for YOLOv5l. For secondary performance metrics, when applying standard image resizing followed by YOLOv5l detection, the model achieved an [email protected] of 0.69 and an [email protected] :0.95 of 0.45. However, when using TW-YOLO, the detection accuracy improved by approximately 10%, with an [email protected] of 0.78 and an [email protected] :0.95 of 0.51. Among all 826 WSLs, TW-YOLO yielded 670 true positives (TPs), 156 false negatives (FNs), and 223 false positives (FPs). YOLOv5l, however, yielded 608 TPs, 218 FNs, and 252 FPs. The average inference time was 12 ns per image for YOLOv5l and 73 ns per image for TW-YOLO. Explainability analysis The score-CAM result for WSL detection (Fig. 5 ) demonstrated that while the YOLOv5l model mainly focused on teeth, considerable attention was diverted to the lips, oral mucosa, and cheek retractors when the standard resize approach was adopted (Fig. 5 A2). In comparison, the TW-YOLO retained more detail within the images (Fig. 5 A3, 4), and the model's attention was concentrated on the peripheral region of the tooth labial surface (Fig. 5 A4), which coincided with the WSL distribution pattern (Fig. 3 B). These features consequently resulted in more precise predictions. The score-CAM results offered further insight into the superior performance of TW-YOLO when TW-YOLO image slices were compared with YOLOv5l image slices at the original resolution. For YOLOv5l (Fig. 5B2), the model’s attention was concentrated on the tip of the tooth and the peri-gingival area, leading to more FNs (Fig. 5B2). However, for TW-YOLO, attention was more evenly distributed along the peripheral area and more concentrated on salient features (Fig. 5B3). Discussion Image data, as a central part of the patient record, play an essential role in the diagnostic and treatment workflow in orthodontic practice. The reading of these image data, however, still heavily depends on orthodontists’ manual work. This problem has become more evident as the amount of dental medical image data has grown exponentially. The development of artificial intelligence, especially in the field of deep learning, has introduced novel tools to increase dental image processing efficiency. Previous studies have shown that deep learning models performed well in cephalometric landmark detection[ 28 ] and dental implant systems classification[ 29 ]. These models could also detect and evaluate the severity of periodontitis[ 30 ] and periodontal bone loss[ 31 ]. In our study, we developed TW-YOLO, a novel network architecture specifically designed for detecting WSLs in intra-oral photographs. It demonstrated significantly stronger agreement with orthodontists' annotations than YOLOv5l, achieving an approximately 10% improvement in detection accuracy. TW-YOLO comprehensively outperformed YOLOv5l across all key metrics, including increased [email protected] , [email protected] :0.95, and TPs, as well as reduced FNs and FPs. Although inference time scaled linearly to an average of 6× that of YOLOv5l due to its dual-network architecture, processing remained substantially faster than manual clinician annotation. Several factors contributed to this superior performance. First, models adopted for fine-tuning were mainly pretrained with images with prominent subjects (target objects occupying > 10% of the image area). Thus, fine-tuning tends to yield superior results when the target objects in the dataset maintain relatively large proportions[ 8 , 10 , 32 ]. The approach developed by Askar et al. for WSL detection[ 7 ] involved cropping intra-oral photographs into slices of individual tooth size; this preserves the original resolution while reformulating detection into a classification task. Their method has achieved promising outcomes (R: 0.58–0.66; P: 0.67; AUC: 0.86). Nevertheless, the clinical utility of this approach remains limited by its reliance on labor-intensive manual cropping. By contrast, Özsunkar et al. directly downsized intra-oral images to 640 × 320 pixels before fine-tuning YOLOv5x, yielding suboptimal performance: a mAP of 0.454 at the 0.5 IOU threshold, detecting only 52% of WSLs and producing 133 TPs, 82 FNs, and 36 FPs[ 11 ]. As demonstrated in our study, simply downsizing intra-oral photographs significantly increases the proportion of small-size bounding boxes (0–1024 pixels), and detection of small-size targets is a well-recognized challenge in the object detection field[ 33 , 34 ]. The suboptimal performance of the YOLO model observed in both the Özsunkar study and our study is primarily attributable to the increase in the number of small-size objects resulting from image downscaling. Rich enamel textual information can be extracted from intra-oral photographs taken by a single-lens reflex camera. Multiple studies have shown that WSLs can be reliably assessed through meticulous, tooth-by-tooth examination of intra-oral photographs[ 3 , 10 ]. This evidence aligns with our finding that tiled detection substantially improved accuracy by preserving critical details often lost after resolution reduction. Compared with other generic tiling approaches that cut entire images into slices for sequential detection[ 13 , 14 ], our method leverages the unique characteristics of intra-oral photographs by first detecting the location of all teeth and then performing slicing and detection in the cropped region. In this way, the area to be detected is reduced, saving detection time. Furthermore, from the perspective of score-CAM, the TW-YOLO model not only suppressed irrelevant areas (e.g. nostrils, lips, and gingival areas; Fig. 5 ) but also recognized salient features and focused on the distribution pattern of WSLs in our dataset. Applying this slicing windows strategy in both the training and prediction phases is more helpful than just applying it during the detection phase. As demonstrated in Fig. 5 B, when sliced images at the original resolution are used as input, the YOLOv5 model can concentrate its attention on gingival margins and incisal edges of tooth surfaces; however, this approach results in the omission of many salient details, as most of the score-CAM overlay is still colored blue (Fig. 5 B2). TW-YOLO, however, picked up more salient features (Fig. 5 B3), thus enhancing its accuracy. The fact that the TW-YOLO network could implicitly learn the distribution pattern of WSLs is quite interesting and worth further investigation. There are some limitations to our study. The size of our dataset was relatively small, and most photographs were taken after fixed aligner removal. With more image data, the fine-tuning of models could achieve better results. Additionally, the integration of an attention mechanism would allow for WSL distribution patterns to be explicitly taught to the models, which would be a more efficient approach than relying on the model to implicitly acquire this understanding over time. Conclusions The novel TW-YOLO model not only demonstrated great accuracy but also showed near-perfect agreement with orthodontists' annotations. It enhanced the detection precision by effectively reducing the resolution degradation and concentrating on the key features of the tooth surface. Explainability analysis provided a better understanding of how these models perform in WSL detection and also indicated directions to explore for further improvements. Abbreviations YOLO You Only Look Once Network TW-YOLO tooth-to-WSL YOLO model WSL white spot lesion mAP mean average precision AP average precision CAM class activation mapping COCO common object in context IOU intersection over union Declarations Ethics approval and consent to participate The current study is in compliance with the Helsinki Declaration, and the protocol was approved by the Ethics Committee of the Foshan Stomatological Hospital, Foshan University (2024-FSKQ-LW-002, approval date: 03-14-2024). Informed consent to participate was obtained from all of the participants included in this study. Consent for publication Not applicable. Availability of data and materials The datasets used and analyzed during the current study are available from the corresponding author upon reasonable request. Competing interests The authors declare that they have no competing interests. Funding The present study was supported by grants from the National Natural Science Foundation of China (81800961), Guangdong Basic and Applied Basic Research Foundation—Natural Science Fund Project (2025A1515010904), and International Orthodontics Foundation Young Research (IOF2022Y06). Authors’ contributions Hau Man Chung: Conceptualization, Methodology, Software, Formal analysis, Investigation, Data curation, Writing – original draft, Visualization. Jingjing Ke : Resources, Data collection, Validation, Investigation. Mengdan Zhang: Formal analysis, Software, Validation, Visualization, Writing – review & editing. Lixian Kong: Data annotation, Resources, Validation. Junming Zheng: Methodology, Software, Validation. Lusai Xiang: Conceptualization, Supervision, Project administration, Funding acquisition, Writing – review & editing, Clinical expertise. All authors read and approved the final manuscript. Acknowledgements Not applicable. References Julien KC, Buschang PH, Campbell PM. Prevalence of white spot lesion formation during orthodontic treatment. Angle Orthod. 2013;83:641–7. Lopatiene K, Borisovaite M, Lapenaite E. Prevention and Treatment of White Spot Lesions During and After Treatment with Fixed Orthodontic Appliances: a Systematic Literature Review. J Oral Maxillofacial Res. 2016;7. Elcock C, Lath DL, Luty JD, Gallagher MG, Abdellatif A, Bäckman B, et al. The new Enamel Defects Index: testing and expansion. Eur J Oral Sci. 2006;114(Suppl 1):35–8. discussion 39–41, 379. Chapman JA, Roberts WE, Eckert GJ, Kula KS, González-Cabezas C. Risk factors for incidence and severity of white spot lesions during treatment with fixed orthodontic appliances. Am J Orthod Dentofac Orthop. 2010;138:188–94. Chauncey RT, Yu Q, Armbruster PC, Ballard RW. A survey of white spot lesion prevention and resolution in the US dental school curricula. J Dent Educ. 2023;87:1552–8. Batra P, Tagra H, Katyal S. Artificial Intelligence in Teledentistry. Discoveries (Craiova). 2022;10:153. Askar H, Krois J, Rohrer C, Mertens S, Elhennawy K, Ottolenghi L et al. Detecting white spot lesions on dental photography using deep learning: A pilot study. J Dent. 2021;107 December 2020. Panyarak W, Wantanajittikul K, Charuakkra A, Prapayasatok S, Suttapak W. Enhancing Caries Detection in Bitewing Radiographs Using YOLOv7. J Digit Imaging. 2023;36:2635–47. Bayraktar Y, Ayan E. Diagnosis of interproximal caries lesions with deep convolutional neural network in digital bitewing radiographs. Clin Oral Invest. 2022;26:623–32. Casalegno F, Newton T, Daher R, Abdelaziz M, Lodi-Rizzini A, Schürmann F, et al. Caries Detection with Near-Infrared Transillumination Using Deep Learning. J Dent Res. 2019;98:1227–33. Ozsunkar PS, Özen DÇ, Abdelkarim AZ, Duman S, Uğurlu M, Demİr MR, et al. Detecting white spot lesions on post-orthodontic oral photographs using deep learning based on the YOLOv5x algorithm: a pilot study. BMC Oral Health. 2024;24:490. Lin H, Shi Z, Zou Z. Fully Convolutional Network With Task Partitioning for Inshore Ship Detection in Optical Remote Sensing Images. IEEE Geosci Remote Sens Lett. 2017;14:1665–9. Bruegger J, Catana DI, Macovaz V, Valdenegro-Toro M, Sabatelli M, Zullich M. Large-image Object Detection for Fine-grained Recognition of Punches Patterns in Medieval Panel Painting. 2025. Akyon FC, Altinuc SO, Temizel A. Slicing Aided Hyper Inference and Fine-tuning for Small Object Detection. In: 2022 IEEE International Conference on Image Processing (ICIP). 2022. pp. 966–70. Kühnisch J, Meyer O, Hesenius M, Hickel R, Gruhn V. Caries Detection on Intraoral Images Using Artificial Intelligence. J Dent Res. 2022;101:158–65. Tareq A, Faisal MI, Islam MS, Rafa NS, Chowdhury T, Ahmed S, et al. Visual Diagnostics of Dental Caries through Deep Learning of Non-Standardised Photographs Using a Hybrid YOLO Ensemble and Transfer Learning Model. Int J Environ Res Public Health. 2023;20:5351. GitHub - HumanSignal/labelImg. LabelImg is now part of the Label Studio community. The popular image annotation tool created by Tzutalin is no longer actively being developed, but you can check out Label Studio, the open source data labeling tool for images, text, hypertext, audio, video and time-series data. GitHub. https://github.com/HumanSignal/labelImg . Accessed 24 May 2025. Batista GEAPA, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl. 2004;6:20–9. Arpit D, Jastrzębski S, Ballas N, Krueger D, Bengio E, Kanwal MS et al. A Closer Look at Memorization in Deep Networks. In: Proceedings of the 34th International Conference on Machine Learning. PMLR; 2017. pp. 233–42. Rotondi MA, Donner A. A confidence interval approach to sample size estimation for interobserver agreement studies with multiple raters and outcomes. J Clin Epidemiol. 2012;65:778–84. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. Int J Comput Vis. 2020;128:336–59. Ma X, Ferguson EC, Jiang X, Savitz SI, Shams S. A multitask deep learning approach for pulmonary embolism detection and identification. Sci Rep. 2022;12:13087. Nayak T, Chadaga K, Sampathila N, Mayrose H, Gokulkrishnan N, Bairy GM, et al. Deep learning based detection of monkeypox virus using skin lesion images. Med Nov Technol Devices. 2023;18:100243. Wang H, Wang Z, Du M, Yang F, Zhang Z, Ding S et al. Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks. 2020. Gildenblat J. contributors. PyTorch library for CAM methods. 2021. Lin T-Y, Maire M, Belongie S, Bourdev L, Girshick R, Hays J et al. Microsoft COCO: Common Objects in Context. 2015. Park J-H, Hwang H-W, Moon J-H, Yu Y, Kim H, Her S-B, et al. Automated identification of cephalometric landmarks: Part 1—Comparisons between the latest deep-learning methods YOLOV3 and SSD. Angle Orthod. 2019;89:903–9. Jang WS, Kim S, Yun PS, Jang HS, Seong YW, Yang HS, et al. Accurate detection for dental implant and peri-implant tissue by transfer learning of faster R-CNN: a diagnostic accuracy study. BMC Oral Health. 2022;22:591. Chang J, Chang M-F, Angelov N, Hsu C-Y, Meng H-W, Sheng S, et al. Application of deep machine learning for the radiographic diagnosis of periodontitis. Clin Oral Investig. 2022;26:6629–37. Krois J, Ekert T, Meinhold L, Golla T, Kharbot B, Wittemeier A, et al. Deep Learning for the Radiographic Detection of Periodontal Bone Loss. Sci Rep. 2019;9:8495. Çelik B, Savaştaer EF, Kaya HI, Çelik ME. The role of deep learning for periapical lesion detection on panoramic radiographs. Dentomaxillofac Radiol. 2023;52:20230118. Hu B, Liu Y, Chu P, Tong M, Kong Q. Small Object Detection via Pixel Level Balancing With Applications to Blood Cell Detection. Front Physiol. 2022;13. Uzkent B, Yeh C, Ermon S. Efficient Object Detection in Large Images Using Deep Reinforcement Learning. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). Snowmass Village, CO, USA: IEEE; 2020. pp. 1813–22. Additional Declarations No competing interests reported. Supplementary Files Supplementarymaterial.docx Cite Share Download PDF Status: Published Journal Publication published 09 Oct, 2025 Read the published version in BMC Oral Health → Version 1 posted Editorial decision: Revision requested 05 Aug, 2025 Reviews received at journal 05 Aug, 2025 Reviews received at journal 23 Jul, 2025 Reviewers agreed at journal 18 Jul, 2025 Reviewers agreed at journal 16 Jul, 2025 Reviewers invited by journal 15 Jul, 2025 Editor assigned by journal 15 Jul, 2025 Editor invited by journal 14 Jul, 2025 Submission checks completed at journal 12 Jul, 2025 First submitted to journal 12 Jul, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7058696","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":486887254,"identity":"03404084-5fad-477a-bd18-b7ed8781e08e","order_by":0,"name":"Hau Man Chung","email":"","orcid":"","institution":"Guanghua School of Stomatology, Hospital of Stomatology, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Stomatology","correspondingAuthor":false,"prefix":"","firstName":"Hau","middleName":"Man","lastName":"Chung","suffix":""},{"id":486887255,"identity":"e5e10685-e291-44d7-b720-7fb43b3dba78","order_by":1,"name":"Jingjing Ke","email":"","orcid":"","institution":"Guanghua School of Stomatology, Hospital of Stomatology, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Stomatology","correspondingAuthor":false,"prefix":"","firstName":"Jingjing","middleName":"","lastName":"Ke","suffix":""},{"id":486887256,"identity":"24ad4ed8-2e46-4abf-8442-cc64bd35d57a","order_by":2,"name":"Mengdan Zhang","email":"","orcid":"","institution":"Guanghua School of Stomatology, Hospital of Stomatology, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Stomatology","correspondingAuthor":false,"prefix":"","firstName":"Mengdan","middleName":"","lastName":"Zhang","suffix":""},{"id":486887257,"identity":"12f0aa25-1cbb-4f04-98d5-ce451d47e33b","order_by":3,"name":"Lixian Kong","email":"","orcid":"","institution":"Guanghua School of Stomatology, Hospital of Stomatology, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Stomatology","correspondingAuthor":false,"prefix":"","firstName":"Lixian","middleName":"","lastName":"Kong","suffix":""},{"id":486887258,"identity":"59614aef-bf84-44eb-9390-481848eab62d","order_by":4,"name":"Junming Zheng","email":"","orcid":"","institution":"Foshan University","correspondingAuthor":false,"prefix":"","firstName":"Junming","middleName":"","lastName":"Zheng","suffix":""},{"id":486887259,"identity":"72c350db-a3a6-4c88-bc5c-2d82ce13ed4b","order_by":5,"name":"Lusai Xiang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABCUlEQVRIiWNgGAWjYDACZhBRwCDHwMDYeAAmKEFYiwGDMVBLA5FaGCBaEhuAFHFaDI4zP3v4xaAufW37YaAtfw7bGxxgPnibh8EuD5cWyWY2c2MZg8O5284kNhxgbDucuOEAW7I1D0NyMS4t/MwMZtISBgdytx0AaWk4nGBwgMdMmofhANip2AAbM/s3oJa6dLPzD2EO4/+GVws/M4+Z5AcD5gSzG0BbGNgOM244wMOGV4tkM0+ZNIPBYcNtN4C2JLalJ848zGZsOccgGacWg/PHt0n+qKiTNzuf/vDBhz/W9nzHmx/eeFNhh1MLCDDzwFgJDM3wyMULGH8g2HX4lY6CUTAKRsGIBAAjKliT76LfGQAAAABJRU5ErkJggg==","orcid":"","institution":"Guanghua School of Stomatology, Hospital of Stomatology, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Stomatology","correspondingAuthor":true,"prefix":"","firstName":"Lusai","middleName":"","lastName":"Xiang","suffix":""}],"badges":[],"createdAt":"2025-07-06 15:23:03","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7058696/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7058696/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s12903-025-06936-w","type":"published","date":"2025-10-09T15:57:47+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":87321275,"identity":"87e8fd10-d50a-48b9-bbe7-47dd72446ef2","added_by":"auto","created_at":"2025-07-22 16:35:52","extension":"jpeg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":73466,"visible":true,"origin":"","legend":"\u003cp\u003eWorkflow of model finetuning and evaluation\u003c/p\u003e","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7058696/v1/a1e72e02be78160b040247cc.jpeg"},{"id":87322177,"identity":"16eb3bbd-a957-46b3-aa41-ae148ba6ec83","added_by":"auto","created_at":"2025-07-22 16:43:52","extension":"jpeg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":79674,"visible":true,"origin":"","legend":"\u003cp\u003eScheme chart of the tooth-to-WSL YOLO model\u003c/p\u003e","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7058696/v1/dcef11b2a6c32d03fc08746f.jpeg"},{"id":87321267,"identity":"90528cac-c6a0-4f3c-9454-ff5ccf9ccdd9","added_by":"auto","created_at":"2025-07-22 16:35:52","extension":"jpeg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":47617,"visible":true,"origin":"","legend":"\u003cp\u003eAnalysis of dataset features. A: Heatmap of locations of white spot lesion bounding boxes (a) and teeth (b). B: Heatmap of the relative location of a white spot lesion to the tooth on which it was observed.\u003c/p\u003e","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7058696/v1/0970868b1e8bd431860e8727.jpeg"},{"id":87321277,"identity":"ee1528a8-c3b7-4778-a375-a05fa4f7d069","added_by":"auto","created_at":"2025-07-22 16:35:52","extension":"jpeg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":39375,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of performance metrics between TW-YOLO and YOLO. A: Precision recall curve at 0.5 IOU threshold for white spot lesions. B: F1 score over IOU threshold curve for white spot lesions.\u003c/p\u003e","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7058696/v1/062391f53032382eb12c8b40.jpeg"},{"id":87322182,"identity":"05217933-f7df-413d-bdc3-5e1bf2fafeff","added_by":"auto","created_at":"2025-07-22 16:43:52","extension":"jpeg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":90398,"visible":true,"origin":"","legend":"\u003cp\u003eExplainability analysis of models. A: Example image from our dataset (A1) and detection result, as well as overlayed score-CAM generated with YOLOv5l (A2). A regional crop of the intra-oral photograph (A3) and detection result, as well as overlayed score-CAM by TW-YOLO (A4), are also displayed to demonstrate differences in attention distribution between the models. B: Intra-oral photograph (B1) along with detection and score-CAM overlay generated by YOLOv5l trained with the standard resize approach (B2). The detection result and score-CAM overlay generated by TW-YOLO are shown in B3.\u003c/p\u003e","description":"","filename":"floatimage5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7058696/v1/185c92764f9a9a3fdfb3f106.jpeg"},{"id":93597664,"identity":"9b0f2103-ab18-4b2a-b3e2-f4cdbb76b448","added_by":"auto","created_at":"2025-10-15 14:18:36","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":961404,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7058696/v1/4bb604a6-8f7a-48d1-8add-296d7e9b56e6.pdf"},{"id":87322583,"identity":"b321c77b-4084-4ac6-8c11-bc0380e490b1","added_by":"auto","created_at":"2025-07-22 16:51:52","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":153682,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarymaterial.docx","url":"https://assets-eu.researchsquare.com/files/rs-7058696/v1/de5fe1c17fb11697165dd5f6.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Tooth-to-white spot lesion YOLO: a novel model for white spot lesion detection","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe stage before cavitation in the development of dental caries is called white spot lesion (WSL). It is characterized by subsurface demineralization areas formed under an intact enamel surface. WSLs manifest as alterations of the translucent feature of the enamel, and the color of these areas appears opaque white. The reported incidence of WSLs is widely variable, but on average, such decalcifications are found in 30\u0026ndash;70% of patients during orthodontic treatment[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. The high incidence of WSLs necessitates attention from patients and practitioners. Early detection of WSLs during orthodontic treatment would allow the implementation of preventive measures and control of the demineralization process before it further progresses[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Various techniques are used to detect WSLs, including visual inspection, photography, light-induced fluorescence, and quantitative laser analysis. The enamel decalcification index (EDI), a visual inspection tool designed to categorize and assess the presence and severity of enamel defects[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e], not only has an accuracy comparable to that of light-induced fluorescence[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] but also is easier to implement and can sometimes be used for patient self-monitoring, which is helpful for early detection[\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. However, using the EDI requires professional training, which limits its applicability in patient self-monitoring. This method is also time-consuming, which makes it impractical for large-scale screening and dynamic monitoring during orthodontic processes.\u003c/p\u003e\u003cp\u003eRecent advances in the field of deep learning have shown promising potential for streamlining the routine work of dental caregivers and empowering patient self-monitoring[\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Many studies have evaluated the classification ability (the ability to differentiate between images of teeth with and without lesions) of deep learning models. Askar et al. demonstrated that SqueezeNet, a convolutional neural network, performs this task with great accuracy[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. Determining the size and location of these lesions, which usually involves object detection and semantic segmentation, is the next step. This is especially relevant in the prevention and management of WSLs because location and size influence esthetic outcomes.\u003c/p\u003e\u003cp\u003eStudies indicate that deep learning models can localize and identify caries in bitewing X-rays with a recall of 0.727 and an F1 score of 0.687[\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. Furthermore, Casalegno et al. used a U-Net\u0026ndash;like network architecture for caries segmentation in near-infrared transillumination (TI) images, achieving a mean intersection over union (mIOU) of 72.7% and an area under the ROC curve (AUC) of 85.6%[\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. These findings collectively demonstrate that deep learning models exhibit not only high processing speed but also substantial diagnostic accuracy in caries detection.\u003c/p\u003e\u003cp\u003eCompared with the images analyzed in these studies, in which the lesions comprise the majority of the image, intra-oral photographs are usually much larger, and the WSLs comprise a much smaller proportion. As demonstrated in the study by Ozsunkar et al.[\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e], applying YOLOv5 to detect WSLs in intra-oral photographs in an end-to-end approach by directly down-sizing images before detection yields poor accuracy. Therefore, a novel deep learning model is needed for this specific task. Inspired by the task partitioning paradigm[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] and the sliding windows strategy[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e], we developed a tooth-to-WSL You Only Look Once (TW-YOLO) model and compared its accuracy metrics with those of YOLO. To the best of our knowledge, our study is the first to implement explainability analysis on object detection models within the domain of dental medicine to gain a better understanding of the mechanisms behind model decision-making.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cp\u003e\u003cb\u003eData collection\u003c/b\u003e\u003c/p\u003e\u003cp\u003eAnonymized intra-oral photographs of orthodontic patients with WSLs were acquired from image archives in the Orthodontics Department of Foshan Stomatological Hospital, Foshan University. The protocol of the current study was approved by the Ethics Committee of the Foshan Stomatological Hospital, Foshan University (2024-FSKQ-LW-002). All patients signed informed consent at the beginning of their treatment sessions to have their anonymized image data to be used for medical research purpose. In total, 653 anonymized intra-oral pictures were collected, a number greater than those in previous studies[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eAll intra-oral photographs were taken from patients receiving fixed appliances orthodontic treatment, either before the treatment commenced or after appliance removal, with a digital reflex camera (Canon EOS 60D, Canon Corp., Tokyo, Japan). The image resolution was approximately 4000 by 3000 pixels. Tooth surfaces were cleaned and dried prior to intra-oral photography.\u003c/p\u003e\u003cp\u003e\u003cb\u003eImage annotation and data augmentation\u003c/b\u003e\u003c/p\u003e\u003cp\u003eAll image data were first annotated manually by two senior orthodontic specialists with LabelImg[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e] by drawing bounding boxes around WSLs. The specialists first annotated 10 cases together, reaching a consensus, and then each specialist continued the annotation work independently. Finally, only regions selected by both orthodontic specialists were kept as annotations. Bounding boxes around individual teeth were drawn by a junior researcher.\u003c/p\u003e\u003cp\u003eThe 653 intra-oral photographs were split into three groups (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e): 457 images were used for training; 130, for validation; and 66, for external testing. With a previous study[\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e] used as a guide, all images and corresponding annotation data were randomly split into train-validation (90% of all images) and external testing (10% of all images) datasets. The images in the train-validation dataset were used for model training and validation, and the images in the external testing dataset were used for performance evaluation. The images in the external testing dataset were never seen by the deep learning models, to avoid affecting memory[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eData augmentation techniques, including random cropping and scaling, image rotation, and affine transformations, were employed to expand the effective dataset scale and enhance the spatial robustness of the deep learning model.\u003c/p\u003e\u003cp\u003eThe minimum sample size was estimated based on the approach outlined in a previous study[\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Based on a pilot study, we estimated that Cohen's kappa coefficient values were 0.6 and 0.77 for YOLOv5 and TW-YOLO, respectively. On average, WSLs covered 15% of the tooth enamel surfaces in the photographs. Given these parameters, at least 505 teeth with WSLs were needed. The external testing dataset included 548 teeth with WSLs, satisfying the minimum sample size requirement.\u003c/p\u003e\u003cp\u003e\u003cb\u003eTW-YOLO model architecture\u003c/b\u003e\u003c/p\u003e\u003cp\u003eTo mitigate the issue of excessive downsizing in conventional image preprocessing of high-resolution intra-oral photographs, we developed the novel TW-YOLO model (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). First, intra-oral photographs (approximately 4000 \u0026times; 3000 pixels) undergo standard proportional resizing to fit within a 640 \u0026times; 640 pixel square. The resized images are then processed by a YOLOv5s network to localize teeth. Non-maximum suppression (NMS) is subsequently applied to retain only non-overlapping predicted bounding boxes. Based on the bounding boxes for the teeth, the image region encompassing all detected teeth is calculated. This region is then cropped out of the original, full-resolution image. Within this cropped input, tiled image extraction is performed using sliding 640 \u0026times; 640 pixel windows. Adjacent windows overlap by 50 pixels to prevent lesion oversight.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eNext, all extracted image tiles along with the entire cropped input are processed individually by the YOLOv5l network to detect WSLs. This facilitates concurrent localized detection within the tiles and holistic detection on the cropped input. During the training phase, each extracted tile is treated as an independent training sample. For the prediction phase, the detection results from all processed tiles and the cropped input are mapped back onto the coordinate space of the original, uncropped image. A final NMS step is applied to eliminate significantly overlapping bounding boxes, retaining the post-processed detections.\u003c/p\u003e\u003cp\u003e\u003cb\u003eFine-tuning of YOLO and our TW-YOLO model\u003c/b\u003e\u003c/p\u003e\u003cp\u003eYOLO models possess parameter counts in the millions (YOLOv5s: 7.2\u0026nbsp;million; YOLOv5l: 46.5\u0026nbsp;million), so training such models from scratch is impractical. Therefore, we initialized both models using pretrained weights (trained on the common objects in context [COCO] dataset) as starting points for fine-tuning. This transfer learning strategy leveraged the generic feature extraction capabilities acquired from broad image data while incorporating domain-specific knowledge pertinent to orthodontic WSL detection.\u003c/p\u003e\u003cp\u003eFine-tuning of the YOLO network was performed by training models for 500 epochs with adaptive optimization hyperparameters. The loss function, comprising object detection loss (computed via intersection over union [IOU]) and classification loss (computed via Cross-Entropy), was evaluated in each epoch and backpropagated to update model weights.\u003c/p\u003e\u003cp\u003e\u003cb\u003eModel performance evaluation\u003c/b\u003e\u003c/p\u003e\u003cp\u003eAfter training, model performance was evaluated using the testing dataset. The primary evaluation metric was the pixel-wise Cohen\u0026rsquo;s kappa coefficient. We adopted this metric to evaluate the agreement between our orthodontists and the models regarding the boundaries of the WSLs. Other accuracy metrics included overall mean average precision (
[email protected]:0.95), which was first introduced by the COCO detection challenge and has since become the most common evaluation metric for object detection accuracy. Average precision at the 0.5 IOU threshold (
[email protected]) and F1 score were chosen as secondary accuracy metrics. The evaluation metrics, including IOU, precision (P), recall (R),
[email protected]:0.95,
[email protected], and F1 score, were calculated using methods outlined in prior research[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. The prediction time for each image was also tracked and compared.\u003c/p\u003e\u003cp\u003eWe would like to point out that for object detection task, a true negative means no detection is overlapped with background, and it is not applicable. Thus, both ROC curve and the area under curve (AUC) can\u0026rsquo;t be calculated. On the other hand, average precision, whether the
[email protected]:0.95 or
[email protected], is the area under precision recall curve by definition. A more detailed description of the calculation algorithms for metrics in our study is included in the Supplementary Information.\u003c/p\u003e\u003cp\u003e\u003cb\u003eExplainability analysis with ablation-CAM\u003c/b\u003e\u003c/p\u003e\u003cp\u003eGradient-weighted class activation mapping (grad-CAM) uses the gradients of any target concept flowing into the final convolutional layer to produce a coarse localization map that highlights important regions in the image for predicting the concept[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. It is widely used in explainability analysis of classification problems[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. However, for object detection problems, both class discrimination and target localization are equally important. While the prediction score can be used for classification explainability, to better understand how deep learning models draw bounding boxes, we investigated the IOU ratio between the predictions from the models and the annotations from the orthodontic specialists. To calculate gradients for this purpose, however, is a tricky task. To solve this problem, Wang[\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e] proposed score-based class activation mapping (score-CAM), which is gradient-free localization mapping suitable for our analysis purposes.\u003c/p\u003e\u003cp\u003eWe adopted Gildenblat's[\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e] CAM method library for score-CAM analysis. For the target layer, we chose the backbone module from RetinaNet and FasterRCNN and the C2f module at the back of the Detection Model module from YOLO.\u003c/p\u003e\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003eStatistical analysis\u003c/h2\u003e\u003cp\u003eFor continuous variables in our study, non-parametric hypothesis tests were chosen due to non-normal distribution. For categorical variables, the\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{}{\\text{\u0026chi;}}^{\\text{2}}\\text{}\\)\u003c/span\u003e\u003c/span\u003etest was used.\u003c/p\u003e\u003c/div\u003e"},{"header":"Results","content":"\u003cp\u003e\u003cb\u003eDataset features\u003c/b\u003e\u003c/p\u003e\u003cp\u003eIn total, 653 intra-oral photographs with 12,216 teeth and 8392 WSL bounding box annotations were included in our dataset. For the external testing subset, 1252 teeth (548 teeth presented with WSLs) and 842 WSLs were included. WSLs covered 14.8% of the area of the affected tooth crown (95% CI: 2.1\u0026ndash;31.2%) in all photographs. With TW-YOLO, slices of intra-oral photographs were analyzed at their original resolution, and the median area for WSL bounding boxes was 2898 px (IQR: 1895\u0026ndash;5080). The YOLOv5l model received downsized images, and the median area for WSLs decreased to 1135 px (IQR: 364\u0026ndash;2423), significantly smaller than that in TW-YOLO. Judging by the standard from the COCO dataset[\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e], only four WSL bounding boxes had a small size (\u0026lt;\u0026thinsp;1024 pixels) during the prediction phase of the TW-YOLO model, whereas 395 had a small size when YOLOv5l was evaluated. It is obvious that the standard resize treatment for YOLOv5l not only decreased the size of all WSLs (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) but also significantly increased the proportion of the small size bounding boxes (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{\u0026chi;}}^{\\text{2}}\\)\u003c/span\u003e\u003c/span\u003e = 795, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001).\u003c/p\u003e\u003cp\u003eHeatmaps of WSLs (Fig.\u0026nbsp;3Aa) and tooth bounding boxes (Fig.\u0026nbsp;3Ab) demonstrated their distribution in the central part of the photographs. For WSLs, the upper part was lighter than the lower part, suggesting that WSLs were primarily found in the upper jaw. The heatmap for the relative position of the WSLs to that of the corresponding tooth (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB) showed that most WSLs were present on the peripheral area, especially on the upper and lower parts, which correspond to the peri-gingival area.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eComparison of model performance\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe YOLOv5s model shares a similar architecture to the YOLOv5l model but is structurally simpler and contains fewer parameters. For images resized to a 640 \u0026times; 640 resolution, YOLOv5s performs rapid detection (8 ns per image). Despite its efficiency, it maintains high accuracy, achieving an mAP of 0.95 at an IOU threshold of 0.5 (
[email protected]) and an mAP of 0.73 across the full range of 0.5\u0026ndash;0.95 IOU thresholds (
[email protected]:0.95). The YOLOv5l model demonstrates superior accuracy compared with YOLOv5s, with an
[email protected] of 0.98 and an
[email protected]:0.95 of 0.8, though it requires a longer inference time, at 12 ns per image.\u003c/p\u003e\u003cp\u003eFor enamel WSLs, the precision-recall curves (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA) and F1-IOU curves (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eB)were obtained using YOLOv5l and TW-YOLO. The pixel-wise Cohen's kappa coefficient was 0.76 for TW-YOLO and 0.62 for YOLOv5l. For secondary performance metrics, when applying standard image resizing followed by YOLOv5l detection, the model achieved an
[email protected] of 0.69 and an
[email protected]:0.95 of 0.45. However, when using TW-YOLO, the detection accuracy improved by approximately 10%, with an
[email protected] of 0.78 and an
[email protected]:0.95 of 0.51. Among all 826 WSLs, TW-YOLO yielded 670 true positives (TPs), 156 false negatives (FNs), and 223 false positives (FPs). YOLOv5l, however, yielded 608 TPs, 218 FNs, and 252 FPs. The average inference time was 12 ns per image for YOLOv5l and 73 ns per image for TW-YOLO.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eExplainability analysis\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe score-CAM result for WSL detection (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e) demonstrated that while the YOLOv5l model mainly focused on teeth, considerable attention was diverted to the lips, oral mucosa, and cheek retractors when the standard resize approach was adopted (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eA2). In comparison, the TW-YOLO retained more detail within the images (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eA3, 4), and the model's attention was concentrated on the peripheral region of the tooth labial surface (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eA4), which coincided with the WSL distribution pattern (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB). These features consequently resulted in more precise predictions.\u003c/p\u003e\u003cp\u003eThe score-CAM results offered further insight into the superior performance of TW-YOLO when TW-YOLO image slices were compared with YOLOv5l image slices at the original resolution. For YOLOv5l (Fig.\u0026nbsp;5B2), the model\u0026rsquo;s attention was concentrated on the tip of the tooth and the peri-gingival area, leading to more FNs (Fig.\u0026nbsp;5B2). However, for TW-YOLO, attention was more evenly distributed along the peripheral area and more concentrated on salient features (Fig.\u0026nbsp;5B3).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eImage data, as a central part of the patient record, play an essential role in the diagnostic and treatment workflow in orthodontic practice. The reading of these image data, however, still heavily depends on orthodontists\u0026rsquo; manual work. This problem has become more evident as the amount of dental medical image data has grown exponentially. The development of artificial intelligence, especially in the field of deep learning, has introduced novel tools to increase dental image processing efficiency. Previous studies have shown that deep learning models performed well in cephalometric landmark detection[\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e] and dental implant systems classification[\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. These models could also detect and evaluate the severity of periodontitis[\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e] and periodontal bone loss[\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eIn our study, we developed TW-YOLO, a novel network architecture specifically designed for detecting WSLs in intra-oral photographs. It demonstrated significantly stronger agreement with orthodontists' annotations than YOLOv5l, achieving an approximately 10% improvement in detection accuracy. TW-YOLO comprehensively outperformed YOLOv5l across all key metrics, including increased
[email protected],
[email protected]:0.95, and TPs, as well as reduced FNs and FPs. Although inference time scaled linearly to an average of 6\u0026times; that of YOLOv5l due to its dual-network architecture, processing remained substantially faster than manual clinician annotation.\u003c/p\u003e\u003cp\u003eSeveral factors contributed to this superior performance. First, models adopted for fine-tuning were mainly pretrained with images with prominent subjects (target objects occupying\u0026thinsp;\u0026gt;\u0026thinsp;10% of the image area). Thus, fine-tuning tends to yield superior results when the target objects in the dataset maintain relatively large proportions[\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. The approach developed by Askar et al. for WSL detection[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e] involved cropping intra-oral photographs into slices of individual tooth size; this preserves the original resolution while reformulating detection into a classification task. Their method has achieved promising outcomes (R: 0.58\u0026ndash;0.66; P: 0.67; AUC: 0.86). Nevertheless, the clinical utility of this approach remains limited by its reliance on labor-intensive manual cropping. By contrast, \u0026Ouml;zsunkar et al. directly downsized intra-oral images to 640 \u0026times; 320 pixels before fine-tuning YOLOv5x, yielding suboptimal performance: a mAP of 0.454 at the 0.5 IOU threshold, detecting only 52% of WSLs and producing 133 TPs, 82 FNs, and 36 FPs[\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. As demonstrated in our study, simply downsizing intra-oral photographs significantly increases the proportion of small-size bounding boxes (0\u0026ndash;1024 pixels), and detection of small-size targets is a well-recognized challenge in the object detection field[\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e, \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]. The suboptimal performance of the YOLO model observed in both the \u0026Ouml;zsunkar study and our study is primarily attributable to the increase in the number of small-size objects resulting from image downscaling.\u003c/p\u003e\u003cp\u003eRich enamel textual information can be extracted from intra-oral photographs taken by a single-lens reflex camera. Multiple studies have shown that WSLs can be reliably assessed through meticulous, tooth-by-tooth examination of intra-oral photographs[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. This evidence aligns with our finding that tiled detection substantially improved accuracy by preserving critical details often lost after resolution reduction.\u003c/p\u003e\u003cp\u003eCompared with other generic tiling approaches that cut entire images into slices for sequential detection[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e], our method leverages the unique characteristics of intra-oral photographs by first detecting the location of all teeth and then performing slicing and detection in the cropped region. In this way, the area to be detected is reduced, saving detection time. Furthermore, from the perspective of score-CAM, the TW-YOLO model not only suppressed irrelevant areas (e.g. nostrils, lips, and gingival areas; Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e) but also recognized salient features and focused on the distribution pattern of WSLs in our dataset. Applying this slicing windows strategy in both the training and prediction phases is more helpful than just applying it during the detection phase. As demonstrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eB, when sliced images at the original resolution are used as input, the YOLOv5 model can concentrate its attention on gingival margins and incisal edges of tooth surfaces; however, this approach results in the omission of many salient details, as most of the score-CAM overlay is still colored blue (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eB2). TW-YOLO, however, picked up more salient features (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eB3), thus enhancing its accuracy. The fact that the TW-YOLO network could implicitly learn the distribution pattern of WSLs is quite interesting and worth further investigation.\u003c/p\u003e\u003cp\u003eThere are some limitations to our study. The size of our dataset was relatively small, and most photographs were taken after fixed aligner removal. With more image data, the fine-tuning of models could achieve better results. Additionally, the integration of an attention mechanism would allow for WSL distribution patterns to be explicitly taught to the models, which would be a more efficient approach than relying on the model to implicitly acquire this understanding over time.\u003c/p\u003e"},{"header":"Conclusions","content":"\u003cp\u003eThe novel TW-YOLO model not only demonstrated great accuracy but also showed near-perfect agreement with orthodontists' annotations. It enhanced the detection precision by effectively reducing the resolution degradation and concentrating on the key features of the tooth surface. Explainability analysis provided a better understanding of how these models perform in WSL detection and also indicated directions to explore for further improvements.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cdiv class=\"DefinitionList\"\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eYOLO\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eYou Only Look Once Network\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eTW-YOLO\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003etooth-to-WSL YOLO model\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eWSL\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003ewhite spot lesion\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003emAP\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003emean average precision\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eAP\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eaverage precision\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eCAM\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eclass activation mapping\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eCOCO\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003ecommon object in context\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eIOU\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eintersection over union\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003c/div\u003e"},{"header":"Declarations","content":"\u003ch3\u003eEthics approval and consent to participate\u003c/h3\u003e\n\u003cp\u003eThe current study is in compliance with the Helsinki Declaration, and the protocol was approved by the Ethics Committee of the Foshan Stomatological Hospital, Foshan University (2024-FSKQ-LW-002, approval date: 03-14-2024). Informed consent to participate was obtained from all of the participants included in this study.\u003c/p\u003e\n\u003ch3\u003eConsent for publication\u003c/h3\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003ch3\u003eAvailability of data and materials\u003c/h3\u003e\n\u003cp\u003eThe datasets used and analyzed during the current study are available from the corresponding author upon reasonable request.\u003c/p\u003e\n\u003ch3\u003eCompeting interests\u003c/h3\u003e\n\u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e\n\u003ch3\u003eFunding\u003c/h3\u003e\n\u003cp\u003eThe present study was supported by grants from the National Natural Science Foundation of China (81800961), Guangdong Basic and Applied Basic Research Foundation\u0026mdash;Natural Science Fund Project (2025A1515010904), and International Orthodontics Foundation Young Research (IOF2022Y06).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026rsquo; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHau\u003c/strong\u003e\u003cstrong\u003eMan Chung:\u003c/strong\u003e Conceptualization, Methodology, Software, Formal analysis, Investigation, Data curation, Writing \u0026ndash; original draft, Visualization.\u003cbr\u003e\u003cstrong\u003eJingjing Ke\u003c/strong\u003e\u003cstrong\u003e:\u003c/strong\u003e Resources, Data collection, Validation, Investigation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMengdan Zhang:\u003c/strong\u003e Formal analysis, Software, Validation, Visualization, Writing \u0026ndash; review \u0026amp; editing.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLixian Kong:\u003c/strong\u003e Data annotation, Resources, Validation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eJunming Zheng:\u003c/strong\u003e Methodology, Software, Validation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLusai Xiang:\u003c/strong\u003e Conceptualization, Supervision, Project administration, Funding acquisition, Writing \u0026ndash; review \u0026amp; editing, Clinical expertise.\u003c/p\u003e\n\u003cp\u003eAll authors read and approved the final manuscript.\u003c/p\u003e\n\u003ch3\u003eAcknowledgements\u003c/h3\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eJulien KC, Buschang PH, Campbell PM. Prevalence of white spot lesion formation during orthodontic treatment. Angle Orthod. 2013;83:641\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLopatiene K, Borisovaite M, Lapenaite E. Prevention and Treatment of White Spot Lesions During and After Treatment with Fixed Orthodontic Appliances: a Systematic Literature Review. J Oral Maxillofacial Res. 2016;7.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eElcock C, Lath DL, Luty JD, Gallagher MG, Abdellatif A, B\u0026auml;ckman B, et al. The new Enamel Defects Index: testing and expansion. Eur J Oral Sci. 2006;114(Suppl 1):35\u0026ndash;8. discussion 39\u0026ndash;41, 379.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChapman JA, Roberts WE, Eckert GJ, Kula KS, Gonz\u0026aacute;lez-Cabezas C. Risk factors for incidence and severity of white spot lesions during treatment with fixed orthodontic appliances. Am J Orthod Dentofac Orthop. 2010;138:188\u0026ndash;94.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChauncey RT, Yu Q, Armbruster PC, Ballard RW. A survey of white spot lesion prevention and resolution in the US dental school curricula. J Dent Educ. 2023;87:1552\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBatra P, Tagra H, Katyal S. Artificial Intelligence in Teledentistry. Discoveries (Craiova). 2022;10:153.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAskar H, Krois J, Rohrer C, Mertens S, Elhennawy K, Ottolenghi L et al. Detecting white spot lesions on dental photography using deep learning: A pilot study. J Dent. 2021;107 December 2020.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePanyarak W, Wantanajittikul K, Charuakkra A, Prapayasatok S, Suttapak W. Enhancing Caries Detection in Bitewing Radiographs Using YOLOv7. J Digit Imaging. 2023;36:2635\u0026ndash;47.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBayraktar Y, Ayan E. Diagnosis of interproximal caries lesions with deep convolutional neural network in digital bitewing radiographs. Clin Oral Invest. 2022;26:623\u0026ndash;32.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCasalegno F, Newton T, Daher R, Abdelaziz M, Lodi-Rizzini A, Sch\u0026uuml;rmann F, et al. Caries Detection with Near-Infrared Transillumination Using Deep Learning. J Dent Res. 2019;98:1227\u0026ndash;33.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOzsunkar PS, \u0026Ouml;zen D\u0026Ccedil;, Abdelkarim AZ, Duman S, Uğurlu M, Demİr MR, et al. Detecting white spot lesions on post-orthodontic oral photographs using deep learning based on the YOLOv5x algorithm: a pilot study. BMC Oral Health. 2024;24:490.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLin H, Shi Z, Zou Z. Fully Convolutional Network With Task Partitioning for Inshore Ship Detection in Optical Remote Sensing Images. IEEE Geosci Remote Sens Lett. 2017;14:1665\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBruegger J, Catana DI, Macovaz V, Valdenegro-Toro M, Sabatelli M, Zullich M. Large-image Object Detection for Fine-grained Recognition of Punches Patterns in Medieval Panel Painting. 2025.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAkyon FC, Altinuc SO, Temizel A. Slicing Aided Hyper Inference and Fine-tuning for Small Object Detection. In: 2022 IEEE International Conference on Image Processing (ICIP). 2022. pp. 966\u0026ndash;70.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eK\u0026uuml;hnisch J, Meyer O, Hesenius M, Hickel R, Gruhn V. Caries Detection on Intraoral Images Using Artificial Intelligence. J Dent Res. 2022;101:158\u0026ndash;65.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTareq A, Faisal MI, Islam MS, Rafa NS, Chowdhury T, Ahmed S, et al. Visual Diagnostics of Dental Caries through Deep Learning of Non-Standardised Photographs Using a Hybrid YOLO Ensemble and Transfer Learning Model. Int J Environ Res Public Health. 2023;20:5351.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGitHub - HumanSignal/labelImg. LabelImg is now part of the Label Studio community. The popular image annotation tool created by Tzutalin is no longer actively being developed, but you can check out Label Studio, the open source data labeling tool for images, text, hypertext, audio, video and time-series data. GitHub. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/HumanSignal/labelImg\u003c/span\u003e\u003cspan address=\"https://github.com/HumanSignal/labelImg\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Accessed 24 May 2025.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBatista GEAPA, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl. 2004;6:20\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eArpit D, Jastrzębski S, Ballas N, Krueger D, Bengio E, Kanwal MS et al. A Closer Look at Memorization in Deep Networks. In: Proceedings of the 34th International Conference on Machine Learning. PMLR; 2017. pp. 233\u0026ndash;42.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRotondi MA, Donner A. A confidence interval approach to sample size estimation for interobserver agreement studies with multiple raters and outcomes. J Clin Epidemiol. 2012;65:778\u0026ndash;84.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLitjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60\u0026ndash;88.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSelvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. Int J Comput Vis. 2020;128:336\u0026ndash;59.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMa X, Ferguson EC, Jiang X, Savitz SI, Shams S. A multitask deep learning approach for pulmonary embolism detection and identification. Sci Rep. 2022;12:13087.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNayak T, Chadaga K, Sampathila N, Mayrose H, Gokulkrishnan N, Bairy GM, et al. Deep learning based detection of monkeypox virus using skin lesion images. Med Nov Technol Devices. 2023;18:100243.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang H, Wang Z, Du M, Yang F, Zhang Z, Ding S et al. Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks. 2020.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGildenblat J. contributors. PyTorch library for CAM methods. 2021.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLin T-Y, Maire M, Belongie S, Bourdev L, Girshick R, Hays J et al. Microsoft COCO: Common Objects in Context. 2015.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePark J-H, Hwang H-W, Moon J-H, Yu Y, Kim H, Her S-B, et al. Automated identification of cephalometric landmarks: Part 1\u0026mdash;Comparisons between the latest deep-learning methods YOLOV3 and SSD. Angle Orthod. 2019;89:903\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJang WS, Kim S, Yun PS, Jang HS, Seong YW, Yang HS, et al. Accurate detection for dental implant and peri-implant tissue by transfer learning of faster R-CNN: a diagnostic accuracy study. BMC Oral Health. 2022;22:591.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChang J, Chang M-F, Angelov N, Hsu C-Y, Meng H-W, Sheng S, et al. Application of deep machine learning for the radiographic diagnosis of periodontitis. Clin Oral Investig. 2022;26:6629\u0026ndash;37.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKrois J, Ekert T, Meinhold L, Golla T, Kharbot B, Wittemeier A, et al. Deep Learning for the Radiographic Detection of Periodontal Bone Loss. Sci Rep. 2019;9:8495.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003e\u0026Ccedil;elik B, Savaştaer EF, Kaya HI, \u0026Ccedil;elik ME. The role of deep learning for periapical lesion detection on panoramic radiographs. Dentomaxillofac Radiol. 2023;52:20230118.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHu B, Liu Y, Chu P, Tong M, Kong Q. Small Object Detection via Pixel Level Balancing With Applications to Blood Cell Detection. Front Physiol. 2022;13.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eUzkent B, Yeh C, Ermon S. Efficient Object Detection in Large Images Using Deep Reinforcement Learning. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). Snowmass Village, CO, USA: IEEE; 2020. pp. 1813\u0026ndash;22.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"bmc-oral-health","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ohea","sideBox":"Learn more about [BMC Oral Health](http://bmcoralhealth.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/ohea/default.aspx","title":"BMC Oral Health","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"White spot lesions, object detection, deep learning, explainability analysis","lastPublishedDoi":"10.21203/rs.3.rs-7058696/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7058696/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eBackground:\u003c/strong\u003e To develop a new deep learning model for detecting white spot lesions (WSLs), which are commonly observed in patients undergoing orthodontic treatment, and assess its accuracy.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethods\u003c/strong\u003e: A total of 653 intra-oral photographs of WSLs were collected and annotated. Our novel model, tooth-to-WSL You Only Look Once (TW-YOLO), and the original YOLOv5 model were fine-tuned and evaluated, with 457 photographs used for training; 130, for validation; and 66, for external testing. Cohen's kappa coefficient between model prediction and orthodontist annotation was used as the primary evaluation metric, and mean average precision (
[email protected]:0.95), average precision (
[email protected]), F1 score, and accuracy were also evaluated. The score-CAM technique was used for explainability analysis.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults\u003c/strong\u003e: Cohen's kappa coefficient values were 0.76 and 0.62 for TW-YOLO and YOLOv5, respectively. The
[email protected]:0.95 was 0.51 for TW-YOLO and 0.45 for YOLOv5. Explainability analysis suggested that the TW-YOLO model could implicitly learn the distribution pattern of WSLs by shifting more attention toward these regions.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusion\u003c/strong\u003e: The novel TW-YOLO model demonstrated not only improved accuracy but also the potential to be applied in other related dentistry studies.\u003c/p\u003e","manuscriptTitle":"Tooth-to-white spot lesion YOLO: a novel model for white spot lesion detection","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-22 16:35:48","doi":"10.21203/rs.3.rs-7058696/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-08-05T12:05:31+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-05T09:09:35+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-07-23T10:57:57+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"183061940989284431979297101300324023340","date":"2025-07-19T03:26:28+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"63749123367211554396400547513678112806","date":"2025-07-16T09:51:16+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-07-15T15:54:30+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-07-15T15:53:32+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-07-14T09:06:05+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-07-12T04:40:46+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Oral Health","date":"2025-07-12T04:38:09+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"bmc-oral-health","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ohea","sideBox":"Learn more about [BMC Oral Health](http://bmcoralhealth.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/ohea/default.aspx","title":"BMC Oral Health","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"70cd2f09-5a46-4ea2-9a76-724ae97ce1ff","owner":[],"postedDate":"July 22nd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-10-13T16:04:20+00:00","versionOfRecord":{"articleIdentity":"rs-7058696","link":"https://doi.org/10.1186/s12903-025-06936-w","journal":{"identity":"bmc-oral-health","isVorOnly":false,"title":"BMC Oral Health"},"publishedOn":"2025-10-09 15:57:47","publishedOnDateReadable":"October 9th, 2025"},"versionCreatedAt":"2025-07-22 16:35:48","video":"","vorDoi":"10.1186/s12903-025-06936-w","vorDoiUrl":"https://doi.org/10.1186/s12903-025-06936-w","workflowStages":[]},"version":"v1","identity":"rs-7058696","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7058696","identity":"rs-7058696","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.