SpineScan: a deep learning model for lumbar spine MRI annotation and Pfirrmann grading assessment | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article SpineScan: a deep learning model for lumbar spine MRI annotation and Pfirrmann grading assessment Aleksandr Minin, Olga Leonova, Aleksandr Krutko, Elizaveta Elgaeva, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6914052/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 03 Nov, 2025 Read the published version in European Spine Journal → Version 1 posted 13 You are reading this latest preprint version Abstract Purpose While recent advances in deep learning have enabled automated Pfirrmann grading systems of intervertebral disc degeneration (IDD), many models remain inaccessible due to proprietary restrictions. This study aimed to develop and validate a convolutional neural network (CNN) for automated Pfirrmann grading using a diverse clinical dataset, and to compare our model’s performance with previously published results. Methods We trained a CNN-based model using the YOLOv8x architecture on two datasets: a well-curated Russian lumbar disc degeneration cohort (RuDDS) and an open-access dataset, totaling 484 lumbar MRI scans. Ground truth grading was provided by expert radiologists. The model was designed to simultaneously detect intervertebral discs and classify degeneration grades from single MRI slices. Performance was evaluated using standard metrics, including precision, recall, and mean average precision (mAP) across Pfirrmann grades I to V. Results Our model achieved a predictive accuracy between 0.78 and 0.82 depending on lumbar level. The highest performance was observed for Grade IV discs (mAP50 = 0.872), while performance for Grade V was lower (mAP50-95 = 0.525), likely due to poor contrast and indistinct boundaries in highly degenerated discs. Overall, the model demonstrated a precision of 0.75 and recall of 0.808. Comparison with previous studies revealed that our results are consistent with expert-level performance. Conclusions The developed model shows strong potential for automated grading of lumbar disc degeneration and performs comparably to expert radiologists in most cases. Our findings support the clinical applicability of AI-assisted grading systems while emphasizing the need for standardized imaging and evaluation protocols. disc degeneration lumbar spine MRI deep learning convolutional neural network Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction Intervertebral disc degeneration as decreased signal intensity, structural inhomogeneity and reduction in disc height, are identified by MRI [ 1 ]. Automated MRI analysis is key to improving the diagnostic value of MRI by providing a more objective and quantitative interpretation of images. Thanks to recent advances in machine learning and artificial intelligence, a few models for automated MRI analysis have been developed. Some of these models focus solely on spinal segmentation [ 2 , 3 ]. Others also perform grading of the disc degeneration on the Pfirrmann scale [ 4 – 8 ]. The Pfirrmann classification is the most widely used by clinicians and radiologists due to its simplicity and convenience, though it has some well-known limitations [ 9 ]. The inter-observer agreement for this classification is fairly high, with inter-observer coefficients ranging from 0.69 to 0.81 [ 1 ], and intra-observer agreement is excellent with an ICC of 0.86 (0.83–0.89) [ 10 ], 0.84–0.90 [ 1 ]. The average prediction accuracy of these models is significantly higher than that of medical experts, demonstrating the high potential of such systems to assist clinicians in medical decision making. However, all grading models have limitations in terms of sharing prediction algorithms due to intellectual property concerns, making it difficult to compare or combine different models. This highlights the need to obtain similar models in other cohorts. We were interested in these models, which showed such high predictive value that we wanted to develop it on similar data and replicate the prediction accuracy values. Previously, we recruited a well-curated disease-oriented Russian disc degeneration study (RuDDS) cohort of patients from two Russian medical centers to facilitate the omics studies of lumbar disc degeneration disease [ 11 ]. Each patient has high quality lumbar MRI scans (1.5 Tesla), and information about different parameters (e.g. osteophytes, Pfirrmann score, Jarosh score and herniation.) of disc degeneration assessed by a radiologist. The purpose of this study was to develop and asses a convolutional neural network (CNN) model to detect intervertebral lumbar discs and to grade the disc degeneration according to Pfirrmann grading system. Materials and Methods Data set and annotation We used two MRI data sets. The first dataset included 243 MRI images of the lumbar spine from the RuDDs study [11], contained MRI scans of symptomatic patients with degenerative spine diseases. The second dataset was obtained from a publicly available archive of 241 lumbar spine MRI images, accessible at [12], selected randomly. To facilitate an objective assessment of model quality, the dataset was carefully partitioned into training (363 MRI studies) and validation subsets (121 MRI studies). The disc degeneration was assessed using the 5-grade Pfirrmann classification, where grade 1 indicates a normal disc and grade 5 indicates a severely degenerated disc [1]. The assessment and segmentation was conducted by experienced clinicians with over 10 years of experience (OL and AK). Image Preprocessing Sagittal T2-weighted MRI images were utilized and preprocessed to ensure uniform intensity scales across the dataset. Initially, pixel intensities are confined to a specified range determined by the 1st and 99th percentiles of the image data, effectively mitigating the impact of extreme intensity outliers. Subsequently, these intensities are linearly scaled to span the full grayscale spectrum, ranging from 0 to 255, thereby facilitating optimal utilization by conventional image processing algorithms commonly used in computer vision. This normalization ensures consistency in the data, which is essential for the subsequent analytical stages involving deep learning models. Model description Given the challenges in distinguishing degenerated discs using traditional segmentation approaches, we decided to use a detection-based approach. For this purpose, we have adopted the YOLOv8 model from the Ultralytics library (https://pypi.org/project/ultralytics/, version 8.1.5), that is effective in real-time object detection tasks (https://arxiv.org/abs/1506.02640, https://arxiv.org/abs/2004.10934). Employing YOLOv8 enables us to address the limitations of previous segmentation methods, facilitating more reliable identification of spinal conditions across our dataset. This approach aims to enhance the accuracy of our analyses and streamline the workflow for clinical assessments, where time and precision are of paramount importance. The maximum number of detected objects per image was restricted to five (max det: 5). This limitation ensured that the model focused on identifying a fixed set of critical structures. Data Augmentation Due to the relatively small sample size in our study, we implemented a series of random data augmentation techniques to prevent overfitting. These techniques included horizontal flipping applied with a probability of 0.5, scaling by a factor of 0.2, rotation within a range of -45° to +45°, and translation by up to 0.1 of the image dimensions. Examples of these augmentation techniques are illustrated in Figure 1, Supplementary files . We preserved the original hue, saturation, and value settings to maintain the inherent brightness and contrast of the MRI images, which are essential for accurate medical diagnosis. This preservation is particularly crucial for discs graded 1-3, as the differences in hue and tissue homogeneity are more indicative of disc health than mere shape changes. Such detailed attention to the original image characteristics ensures that our augmentations introduce variability without distorting critical diagnostic features. Optimization Strategy To prioritize localization accuracy before refining class discrimination, the box loss gain was increased to 10.0 (box: 10.0), while the classification loss gain was set to a moderate level of 0.5 (cls: 0.5). This phased optimization approach initially emphasized precise disc localization, establishing a robust foundation for subsequent enhancements in Pfirrmann grade classification. Additionally, the distribution focal loss parameter was adjusted to 1.5 (dfl: 1.5). Key evaluation metrics, including Precision (P), Recall (R), mean Average Precision at IoU=0.5 (mAP50), and mean Average Precision across IoU thresholds from 0.5 to 0.95 (mAP50-95), were tracked to assess the model’s ability to balance detection accuracy and classification performance. While variability in these metrics was observed during the early phases of training, they stabilized as the model matured, indicating a successful learning process that effectively addressed the challenges posed by the heterogeneous MRI dataset. In sum, these carefully tailored modifications (args.yaml configuration https://doi.org/10.6084/m9.figshare.29322854.v1 ) were implemented to accommodate the anatomical constraints, clinical demands, and data-driven nuances intrinsic to lumbar spine MRI analysis, thereby enhancing the YOLOv8x model’s capacity to accurately detect and characterize intervertebral disc degeneration. Results Dataset A total of 484 patients were included in this study, with their MRI data sourced from two distinct datasets. The dataset utilized in this study comprises a diverse collection of MRI images, characterized by significant variations in quality and pixel intensity values, vividly illustrated in Figure 1. Additionally, the dimensions of the images within the dataset exhibit considerable variability, with heights ranging from 256 to 1068 pixels and widths from 240 to 1008 pixels. We evaluated the slice averaging strategies for model performance (see Supplementary notes, 1 ) and decided to use central ± 2 Slice approach. The intervertebral discs at the L1-L2, L2-L3, L3-L4, L4-L5, and L5-S1 levels were precisely delineated using bounding boxes, as demonstrated in Figure 2. Analysis of Pfirrmann Grade Distribution Across Intervertebral Discs Upon analysis of the MRI scans, an imbalance was observed in the distribution of Pfirrmann grades among the intervertebral discs ( Figure 2, Supplementary files ). Specifically, the dataset included 379 discs classified as grade I, 699 as grade II, 519 as grade III, 691 as grade IV, and 132 as grade V. Deep Learning Training Training Process Monitoring Throughout the training process, performance was continuously monitored using a variety of loss and metric curves, providing key insights into the model’s learning trajectory (Figure 3). The evolution of train/box_loss, train/cls_loss, and train/dfl_loss demonstrated the model’s progressive improvement in disc localization, classification, and bounding box refinement. These losses gradually decreased over the course of training, suggesting effective model adaptation. Corresponding validation metrics (val/box_loss, val/cls_loss, and val/dfl_loss) confirmed that these improvements were generalizing beyond the training data, indicating the model’s ability to perform reliably on unseen MRI scans. Model Performance Evaluation Table 1 presents the performance metrics of the YOLOv8x model on the validation set, including Precision (P), Recall (R), mean Average Precision at IoU=0.5 (mAP50), and the averaged mAP across IoU thresholds from 0.5 to 0.95 (mAP50-95) for all Pfirrmann grades of disc degeneration. The overall results indicate a well-balanced detection and classification system, with P = 0.75 and R = 0.808, reflecting robust detection capabilities across the dataset. Table 1. Model performance metrics across Pfirrmann grades Class Instances P R mAP50 mAP50-95 all 605 0.75 0.808 0.792 0.667 Grade I 102 0.763 0.794 0.822 0.735 Grade II 173 0.732 0.821 0.785 0.707 Grade III 120 0.603 0.792 0.718 0.634 Grade IV 170 0.831 0.839 0.872 0.735 Grade V 40 0.819 0.793 0.764 0.525 Discs graded as I and IV show particularly high performance – mAP50=0.872 for Grade IV and mAP50=0.822 for Grade I. Conversely, Grades II and III present more challenges. Although Recall remains high for these grades, signifying that most discs are detected, Precision and mAP50-95 are lower in comparison. The mAP50-95 for Grade II (0.707) and Grade III (0.634) reflects this uncertainty, particularly as stricter IoU criteria reduce the model’s ability to consistently localize the discs. Grade V discs, representing the most severe degeneration, showed an interesting trend. Despite relatively high Precision (0.819) and Recall (0.793), the mAP50-95 significantly drops to 0.525. This decline suggests that although the model is adept at detecting severely degenerated discs under moderate IoU thresholds, it struggles with precise localization and consistent performance as the IoU threshold increases. The Confusion Matrix (Figure 4) demonstrates that the model successfully classifies most disc grades, as reflected by the high diagonal values: Grade I (0.75), Grade II (0.75), Grade III (0.75), Grade IV (0.77), and Grade V (0.82). However, misclassifications primarily occur between adjacent grades, particularly between intermediate degeneration stages (Grades II and III) and severe degeneration (Grades IV and V). Additionally, there are some challenges in distinguishing disc structures from background noise. Given the substantial inter-observer agreement (0.69–0.81) and excellent intra-observer reliability (ICC 0.84–0.90), these misclassifications are likely attributable to the inherent complexity and subjectivity in grading disc degeneration, rather than errors in the model’s performance. This highlights the need for further refinement in both model training and grading protocols to improve classification accuracy in borderline cases. Web Service To facilitate the practical application of our trained model, we developed SpineScan (spine-scan.science.nprog.ru ) , a web-based service designed for automated analysis of lumbar MRI scans. SpineScan provides clinicians with an intuitive platform for evaluating spinal health through advanced image processing and machine learning techniques (for a detailed description, see Supplementary notes, 2 ). Discussion Here we developed an automated system and web service for classifying MRI grades of disc degeneration, based on a two clinical dataset (RuDDs and open-access) of 484 lumbar scans. The predictive value of our model was 0.78–0.82 depending on the lumbar level, these results are comparable with experts and can be used in practice. This is especially relevant when it is necessary to analyze a large amount of MRI data. There are currently many studies aimed to the use of CNN in the Pfirrmann grading. In many of these studies, CNN models have results comparable to experts the Pfirrmann grading. For example, the SpineNet project and its updated version, working with the Twins UK, Genodisc and other databases, showed the largest improvements in the Pfirrmann grading accuracy frоm 71.0–73.0% [ 13 ]. This model provides the radiological grading which incorporates context from multiple vertebrae and sequences, as a real radiologist would. Similar data was shown by Nikpasand M et al [ 14 ]: their model for the IVD images, the CNN-generated Pfirrmann scores agreed with the lead grader on 78% of the images, which was significantly better than the human graders. The Fleiss kappa statistic for the CNN was 0.68, which was, again, much higher than between the human graders and indicative of substantial agreement. Baur D et al [ 15 ] using a combination graph neural network and convolutional neural network, showed inter-rater reliability according to the Pfirrmann grading system exhibited moderate inter-rater agreement, with Cohen’s kappa values in the range 0.455–0.565. These authors also see promise in 3D models. Liawrungrueang W et al [ 6 , 16 ] reported impressive results, with their deep CNN model detecting and classifying lumbar IDD with over 95% accuracy: Grade I – 0.98, Grade II – 1.0, Grade III – 0.99, Grade IV – 0.99, Grade V – 1.0. We used the same model (YOLOv8) and a comparably sized training set, but were unable to replicate these results. Our model achieved a predictive accuracy of 0.78–0.82 depending on the lumbar level. We tested various training settings and sample sizes, but these changes had little effect on prediction quality. While our results are consistent with other studies, we remain cautious about the reported accuracy estimates in the paper by Liawrungrueang W et al. This work has several limitations. The heterogeneity of MRI images presents significant challenges in standardizing data for analysis. However, it also offers an opportunity to develop robust deep learning models capable of generalizing across a wide range of imaging conditions. These discrepancies often result from differences in MRI machines and imaging protocols across medical facilities. Secondly, our model does not separate disk segmentation from grading. In some cases, discs with a Pfirrmann grade of 5 are so severely degenerated that their height is nearly absent, and adjacent vertebrae appear fused. Under these conditions, discs are not visually distinguishable, making accurate segmentation impractical. Furthermore, methods that rely on CSF pixel intensity, such as those described in [ 2 ], are not applicable to these images. This highlights the need for adaptive or alternative analytical methods to handle the significant variability in our dataset. Another limitation is that clinicians assess disc degeneration by integrating impressions from multiple MRI slices, while our model analyzes only a single slice. Disc signal intensity and degeneration can vary between slices. We believe there is strong potential in using 2.5D or 3D models to address the limitations of our current 2D approach. Conclusion The YOLOv8x model demonstrated robust performance in detecting and classifying lumbar intervertebral discs across a diverse MRI dataset. With an overall Precision of 0.75 and Recall of 0.808, the model effectively identifies discs with mild to moderate degeneration (Grades I-IV), particularly excelling with Grade IV, which achieved the highest mAP50 of 0.872. However, challenges arose with the most severely degenerated discs (Grade V), where localization accuracy decreased at higher IoU thresholds (mAP50-95 = 0.525), likely due to low contrast and irregular boundaries. Misclassifications, primarily between adjacent grades, can be attributed to the inherent complexity of grading disc degeneration, a factor supported by substantial inter- and intra-observer agreement. These findings underscore the need for further model refinement and improved grading protocols to enhance performance, especially for borderline cases. The YOLOv8x model offers a solid foundation for clinical applications, with continued advancements necessary to improve accuracy, particularly in the classification of severe degeneration. Declarations Author Contribution AM, OL and YT contributed to the study concept. Material preparation was performed by OL and AK; data collection and analysis were performed by AM, OL, AK and YT. Methodology, including models selection and discussions, was performed by AM, EE, DA, DS. AM, OL and YT drafted the manuscript. All authors critically reviewed and approved the final manuscript.AM and OL contributed equally. Data Availability Model configuration of YOLOv8x (args.yaml https://doi.org/10.6084/m9.figshare.29322854.v1) implemented to accommodate the anatomical constraints, clinical demands, and data-driven nuances intrinsic to lumbar spine MRI analysis.Developed SpineScan (spine-scan.science.nprog.ru) is a web-based service designed for automated analysis of lumbar MRI. References Pfirrmann CW, Metzdorf A, Zanetti M, Hodler J, Boos N. Magnetic resonance classification of lumbar intervertebral disc degeneration. Spine (Phila Pa 1976). 2001 Sep 1;26(17):1873–8. DOI: 10.1097/00007632-200109010-00011 van der Graaf JW, van Hooff ML, Buckens CFM, Rutten M, van Susante JLC, Kroeze RJ, et al. Lumbar spine segmentation in MR images: a dataset and a public benchmark. Sci data. 2024 Mar 2;11(1):264. DOI: 10.1038/s41597-024-03090-w Natalia F, Meidia H, Afriliana N, Al-Kafri AS, Sudirman S, Simpson A, et al. Development of Ground Truth Data for Automatic Lumbar Spine MRI Image Segmentation. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE; 2018. p. 1449–54. DOI: 10.1109/HPCC/SmartCity/DSS.2018.00239 Zheng H-D, Sun Y-L, Kong D-W, Yin M-C, Chen J, Lin Y-P, et al. Deep learning-based high-accuracy quantitation for lumbar intervertebral disc degeneration from MRI. Nat Commun. 2022 Feb 11;13(1):841. DOI: 10.1038/s41467-022-28387-5 Jamaludin A, Kadir T, Zisserman A, McCall I, Williams FMK, Lang H, et al. ISSLS PRIZE in Clinical Science 2023: comparison of degenerative MRI features of the intervertebral disc between those with and without chronic low back pain. An exploratory study of two large female populations using automated annotation. Eur Spine J. 2023 May 30;32(5):1504–16. DOI: 10.1007/s00586-023-07604-9 Liawrungrueang W, Kim P, Kotheeranurak V, Jitpakdee K, Sarasombath P. Automatic Detection, Classification, and Grading of Lumbar Intervertebral Disc Degeneration Using an Artificial Neural Network Model. Diagnostics (Basel, Switzerland). 2023 Feb 10;13(4). DOI: 10.3390/diagnostics13040663 Niemeyer F, Galbusera F, Tao Y, Kienle A, Beer M, Wilke H-J. A Deep Learning Model for the Accurate and Reliable Classification of Disc Degeneration Based on MRI Data. Invest Radiol. 2021 Feb 1;56(2):78–85. DOI: 10.1097/RLI.0000000000000709 Natalia F, Sudirman S, Ruslim D, Al-Kafri A. Lumbar spine MRI annotation with intervertebral disc height and Pfirrmann grade predictions. PLoS One. 2024;19(5):e0302067. DOI: 10.1371/journal.pone.0302067 Wang YXJ. Several concerns on grading lumbar disc degeneration on MR image with Pfirrmann criteria. J Orthop Transl. 2022 Jan;32:101–2. DOI: 10.1016/j.jot.2021.12.003 Urrutia J, Besa P, Campos M, Cikutovic P, Cabezon M, Molina M, et al. The Pfirrmann classification of lumbar intervertebral disc degeneration: an independent inter- and intra-observer agreement assessment. Eur Spine J. 2016;25(9):2728–33. Leonova ON, Elgaeva EE, Golubeva TS, Peleganchuk A V., Krutko A V., Aulchenko YS, et al. A protocol for recruiting and analyzing the disease-oriented Russian disc degeneration study (RuDDS) biobank for functional omics studies of lumbar disc degeneration. Abdelbasset WK, editor. PLoS One. 2022 May 13;17(5):e0267384. DOI: 10.1371/journal.pone.0267384 Sudirman S, Al Kafri A, Natalia F, Meidia H, Afriliana N, Al-Rashdan W, et al. Lumbar Spine MRI Dataset, https://data.mendeley.com/datasets/k57fr854j2/2. 2019. Windsor R, Jamaludin A, Kadir T, Zisserman A. Automated detection, labelling and radiological grading of clinical spinal MRIs. Sci Rep. 2024 Jul 1;14(1):14993. DOI: 10.1038/s41598-024-64580-w Nikpasand M, Middendorf JM, Ella VA, Jones KE, Ladd B, Takahashi T, et al. Automated magnetic resonance imaging-based grading of the lumbar intervertebral disc and facet joints. JOR spine. 2024 Sep;7(3):e1353. DOI: 10.1002/jsp2.1353 Baur D, Bieck R, Berger J, Schöfer P, Stelzner T, Neumann J, et al. Automated Three-Dimensional Imaging and Pfirrmann Classification of Intervertebral Disc Using a Graphical Neural Network in Sagittal Magnetic Resonance Imaging of the Lumbar Spine. J Imaging Informatics Med. 2024 Sep 12; DOI: 10.1007/s10278-024-01251-2 Liawrungrueang W, Cholamjiak W, Sarasombath P, Jitpakdee K, Kotheeranurak V. Artificial Intelligence Classification for Detecting and Grading Lumbar Intervertebral Disc Degeneration. Spine Surg Relat Res. 2024 Nov 27;8(6):552–9. DOI: 10.22603/ssrr.2024-0154 Additional Declarations No competing interests reported. Supplementary Files SpineScanSupplementarynotes.docx SpineScanSupplementaryfiles.docx Cite Share Download PDF Status: Published Journal Publication published 03 Nov, 2025 Read the published version in European Spine Journal → Version 1 posted Editorial decision: Revision requested 02 Sep, 2025 Reviews received at journal 29 Aug, 2025 Reviewers agreed at journal 27 Aug, 2025 Reviewers agreed at journal 25 Aug, 2025 Reviewers agreed at journal 08 Aug, 2025 Reviews received at journal 28 Jun, 2025 Reviews received at journal 27 Jun, 2025 Reviewers agreed at journal 26 Jun, 2025 Reviewers agreed at journal 25 Jun, 2025 Reviewers invited by journal 25 Jun, 2025 Editor assigned by journal 23 Jun, 2025 Submission checks completed at journal 23 Jun, 2025 First submitted to journal 17 Jun, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6914052","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":477645733,"identity":"b18bca9c-243e-4833-a12e-21af922f2e2d","order_by":0,"name":"Aleksandr Minin","email":"","orcid":"","institution":"Lomonosov Moscow State University","correspondingAuthor":false,"prefix":"","firstName":"Aleksandr","middleName":"","lastName":"Minin","suffix":""},{"id":477645734,"identity":"ab76aedd-8a38-4a06-be9b-3a7f8ea333c7","order_by":1,"name":"Olga Leonova","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA0UlEQVRIiWNgGAWjYFAC5gYIzQ6iDSyI0cII1cJzAKRFghQtEglgkrAGfunGxgcfd9gk9s98fnXDjwIJBv727gS8WiTnHGw2nHkmLXHG7Zyymz1Ah0mcObsBrxaDG4lt0rxth40Zbuek3eABajGQyCWopf03SIv8zTNpN/8QqaWNGahFzuAG+7HbRNkiOSOxWXJmW5qc4ZkcttsyBhI8BP3CL5F88MPHNhseuePHn91888dGjr+9F78WJMBjACaJVQ4C7A9IUT0KRsEoGAUjCAAAVbRIjlLnBO8AAAAASUVORK5CYII=","orcid":"","institution":"Central Scientific Research Institute of Traumatology and Orthopedics","correspondingAuthor":true,"prefix":"","firstName":"Olga","middleName":"","lastName":"Leonova","suffix":""},{"id":477645735,"identity":"4e879554-f4fc-4cfd-a4d9-1a2d93d269e0","order_by":2,"name":"Aleksandr Krutko","email":"","orcid":"","institution":"Central Scientific Research Institute of Traumatology and Orthopedics","correspondingAuthor":false,"prefix":"","firstName":"Aleksandr","middleName":"","lastName":"Krutko","suffix":""},{"id":477645736,"identity":"8e13c7f7-f18f-4ff2-a2e8-cb381e5c71a9","order_by":3,"name":"Elizaveta Elgaeva","email":"","orcid":"","institution":"Institute of Cytology and Genetics","correspondingAuthor":false,"prefix":"","firstName":"Elizaveta","middleName":"","lastName":"Elgaeva","suffix":""},{"id":477645737,"identity":"9d381e57-8c68-4e44-aee6-7186a890d7af","order_by":4,"name":"Denis Antonets","email":"","orcid":"","institution":"Lomonosov Moscow State University","correspondingAuthor":false,"prefix":"","firstName":"Denis","middleName":"","lastName":"Antonets","suffix":""},{"id":477645738,"identity":"67778708-91e4-4050-9803-4bbb749a40bb","order_by":5,"name":"Dmitriy Shtokalo","email":"","orcid":"","institution":"Lomonosov Moscow State University","correspondingAuthor":false,"prefix":"","firstName":"Dmitriy","middleName":"","lastName":"Shtokalo","suffix":""},{"id":477645739,"identity":"85fd9fe4-4d1d-4eed-b205-8fdc4e91b24c","order_by":6,"name":"Yakov Tsepilov","email":"","orcid":"","institution":"Institute of Cytology and Genetics","correspondingAuthor":false,"prefix":"","firstName":"Yakov","middleName":"","lastName":"Tsepilov","suffix":""}],"badges":[],"createdAt":"2025-06-17 11:53:24","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6914052/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6914052/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s00586-025-09537-x","type":"published","date":"2025-11-03T15:56:59+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":85742476,"identity":"0aaae51c-a252-4a9b-accf-53926be3f6e7","added_by":"auto","created_at":"2025-07-01 09:06:55","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":588388,"visible":true,"origin":"","legend":"\u003cp\u003eVariability in MRI scan quality across the dataset.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-6914052/v1/d7acaee36e29f9895e9e75dd.png"},{"id":85742479,"identity":"fe92ae16-f8a8-4320-bab4-769297794372","added_by":"auto","created_at":"2025-07-01 09:06:55","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":383182,"visible":true,"origin":"","legend":"\u003cp\u003eMRI Image Annotation Using Bounding Boxes\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-6914052/v1/843665519d263023e48819e3.png"},{"id":85742483,"identity":"0b315f93-7619-4692-8e3b-eb9df958acd3","added_by":"auto","created_at":"2025-07-01 09:06:55","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":154300,"visible":true,"origin":"","legend":"\u003cp\u003eTraining Dynamics and Evaluation Metrics\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-6914052/v1/cd6d2be5e77f82e75c90fcd3.png"},{"id":85742477,"identity":"c5d930d6-3094-4d6f-93c7-c3b17c8e0573","added_by":"auto","created_at":"2025-07-01 09:06:55","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":49483,"visible":true,"origin":"","legend":"\u003cp\u003eConfusion Matrix for Model Evaluation\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-6914052/v1/2e61338ca8d48a2ac19e1e18.png"},{"id":95563995,"identity":"493889ba-2697-435e-b58a-9f607e076365","added_by":"auto","created_at":"2025-11-10 16:06:15","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2092582,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6914052/v1/8137553e-b6e0-4dc9-806d-4893f63df9d6.pdf"},{"id":85742481,"identity":"a97a6201-2359-45a3-8335-869115e4597f","added_by":"auto","created_at":"2025-07-01 09:06:55","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":645360,"visible":true,"origin":"","legend":"","description":"","filename":"SpineScanSupplementarynotes.docx","url":"https://assets-eu.researchsquare.com/files/rs-6914052/v1/8edfa48ee556c657f0f3259d.docx"},{"id":85745051,"identity":"1a92b13a-b237-4f38-841d-b9b9b7adff79","added_by":"auto","created_at":"2025-07-01 09:22:55","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":4545238,"visible":true,"origin":"","legend":"","description":"","filename":"SpineScanSupplementaryfiles.docx","url":"https://assets-eu.researchsquare.com/files/rs-6914052/v1/b1aa9f433e5061337b6c4936.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"SpineScan: a deep learning model for lumbar spine MRI annotation and Pfirrmann grading assessment","fulltext":[{"header":"Introduction","content":"\u003cp\u003eIntervertebral disc degeneration as decreased signal intensity, structural inhomogeneity and reduction in disc height, are identified by MRI [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. Automated MRI analysis is key to improving the diagnostic value of MRI by providing a more objective and quantitative interpretation of images. Thanks to recent advances in machine learning and artificial intelligence, a few models for automated MRI analysis have been developed. Some of these models focus solely on spinal segmentation [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Others also perform grading of the disc degeneration on the Pfirrmann scale [\u003cspan additionalcitationids=\"CR5 CR6 CR7\" citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe Pfirrmann classification is the most widely used by clinicians and radiologists due to its simplicity and convenience, though it has some well-known limitations [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. The inter-observer agreement for this classification is fairly high, with inter-observer coefficients ranging from 0.69 to 0.81 [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e], and intra-observer agreement is excellent with an ICC of 0.86 (0.83\u0026ndash;0.89) [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e], 0.84\u0026ndash;0.90 [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. The average prediction accuracy of these models is significantly higher than that of medical experts, demonstrating the high potential of such systems to assist clinicians in medical decision making.\u003c/p\u003e \u003cp\u003eHowever, all grading models have limitations in terms of sharing prediction algorithms due to intellectual property concerns, making it difficult to compare or combine different models. This highlights the need to obtain similar models in other cohorts.\u003c/p\u003e \u003cp\u003eWe were interested in these models, which showed such high predictive value that we wanted to develop it on similar data and replicate the prediction accuracy values. Previously, we recruited a well-curated disease-oriented Russian disc degeneration study (RuDDS) cohort of patients from two Russian medical centers to facilitate the omics studies of lumbar disc degeneration disease [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Each patient has high quality lumbar MRI scans (1.5 Tesla), and information about different parameters (e.g. osteophytes, Pfirrmann score, Jarosh score and herniation.) of disc degeneration assessed by a radiologist.\u003c/p\u003e \u003cp\u003eThe purpose of this study was to develop and asses a convolutional neural network (CNN) model to detect intervertebral lumbar discs and to grade the disc degeneration according to Pfirrmann grading system.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003ch2\u003e\u003cem\u003eData set and annotation\u003c/em\u003e\u003c/h2\u003e\n\u003cp\u003eWe used two MRI data sets. The first dataset included 243 MRI images of the lumbar spine from the RuDDs study [11], contained MRI scans of symptomatic patients with degenerative spine diseases. The second dataset was obtained from a publicly available archive of 241 lumbar spine MRI images, accessible at [12], selected randomly. To facilitate an objective assessment of model quality, the dataset was carefully partitioned into training (363 MRI studies) and validation subsets (121 MRI studies).\u003c/p\u003e\n\u003cp\u003eThe disc degeneration was assessed using the 5-grade Pfirrmann classification, where grade 1 indicates a normal disc and grade 5 indicates a severely degenerated disc [1]. The assessment and segmentation was conducted by experienced clinicians with over 10 years of experience (OL and AK).\u003c/p\u003e\n\u003ch2\u003e\u003cem\u003eImage Preprocessing\u003c/em\u003e\u003c/h2\u003e\n\u003cp\u003eSagittal T2-weighted MRI images were utilized and preprocessed to ensure uniform intensity scales across the dataset. Initially, pixel intensities are confined to a specified range determined by the 1st and 99th percentiles of the image data, effectively mitigating the impact of extreme intensity outliers. Subsequently, these intensities are linearly scaled to span the full grayscale spectrum, ranging from 0 to 255, thereby facilitating optimal utilization by conventional image processing algorithms commonly used in computer vision. This normalization ensures consistency in the data, which is essential for the subsequent analytical stages involving deep learning models.\u003c/p\u003e\n\u003ch2\u003e\u003cem\u003eModel description\u003c/em\u003e\u003c/h2\u003e\n\u003cp\u003eGiven the challenges in distinguishing degenerated discs using traditional segmentation approaches, we decided to use a detection-based approach. For this purpose, we have adopted the YOLOv8 model from the Ultralytics library (https://pypi.org/project/ultralytics/, version 8.1.5), that is effective in real-time object detection tasks (https://arxiv.org/abs/1506.02640, https://arxiv.org/abs/2004.10934). Employing YOLOv8 enables us to address the limitations of previous segmentation methods, facilitating more reliable identification of spinal conditions across our dataset. This approach aims to enhance the accuracy of our analyses and streamline the workflow for clinical assessments, where time and precision are of paramount importance. The maximum number of detected objects per image was restricted to five (max det: 5). This limitation ensured that the model focused on identifying a fixed set of critical structures.\u003c/p\u003e\n\u003ch2\u003e\u003cem\u003eData Augmentation\u003c/em\u003e\u003c/h2\u003e\n\u003cp\u003eDue to the relatively small sample size in our study, we implemented a series of random data augmentation techniques to prevent overfitting. These techniques included horizontal flipping applied with a probability of 0.5, scaling by a factor of 0.2, rotation within a range of -45\u0026deg; to +45\u0026deg;, and translation by up to 0.1 of the image dimensions. Examples of these augmentation techniques are illustrated in \u003cem\u003eFigure 1,\u003c/em\u003e \u003cem\u003eSupplementary files\u003c/em\u003e. We preserved the original hue, saturation, and value settings to maintain the inherent brightness and contrast of the MRI images, which are essential for accurate medical diagnosis. This preservation is particularly crucial for discs graded 1-3, as the differences in hue and tissue homogeneity are more indicative of disc health than mere shape changes. Such detailed attention to the original image characteristics ensures that our augmentations introduce variability without distorting critical diagnostic features.\u003c/p\u003e\n\u003ch2\u003e\u003cem\u003eOptimization Strategy\u003c/em\u003e\u003c/h2\u003e\n\u003cp\u003eTo prioritize localization accuracy before refining class discrimination, the box loss gain was increased to 10.0 (box: 10.0), while the classification loss gain was set to a moderate level of 0.5 (cls: 0.5). This phased optimization approach initially emphasized precise disc localization, establishing a robust foundation for subsequent enhancements in Pfirrmann grade classification. Additionally, the distribution focal loss parameter was adjusted to 1.5 (dfl: 1.5).\u003c/p\u003e\n\u003cp\u003eKey evaluation metrics, including Precision (P), Recall (R), mean Average Precision at IoU=0.5 (mAP50), and mean Average Precision across IoU thresholds from 0.5 to 0.95 (mAP50-95), were tracked to assess the model\u0026rsquo;s ability to balance detection accuracy and classification performance. While variability in these metrics was observed during the early phases of training, they stabilized as the model matured, indicating a successful learning process that effectively addressed the challenges posed by the heterogeneous MRI dataset.\u003c/p\u003e\n\u003cp\u003eIn sum, these carefully tailored modifications (args.yaml configuration https://doi.org/10.6084/m9.figshare.29322854.v1 ) were implemented to accommodate the anatomical constraints, clinical demands, and data-driven nuances intrinsic to lumbar spine MRI analysis, thereby enhancing the YOLOv8x model\u0026rsquo;s capacity to accurately detect and characterize intervertebral disc degeneration.\u003c/p\u003e"},{"header":"Results","content":"\u003ch2\u003e\u003cem\u003eDataset\u003c/em\u003e\u003c/h2\u003e\n\u003cp\u003eA total of 484 patients were included in this study, with their MRI data sourced from two distinct datasets. The dataset utilized in this study comprises a diverse collection of MRI images, characterized by significant variations in quality and pixel intensity values, vividly illustrated in Figure 1. Additionally, the dimensions of the images within the dataset exhibit considerable variability, with heights ranging from 256 to 1068 pixels and widths from 240 to 1008 pixels. We evaluated the slice averaging strategies for model performance (see \u003cem\u003eSupplementary notes, 1\u003c/em\u003e) and decided to use central \u0026plusmn; 2 Slice approach.\u003c/p\u003e\n\u003cp\u003eThe intervertebral discs at the L1-L2, L2-L3, L3-L4, L4-L5, and L5-S1 levels were precisely delineated using bounding boxes, as demonstrated in Figure 2.\u003c/p\u003e\n\u003ch2\u003e\u003cem\u003eAnalysis of Pfirrmann Grade Distribution Across Intervertebral Discs\u003c/em\u003e\u003c/h2\u003e\n\u003cp\u003eUpon analysis of the MRI scans,\u0026nbsp;an imbalance was observed in the distribution of Pfirrmann grades among the intervertebral discs (\u003cem\u003eFigure 2,\u003c/em\u003e \u003cem\u003eSupplementary files\u003c/em\u003e). Specifically, the dataset included 379 discs classified as grade I, 699 as grade II, 519 as grade III, 691 as grade IV, and 132 as grade V.\u003c/p\u003e\n\u003ch2\u003e\u003cstrong\u003e\u003cem\u003eDeep Learning Training\u003c/em\u003e\u003c/strong\u003e\u003c/h2\u003e\n\u003ch2\u003e\u003cem\u003eTraining Process Monitoring\u003c/em\u003e\u003c/h2\u003e\n\u003cp\u003eThroughout the training process, performance was continuously monitored using a variety of loss and metric curves, providing key insights into the model\u0026rsquo;s learning trajectory (Figure 3). The evolution of train/box_loss, train/cls_loss, and train/dfl_loss demonstrated the model\u0026rsquo;s progressive improvement in disc localization, classification, and bounding box refinement. These losses gradually decreased over the course of training, suggesting effective model adaptation. Corresponding validation metrics (val/box_loss, val/cls_loss, and val/dfl_loss) confirmed that these improvements were generalizing beyond the training data, indicating the model\u0026rsquo;s ability to perform reliably on unseen MRI scans.\u003c/p\u003e\n\u003ch2\u003e\u003cem\u003eModel Performance Evaluation\u003c/em\u003e\u003c/h2\u003e\n\u003cp\u003eTable 1 presents the performance metrics of the YOLOv8x model on the validation set, including Precision (P), Recall (R), mean Average Precision at IoU=0.5 (mAP50), and the averaged mAP across IoU thresholds from 0.5 to 0.95 (mAP50-95) for all Pfirrmann grades of disc degeneration. The overall results indicate a well-balanced detection and classification system, with P = 0.75 and R = 0.808, reflecting robust detection capabilities across the dataset.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1.\u003c/strong\u003e Model performance metrics across Pfirrmann grades\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"100%\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 18px;\"\u003e\n \u003cp\u003eClass\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003eInstances\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003eP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003eR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 15px;\"\u003e\n \u003cp\u003emAP50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21px;\"\u003e\n \u003cp\u003emAP50-95\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 18px;\"\u003e\n \u003cp\u003eall\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e605\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0.75\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0.808\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 15px;\"\u003e\n \u003cp\u003e0.792\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21px;\"\u003e\n \u003cp\u003e0.667\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 18px;\"\u003e\n \u003cp\u003eGrade I\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e102\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0.763\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0.794\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 15px;\"\u003e\n \u003cp\u003e0.822\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21px;\"\u003e\n \u003cp\u003e0.735\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 18px;\"\u003e\n \u003cp\u003eGrade II\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e173\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0.732\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0.821\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 15px;\"\u003e\n \u003cp\u003e0.785\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21px;\"\u003e\n \u003cp\u003e0.707\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 18px;\"\u003e\n \u003cp\u003eGrade III\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e120\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0.603\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0.792\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 15px;\"\u003e\n \u003cp\u003e0.718\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21px;\"\u003e\n \u003cp\u003e0.634\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 18px;\"\u003e\n \u003cp\u003eGrade IV\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e170\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0.831\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0.839\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 15px;\"\u003e\n \u003cp\u003e0.872\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21px;\"\u003e\n \u003cp\u003e0.735\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 18px;\"\u003e\n \u003cp\u003eGrade V\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19px;\"\u003e\n \u003cp\u003e40\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0.819\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 12px;\"\u003e\n \u003cp\u003e0.793\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 15px;\"\u003e\n \u003cp\u003e0.764\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21px;\"\u003e\n \u003cp\u003e0.525\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eDiscs graded as I and IV show particularly high performance \u0026ndash; mAP50=0.872 for Grade IV and mAP50=0.822 for Grade I.\u003c/p\u003e\n\u003cp\u003eConversely, Grades II and III present more challenges. Although Recall remains high for these grades, signifying that most discs are detected, Precision and mAP50-95 are lower in comparison. The mAP50-95 for Grade II (0.707) and Grade III (0.634) reflects this uncertainty, particularly as stricter IoU criteria reduce the model\u0026rsquo;s ability to consistently localize the discs.\u003c/p\u003e\n\u003cp\u003eGrade V discs, representing the most severe degeneration, showed an interesting trend. Despite relatively high Precision (0.819) and Recall (0.793), the mAP50-95 significantly drops to 0.525. This decline suggests that although the model is adept at detecting severely degenerated discs under moderate IoU thresholds, it struggles with precise localization and consistent performance as the IoU threshold increases.\u003c/p\u003e\n\u003cp\u003eThe Confusion Matrix (Figure 4) demonstrates that the model successfully classifies most disc grades, as reflected by the high diagonal values: Grade I (0.75), Grade II (0.75), Grade III (0.75), Grade IV (0.77), and Grade V (0.82). However, misclassifications primarily occur between adjacent grades, particularly between intermediate degeneration stages (Grades II and III) and severe degeneration (Grades IV and V). Additionally, there are some challenges in distinguishing disc structures from background noise.\u003c/p\u003e\n\u003cp\u003eGiven the substantial inter-observer agreement (0.69\u0026ndash;0.81) and excellent intra-observer reliability (ICC 0.84\u0026ndash;0.90), these misclassifications are likely attributable to the inherent complexity and subjectivity in grading disc degeneration, rather than errors in the model\u0026rsquo;s performance. This highlights the need for further refinement in both model training and grading protocols to improve classification accuracy in borderline cases.\u003c/p\u003e\n\u003ch2\u003e\u003cem\u003eWeb Service\u003c/em\u003e\u003c/h2\u003e\n\u003cp\u003eTo facilitate the practical application of our trained model, we developed SpineScan (spine-scan.science.nprog.ru\u003cu\u003e)\u003c/u\u003e, a web-based service designed for automated analysis of lumbar MRI scans. SpineScan provides clinicians with an intuitive platform for evaluating spinal health through advanced image processing and machine learning techniques (for a detailed description, see \u003cem\u003eSupplementary\u003c/em\u003e\u003cem\u003e\u0026nbsp;notes, 2\u003c/em\u003e).\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eHere we developed an automated system and web service for classifying MRI grades of disc degeneration, based on a two clinical dataset (RuDDs and open-access) of 484 lumbar scans. The predictive value of our model was 0.78\u0026ndash;0.82 depending on the lumbar level, these results are comparable with experts and can be used in practice. This is especially relevant when it is necessary to analyze a large amount of MRI data.\u003c/p\u003e \u003cp\u003eThere are currently many studies aimed to the use of CNN in the Pfirrmann grading. In many of these studies, CNN models have results comparable to experts the Pfirrmann grading. For example, the SpineNet project and its updated version, working with the Twins UK, Genodisc and other databases, showed the largest improvements in the Pfirrmann grading accuracy frоm 71.0\u0026ndash;73.0% [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. This model provides the radiological grading which incorporates context from multiple vertebrae and sequences, as a real radiologist would.\u003c/p\u003e \u003cp\u003eSimilar data was shown by Nikpasand M et al [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]: their model for the IVD images, the CNN-generated Pfirrmann scores agreed with the lead grader on 78% of the images, which was significantly better than the human graders. The Fleiss kappa statistic for the CNN was 0.68, which was, again, much higher than between the human graders and indicative of substantial agreement.\u003c/p\u003e \u003cp\u003eBaur D et al [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e] using a combination graph neural network and convolutional neural network, showed inter-rater reliability according to the Pfirrmann grading system exhibited moderate inter-rater agreement, with Cohen\u0026rsquo;s kappa values in the range 0.455\u0026ndash;0.565. These authors also see promise in 3D models.\u003c/p\u003e \u003cp\u003eLiawrungrueang W et al [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] reported impressive results, with their deep CNN model detecting and classifying lumbar IDD with over 95% accuracy: Grade I \u0026ndash; 0.98, Grade II \u0026ndash; 1.0, Grade III \u0026ndash; 0.99, Grade IV \u0026ndash; 0.99, Grade V \u0026ndash; 1.0. We used the same model (YOLOv8) and a comparably sized training set, but were unable to replicate these results. Our model achieved a predictive accuracy of 0.78\u0026ndash;0.82 depending on the lumbar level. We tested various training settings and sample sizes, but these changes had little effect on prediction quality. While our results are consistent with other studies, we remain cautious about the reported accuracy estimates in the paper by Liawrungrueang W et al.\u003c/p\u003e \u003cp\u003eThis work has several limitations. The heterogeneity of MRI images presents significant challenges in standardizing data for analysis. However, it also offers an opportunity to develop robust deep learning models capable of generalizing across a wide range of imaging conditions. These discrepancies often result from differences in MRI machines and imaging protocols across medical facilities.\u003c/p\u003e \u003cp\u003eSecondly, our model does not separate disk segmentation from grading. In some cases, discs with a Pfirrmann grade of 5 are so severely degenerated that their height is nearly absent, and adjacent vertebrae appear fused. Under these conditions, discs are not visually distinguishable, making accurate segmentation impractical. Furthermore, methods that rely on CSF pixel intensity, such as those described in [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e], are not applicable to these images. This highlights the need for adaptive or alternative analytical methods to handle the significant variability in our dataset.\u003c/p\u003e \u003cp\u003eAnother limitation is that clinicians assess disc degeneration by integrating impressions from multiple MRI slices, while our model analyzes only a single slice. Disc signal intensity and degeneration can vary between slices. We believe there is strong potential in using 2.5D or 3D models to address the limitations of our current 2D approach.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThe YOLOv8x model demonstrated robust performance in detecting and classifying lumbar intervertebral discs across a diverse MRI dataset. With an overall Precision of 0.75 and Recall of 0.808, the model effectively identifies discs with mild to moderate degeneration (Grades I-IV), particularly excelling with Grade IV, which achieved the highest mAP50 of 0.872. However, challenges arose with the most severely degenerated discs (Grade V), where localization accuracy decreased at higher IoU thresholds (mAP50-95\u0026thinsp;=\u0026thinsp;0.525), likely due to low contrast and irregular boundaries. Misclassifications, primarily between adjacent grades, can be attributed to the inherent complexity of grading disc degeneration, a factor supported by substantial inter- and intra-observer agreement. These findings underscore the need for further model refinement and improved grading protocols to enhance performance, especially for borderline cases. The YOLOv8x model offers a solid foundation for clinical applications, with continued advancements necessary to improve accuracy, particularly in the classification of severe degeneration.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eAM, OL and YT contributed to the study concept. Material preparation was performed by OL and AK; data collection and analysis were performed by AM, OL, AK and YT. Methodology, including models selection and discussions, was performed by AM, EE, DA, DS. AM, OL and YT drafted the manuscript. All authors critically reviewed and approved the final manuscript.AM and OL contributed equally.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eModel configuration of YOLOv8x (args.yaml https://doi.org/10.6084/m9.figshare.29322854.v1) implemented to accommodate the anatomical constraints, clinical demands, and data-driven nuances intrinsic to lumbar spine MRI analysis.Developed SpineScan (spine-scan.science.nprog.ru) is a web-based service designed for automated analysis of lumbar MRI.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003ePfirrmann CW, Metzdorf A, Zanetti M, Hodler J, Boos N. Magnetic resonance classification of lumbar intervertebral disc degeneration. Spine (Phila Pa 1976). 2001 Sep 1;26(17):1873\u0026ndash;8. DOI: 10.1097/00007632-200109010-00011\u003c/li\u003e\n\u003cli\u003evan der Graaf JW, van Hooff ML, Buckens CFM, Rutten M, van Susante JLC, Kroeze RJ, et al. Lumbar spine segmentation in MR images: a dataset and a public benchmark. Sci data. 2024 Mar 2;11(1):264. DOI: 10.1038/s41597-024-03090-w\u003c/li\u003e\n\u003cli\u003eNatalia F, Meidia H, Afriliana N, Al-Kafri AS, Sudirman S, Simpson A, et al. Development of Ground Truth Data for Automatic Lumbar Spine MRI Image Segmentation. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE; 2018. p. 1449\u0026ndash;54. DOI: 10.1109/HPCC/SmartCity/DSS.2018.00239\u003c/li\u003e\n\u003cli\u003eZheng H-D, Sun Y-L, Kong D-W, Yin M-C, Chen J, Lin Y-P, et al. Deep learning-based high-accuracy quantitation for lumbar intervertebral disc degeneration from MRI. Nat Commun. 2022 Feb 11;13(1):841. DOI: 10.1038/s41467-022-28387-5\u003c/li\u003e\n\u003cli\u003eJamaludin A, Kadir T, Zisserman A, McCall I, Williams FMK, Lang H, et al. ISSLS PRIZE in Clinical Science 2023: comparison of degenerative MRI features of the intervertebral disc between those with and without chronic low back pain. An exploratory study of two large female populations using automated annotation. Eur Spine J. 2023 May 30;32(5):1504\u0026ndash;16. DOI: 10.1007/s00586-023-07604-9\u003c/li\u003e\n\u003cli\u003eLiawrungrueang W, Kim P, Kotheeranurak V, Jitpakdee K, Sarasombath P. Automatic Detection, Classification, and Grading of Lumbar Intervertebral Disc Degeneration Using an Artificial Neural Network Model. Diagnostics (Basel, Switzerland). 2023 Feb 10;13(4). DOI: 10.3390/diagnostics13040663\u003c/li\u003e\n\u003cli\u003eNiemeyer F, Galbusera F, Tao Y, Kienle A, Beer M, Wilke H-J. A Deep Learning Model for the Accurate and Reliable Classification of Disc Degeneration Based on MRI Data. Invest Radiol. 2021 Feb 1;56(2):78\u0026ndash;85. DOI: 10.1097/RLI.0000000000000709\u003c/li\u003e\n\u003cli\u003eNatalia F, Sudirman S, Ruslim D, Al-Kafri A. Lumbar spine MRI annotation with intervertebral disc height and Pfirrmann grade predictions. PLoS One. 2024;19(5):e0302067. DOI: 10.1371/journal.pone.0302067\u003c/li\u003e\n\u003cli\u003eWang YXJ. Several concerns on grading lumbar disc degeneration on MR image with Pfirrmann criteria. J Orthop Transl. 2022 Jan;32:101\u0026ndash;2. DOI: 10.1016/j.jot.2021.12.003\u003c/li\u003e\n\u003cli\u003eUrrutia J, Besa P, Campos M, Cikutovic P, Cabezon M, Molina M, et al. The Pfirrmann classification of lumbar intervertebral disc degeneration: an independent inter- and intra-observer agreement assessment. Eur Spine J. 2016;25(9):2728\u0026ndash;33. \u003c/li\u003e\n\u003cli\u003eLeonova ON, Elgaeva EE, Golubeva TS, Peleganchuk A V., Krutko A V., Aulchenko YS, et al. A protocol for recruiting and analyzing the disease-oriented Russian disc degeneration study (RuDDS) biobank for functional omics studies of lumbar disc degeneration. Abdelbasset WK, editor. PLoS One. 2022 May 13;17(5):e0267384. DOI: 10.1371/journal.pone.0267384\u003c/li\u003e\n\u003cli\u003eSudirman S, Al Kafri A, Natalia F, Meidia H, Afriliana N, Al-Rashdan W, et al. Lumbar Spine MRI Dataset, https://data.mendeley.com/datasets/k57fr854j2/2. 2019. \u003c/li\u003e\n\u003cli\u003eWindsor R, Jamaludin A, Kadir T, Zisserman A. Automated detection, labelling and radiological grading of clinical spinal MRIs. Sci Rep. 2024 Jul 1;14(1):14993. DOI: 10.1038/s41598-024-64580-w\u003c/li\u003e\n\u003cli\u003eNikpasand M, Middendorf JM, Ella VA, Jones KE, Ladd B, Takahashi T, et al. Automated magnetic resonance imaging-based grading of the lumbar intervertebral disc and facet joints. JOR spine. 2024 Sep;7(3):e1353. DOI: 10.1002/jsp2.1353\u003c/li\u003e\n\u003cli\u003eBaur D, Bieck R, Berger J, Sch\u0026ouml;fer P, Stelzner T, Neumann J, et al. Automated Three-Dimensional Imaging and Pfirrmann Classification of Intervertebral Disc Using a Graphical Neural Network in Sagittal Magnetic Resonance Imaging of the Lumbar Spine. J Imaging Informatics Med. 2024 Sep 12; DOI: 10.1007/s10278-024-01251-2\u003c/li\u003e\n\u003cli\u003eLiawrungrueang W, Cholamjiak W, Sarasombath P, Jitpakdee K, Kotheeranurak V. Artificial Intelligence Classification for Detecting and Grading Lumbar Intervertebral Disc Degeneration. Spine Surg Relat Res. 2024 Nov 27;8(6):552\u0026ndash;9. DOI: 10.22603/ssrr.2024-0154\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"european-spine-journal","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"esjo","sideBox":"Learn more about [European Spine Journal](http://link.springer.com/journal/586)","snPcode":"586","submissionUrl":"https://submission.springernature.com/new-submission/586/3","title":"European Spine Journal","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"disc degeneration, lumbar spine, MRI, deep learning, convolutional neural network","lastPublishedDoi":"10.21203/rs.3.rs-6914052/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6914052/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003ePurpose\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWhile recent advances in deep learning have enabled automated Pfirrmann grading systems of intervertebral disc degeneration (IDD), many models remain inaccessible due to proprietary restrictions. This study aimed to develop and validate a convolutional neural network (CNN) for automated Pfirrmann grading using a diverse clinical dataset, and to compare our model’s performance with previously published results.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethods\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe trained a CNN-based model using the YOLOv8x architecture on two datasets: a well-curated Russian lumbar disc degeneration cohort (RuDDS) and an open-access dataset, totaling 484 lumbar MRI scans. Ground truth grading was provided by expert radiologists. The model was designed to simultaneously detect intervertebral discs and classify degeneration grades from single MRI slices. Performance was evaluated using standard metrics, including precision, recall, and mean average precision (mAP) across Pfirrmann grades I to V.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eOur model achieved a predictive accuracy between 0.78 and 0.82 depending on lumbar level. The highest performance was observed for Grade IV discs (mAP50 = 0.872), while performance for Grade V was lower (mAP50-95 = 0.525), likely due to poor contrast and indistinct boundaries in highly degenerated discs. Overall, the model demonstrated a precision of 0.75 and recall of 0.808. Comparison with previous studies revealed that our results are consistent with expert-level performance.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe developed model shows strong potential for automated grading of lumbar disc degeneration and performs comparably to expert radiologists in most cases. Our findings support the clinical applicability of AI-assisted grading systems while emphasizing the need for standardized imaging and evaluation protocols.\u003c/p\u003e","manuscriptTitle":"SpineScan: a deep learning model for lumbar spine MRI annotation and Pfirrmann grading assessment","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-01 09:06:50","doi":"10.21203/rs.3.rs-6914052/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-09-02T04:37:56+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-29T18:50:15+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"275668021241037695346928292556867370395","date":"2025-08-27T08:43:50+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"113071430010926832414772040669412677728","date":"2025-08-25T12:57:38+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"32016797146500198841387361673430269242","date":"2025-08-08T22:05:41+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-06-28T06:44:01+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-06-27T09:43:14+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"57496171558579502865054544986313210194","date":"2025-06-26T10:28:53+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"62932635934838111544513477576895896485","date":"2025-06-25T13:32:07+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-06-25T13:07:22+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-06-23T14:43:07+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-06-23T14:42:36+00:00","index":"","fulltext":""},{"type":"submitted","content":"European Spine Journal","date":"2025-06-17T11:45:45+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"european-spine-journal","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"esjo","sideBox":"Learn more about [European Spine Journal](http://link.springer.com/journal/586)","snPcode":"586","submissionUrl":"https://submission.springernature.com/new-submission/586/3","title":"European Spine Journal","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"4f882a6b-2681-4475-bc48-8d87130312b2","owner":[],"postedDate":"July 1st, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-11-10T16:00:22+00:00","versionOfRecord":{"articleIdentity":"rs-6914052","link":"https://doi.org/10.1007/s00586-025-09537-x","journal":{"identity":"european-spine-journal","isVorOnly":false,"title":"European Spine Journal"},"publishedOn":"2025-11-03 15:56:59","publishedOnDateReadable":"November 3rd, 2025"},"versionCreatedAt":"2025-07-01 09:06:50","video":"","vorDoi":"10.1007/s00586-025-09537-x","vorDoiUrl":"https://doi.org/10.1007/s00586-025-09537-x","workflowStages":[]},"version":"v1","identity":"rs-6914052","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6914052","identity":"rs-6914052","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.