KOM: A Multi-Agent Artificial Intelligence System for Precision Management of Knee Osteoarthritis (KOA) | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article KOM: A Multi-Agent Artificial Intelligence System for Precision Management of Knee Osteoarthritis (KOA) Weizhi Liu, Xi Chen, Zekun Jiang, Liang Zhao, Kunyuan Jiang, Ruisi Tang, and 7 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8147049/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Knee osteoarthritis (KOA) affects more than 600 million individuals globally and is associated with significant pain, functional impairment, and disability. While personalized multidisciplinary interventions have the potential to slow disease progression and enhance quality of life, they typically require substantial medical resources and expertise, making them difficult to implement in resource-limited settings. To address this challenge, we developed KOM, a multi-agent system designed to automate KOA evaluation, risk prediction, and treatment prescription. This system assists clinicians in performing essential tasks across the KOA care pathway and supports the generation of tailored management plans based on individual patient profiles, disease status, risk factors, and contraindications. In benchmark experiments, KOM demonstrated superior performance compared to several general-purpose large language models in imaging analysis and prescription generation. A randomized three-arm simulation study further revealed that collaboration between KOM and clinicians reduced total diagnostic and planning time by 38.5% and resulted in improved treatment quality compared to each approach used independently. These findings indicate that KOM could help facilitate automated KOA management and, when integrated into clinical workflows, has the potential to enhance care efficiency. The modular architecture of KOM may also offer valuable insights for developing AI-assisted management systems for other chronic conditions. Knee osteoarthritis (KOA) automated evaluation risk prediction treatment prescription multi-agent system personalized management clinical decision support imaging analysis clinician–AI collaboration care efficiency Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction KOA affects approximately 600 million people worldwide, characterized by progressive pain and functional deterioration that frequently necessitates total knee replacement in the end 1-3 . Timely and appropriate intervention is essential for slowing structural deterioration, alleviating symptoms, and improving functional outcomes 4 . However, delivering multidisciplinary personalized KOA management for large patient groups remains challenging, particularly in resource-limited healthcare systems 3 . Advances in AI have created new opportunities for KOA management through automated analysis, risk prediction 5-11 . AI-driven radiographic techniques, particularly convolutional neural networks, now facilitate the automated analysis of osteoarthritis images 5-7 . In parallel, prediction models integrating clinical and imaging data have been developed to estimate the risk of structural progression and functional decline in KOA patients 8-10 . However, these studies have not directly demonstrated their impact on the complete KOA management workflow. Moreover, the principal challenges in complete KOA management stem from the intensive nature of patient interactions and treatment planning. First, completing patient assessment process requires substantial clinical resources. In clinical workflows, multiple rounds of communication between clinicians and patients are typically needed to complete a full medical history collection. This process consumes significant time from clinicians, who are already facing heavy workloads, and the repeated execution of these time-consuming procedures may lead to the omission of critical information during interactions, potentially resulting in serious medical events. Second, effective disease management strategies require the formulation of personalized intervention plans based on the current disease status, quantified key progression risk factors, contraindications, and patient-specific needs. In the absence of AI assistance, it is often difficult to quickly determine key risk factors when multiple risk factors are present, and it is also challenging to formulate an individualized management plan for knee osteoarthritis based on the patient’s information within a short period. To address these limitations, we have explored several solutions. Our early work developed structured prompting techniques to enhance LLM performance for osteoarthritis queries 12 . An osteoarthritis agent (DocOA) used LLMs with Retrieval-Augmented Generation (RAG) to access guideline knowledge and generate compliant treatment recommendations 13 . After that, a multi-agent system was developed for complex clinical cases, where agents simulate different specialists in multidisciplinary team discussions 14 . These incremental developments provided preliminary insights but demonstrated limited gains in overall KOA care quality and efficiency, still failing to alleviate the clinical burden of KOA management. However, these findings motivated the production of the KOM system, a multi-agent AI system designed to support multiple components of KOA management, including patient interaction and assessment, risk prediction, and individualized treatment planning. KOM is implemented using a modular and extensible architecture designed to support future adaptation to other chronic diseases requiring complex, longitudinal management. The system consists of three specialized agents integrating LLMs, ResNet architecture, and other machine learning algorithms. Specifically, it comprises: 1. An Assessment Agent capable of interacting with patients, processing multimodal data, analyzing radiological images, and generating structured evaluation reports. 2. A Risk Agent designed to extract patient-specific progression risk factors, predicts individualized KOA progression, and generates risk reports. 3. A Therapy Agents Group consists of a set of domain-specific agents maintaining evidence-based medical knowledge, collectively simulating multidisciplinary team (MDT) discussions to generate personalized management plans. Following the clinical workflow, KOM completes the process from data collection to management planning. Specifically, the Assessment Agent collects and evaluates a patient’s demographic information, present and past medical history, personal history, imaging data, psychological state, nutritional status, physical activity level, socioeconomic condition, and treatment preferences, and automatically generates a structured electronic medical report. Subsequently, the Risk Agent extracts relevant clinical indicators to predict short- and mid-term structural and symptomatic progression of KOA, and identifies quantified risk factors that enable data-driven risk stratification and disease monitoring. Based on the medical reports and risks reports from previous two agents, the Therapy Agents Group further simulates an MDT decision-making process to generate intervention plans. Each agent was evaluated against general-purpose language models and algorithmic baselines to assess performance. Furthermore, we conducted a three-arm comparative study to assess the system's clinical utility in controlled simulated environments. The comparative study included three groups: an independent KOM deployment (KOM alone), an integrated KOM-doctoral collaboration (KOM plus clinicians), and a traditional doctoral practice (clinicians alone). This study describes the development, validation, and clinical evaluation of KOM and examines its potential role in enhancing the quality and efficiency of KOA management. Results Development of the Knee Osteoarthritis Management (KOM) System. KOM is an interactive multi-agent system that supports clinicians in three areas for KOA management: patient information collection and assessment, disease trajectory prediction and etiology analysis, and individualized treatment planning (Figure 1). The KOM architecture features three specialized agents: 1.Assessment Agent: Features multi-round patient interaction capabilities, image analysis functionality, and summary report generation. collects and evaluates a patient’s demographic information, present and past medical history, personal history, imaging data, psychological state, nutritional status, physical activity level, socioeconomic condition, and treatment preferences. It analyzes knee radiographs to classify the severity of KOA and identify specific features, including osteophyte formation and alterations in joint space. Finally, The Assessment Agent generates a structured evaluation report (Figure 2a). 2.Risk Agent: Capable of predicting knee osteoarthritis progression and identifying individual risk factors that contribute to disease advancement. It extracts parameters from evaluation report to forecast KOOS (Knee Injury and Osteoarthritis Outcome Score) subscale scores and KL (Kellgren–Lawrence) radiographic grading in the following four years. Finally, the Risk Agent generates a structured risk report (Figure 3a). 3.Therapy Agents Group: Composed of specialist agents from different medical disciplines, each with specialized knowledge bases. These agents simulate clinical physicians in multidisciplinary team discussions and develop multidisciplinary treatment plans based on evaluation report and risk report (Figure 4a). The workflow of KOM is illustrated in Figure 1b. The system offers two interaction modalities: sequential progression through the complete pathway or independent access to individual agents with manual data input. This flexibility accommodates diverse clinical workflows across healthcare settings. Assessment Agent The Assessment Agent collects patient information through structured dialogues and performs radiographic image analysis to generate evaluation reports. Its workflow is organized into three stages: information collection, treatment willingness confirmation, and radiographic analysis (Figure 2a). The agent integrates two key components: a patient–agent interaction module and an X-ray analysis module. The radiographic analysis output is seamlessly incorporated into the patient history, enabling a unified and structured clinical record. For the Patient-agent interaction module, this study implemented Qwen-Max as the foundation model. It developed a structured prompt through systematic engineering to facilitate multi-turn conversations between the agent and patients. This approach enabled the collection of information and the generation of summary reports. The model was configured with a temperature parameter of 0.8 and accessed via the application programming interface (API). Three human physicians evaluated 100 simulated patient-agent interactions, in which physicians acted as patients interacting with the assessment agent. The human physicians evaluated the interaction process and the final summary report, rating them on four metrics on a scale of 1 to 5, where 1 indicated no compliance and 5 indicated complete compliance. The results are as follows (Figure 2g): field completeness (4.03 ± 0.21), logical consistency (3.99 ± 0.16), medical accuracy (4.37 ± 0.47), and readability (4.00 ± 0.23). The X-ray analysis module was developed using the Osteoarthritis Initiative (OAI) dataset, which contains longitudinal bilateral knee radiographs from 4796 cases at their baseline evaluation, 2-year follow-up evaluation, and 4-year follow-up evaluation, resulting in a total of 12,719 bilateral anterior-posterior knee X-rays. The X-ray analysis module performs KOA severity grading, detects bilateral osteophytes, and assesses bilateral joint space. The workflow consists of initial knee center localization to determine the region of interest (ROI), followed by identification of both the left and right knee joints, with subsequent osteophyte detection and joint space analysis for each joint. To implement these functionalities, we developed a series of algorithms trained on a randomly selected subset of data from the OAI dataset, which human experts had calibrated before model training. The knee center localization algorithm utilized a U-Net architecture trained on 200 labeled images. After 40 training epochs, the validation metrics improved: the loss decreased from 0.7576 to 0.3804, the IoU increased from 0.0016 to 0.5790, and the center point error was reduced from 146.02 to 4.83 pixels (Figure 2d). For severity classification, a ResNet model was trained on balanced class distributions. Using an 8:1:1 data partition with 5-fold cross-validation, the model achieved an overall accuracy of 80.8% (Figure 2e). The None/Doubt class showed the highest accuracy (90.7%), with confusion primarily between Moderate and Mild grades. We also developed ten specialized models for extracting radiographic features from distinct anatomical regions, including the medial and lateral joint spaces, as well as the medial and lateral aspects of both femoral and tibial surfaces. The lateral joint space narrowing classification model achieved an accuracy of 89.8%, surpassing the medial joint space narrowing model (77.1%). Classification accuracy for subchondral sclerosis demonstrated regional variation, with the highest accuracy observed in the medial tibial plateau (56.4%), the lateral tibial plateau (83.1%), the medial femoral condyle (60.6%), and the lateral femoral condyle (85.3%). Osteophyte detection models maintained relatively consistent performance across all anatomical quadrants, with accuracy ranging from 78.5% to 95.5%. Gradient-weighted Class Activation Mapping (Grad-CAM) analysis enhanced model interpretability for classification tasks (Figure 2f). The relevant model design, training processes, and detailed the Supplementary Graphs S2. Comparative Evaluation of Radiographic Performance The Assessment Agent was benchmarked against five leading vision-language models (Google Gemini 2.0 Pro, GPT-4o, Claude 3.7, QwenMax VL, and LLaMA 3.2 90B Vision Instruct) using 500 bilateral knee radiographs from the OAI dataset, which were excluded from the training set. The evaluation assessed KOA severity grading, the detection of OA presence. For KOA severity grading (Figure 2b, c, h), the Assessment Agent achieved 77.16% accuracy, outperforming Gemini 2.0 Pro (34.50%). In OA presence detection, the Assessment Agent attained an accuracy of 82.22% compared to Gemini's 76.66%. The Assessment Agent maintained performance across anatomical locations with 75.38% and 78.95% accuracy on left and right KOA severity grading, respectively, and 84.09% and 80.35% accuracy for left and right knee OA detection. All competing models showed lower accuracy below 65% on these tasks. Moreover, the Assessment Agent demonstrated diagnostic capability across different levels of disease severity (64.68%-82.16% accuracy across classifications). In contrast, competing models often achieve high accuracy for None/Doubt cases but poor performance (<20%) on Mild or higher grades (Figure 2c), which may constrain their applicability in similar clinical tasks. Detailed metrics for this task are available in the Supplementary Graphs S2. Risk Agent The Risk Agent predicts the functional outcome and radiographic outcome of KOA at 1 years and 4 years of follow-up. It also identifies patient-specific risk factors. At the 1-year follow-up (V01), all six machine learning models demonstrated predictive ability across KOOS subscores (Figure 3d). The strongest results were obtained for right knee symptoms and quality of life, with correlation coefficients approaching 0.74 and explained variance (R²) values exceeding 0.50 in the best-performing models. ElasticNet provided relatively stable performance across multiple KOOS subscores, achieving R² values up to 0.58 with relatively low mean absolute errors. Random Forest and Gradient Boosting also performed well, particularly in predicting pain and function-related scores. In contrast, SVR and LightGBM showed less consistent results. At the 4-year follow-up (V06), predictive accuracy declined across all KOOS subscores. The best-performing models reached correlation values of 0.65–0.69 with R² values between 0.30 and 0.46, notably lower than at V01. Random Forest, Gradient Boosting, and ElasticNet remained relatively stable performance across tasks, while SVR and LightGBM again produced weaker predictions. Despite the decline, the prediction of quality-of-life and pain subscores retained moderate correlation values, whereas sports and recreation scores showed the lowest stability. For KL Grade Classification prediction tasks (Figure 3b, c) eight algorithms were evaluated. At V01, ensemble-based models achieved the highest predictive performance. For the left knee, AdaBoost reached the best overall results (accuracy = 0.910, F1 = 0.908, AUC = 0.965). Random Forest (accuracy = 0.902, AUC = 0.971) and XGBoost (accuracy = 0.897, AUC = 0.965) also performed strongly. For the right knee, LightGBM (accuracy = 0.910, AUC = 0.962) and XGBoost (accuracy = 0.908, AUC = 0.967) achieved the highest classification accuracy, while Random Forest remained highly competitive (accuracy = 0.900, AUC = 0.972). Across both knees, all ensemble approaches produced AUC values above 0.96, indicating robust discriminative capacity. At V06, predictive performance declined compared with V01. For the left knee, Gradient Boosting, XGBoost, and LightGBM produced the most consistent results, with accuracies ranging from 0.75 to 0.76 and AUC values close to 0.92. For the right knee, XGBoost yielded the best balance of metrics (accuracy = 0.765, AUC = 0.922), while Random Forest also performed well (accuracy = 0.762, AUC = 0.932). Other algorithms, including SVM, neural networks, and KNN, exhibited weaker performance at both time points. Following functional outcome prediction, individualized risk factors were identified using SHAP analysis, which enhanced interpretability by quantifying the contributions of each feature to the prediction. In a representative case, the predicted KOOS symptom score (72.18) fell below the cohort mean (75.50), with primary negative contributors including osteophytes in baseline left knee X-ray (-1.08), diminished KOOS pain score (-1.02), suggesting osteophytes and pain are the individualized progression risks for this patient. This patient also shows below-average peak knee extension torque; this parameter contributed positively (+1.19), suggesting residual muscle strength may protect against symptomatic progression. A detailed description of the model design, training procedures, and experimental results is provided in Supplementary Graphs S3 as well as Supplementary Tables T1 and T2. Therapy Agents Group A multi-agent cluster was developed to facilitate multi-agent conversations, mimicking the Multi-Disciplinary Team (MDT) approach adopted in clinical practice, for formulating patient-specific, multidisciplinary management plans. The cluster of agents receives the evaluation report and risk report generated in previous stages, engages in discussion, and generates the final personalized management plan. The cluster comprised multiple agents functioning as specialists, including an Exercise Rehabilitation agent, an Orthopedic agent, a Psycho-Nutrition agent, and a Clinical Decision agent. The Clinical Decision Agent functions as the coordinator and integrator within the cluster. It synthesizes the recommendations provided by the other specialist agents, resolves conflicts where their suggestions diverge, and applies evidence-based clinical guidelines to ensure the final plan is coherent, feasible, and clinically appropriate. In addition, the Clinical Decision Agent prioritizes interventions based on patient risk factor, comorbidities, and treatment preferences, thereby generating a plan that aligns with real-world clinical decision-making standards. Each agent was furnished with domain-specific medical data. Qwen-Max was utilized as the base model for all agents, with a temperature parameter set at 0.8. The model was accessed via API. Prompt engineering was conducted to instruct each agent to act as a clinical specialist, engage in active discussion with other agents, and develop a personalized management plan for the given KOA patient. Each agent is equipped with a retrieval augmented generation tool to utilize medical data from their respective knowledge base to generate and revise the management plan. We curated six specialized medical databases; each derived from authoritative clinical guidelines and peer-reviewed articles indexed in the Medline database (Figure 4a). All databases were constructed through a structured pipeline of literature search, eligibility screening, data extraction, and knowledge structuring. Each agent within the multi-agent cluster is paired with its corresponding evidence database to generate patient-specific recommendations: KOM Agent 1 – Nutrition and Psychology: linked to the psychological database (210 entries) and nutrition database (349 entries), enabling the generation of individualized psychological counseling and nutritional prescriptions. KOM Agent 2 – Medication and Surgery: connected to the surgical evidence database (1,549 entries) to determine surgical indications and medication strategies based on established osteoarthritis guidelines. KOM Agent 3 – Exercise Prescription: supported by the rehabilitation database (934 entries) and Exercise database (975 entries), which provides evidence-based exercise regimens tailored to the patient’s KOA severity and physical capacity. KOM Agent 4 – Clinical Decision and Summary: serves as the coordinator and synthesizer, using a shared guideline database to integrate the recommendations of all other agents, resolve conflicts, and formulate a coherent, evidence-based management plan. Additionally, a guideline database is accessible to all agents, ensuring that each recommendation aligns with up-to-date clinical practice standards. We evaluated the quality of generated treatment plans using retrospective clinical data from 250 patients with knee osteoarthritis treated at West China Hospital. For each case, we conducted expert evaluations of the generated prescriptions and calculated their similarity scores against gold-standard prescriptions to assess system performance. And we benchmarked KOM against leading general-purpose AI models, including GPT-4o, GPT-4o-mini, DeepSeekR1, Claude 3.7 Sonnet, QwenMax, Qwen2.5-14B, and Gemini 2.0 Pro (Figure 4a). Additionally, we benchmarked a single-agent RAG implementation against our agents’ group with domain-specific databases and collaborative decision-making. For lexical and semantic similarity analysis, we employed three established metrics. KOM achieved the highest BLEU score (0.0191), outperforming GPT-4o (0.0064) and Qwen2.5-14B (0.0083). Similarly, KOM led in ROUGE-L metrics with 0.2905, higher than QwenMax (0.1244) and DeepSeekR1 (0.1031). BERT evaluations showed narrower differences, with KOM (0.8069) performing comparably to GPT-4o-mini (0.8156) and GPT-4o (0.8122), demonstrating competitive semantic consistency across models (Figure 4g). The expert evaluation involved three specialists in orthopedics and sports medicine who independently rated prescriptions across seven dimensions on a 1–5 scale. KOM achieved the highest composite score (29.63 ± 1.33), outperforming the next-best model, DeepSeekR1 (26.03), by 3.60 points (Figure 4b, c). KOM received higher mean ratings across all evaluated dimensions, including completeness (4.408), personalization (4.380), and safety (4.366). Although nutritional guidance represented the lowest-scoring dimension across all models, KOM maintained its leading position with a score of 3.903 (Figure 4d, e). Z-score normalization (Figure 4f) highlighted KOM’s relative strengths in completeness (+0.67), personalization (+0.73), and safety (+0.66), with moderate performance in evidence-based practice (+0.48) and feasibility (+0.06). Other models demonstrated specific advantages in individual domains: G4M in evidence-based practice (+0.99), GPT-4o-mini in completeness (+0.76) and safety (+0.87), QwenMax in safety (+1.13), DeepSeekR1 in personalization (+1.10), and Gemini 2.0 Pro in completeness (+1.10). Across all models, exercise design and nutritional advice consistently emerged as weaker areas of focus. The relevant model design, training processes, and experimental result data are presented in detail in Figure 4 and the Supplementary Graphs S4 for reference. Clinical Evaluation of the KOM System To evaluate the effectiveness of the KOM system in clinical practice, we conducted an end-to-end simulation using 50 cases of KOA from West China Hospital (Figure 5a). The clinical evaluation included three operating conditions: physicians performing the assessment and treatment planning process alone, the KOM system functioning autonomously and performing the process, and physicians collaborating with the KOM system, where the system performs the X-ray evaluation and treatment planning, and the physicians can supervise and modify the reports generated by the KOM system at each stage of the process (Figure 5b). The quality of X-ray assessment, the quality of the management plan, and the entire processing time were evaluated. The approval rate of radio-graphic grading was defined as the proportion of knee images correctly classified according to severity grading within the final cohort of 50 cases. Expert evaluation of KOA grading results showed that image classifications generated independently by ten physicians achieved approval rates ranging from 42.0% to 66.0%, with a mean approval rate of 56.0%. When assisted by the KOM system, the approval rates increased, ranging from 90.0% to 96.0%, yielding a mean approval rate of 93.0%. Under the fully automated KOM-only condition, approval rates ranged from 72.0% to 82.0%, corresponding to a mean approval rate of 77.4% (Figure 5e). Expert evaluation was conducted on the same 50 prescriptions, with each prescription assessed across seven clinical criteria: clinical evidence, completeness, exercise prescription standardization, nutritional prescription standardization, safety, personalization, and accessibility. The aggregate average score was 3.63 for MS (Physicians), 4.56 for KOM, and 4.43 for the collaboration group (Figure 5d). Regarding content completeness, the MS+KOM group achieved a score of 4.73, compared with 4.63 for KOM and 4.01 for MS. For personalization, both the collaboration group and KOM group demonstrated comparable performance, with scores exceeding 4.80. Similarly, for safety, both groups maintained scores above 4.80. In exercise prescription quality, the scores were 3.13 (MS), 4.11 (KOM), and 4.10 (MS+KOM), highlighting the benefit of AI augmentation. For nutritional guidance, the scores were 3.30 (MS), 3.93 (KOM), and 3.97 (MS+KOM). In accessibility and feasibility, the collaboration group scored 4.10, lower than KOM’s 4.59. In adherence to evidence-based practice, the collaboration group achieved 4.41, while KOM scored 4.63. Quantitative prescription similarity metrics demonstrated that BLEU scores were 0.0065 (MS), 0.0455 (KOM), and 0.0500 (collaboration group). ROUGE-L scores were 0.1126 (MS), 0.2590 (KOM), and 0.2340 (collaboration group). BERTs were 0.8021 (MS), 0.7996 (KOM), and 0.8116 (collaboration group), indicating superior semantic alignment with reference standards in the human-AI collaborative condition (Figure 5f). For the complete clinical workflow, the MS group required 586 ± 56 seconds per case. In contrast, the collaboration group completed identical tasks in 361 ± 42 seconds (Figure 5c), demonstrating a 38.5% reduction in processing time. Discussion Main Finding This study introduces the KOM system; the first evaluated multi-agent systems for KOA. The Assessment Agent acquires patient-related information and treatment goals through interactive dialogue and analyzes knee radiographs to classify KOA severity in accordance with established clinical criteria. The Risk Agent forecasts functional and radiographic outcomes at one- and four-years follow-up while identifying patient-specific risk factors to inform intervention planning. The Therapy Agents Group, designed to simulate multidisciplinary team discussions, generates evidence-based, personalized management plans by integrating domain-specific knowledge from rehabilitation, exercise, surgical, psychological, and nutritional specialties. Evaluation results indicate that the Assessment Agent demonstrates enhanced performance compared to general-purpose AI models across assessment parameters. The Risk Agent accurately predicted functional and radiographic outcomes at 1-year and 4-year follow-ups. The Therapy Agents Group developed evidence-based, individualized management plans that demonstrated higher quality than those generated by current language models across multiple evaluation metrics. In a clinical evaluation study comprising three groups (physicians alone, KOM alone, and doctoral trainee-KOM collaboration), the collaboration group demonstrated significant advantages. Expert reviewers approved 93.0% of treatment plans from the doctoral trainee-KOM collaboration compared to 53.8% for physicians alone and 77.4% for KOM alone. Quality assessment across seven clinical criteria revealed superior performance in the collaborative condition, particularly in content completeness, personalization, and safety considerations. Notably, the doctoral trainee-KOM collaboration resulted in a 38.5% reduction in processing time compared to physicians working independently. These findings suggest that KOM may serve as a useful clinical decision support tool that can enhance both the quality and efficiency of KOA management while providing a methodological framework for developing similar systems for other chronic degenerative disorders. Assessment Agent via LLM-DL Hybrid Architecture The Assessment Agent of the KOM system employs a hybrid architecture that integrates deep learning-based image interpretation with prompt-optimized LLM. The system utilizes a ResNet-based convolutional neural network trained on more than 12,000 standardized bilateral anteroposterior knee radiographs from the OAI database. This enables classification of KOA severity, alongside accurate assessments of medial and lateral joint space narrowing, osteophyte presence, and subchondral bone sclerosis. While general-purpose vision-language models (VLMs) such as GPT-4V demonstrate zero-shot generalization capabilities in open-domain tasks, our findings reveal significant limitations when applied to specialized clinical imaging interpretation 15 – 18 . In our evaluation, these models demonstrated inadequate accuracy and consistency in grading KOA severity under zero-shot conditions. Traditional machine learning techniques, including gradient-boosted trees and convolutional neural networks trained on curated OA datasets, have repeatedly shown performance comparable to expert readers in radiographic grading. 19 – 21 . For patient information collection, LLMs have demonstrated effectiveness in generating structured clinical narratives and supporting interactive clinical workflows 22 . Recognizing these strengths, we developed a hybrid LLM-DL framework that produces a flexible, interpretable, and clinically applicable workflow for case generation. This approach aligns with recent developments in hybrid architectures, such as DeepDR-LLM, which combined image-based transformers with LLMs fine-tuned on 370,000 real-world diabetes management records to generate personalized recommendations 23 . Those suggest that LLM-DL architectures provide an adaptable and controllable solution for structured documentation in medical domains with defined task parameters and standardized inputs. Risk Agent with Etiological Analysis The Risk Agent in KOM forecasts both symptomatic and structural trajectories of KOA using supervised machine learning algorithms. This component models the temporal changes in KOOS subdomains and KL grades at 1-year and 4-year intervals. The model was trained on 31 multimodal features, including demographics, baseline KOOS scores, and radiographic measurements such as joint space width, osteophyte presence, and sclerosis grades. By generating personalized risk projections over clinically meaningful timeframes, this approach may help address an existing gap in KOA management, which is especially valuable given the heterogeneous progression of the disease. The model captures the known discordance between subjective symptoms and objective structural changes in KOA 19 – 21 . Jointly modeling function and structure enables improved risk stratification, facilitating the development of adaptive therapeutic strategies. The system demonstrates generalization across different time horizons, suggesting a degree of temporal stability in the evaluated tasks. While recent research explores molecular biomarkers, genomics, and metabolomics for predicting KOA progression 10 , these approaches face significant implementation barriers in clinical settings. Such methods often require invasive procedures, costly assays, or surgical specimens 24 . Furthermore, their availability in routine clinical environments remains limited, and center-specific biases and demographic variations frequently compromise their generalizability 22 , 25 . Therapy Agents Group via Multi-Agent Collaboration The third core component of the KOM system utilizes a multi-agent architecture that generates personalized intervention plans tailored to individual patient clinical presentations and etiologies. This framework integrates specialized agents focused on distinct therapeutic domains, including exercise prescription, pharmacological and surgical interventions, nutritional planning, and psychological support. These domain-specific agents operate independently while being coordinated by a central clinical agent that synthesizes their outputs into a cohesive, individualized treatment strategy. This multi-agent approach represents a departure from monolithic language models toward a modular design that enhances transparency, domain expertise, and decision quality. Comparative evaluations against leading language models demonstrated that KOM delivered superior performance across key clinical metrics, including recommendation accuracy, personalization, and actionability. The performance differential was most significant in complex cases involving multiple comorbidities or atypical presentations, where general-purpose models typically produced either overly generic or inconsistent recommendations. Our investigation of RAG with prompt engineering for incorporating external medical literature yielded limited performance improvements. In several evaluation categories, the RAG-enhanced model underperformed compared to its base language model, particularly in terms of clinical adaptability and semantic coherence. This limitation likely results from contextual inconsistencies caused by semantic drift in retrieved documents, a recognized challenge in current RAG implementations for specialized clinical reasoning tasks 26 – 28 . In contrast, the KOM multi-agent framework demonstrated effective capabilities in problem decomposition, domain-specific reasoning, and iterative refinement. These strengths align with findings from other domains: decentralized multi-agent reinforcement learning has shown improved task execution and generalization in complex network systems 29 . At the same time, Meta AI's Cicero system achieved expert-level performance in strategic gameplay through coordinated decision-making among specialized agents 30 . Enhancing Clinical Process through AI-Clinician Collaboration Beyond its core capabilities, we evaluated the KOM system in collaborative scenarios with physicians to simulate real-world clinical workflows. Physicians used KOM for image interpretation and treatment planning assistance, with comparative analyses revealing that these AI-physician partnerships performed better within our evaluation settings than either component used independently across diagnostic accuracy, workflow efficiency, and treatment quality metrics. This collaborative approach exemplifies the human-AI symbiosis paradigm in healthcare, where AI systems function as intelligent assistants that enhance decision quality while reducing cognitive load. Recent research strongly supports this collaborative model. Goh et al. 31 demonstrated that emergency physicians using an LLM achieved 15% higher diagnostic accuracy and greater decision consistency compared to unaided controls. In a multicenter study, clinicians using GPT-4 for patient case evaluation reported 20% faster processing times and a 12% improvement in review consistency. Similarly, Ayers et al. 32 demonstrated that a fine-tuned chatbot matched primary care physicians in terms of completeness and empathy when addressing common health questions. These findings underscore the practical value of AI-human collaboration, particularly in educational settings where structured guidance is crucial. By enabling physicians to explore complex reasoning with AI support, KOM functioned as a supportive tool for decision making in this study that helps develop diagnostic reasoning and decision-making skills. We anticipate that AI-assisted clinical education will become fundamental to modern medical training. As language models improve in contextual understanding and interfaces become more intuitive, systems like KOM will evolve into intelligent learning partners, potentially transforming how clinical reasoning is taught and assessed 33 – 35 . Limitation Despite these strengths, several limitations exist. First, while Assessment Agent performed well in retrospective testing, clinical implementation requires prospective validation with doctoral trainee oversight. Second, KOM focuses solely on knee osteoarthritis and may generate inappropriate outputs for other knee conditions; future versions should include a preliminary classifier to distinguish KOA from non-KOA pathologies. Third, the progression prediction model could benefit from additional factors beyond the current 31 features, which themselves present data collection challenges; wearable integration and feature optimization could address these issues. Finally, the intervention component lacks advanced mechanisms for resolving conflicting recommendations across different therapeutic domains, currently relying on basic prioritization strategies rather than sophisticated conflict resolution. Implication for Future Research In future research, we aim to enhance our agents while maintaining their clinical utility and effectiveness. Although our current integration of deep learning with large language models effectively addresses clinical needs, we seek to develop more compact models with higher integration capacity that could potentially function on mobile devices for real-time assessment. Regarding diagnostic evaluation, we plan to implement embedded technologies for disease monitoring, which could reduce dependence on traditional radiographic examinations that require specialized equipment, thereby streamlining the assessment process. For the prediction component, we plan to conduct further parameter analysis to identify more readily obtainable and clinically relevant variables. Regarding treatment recommendations, we intend to validate the efficacy of AI-generated management plans compared to conventional approaches through controlled clinical studies. Building upon our KOA management framework, this methodological approach can be adapted to other chronic conditions that require comprehensive management strategies. Conclusion This study presents KOM, a multi-agent artificial intelligence system for precision management of knee osteoarthritis. It successfully performs patient information collection, disease assessment, progression prediction, risk factor identification, and generates management plans. In our evaluation, it performed better than the general-purpose AI models tested on the specified setting tasks. The three-arm comparative study demonstrated that doctoral trainee-KOM collaboration achieved superior performance while reducing processing time. This study establishes a methodological foundation for developing scalable, evidence-based management strategies that may be adapted to address other chronic disorders. Methods Ethical Considerations and Study Design For this study, we retrospectively recruited 300 patients with KOA from West China Hospital, Sichuan University. Among them, 250 cases with complete baseline and follow-up data were included for retrospective validation of the treatment planning module. In addition, a subset of 50 de-identified cases was selected to construct a simulated cohort for prospective evaluation under controlled experimental conditions. The Ethics Committee of West China Hospital approved the study protocol (approval number: 23-2277). Radiographic data for deep learning model development were obtained exclusively from the publicly accessible OAI database 36 , a NIH-funded longitudinal observational study that provides standardized bilateral knee radiographs with corresponding clinical and demographic data; the OAI dataset is publicly licensed for research purposes and did not require additional approval. All chatbot queries were conducted between January and April 2025 in Chengdu, Sichuan, China. For performance evaluation, three senior orthopedic and sports medicine experts independently assessed model outputs under a double-masked design in which evaluators were unaware of model identities; no patients or public participants were involved. In addition to accuracy and consistency, we screened model outputs for potentially harmful, misleading, or biased responses (eg, unsafe medication recommendations, diagnostic errors, or demographic bias) and found none. Apart from the OAI data, which remains under its original license, all other datasets and code used in this study are owned by the research team; however, our example code and study prompts have been made publicly available to promote transparency and reproducibility. In the development of KOM, we initial attempted using single-agent to perform all tasks in the KOA care pathway. But it performed poorly, which leads to the adoption of a multi-agent strategy in which each agent was specifically trained for different tasks. Through iterative rounds of testing and development the structure of KOM system was finalized, which included the Assessment Agent, Risk Agent, and Therapy Agents Group. Assessment Agent Functionality and Workflow Upon accessing the KOM system, patients first indicate whether bilateral knee anteroposterior radiographs are available. When provided, the system automatically performs deep learning-based analysis of each knee, generating assessments of osteoarthritis severity, joint space narrowing, subchondral bone sclerosis, and osteophyte presence. Subsequently, the intelligent conversational interface collects structured patient information, encompassing demographics, chief complaints, medical and family histories, current treatments, and lifestyle factors such as physical activity, occupational loading, and prior joint injuries. It also gathers data on metabolic and hormonal status, psychological and nutritional health, and treatment preferences. The interface identifies and prompts for missing information to ensure data collection. Patients can request clarification on medical terminology or data requirements in real-time. For unavailable clinical assessments such as the KOOS score, the Assessment Agent guides patients through the complete assessment protocol. Development of the image analysis module The imaging analysis models were trained on 12,719 bilateral knee radiographs obtained from the publicly available OAI database. The pipeline first performs joint localization using two independently trained U-Net models, which define the regions of interest for subsequent processing. Multi-task image analysis is then conducted using eleven task-specific deep neural networks based on the ResNet architecture, enabling simultaneous classification of KOA severity, joint space narrowing, subchondral sclerosis, and osteophyte presence. Knee Joint Localization with an Enhanced UNet Pipeline To automatically identify the central regions of bilateral knees in radiographs, we implemented an optimized UNet 37 -based segmentation framework designed explicitly for high-contrast X-ray images. This localization process served as the foundation for all subsequent image analyses. We trained the model on 200 manually annotated anteroposterior radiographs, each paired with binary masks outlining knee centers. The architecture featured a five-layer UNet with skip connections, batch normalization, and ReLU activations. We initialized weights using Kaiming normalization to promote stable convergence. To address the significant foreground-background imbalance typical in knee radiographs, we employed an equally weighted hybrid loss function combining binary cross-entropy and Dice loss. During training, input images and masks underwent synchronized random flipping and resized cropping to maintain alignment. Performance evaluation utilized two metrics: bounding box intersection-over-union (IoU) and center point localization error. The Hungarian algorithm matched predicted and ground truth regions, ensuring unbiased metric computation. Training proceeded for 40 epochs using the Adam optimizer with cosine annealing learning rate scheduling, with model selection based on validation IoU performance. Radiographic Classification of KOA Features Using Transfer-Learned ResNet50 To facilitate multidimensional classification of KOA-related radiographic features, we implemented a ResNet 38 -50-based deep learning pipeline trained on localized anteroposterior (AP) knee radiographs. Following initial joint localization, each cropped image was processed through a modified ResNet-50 architecture, where the final classification layer was replaced with task-specific output heads. Classification Targets and Dataset Construction We curated a series of supervised classification tasks covering 11 clinically relevant radiographic features: KOA severity; Joint space narrowing of the medial and lateral compartments; Subchondral bone sclerosis (medial femur, lateral femur, medial tibia, lateral tibia); Osteophyte presence (binary detection across the same four compartments). For each task, we constructed balanced datasets containing 500 images per severity level. The data was partitioned into training, validation, and testing sets using an 8:1:1 ratio to ensure a methodical approach to model development and evaluation. KL Grading Adjustment and Label Consolidation While the Kellgren-Lawrence (KL) grading system represents the established standard for radiographic KOA severity assessment 39 , model development revealed consistent difficulties in discriminating between KL grades 0 and 1. Expert review, conducted by two musculoskeletal radiologists and a senior orthopedic surgeon, confirmed that the visual differences between these grades were subtle and had minimal implications for treatment 40,41 . Consequently, we consolidated KL 0 (normal) and KL 1 (doubtful) into a unified None/Doubt category. The remaining KL grades were mapped as follows: KL 2 → Mild, KL 3 → Moderate, KL 4 → Severe. This restructuring produced a clinically meaningful and computationally effective four-class KOA severity framework: None/Doubt (KL 0–1) Mild (KL 2) Moderate (KL 3) Severe (KL 4) To maintain compatibility with clinical standards, multimodal models were initially trained on the original five-grade KL classification and subsequently converted to the four-grade schema during downstream evaluation. All radiographs underwent standardized bone window processing (center: 300 HU, width: 1500 HU) to enhance bony structure contrast and minimize irrelevant soft-tissue variation. Images were normalized to the [0, 1] range, resized to 256 × 256 pixels, and randomly cropped to 224 × 224 pixels during training. Data augmentation included horizontal flipping, small-angle rotations, and brightness/contrast adjustments, applied synchronously to maintain label consistency. Model training was performed using a five-fold cross-validation approach. For each fold, we optimized the network using the Adam optimizer (initial learning rate: 1e-5) with a stepwise decay schedule (γ = 0.1 every seven epochs). Cross-entropy loss was used as the objective function. An early stopping mechanism (patience: 10 epochs; minimum delta: 1e-6) preserved the best-performing model state based on validation loss. Ground truth annotations from the OAI dataset were subjected to multi-level expert validation to ensure labeling accuracy and inter-rater consensus. To mitigate overfitting and maximize generalization, only center-cropped images were used during validation and testing. Model performance was evaluated using classification accuracy, confusion matrices, and area under the receiver operating characteristic curve (AUC-ROC) across all classes. These metrics were calculated per fold and averaged to obtain robust task-specific performance estimates. To enhance interpretability, Gradient-weighted Class Activation Mapping (Grad-CAM) was applied to visualize spatial attention maps on representative validation samples. These visualizations confirmed the appropriate model focus and identified anatomical regions that informed specific predictions. Training histories, including per-epoch loss, accuracy curves, confusion matrices, and ROC plots, were documented to ensure transparency and reproducibility. Development of KOA Patient Information Collection module To facilitate standardized and clinically interpretable documentation of KOA patient profiles, we developed an Assessment Agent based on a LLM optimized through domain-specific prompt engineering. The agent was designed to function as a clinical intake assistant, conducting dynamic natural language dialogue to compile clinical profiles for each patient. System Design and Implementation The generation process began with a structured system prompt that established the documentation template and directed the model to gather information across 11 essential clinical domains: demographics, chief complaint and history of present illness, radiographic findings, past and family history, current treatment status, psychological well-being, nutritional condition, treatment goals and preferences, and available rehabilitation resources. To enhance clinical utility, the prompt incorporated mechanisms for real-time clarification of medical terminology. When simulated patients expressed uncertainty regarding specialized concepts (e.g., KOOS scoring or subchondral sclerosis), the LLM automatically provided context-appropriate explanations before continuing the assessment dialogue. Evaluation Methodology We evaluated the Assessment Agent using 100 simulated KOA patient scenarios from West China Hospital. For each simulation, a trained research assistant assumed a predefined patient persona and engaged in an interactive dialogue with the LLM. Each session proceeded until the model determined that sufficient clinical information had been collected, at which point it autonomously concluded the interview process. All generated outputs were anonymized and randomized for subsequent expert review. To assess the clinical quality of the generated structured cases, we conducted a blinded expert evaluation. Three senior orthopedic and sports medicine physicians independently evaluated each case across four predetermined dimensions: Field completeness: Assessment of whether all expected clinical fields contained adequate information Logical consistency: Evaluation of internal coherence and clinical plausibility of the narrative Medical accuracy: Assessment of appropriate and correct application of clinical terminology and judgment Readability: Evaluation of clarity, fluency, and professional expression Before formal assessment, all expert evaluators participated in a calibration session. This session included review and discussion of representative examples illustrating high-, medium-, and low-quality cases, followed by collaborative refinement of the scoring rubric. During the formal evaluation phase, each expert was blinded to both the origin and sequence of the cases, and no communication was permitted during individual assessments. Each dimension was rated on a 1-5 scale, with the final score for each case calculated by averaging ratings across the three evaluators. Agent Benchmarking and Comparative Evaluation To assess the diagnostic performance of the KOM system relative to current vision-language models, we conducted a benchmarking study comparing five leading multimodal large language models (MLLMs): Google Gemini 2.0 Pro, GPT-4o, Claude 3.7 Sonnet, QwenMax VL, and LLaMA 3.2 90B Vision Instruct. All models were evaluated under identical input and task conditions to ensure methodological consistency. Evaluation Dataset and Protocol The evaluation dataset consisted of 500 bilateral knee radiographs (1,000 knees in total) obtained from the publicly available OAI cohort. All samples were excluded from any previous training or fine-tuning of the KOM system or its constituent components. For each case, a standardized input prompt was constructed to elicit three clinically relevant outputs: Bilateral KOA severity grading (based on the Kellgren-Lawrence scale) OA presence detection for each knee (binary classification) Left knee localization (spatial discrimination to assess model orientation awareness) To maintain consistency across model evaluations, all knee radiographs underwent preprocessing to ensure uniform viewing orientation (with the left knee positioned on the right side, facing the observer), and standardized diagnostic prompts were employed. Each model received identical image-text inputs and was evaluated based solely on its unmodified output without manual intervention. Evaluation Metrics and Ground Truth For KOA severity grading, the reference standard was derived from the revised OAI dataset and mapped into a four-class severity framework: None/Doubt (KL 0-1), Mild (KL 2), Moderate (KL 3), and Severe (KL 4). The KOM system was explicitly designed to predict KOA severity grades that align directly with this classification schema. All other multimodal agents were first prompted to produce KL grades, which were then converted into corresponding severity categories using the same KL-to-severity mapping protocol. For the OA presence detection task, models employed different output approaches. The KOM system internally predicted a KL severity grade for each knee, from which OA presence was derived using a predefined threshold; cases classified as KL 0 or 1 were categorized as "No OA." At the same time, KL ≥2 was designated as "OA present." In contrast, multimodal LLMs were prompted to directly determine whether OA was present or absent for each knee, without generating an explicit KL grade. To enable valid comparison, ground truth OA labels were derived using consistent KL-based thresholding: KL 0-1 → No OA, KL 2-4 → OA present. Model predictions were binarized accordingly, and accuracy was calculated separately for left and right knees. Risk Agent Functionality and Workflow Upon activation of the symptom and radiographic prediction agent, patients provide baseline data encompassing 31 clinical parameters, including body mass index (BMI), age, body weight, KOOS subscale scores, and bilateral knee muscle strength measurements. These parameters can be seamlessly populated from the previously completed structured clinical documentation Agent or directly entered by the patient. Symptom Prediction Submodule To predict KOOS subscale outcomes at 1-year and 4-year follow-ups, regression models utilizing XGBoost 42 , LightGBM 43 , Random Forest 44 , Gradient Boosting 45 , Support Vector Regression (SVR 46 ), and Elastic Net 47 algorithms were developed. The input dataset comprised 31 patient parameters that underwent preprocessing, including removal of incomplete cases, categorical encoding, and z-score standardization. Model performance was evaluated using five-fold cross-validation, which employed multiple metrics: R², mean squared error (MSE), mean absolute error (MAE), and Pearson correlation coefficient (Pearson r). Feature importance analyses were conducted and visualized via bar plots, residual analyses, and scatter plots to provide insights into model predictions. Quantitative evaluation results are presented in the supplementary Graphs S3. Radiographic Prediction Submodule For radiographic outcome prediction, structured clinical data from the OAI dataset were utilized to forecast KL grades for both knees at 1-year and 4-year intervals, constituting four distinct prediction tasks. The dataset underwent stratified splitting (70% training, 30% validation) and class balancing to ensure uniform representation with 1,000 cases per KL grade. Eight machine learning algorithms were evaluated:XGBoost 42 , LightGBM 43 , Random Forest 44 , Gradient Boosting 45 , AdaBoost 48 , Support Vector Machine (SVM 49 ), K-Nearest Neighbors (KNN 50 ), and Multi-layer Perceptron (MLP). Model robustness was assessed using 100 iterations of Monte Carlo cross-validation, with performance quantified through accuracy (ACC), weighted precision, recall, F1-score, and macro-area under the receiver operating characteristic curve (macro-AUC). Confusion matrices and ROC curves illustrating misclassification patterns and model discriminative capabilities are provided in the Supplementary Graphs S3. Risk Factor Analysis Submodule To produce individualized risk factor assessments, predictive models were enhanced with SHAP 51 analyses. These analyses produce interactive force plots that illustrate the relative contributions of key clinical parameters, such as BMI, body weight, age, and muscle strength, in predicting patient-specific osteoarthritis progression risk. Each prediction task utilizes the single best-performing model, allowing comparative analyses of risk factor contributions. This flexible analytical framework supports both consensus-driven and divergent risk assessments, providing clinicians with insights to tailor interventions to individual patient profiles. Therapy Agents Group Functionality and Workflow When entering the multidisciplinary Intervention Agent, patient profiles that integrate structured clinical data from the Assessment Agent and personalized predictive outcomes with risk factor analyses produced by the Risk Agent are used as foundational inputs. This Agent employs a collaborative multi-agent artificial intelligence architecture to generate tailored therapeutic recommendations across diverse clinical domains autonomously. The system incorporates specialized intelligent agents functioning as exercise prescriptionists, surgical and pharmacological interventionists, and nutritional and psychological specialists. Each agent independently develops structured, evidence-informed treatment recommendations within its domain of expertise. A clinical decision-making agent subsequently synthesizes these domain-specific interventions to produce a patient-specific management strategy. Throughout this process, the system prioritizes clinical applicability, scientific accuracy, patient safety, and individualized care. Detailed metrics for evaluating the clinical effectiveness of these interventions are described in subsequent sections. Knowledge Base Construction To ensure evidence-based therapeutic recommendations, extensive domain-specific knowledge bases were developed, encompassing five key intervention categories: exercise rehabilitation, surgical techniques, rehabilitation interventions, nutrition, and psychological therapies. A literature search conducted in the PubMed database initially identified 33,641 articles and clinical guidelines. Through article evaluation, these were refined to a core repository of 4,017 high-quality, peer-reviewed publications and internationally recognized clinical guidelines. Each selected document underwent targeted extraction of results and recommendations sections, excluding unrelated content to optimize relevance and retrieval accuracy. The annotated excerpts were organized into structured repositories optimized for RAG, thereby enhancing the accuracy and specificity of knowledge retrieval and agent-generated therapeutic recommendations. Complete details of literature selection criteria, evaluation methodologies, and knowledge base composition are provided within the supplementary materials. Development of Individual and Multi-Agent Architectures The multidisciplinary clinical recommendation system integrates four distinct intelligent agents, each leveraging the Qwen-Max large language model as a foundational engine, further optimized through targeted RAG techniques and tailored prompt engineering. Exercise Prescriptionist Agent: applies the FITT-VP framework (Frequency, Intensity, Time, Type, Volume, and Progression), dynamically customizing exercise regimens based on patient-specific clinical data and therapeutic objectives retrieved via RAG. Surgical and Medication Specialist Agent: Matches patient profiles against surgical guidelines and pharmacological guideline databases, providing detailed and precise intervention proposals with appropriate dosing, timing, and procedural specifications. Nutritional and Psychological Specialist Agent: Integrates nutritional prescriptions strictly adhering to the ABCMV principles (Adequacy, Balance, Calorie control, Moderation, Variety), supplemented by tailored psychological management strategies responsive to individual patient needs. Clinical Decision-Making Agent: Serves as the integration hub where outputs from the specialized agents converge. This agent critically evaluates and synthesizes the recommendations, optimizing outcomes along dimensions of accuracy, comprehensiveness, personalization, patient safety, and domain-specific professional standards. Through this architecture, the final integrated prescription is validated and tailored to each patient's unique clinical profile and therapeutic requirements. Clinical Validation with Real-World Patient Data We retrospectively assembled a cohort of 250 KOA patients from West China Hospital, Sichuan University. This dataset captured detailed baseline demographics, clinical examinations, radiographic findings, etiological diagnoses, patient treatment preferences, and institutional resource availability. Clinical management strategies were classified into five categories: Conservative management (n = 73) Total knee arthroplasty (n = 62) Unicompartmental knee arthroplasty (n = 43) Osteotomy (n = 39) Arthroscopic surgery (n = 33) Evaluation Protocol For each patient case, the KOM system, GPT-4o, Claude 3.7, and five additional vision-language models independently generated treatment recommendations. To eliminate bias, all model outputs were de-identified and randomized into a single pool for evaluation. Three senior orthopedic experts, each with over ten years of specialized clinical experience, conducted fully blinded evaluations of all recommendations using a unified seven-dimensional rubric. Evidence-based practice Completeness Exercise prescription Nutrition prescription Personalization Accessibility and feasibility Safety Before formal scoring, the reviewers participated in a calibration workshop where they jointly reviewed five exemplar cases representing both exemplary and suboptimal recommendations. This process resolved scoring discrepancies and resulted in a detailed evaluation manual, ensuring harmonized threshold definitions and high inter-rater consistency. In parallel, standard clinical prescriptions were developed as benchmark references, with comparative linguistic similarity analyses (including BLEU, BERT, and ROUGE metrics) quantifying each model’s adherence to the gold-standard clinical treatment protocol. Prospective Evaluation Using a Simulated Patient Cohort Study Design and Participant Selection We extracted de-identified records for 50 KOA patients from the West China Hospital database. Baseline weight-bearing radiographs and accompanying clinical data underwent independent review and confirmation by two senior orthopedic surgeons; only cases approved by both reviewers advanced to the simulated evaluation. Twenty doctoral candidates without prior KOA-specific imaging or therapeutic training (age 25–35 years, ≤1 year clinical rotation) were randomized via a computerized draw application into two arms (n = 10 each): "Physicians-only" group "Physicians + KOM" group Randomization was performed by an independent data manager using sealed electronic envelopes to ensure allocation concealment. Before case evaluation, all participants attended a single standardized training session that covered KOA radiographic interpretation and prescription formulation protocols. Clinical Materials and Assessment Methodology Corresponding radiographic data for each simulated case were drawn from the OAI, ensuring robust clinical validity and standardized imaging support. Radiographic grading accuracy was calculated as the proportion of cases for which the predicted KOA severity exactly matched the adjudicated reference grade in our corrected OAI database. We recorded the total time required for radiographic interpretation and prescription development tasks across all groups, quantitatively assessing efficiency gains attributable to KOM system integration. Treatment Plan Evaluation To evaluate clinical decision-making quality, we generated treatment plans for 50 simulated KOA patient cases across three cohorts: KOM group (KOM runs three times per case) physicians’ group (three different physicians per case) collaboration group (three physicians using KOM) This process resulted in a total of 450 de-identified plans. All plans were pooled, randomized, and stripped of origin labels to prevent evaluation bias. Two senior orthopedic specialists, each with over a decade of specialized experience, independently scored every plan using a harmonized seven-dimensional rubric: Evidence-based practice Completeness Exercise prescription Nutrition prescription Personalization Accessibility and feasibility Safety Before formal review, the experts jointly examined five exemplar plans (representing both high and low quality), reconciled scoring discrepancies, and finalized a detailed evaluation manual to ensure consistent application of rating thresholds. To anchor assessments in best-practice care, a third senior specialist created gold-standard prescriptions for each case based on current clinical guidelines. Finally, we quantitatively compared each free-text plan against its benchmark using established textual similarity metrics (BLEU, BERT, and ROUGE), enabling a appraisal of each plan's coherence with expert protocols. procedural steps, the full scoring rubric, and calibration details are provided in the Supplementary Methods. Statistical Analysis To standardize scores within each model across metrics, we applied row-wise z-score normalization to the model-by-metric mean matrix used for visualization (Fig. 4f). For model and metric , with mean score , we computed where and denote the mean and standard deviation of all metric scores for model , respectively. This emphasizing the relative distribution of metrics within a model rather than between-model differences. The resulting z-score matrix was used for heatmap visualization and pattern analysis. For statistical analysis of between-group differences, diagnostic accuracy and task completion time were treated as continuous variables and assessed for normality using the Shapiro-Wilk test. For normally distributed data, independent samples t-tests were used; for non-normally distributed data, the Mann-Whitney U test was employed for between-group comparisons. Treatment quality scores, being ordinal variables, were consistently analyzed using Mann-Whitney U tests. Statistical significance for all between-group comparisons was established at p < 0.05. Declarations Acknowledgements Funding This work was funded by the Youth Research Fund of Sichuan Science and Technology Planning Department. Grant number 23NSFSC4894 (received by Xi Chen) Author Contributions W.L., X.C., and Z.J. are the main designers of the study. W.L., X.C. are the main executors of the study. K.L., and J.L. contributed to the study by managing and supervising the revision work and providing critical feedback during the major revision process. H.Z., H.C., K.L., and W.L. served as consultants for computer science-related knowledge. H.C. and W.L. developed the code for this study and performed the model training. W.L., X.C., Z.J., L.Z., K.Z., R.T., L.W., and M.Y. evaluated the models' responses and prepared the test dataset. W.L., X.C., and Z.J. participated in drafting the manuscript. K.L., and J.L. provided overall guidance and supervision for the project. All authors have read and approved the final version of the manuscript. Competing interests The authors declare no competing interests. Ethics declarations This study was approved by the Ethics Committee of West China Hospital, Sichuan University (Approval No. 23-2277). All procedures complied with the Declaration of Helsinki and relevant national regulations, including China’s Personal Information Protection Law. Code availability The complete source code for this project is publicly accessible at https://github.com/jacobliuweizhi/KOM. A demonstration of the implementation is available through our interactive web interface at https://huggingface.co/spaces/Miemie123/Streamlit?page=Tailored+Therapy+Recommendation&start=1. Data availability The code developed for this study is available at: https://github.com/jacobliuweizhi/KOM under the GNU Affero General Public License v3.0. W.L., H.Z., H.C., and K.L. contributed to the code development and are responsible for maintaining the repository. Reference documents used in the RAG module are listed in the repository. Osteoarthritis-related imaging and clinical data used in this study are accessible through the OAIdatabase (https://nda.nih.gov/oai), subject to data use agreements. References Steinmetz JD et al (2023) Global, regional, and national burden of osteoarthritis, 1990–2020 and projections to 2050: a systematic analysis for the Global Burden of Disease Study 2021. Lancet Rheumatol 5:e508–e522 Martel-Pelletier J, Boileau C, Pelletier J-P, Roughley PJ (2008) Cartilage in normal and osteoarthritis conditions. Best Pract Res Clin Rheumatol 22:351–384 Kloppenburg M, Namane M, Cicuttini F, Osteoarthritis (2025) Lancet 405:71–85. https://doi.org:10.1016/S0140-6736(24)02322-5 Bliddal H et al (2024) Once-Weekly Semaglutide in Persons with Obesity and Knee Osteoarthritis. N Engl J Med 391:1573–1583. https://doi.org:10.1056/NEJMoa2403664 Thomas KA et al (2020) Automated Classification of Radiographic Knee Osteoarthritis Severity Using Deep Neural Networks. Radiology: Artif Intell 2:e190065. https://doi.org:10.1148/ryai.2020190065 Leung K et al (2020) Prediction of Total Knee Replacement and Diagnosis of Osteoarthritis by Using Deep Learning on Knee Radiographs: Data from the Osteoarthritis Initiative. Radiology 296:584–593. https://doi.org:10.1148/radiol.2020192091 Norman B, Pedoia V, Majumdar S (2018) Use of 2D U-Net Convolutional Neural Networks for Automated Cartilage and Meniscus Segmentation of Knee MR Imaging Data to Determine Relaxometry and Morphometry. Radiology 288:177–185. https://doi.org:10.1148/radiol.2018172322 Tiulpin A et al (2019) Multimodal Machine Learning-based Knee Osteoarthritis Progression Prediction from Plain Radiographs and Clinical Data. Sci Rep 9:20038. https://doi.org:10.1038/s41598-019-56527-3 Castagno S, Birch M, van der Schaar M, McCaskie A (2024) Predicting rapid progression in knee osteoarthritis: a novel and interpretable automated machine learning approach, with specific focus on young patients and early disease. Annals of the Rheumatic Diseases , ard-2024-225872 https://doi.org:10.1136/ard-2024-225872 Nielsen RL et al (2024) Data-driven identification of predictive risk biomarkers for subgroups of osteoarthritis using interpretable machine learning. Nat Commun 15:2817. https://doi.org:10.1038/s41467-024-46663-4 Du K et al (2025) Comparing Artificial Intelligence–Generated and Clinician-Created Personalized Self-Management Guidance for Patients With Knee Osteoarthritis: Blinded Observational Study. J Med Internet Res 27:e67830. https://doi.org:10.2196/67830 Wang L et al (2024) Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. npj Digit Med 7:41. https://doi.org:10.1038/s41746-024-01029-4 Chen X et al (2024) Evaluating and Enhancing Large Language Models' Performance in Domain-Specific Medicine: Development and Usability Study With DocOA. J Med Internet Res 26:e58158. https://doi.org:10.2196/58158 Chen X et al (2025) Enhancing diagnostic capability with multi-agents conversational large language models. NPJ Digit Med 8:159. https://doi.org:10.1038/s41746-025-01550-0 Radford A et al in International conference on machine learning. 8748–8763 (PmLR) Li B et al (2025) LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition? arXiv preprint arXiv:2503.07487 Jang J et al (2024) Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders. Sci Rep 14:23199 Jiang Y, Chen C, Nguyen D, Mervak BM, Tan C (2024) Gpt-4v cannot generate radiology reports yet. arXiv preprint arXiv:2407.12176 Pi S-W, Lee B-D, Lee MS, Lee HJ (2023) Ensemble deep-learning networks for automated osteoarthritis grading in knee X-ray images. Sci Rep 13:22887 Wang P et al (2023) Large language models are not fair evaluators. Preprint Arxiv Swiecicki A et al (2021) Deep learning-based algorithm for assessment of knee osteoarthritis severity in radiographs matches performance of radiologists. Comput Biol Med 133:104334 Elnashar A, White J, Schmidt DC (2025) Enhancing structured data generation with GPT-4o evaluating prompt efficiency across prompt styles. Front Artif Intell 8:1558938 Li J et al (2024) Integrated image-based deep learning and language models for primary diabetes care. Nat Med 30:2886–2896. https://doi.org:10.1038/s41591-024-03139-8 Yu KKH et al (2025) Investigative needle core biopsies support multimodal deep-data generation in glioblastoma. Nat Commun 16:3957. https://doi.org:10.1038/s41467-025-58452-8 Sosinsky A et al (2024) Insights for precision oncology from the integration of genomic and clinical data of 13,880 tumors from the 100,000 Genomes Cancer Programme. Nat Med 30:279–289. https://doi.org:10.1038/s41591-023-02682-0 Zakka C et al (2024) Almanac — Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI 1:AIoa2300068. https://doi.org:doi:10.1056/AIoa2300068 Ceresa M et al (2025) Retrieval Augmented Generation Evaluation for Health Documents. arXiv preprint arXiv:2505.04680 Yang R et al (2025) Retrieval-augmented generation for generative artificial intelligence in health care. npj Health Syst 2:2. https://doi.org:10.1038/s44401-024-00004-1 † MFARDT et al (2022) Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science 378:1067–1074. https://doi.org:doi:10.1126/science.ade9097 Ma C, Li A, Du Y, Dong H, Yang Y (2024) Efficient and scalable reinforcement learning for large-scale network control. Nat Mach Intell 6:1006–1020. https://doi.org:10.1038/s42256-024-00879-7 Goh E et al (2024) Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open 7:e2440969–e2440969. https://doi.org:10.1001/jamanetworkopen.2024.40969 Ayers JW et al (2023) Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 183:589–596. https://doi.org:10.1001/jamainternmed.2023.1838 Gaber F et al (2025) Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. npj Digit Med 8:263. https://doi.org:10.1038/s41746-025-01684-1 Car J et al (2025) The Digital Health Competencies in Medical Education Framework: An International Consensus Statement Based on a Delphi Study. JAMA Netw Open 8:e2453131–e2453131. https://doi.org:10.1001/jamanetworkopen.2024.53131 Tordjman M et al (2025) Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat Med. https://doi.org:10.1038/s41591-025-03726-3 Investigators OI (2008) The Osteoarthritis Initiative: A Knee Health Study. https://nda.nih.gov/oai/ Ronneberger O, Fischer P, Brox T (2015) in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9 , proceedings, part III 18. 234–241 (Springer) He K, Zhang X, Ren S, Sun J in Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778 Kellgren JH, Lawrence JS (1957) Radiological assessment of osteo-arthrosis. Ann Rheum Dis 16:494–502. https://doi.org:10.1136/ard.16.4.494 Kohn MD, Sassoon AA, Fernando ND (2016) Classifications in Brief: Kellgren-Lawrence Classification of Osteoarthritis. Clin Orthop Relat Res 474:1886–1893. https://doi.org:10.1007/s11999-016-4732-4 Tang S et al (2025) Osteoarthritis. Nat Rev Dis Primers 11:10. https://doi.org:10.1038/s41572-025-00594-6 Chen T, Guestrin C in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794 Ke G et al (2017) Lightgbm: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30 Breiman L (2001) Random forests. Mach Learn 45:5–32 Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat, 1189–1232 Drucker H, Burges CJ, Kaufman L, Smola A, Vapnik V (1996) Support vector regression machines. Adv Neural Inf Process Syst 9 Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Royal Stat Soc Ser B: Stat Methodol 67:301–320 Freund Y, Schapire RE in icml. 148–156 (Citeseer) Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297 Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27 Lundberg SM, Lee S-I (2017) A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 30 Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8147049","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":547036143,"identity":"6d08d73a-bea7-490e-a198-bf37a559571b","order_by":0,"name":"Weizhi Liu","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Weizhi","middleName":"","lastName":"Liu","suffix":""},{"id":547037699,"identity":"2aadbf7f-a2c5-4f86-a8cb-b92da3f51a47","order_by":1,"name":"Xi Chen","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Xi","middleName":"","lastName":"Chen","suffix":""},{"id":547037700,"identity":"75f26d60-7f83-42c9-8f41-f40ca2e44e4a","order_by":2,"name":"Zekun Jiang","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Zekun","middleName":"","lastName":"Jiang","suffix":""},{"id":547037701,"identity":"0581971e-7da2-4d46-9ec2-286e44529cb0","order_by":3,"name":"Liang Zhao","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Liang","middleName":"","lastName":"Zhao","suffix":""},{"id":547037702,"identity":"130f5cbe-f803-4241-a6da-42ee034321a9","order_by":4,"name":"Kunyuan Jiang","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Kunyuan","middleName":"","lastName":"Jiang","suffix":""},{"id":547037703,"identity":"f6d7819f-5e2d-483c-b888-889a3354b917","order_by":5,"name":"Ruisi Tang","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Ruisi","middleName":"","lastName":"Tang","suffix":""},{"id":547037704,"identity":"1552a7c9-9385-4a15-916a-695ce94a5c24","order_by":6,"name":"Li Wang","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Li","middleName":"","lastName":"Wang","suffix":""},{"id":547037705,"identity":"107ae412-3f18-4f34-b6dc-71be12533261","order_by":7,"name":"Mingke You","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Mingke","middleName":"","lastName":"You","suffix":""},{"id":547037858,"identity":"4734a90d-552d-4858-ac4c-8bcaaf68becd","order_by":8,"name":"Hanyu Zhou","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Hanyu","middleName":"","lastName":"Zhou","suffix":""},{"id":547037859,"identity":"d8c5c678-3a79-4b3d-8cfc-c331ff75dae1","order_by":9,"name":"Hongyu Chen","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Hongyu","middleName":"","lastName":"Chen","suffix":""},{"id":547037860,"identity":"a73c0af7-f2dc-48b0-9f02-7f36b3ccc4f0","order_by":10,"name":"Yong Nie","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA60lEQVRIiWNgGAWjYDACZgjFw8/efvABA8MB4rXISPacSTYgTgsU2BjcSDCTIEqLOTvzMYmPO2p5DM4cSKvmqbljzz8jgfFzAR4tls1saZIzzxznkTzeeOw2z7FniTNuJDBLz8CjxeAwj5k0b9sxHj6gLbdzGw4nGEgksDHz4NXC/w2shQHol2KgFnsitPCwAbXU8AgAtTADtTBuIKyFzdhyZtsBHlAgS/85djhxxpmHzdJ4tZw//PDGx7Y6e1BUfpxRc9ievz354Gd8WoCABRgdh5EFGBvwawDG/wcGhjpCikbBKBgFo2AkAwCLBU93Q/aMrgAAAABJRU5ErkJggg==","orcid":"","institution":"","correspondingAuthor":true,"prefix":"","firstName":"Yong","middleName":"","lastName":"Nie","suffix":""},{"id":547038293,"identity":"bc842eea-8134-419d-8222-3ee97e5402bf","order_by":11,"name":"Kang Li","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAtUlEQVRIiWNgGAWjYBACPmY2EGXDwCBBrBY2iJY0UrQwgLUcJkULO1vi54Jf5xP7Z7c//MzDYJdHjMMOS8/su504484ZY2kehuRiIrSwN0jz9tzO3SCRw8bMw3AgsYEILc2/eXvOAbWkPyNWC9sxaZ4fB4BaEsyI1pJmzduQXD/jRo6x5ByDZMJa+PmPGd/m+WNnzD8j/eGHNxV2hLWAAWMbjGVAlHoQ+EO0ylEwCkbBKBiJAADl1DSBMJIk6wAAAABJRU5ErkJggg==","orcid":"","institution":"","correspondingAuthor":true,"prefix":"","firstName":"Kang","middleName":"","lastName":"Li","suffix":""},{"id":547038294,"identity":"b589ae54-8bd5-41f8-a8ef-3e037c3caaeb","order_by":12,"name":"Jian Li","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA/UlEQVRIiWNgGAWjYBACPmaGhAPIAoz9zMyHH+DTwoakhbEBRMxsZ0szwKsF2QKwlg3neRQk8GphZ3h44OcOBntz9t7nD3/m2MhuPszDYMBQYxONz2EHe88wJO7sOW7YzLstzXjbYd4DDxiOpeU24PMLbxtDgsGNNMZmxm2HE7cd5kswYGw4jFfLwb9tDPYgLY0/t/1P3NzMYyBBSMthoC2MG4BaGni3HUjcwEyMFtk2kF+OMc7m3ZZsPOMwMJAT8PiFn/9M8se3baAQa2P4+HObnWx//+HDDz7U2ODUwsDAkwAk/jOgxl4CTuUgwH4ATOGN8FEwCkbBKBjZAABSG1qM+Iy+LgAAAABJRU5ErkJggg==","orcid":"","institution":"","correspondingAuthor":true,"prefix":"","firstName":"Jian","middleName":"","lastName":"Li","suffix":""}],"badges":[],"createdAt":"2025-11-18 15:14:03","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-8147049/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8147049/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":96364638,"identity":"99c5a91f-93b7-44b9-954d-2584e3a65836","added_by":"auto","created_at":"2025-11-20 10:09:29","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":3856499,"visible":true,"origin":"","legend":"","description":"","filename":"manuscriptmerged20251116.docx","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/cddaa9c144f9a91f5d882dfd.docx"},{"id":96364632,"identity":"008358bf-8579-4446-80d9-823aff69ccb8","added_by":"auto","created_at":"2025-11-20 10:09:29","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":342,"visible":true,"origin":"","legend":"","description":"","filename":"rs8147049.json","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/193f9efdc27c4818e6d15905.json"},{"id":96289321,"identity":"b3c73d19-1f44-46c6-9f39-5ca7f214d434","added_by":"auto","created_at":"2025-11-19 12:19:07","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":162072,"visible":true,"origin":"","legend":"","description":"","filename":"rs81470490enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/1f33063add6d888cd330bfe1.xml"},{"id":96289319,"identity":"37cb1547-777c-415a-b89e-7086d4067d08","added_by":"auto","created_at":"2025-11-19 12:19:07","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":409094,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/3768a7828e3b5edd0fe1ba96.png"},{"id":96289322,"identity":"cb70bcec-2d14-4bdd-8c76-4ea717438492","added_by":"auto","created_at":"2025-11-19 12:19:07","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1019911,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/3b329d99594fc3f18ca0a1e3.png"},{"id":96365476,"identity":"32d7688d-fb80-4b5a-90ad-79d20aa9bdd0","added_by":"auto","created_at":"2025-11-20 10:10:23","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1020257,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/68665f3ad4e0c5a82d03e505.png"},{"id":96364836,"identity":"21bd060a-b2c3-4fea-87d3-92ec5a39a5b5","added_by":"auto","created_at":"2025-11-20 10:09:42","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":891894,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/67047eeb29319629db59aa7c.png"},{"id":96289323,"identity":"e34119a4-fc33-424d-89e5-aa26305d963d","added_by":"auto","created_at":"2025-11-19 12:19:07","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":390590,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/ed45942ca00f06e53961c218.png"},{"id":96365457,"identity":"06777116-1700-4e38-aa06-8414b03f80fc","added_by":"auto","created_at":"2025-11-20 10:10:22","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":147714,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/f8b48260b9ee432d4fd97061.png"},{"id":96289327,"identity":"9745ae49-25c1-4bcc-b488-b5615c38c806","added_by":"auto","created_at":"2025-11-19 12:19:07","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":216460,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/a1d1df1cf1726c80530b5611.png"},{"id":96289330,"identity":"d326bc19-c35a-49df-a17f-94378e04c80a","added_by":"auto","created_at":"2025-11-19 12:19:07","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":219904,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/cf586750611285ceeb757d67.png"},{"id":96289333,"identity":"7734c70e-3d09-4975-ba7e-bd7949c78d7f","added_by":"auto","created_at":"2025-11-19 12:19:07","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":169614,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/fe2c1fa031c44c81c603ec1f.png"},{"id":96289332,"identity":"4daa7477-6b30-4d84-9056-4d5836ee394c","added_by":"auto","created_at":"2025-11-19 12:19:07","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":97839,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/4b4a9c3bad6aea85348cf8ac.png"},{"id":96364827,"identity":"efe141fd-73b8-4276-acbb-81d827355c02","added_by":"auto","created_at":"2025-11-20 10:09:41","extension":"xml","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":156748,"visible":true,"origin":"","legend":"","description":"","filename":"rs81470490structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/ae799eaa8c16f69adaf9dd46.xml"},{"id":96289335,"identity":"0f6f48e1-758e-4508-b2a3-fb8a534f361e","added_by":"auto","created_at":"2025-11-19 12:19:07","extension":"html","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":177720,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/a4a832b57e765d7dc57bba57.html"},{"id":96289316,"identity":"b47801e1-1b23-4679-8cf3-21100231a31e","added_by":"auto","created_at":"2025-11-19 12:19:06","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":355717,"visible":true,"origin":"","legend":"\u003cp\u003eTitle: Multi-Agent Framework for KOA Assessment, Risk Prediction, and Therapy Planning\u003c/p\u003e\n\u003cp\u003eCaption: Overview of the multi-agent framework for KOA management, clinical workflow, and performance benchmarking. The framework includes an Assessment Agent, Risk Agent, and Therapy Agents Group working together to evaluate symptoms and imaging, predict progression, and plan personalized therapy. The workflow involves assessment, risk prediction, and therapy planning phases. KOM showed higher performance than the evaluated baseline models and participants within the specific tasks and datasets assessed in this study.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/4c4179765c9edb31a9c0ab5f.png"},{"id":96289317,"identity":"3d7d8f4e-7516-49e8-8196-e4add4cb7d81","added_by":"auto","created_at":"2025-11-19 12:19:07","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":1007866,"visible":true,"origin":"","legend":"\u003cp\u003eTitle: Performance Evaluation and Benchmarking of the KOA Assessment Agent.\u003c/p\u003e\n\u003cp\u003eCaption: KOM outperforms major large models in KOA severity grading and OA detection across severity levels. Training and visualization demonstrate accurate joint localization and radiographic focus. KOM achieves better performance on multiple assessment tasks and high expert evaluation scores for information quality, with superior accuracy for both knees compared to competitors.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/37fafb653973a3894fdcce2c.png"},{"id":96364170,"identity":"59826c8b-347a-4305-acfe-f5593c895787","added_by":"auto","created_at":"2025-11-20 10:08:59","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1035469,"visible":true,"origin":"","legend":"\u003cp\u003eTitle: Performance Evaluation and Predictive Modeling of the KOA Risk Agent. Caption: The framework integrates structured data for KOOS and KL grade prediction. Various machine learning models are compared using accuracy, confusion matrices, and radar plots across time points. Performance varies across models, with some showing stronger long-term prediction capabilities.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/24a9582ef45997bbb5891e88.png"},{"id":96364780,"identity":"e898fb32-59bf-48c2-95cf-2ea9b75bb037","added_by":"auto","created_at":"2025-11-20 10:09:38","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":743320,"visible":true,"origin":"","legend":"\u003cp\u003eTitle: Performance Evaluation and Comparative Analysis of the KOA Therapy Agents Group\u003c/p\u003e\n\u003cp\u003eCaption: KOM achieves the highest expert scores and rankings across seven clinical domains compared to multiple large models. Text similarity analysis shows KOM outputs most closely match reference prescriptions across BLEU, ROUGE, and BERT metrics.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/06cc8cb0476b253e61a58011.png"},{"id":96289318,"identity":"835e911a-1f71-46fd-bdfa-341aad3033b4","added_by":"auto","created_at":"2025-11-19 12:19:07","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":378310,"visible":true,"origin":"","legend":"\u003cp\u003eTitle: Evaluation of Human–AI Collaboration in KOA Diagnosis and Treatment Planning\u003c/p\u003e\n\u003cp\u003eCaption: Study design, physician–KOM collaboration workflow, diagnostic efficiency, clinical evaluation, and text similarity analysis across different groups. The study included 50 KOA patients (KL 0–4, n = 10 each) and compared three settings: physician-only (MS), KOM-only, and physician–KOM collaboration (MS+KOM). Panel (a) shows the study design; (b) illustrates the physician–KOM collaboration process; (c) presents diagnostic time comparisons, with the collaboration group achieving faster completion; (d) shows expert evaluations across seven clinical criteria, where the collaboration group achieved the highest scores; (e) compares grading accuracy, which improved from MS to KOM and was highest in MS+KOM; and (f) presents text similarity metrics (BLEU, ROUGE, BERT), showing that collaboration produced outputs closest to reference prescriptions.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/894f66ef8308bde3ea84db4f.png"},{"id":96453106,"identity":"95a3c8f9-e70e-4e19-9eaa-c66e51fa5f4c","added_by":"auto","created_at":"2025-11-21 09:58:09","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3805189,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8147049/v1/fa7134b5-10dd-4328-aa24-1b885d29e60b.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eKOM: A Multi-Agent Artificial Intelligence System for Precision Management of Knee Osteoarthritis (KOA)\u003c/strong\u003e\u003c/em\u003e\u003c/p\u003e","fulltext":[{"header":"Introduction","content":"\u003cp\u003eKOA affects approximately 600 million people worldwide, characterized by progressive pain and functional deterioration that frequently necessitates total knee replacement in the end \u003csup\u003e1-3\u003c/sup\u003e. Timely and appropriate intervention is essential for slowing structural deterioration, alleviating symptoms, and improving functional outcomes\u003csup\u003e4\u003c/sup\u003e. However, delivering multidisciplinary personalized KOA management for large patient groups remains challenging, particularly in resource-limited healthcare systems\u003csup\u003e3\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eAdvances in AI have created new opportunities for KOA management through automated analysis, risk prediction\u003csup\u003e5-11\u003c/sup\u003e. AI-driven radiographic techniques, particularly convolutional neural networks, now facilitate the automated analysis of osteoarthritis images\u003csup\u003e5-7\u003c/sup\u003e. In parallel, prediction models integrating clinical and imaging data have been developed to estimate the risk of structural progression and functional decline in KOA patients\u003csup\u003e8-10\u003c/sup\u003e. However, these studies have not directly demonstrated their impact on the complete KOA management workflow.\u003c/p\u003e\n\u003cp\u003eMoreover, the principal challenges in complete KOA management stem from the intensive nature of patient interactions and treatment planning. First, completing patient assessment process requires substantial clinical resources. In clinical workflows, multiple rounds of communication between clinicians and patients are typically needed to complete a full medical history collection. This process consumes significant time from clinicians, who are already facing heavy workloads, and the repeated execution of these time-consuming procedures may lead to the omission of critical information during interactions, potentially resulting in serious medical events. Second, effective disease management strategies require the formulation of personalized intervention plans based on the current disease status, quantified key progression risk factors, contraindications, and patient-specific needs. In the absence of AI assistance, it is often difficult to quickly determine key risk factors when multiple risk factors are present, and it is also challenging to formulate an individualized management plan for knee osteoarthritis based on the patient\u0026rsquo;s information within a short period.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTo address these limitations, we have explored several solutions. Our early work developed structured prompting techniques to enhance LLM performance for osteoarthritis queries\u003csup\u003e12\u003c/sup\u003e. An osteoarthritis agent (DocOA) used LLMs with Retrieval-Augmented Generation (RAG) to access guideline knowledge and generate compliant treatment recommendations\u003csup\u003e13\u003c/sup\u003e. After that, a multi-agent system was developed for complex clinical cases, where agents simulate different specialists in multidisciplinary team discussions\u003csup\u003e14\u003c/sup\u003e. These incremental developments provided preliminary insights but demonstrated limited gains in overall KOA care quality and efficiency, still failing to alleviate the clinical burden of KOA management. However, these findings motivated the production of the KOM system,\u0026nbsp;a multi-agent AI system designed to support multiple components of KOA management, including patient interaction and assessment, risk prediction, and individualized treatment planning. KOM is implemented using a modular and extensible architecture designed to support future adaptation to other chronic diseases requiring complex, longitudinal management. The system consists of three specialized agents integrating LLMs, ResNet architecture, and other machine learning algorithms. Specifically, it comprises:\u003c/p\u003e\n\u003cp\u003e1. An Assessment Agent capable of interacting with patients, processing multimodal data, analyzing radiological images, and generating structured evaluation reports.\u003c/p\u003e\n\u003cp\u003e2. A Risk Agent designed to extract patient-specific progression risk factors, predicts individualized KOA progression, and generates risk reports.\u003c/p\u003e\n\u003cp\u003e3. A Therapy Agents Group consists of a set of domain-specific agents maintaining evidence-based medical knowledge, collectively simulating multidisciplinary team (MDT) discussions to generate personalized management plans.\u003c/p\u003e\n\u003cp\u003eFollowing the clinical workflow, KOM completes the process from data collection to management planning. Specifically, the Assessment Agent collects and evaluates a patient\u0026rsquo;s demographic information, present and past medical history, personal history, imaging data, psychological state, nutritional status, physical activity level, socioeconomic condition, and treatment preferences, and automatically generates a structured electronic medical report. Subsequently, the Risk Agent extracts relevant clinical indicators to predict short- and mid-term structural and symptomatic progression of KOA, and identifies quantified risk factors that enable data-driven risk stratification and disease monitoring. Based on the medical reports and risks reports from previous two agents, the Therapy Agents Group further simulates an MDT decision-making process to generate intervention plans. Each agent was evaluated against general-purpose language models and algorithmic baselines to assess performance.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eFurthermore, we conducted a three-arm comparative study to assess the system\u0026apos;s clinical utility in controlled simulated environments. The comparative study included three groups: an independent KOM deployment (KOM alone), an integrated KOM-doctoral collaboration (KOM plus clinicians), and a traditional doctoral practice (clinicians alone).\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;This study describes the development, validation, and clinical evaluation of KOM and examines its potential role in enhancing the quality and efficiency of KOA management.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e\u003cstrong\u003eDevelopment of the Knee Osteoarthritis Management (KOM) System.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eKOM is an interactive multi-agent system that supports clinicians in three areas for KOA management: patient information collection and assessment, disease trajectory prediction and etiology analysis, and individualized treatment planning (Figure 1).\u003c/p\u003e\n\u003cp\u003eThe KOM architecture features three specialized agents:\u003c/p\u003e\n\u003cp\u003e1.Assessment Agent: Features multi-round patient interaction capabilities, image analysis functionality, and summary report generation. collects and evaluates a patient\u0026rsquo;s demographic information, present and past medical history, personal history, imaging data, psychological state, nutritional status, physical activity level, socioeconomic condition, and treatment preferences. It analyzes knee radiographs to classify the severity of KOA and identify specific features, including osteophyte formation and alterations in joint space. Finally, The Assessment Agent generates a structured evaluation report (Figure 2a).\u003c/p\u003e\n\u003cp\u003e2.Risk Agent: Capable of predicting knee osteoarthritis progression and identifying individual risk factors that contribute to disease advancement. It extracts parameters from evaluation report to forecast KOOS (Knee Injury and Osteoarthritis Outcome Score) subscale scores and KL (Kellgren\u0026ndash;Lawrence) radiographic grading in the following four years. Finally, the Risk Agent generates a structured risk report (Figure 3a).\u003c/p\u003e\n\u003cp\u003e3.Therapy Agents Group: Composed of specialist agents from different medical disciplines, each with specialized knowledge bases. These agents simulate clinical physicians in multidisciplinary team discussions and develop multidisciplinary treatment plans based on evaluation report and risk report (Figure 4a).\u003c/p\u003e\n\u003cp\u003eThe workflow of KOM is illustrated in Figure 1b. The system offers two interaction modalities: sequential progression through the complete pathway or independent access to individual agents with manual data input. This flexibility accommodates diverse clinical workflows across healthcare settings.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAssessment Agent\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe Assessment Agent collects patient information through structured dialogues and performs radiographic image analysis to generate evaluation reports. Its workflow is organized into three stages: information collection, treatment willingness confirmation, and radiographic analysis (Figure 2a). The agent integrates two key components: a patient\u0026ndash;agent interaction module and an X-ray analysis module. The radiographic analysis output is seamlessly incorporated into the patient history, enabling a unified and structured clinical record.\u003c/p\u003e\n\u003cp\u003eFor the Patient-agent interaction module, this study implemented Qwen-Max as the foundation model. It developed a structured prompt through systematic engineering to facilitate multi-turn conversations between the agent and patients. This approach enabled the collection of information and the generation of summary reports. The model was configured with a temperature parameter of 0.8 and accessed via the application programming interface (API). Three human physicians evaluated 100 simulated patient-agent interactions, in which physicians acted as patients interacting with the assessment agent. The human physicians evaluated the interaction process and the final summary report, rating them on four metrics on a scale of 1 to 5, where 1 indicated no compliance and 5 indicated complete compliance. The results are as follows (Figure 2g): field completeness (4.03 \u0026plusmn; 0.21), logical consistency (3.99 \u0026plusmn; 0.16), medical accuracy (4.37 \u0026plusmn; 0.47), and readability (4.00 \u0026plusmn; 0.23).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe X-ray analysis module was developed using the Osteoarthritis Initiative (OAI) dataset, which contains longitudinal bilateral knee radiographs from 4796 cases at their baseline evaluation, 2-year follow-up evaluation, and 4-year follow-up evaluation, resulting in a total of 12,719 bilateral anterior-posterior knee X-rays. The X-ray analysis module performs KOA severity grading, detects bilateral osteophytes, and assesses bilateral joint space. The workflow consists of initial knee center localization to determine the region of interest (ROI), followed by identification of both the left and right knee joints, with subsequent osteophyte detection and joint space analysis for each joint. To implement these functionalities, we developed a series of algorithms trained on a randomly selected subset of data from the OAI dataset, which human experts had calibrated before model training. The knee center localization algorithm utilized a U-Net architecture trained on 200 labeled images. After 40 training epochs, the validation metrics improved: the loss decreased from 0.7576 to 0.3804, the IoU increased from 0.0016 to 0.5790, and the center point error was reduced from 146.02 to 4.83 pixels (Figure 2d). For severity classification, a ResNet model was trained on balanced class distributions. Using an 8:1:1 data partition with 5-fold cross-validation, the model achieved an overall accuracy of 80.8% (Figure 2e). The None/Doubt class showed the highest accuracy (90.7%), with confusion primarily between Moderate and Mild grades. We also developed ten specialized models for extracting radiographic features from distinct anatomical regions, including the medial and lateral joint spaces, as well as the medial and lateral aspects of both femoral and tibial surfaces. The lateral joint space narrowing classification model achieved an accuracy of 89.8%, surpassing the medial joint space narrowing model (77.1%). Classification accuracy for subchondral sclerosis demonstrated regional variation, with the highest accuracy observed in the medial tibial plateau (56.4%), the lateral tibial plateau (83.1%), the medial femoral condyle (60.6%), and the lateral femoral condyle (85.3%). Osteophyte detection models maintained relatively consistent performance across all anatomical quadrants, with accuracy ranging from 78.5% to 95.5%. Gradient-weighted Class Activation Mapping (Grad-CAM) analysis enhanced model interpretability for classification tasks (Figure 2f). The relevant model design, training processes, and detailed the Supplementary Graphs S2.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eComparative Evaluation of Radiographic Performance\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe Assessment Agent was benchmarked against five leading vision-language models (Google Gemini 2.0 Pro, GPT-4o, Claude 3.7, QwenMax VL, and LLaMA 3.2 90B Vision Instruct) using 500 bilateral knee radiographs from the OAI dataset, which were excluded from the training set. The evaluation assessed KOA severity grading, the detection of OA presence. For KOA severity grading (Figure 2b, c, h), the Assessment Agent achieved 77.16% accuracy, outperforming Gemini 2.0 Pro (34.50%). In OA presence detection, the Assessment Agent attained an accuracy of 82.22% compared to Gemini\u0026apos;s 76.66%. The Assessment Agent maintained performance across anatomical locations with 75.38% and 78.95% accuracy on left and right KOA severity grading, respectively, and 84.09% and 80.35% accuracy for left and right knee OA detection. All competing models showed lower accuracy below 65% on these tasks.\u003c/p\u003e\n\u003cp\u003eMoreover, the Assessment Agent demonstrated diagnostic capability across different levels of disease severity (64.68%-82.16% accuracy across classifications). In contrast, competing models often achieve high accuracy for None/Doubt cases but poor performance (\u0026lt;20%) on Mild or higher grades (Figure 2c), which may constrain their applicability in similar clinical tasks. Detailed metrics for this task are available in the Supplementary Graphs S2.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eRisk Agent\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe Risk Agent predicts the functional outcome and radiographic outcome of KOA at 1 years and 4 years of follow-up. It also identifies patient-specific risk factors.\u003c/p\u003e\n\u003cp\u003eAt the 1-year follow-up (V01), all six machine learning models demonstrated predictive ability across KOOS subscores (Figure 3d). The strongest results were obtained for right knee symptoms and quality of life, with correlation coefficients approaching 0.74 and explained variance (R\u0026sup2;) values exceeding 0.50 in the best-performing models. ElasticNet provided relatively stable performance across multiple KOOS subscores, achieving R\u0026sup2; values up to 0.58 with relatively low mean absolute errors. Random Forest and Gradient Boosting also performed well, particularly in predicting pain and function-related scores. In contrast, SVR and LightGBM showed less consistent results. At the 4-year follow-up (V06), predictive accuracy declined across all KOOS subscores. The best-performing models reached correlation values of 0.65\u0026ndash;0.69 with R\u0026sup2; values between 0.30 and 0.46, notably lower than at V01. Random Forest, Gradient Boosting, and ElasticNet remained relatively stable performance across tasks, while SVR and LightGBM again produced weaker predictions. Despite the decline, the prediction of quality-of-life and pain subscores retained moderate correlation values, whereas sports and recreation scores showed the lowest stability.\u003c/p\u003e\n\u003cp\u003eFor KL Grade Classification prediction tasks (Figure 3b, c) eight algorithms were evaluated. At V01, ensemble-based models achieved the highest predictive performance. For the left knee, AdaBoost reached the best overall results (accuracy = 0.910, F1 = 0.908, AUC = 0.965). Random Forest (accuracy = 0.902, AUC = 0.971) and XGBoost (accuracy = 0.897, AUC = 0.965) also performed strongly. For the right knee, LightGBM (accuracy = 0.910, AUC = 0.962) and XGBoost (accuracy = 0.908, AUC = 0.967) achieved the highest classification accuracy, while Random Forest remained highly competitive (accuracy = 0.900, AUC = 0.972). Across both knees, all ensemble approaches produced AUC values above 0.96, indicating robust discriminative capacity. At V06, predictive performance declined compared with V01. For the left knee, Gradient Boosting, XGBoost, and LightGBM produced the most consistent results, with accuracies ranging from 0.75 to 0.76 and AUC values close to 0.92. For the right knee, XGBoost yielded the best balance of metrics (accuracy = 0.765, AUC = 0.922), while Random Forest also performed well (accuracy = 0.762, AUC = 0.932). Other algorithms, including SVM, neural networks, and KNN, exhibited weaker performance at both time points.\u003c/p\u003e\n\u003cp\u003eFollowing functional outcome prediction, individualized risk factors were identified using SHAP analysis, which enhanced interpretability by quantifying the contributions of each feature to the prediction. In a representative case, the predicted KOOS symptom score (72.18) fell below the cohort mean (75.50), with primary negative contributors including osteophytes in baseline left knee X-ray (-1.08), diminished KOOS pain score (-1.02), suggesting osteophytes and pain are the individualized progression risks for this patient. This patient also shows below-average peak knee extension torque; this parameter contributed positively (+1.19), suggesting residual muscle strength may protect against symptomatic progression. A detailed description of the model design, training procedures, and experimental results is provided in Supplementary Graphs S3 as well as Supplementary Tables T1 and T2.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTherapy Agents Group\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA multi-agent cluster was developed to facilitate multi-agent conversations, mimicking the Multi-Disciplinary Team (MDT) approach adopted in clinical practice, for formulating patient-specific, multidisciplinary management plans. The cluster of agents receives the evaluation report and risk report generated in previous stages, engages in discussion, and generates the final personalized management plan. The cluster comprised multiple agents functioning as specialists, including an Exercise Rehabilitation agent, an Orthopedic agent, a Psycho-Nutrition agent, and a Clinical Decision agent.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe Clinical Decision Agent functions as the coordinator and integrator within the cluster. It synthesizes the recommendations provided by the other specialist agents, resolves conflicts where their suggestions diverge, and applies evidence-based clinical guidelines to ensure the final plan is coherent, feasible, and clinically appropriate. In addition, the Clinical Decision Agent prioritizes interventions based on patient risk factor, comorbidities, and treatment preferences, thereby generating a plan that aligns with real-world clinical decision-making standards. Each agent was furnished with domain-specific medical data. Qwen-Max was utilized as the base model for all agents, with a temperature parameter set at 0.8. The model was accessed via API. Prompt engineering was conducted to instruct each agent to act as a clinical specialist, engage in active discussion with other agents, and develop a personalized management plan for the given KOA patient. Each agent is equipped with a retrieval augmented generation tool to utilize medical data from their respective knowledge base to generate and revise the management plan. We curated six specialized medical databases; each derived from authoritative clinical guidelines and peer-reviewed articles indexed in the Medline database (Figure 4a). All databases were constructed through a structured pipeline of literature search, eligibility screening, data extraction, and knowledge structuring. Each agent within the multi-agent cluster is paired with its corresponding evidence database to generate patient-specific recommendations:\u003c/p\u003e\n\u003cp\u003eKOM Agent 1 \u0026ndash; Nutrition and Psychology: linked to the psychological database (210 entries) and nutrition database (349 entries), enabling the generation of individualized psychological counseling and nutritional prescriptions.\u003c/p\u003e\n\u003cp\u003eKOM Agent 2 \u0026ndash; Medication and Surgery: connected to the surgical evidence database (1,549 entries) to determine surgical indications and medication strategies based on established osteoarthritis guidelines.\u003c/p\u003e\n\u003cp\u003eKOM Agent 3 \u0026ndash; Exercise Prescription: supported by the rehabilitation database (934 entries) and Exercise database (975 entries), which provides evidence-based exercise regimens tailored to the patient\u0026rsquo;s KOA severity and physical capacity.\u003c/p\u003e\n\u003cp\u003eKOM Agent 4 \u0026ndash; Clinical Decision and Summary: serves as the coordinator and synthesizer, using a shared guideline database to integrate the recommendations of all other agents, resolve conflicts, and formulate a coherent, evidence-based management plan.\u003c/p\u003e\n\u003cp\u003eAdditionally, a guideline database is accessible to all agents, ensuring that each recommendation aligns with up-to-date clinical practice standards.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWe evaluated the quality of generated treatment plans using retrospective clinical data from 250 patients with knee osteoarthritis treated at West China Hospital. For each case, we conducted expert evaluations of the generated prescriptions and calculated their similarity scores against gold-standard prescriptions to assess system performance. And we benchmarked KOM against leading general-purpose AI models, including GPT-4o, GPT-4o-mini, DeepSeekR1, Claude 3.7 Sonnet, QwenMax, Qwen2.5-14B, and Gemini 2.0 Pro (Figure 4a). \u0026nbsp;Additionally, we benchmarked a single-agent RAG implementation against our agents\u0026rsquo; group with domain-specific databases and collaborative decision-making.\u003c/p\u003e\n\u003cp\u003eFor lexical and semantic similarity analysis, we employed three established metrics. KOM achieved the highest BLEU score (0.0191), outperforming GPT-4o (0.0064) and Qwen2.5-14B (0.0083). Similarly, KOM led in ROUGE-L metrics with 0.2905, higher than QwenMax (0.1244) and DeepSeekR1 (0.1031). BERT evaluations showed narrower differences, with KOM (0.8069) performing comparably to GPT-4o-mini (0.8156) and GPT-4o (0.8122), demonstrating competitive semantic consistency across models (Figure 4g).\u003c/p\u003e\n\u003cp\u003eThe expert evaluation involved three specialists in orthopedics and sports medicine who independently rated prescriptions across seven dimensions on a 1\u0026ndash;5 scale. KOM achieved the highest composite score (29.63 \u0026plusmn; 1.33), outperforming the next-best model, DeepSeekR1 (26.03), by 3.60 points (Figure 4b, c). KOM received higher mean ratings across all evaluated dimensions, including completeness (4.408), personalization (4.380), and safety (4.366). Although nutritional guidance represented the lowest-scoring dimension across all models, KOM maintained its leading position with a score of 3.903 (Figure 4d, e).\u003c/p\u003e\n\u003cp\u003eZ-score normalization (Figure 4f) highlighted KOM\u0026rsquo;s relative strengths in completeness (+0.67), personalization (+0.73), and safety (+0.66), with moderate performance in evidence-based practice (+0.48) and feasibility (+0.06). Other models demonstrated specific advantages in individual domains: G4M in evidence-based practice (+0.99), GPT-4o-mini in completeness (+0.76) and safety (+0.87), QwenMax in safety (+1.13), DeepSeekR1 in personalization (+1.10), and Gemini 2.0 Pro in completeness (+1.10). Across all models, exercise design and nutritional advice consistently emerged as weaker areas of focus. The relevant model design, training processes, and experimental result data are presented in detail in Figure 4 and the Supplementary Graphs S4 for reference.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eClinical Evaluation of the KOM System\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo evaluate the effectiveness of the KOM system in clinical practice, we conducted an end-to-end simulation using 50 cases of KOA from West China Hospital (Figure 5a). The clinical evaluation included three operating conditions: physicians performing the assessment and treatment planning process alone, the KOM system functioning autonomously and performing the process, and physicians collaborating with the KOM system, where the system performs the X-ray evaluation and treatment planning, and the physicians can supervise and modify the reports generated by the KOM system at each stage of the process (Figure 5b). The quality of X-ray assessment, the quality of the management plan, and the entire processing time were evaluated.\u003c/p\u003e\n\u003cp\u003eThe approval rate of radio-graphic grading was defined as the proportion of knee images correctly classified according to severity grading within the final cohort of 50 cases. Expert evaluation of KOA grading results showed that image classifications generated independently by ten physicians achieved approval rates ranging from 42.0% to 66.0%, with a mean approval rate of 56.0%. When assisted by the KOM system, the approval rates increased, ranging from 90.0% to 96.0%, yielding a mean approval rate of 93.0%. Under the fully automated KOM-only condition, approval rates ranged from 72.0% to 82.0%, corresponding to a mean approval rate of 77.4% (Figure 5e). Expert evaluation was conducted on the same 50 prescriptions, with each prescription assessed across seven clinical criteria: clinical evidence, completeness, exercise prescription standardization, nutritional prescription standardization, safety, personalization, and accessibility. The aggregate average score was 3.63 for MS (Physicians), 4.56 for KOM, and 4.43 for the collaboration group (Figure 5d). Regarding content completeness, the MS+KOM group achieved a score of 4.73, compared with 4.63 for KOM and 4.01 for MS. For personalization, both the collaboration group and KOM group demonstrated comparable performance, with scores exceeding 4.80. Similarly, for safety, both groups maintained scores above 4.80. In exercise prescription quality, the scores were 3.13 (MS), 4.11 (KOM), and 4.10 (MS+KOM), highlighting the benefit of AI augmentation. For nutritional guidance, the scores were 3.30 (MS), 3.93 (KOM), and 3.97 (MS+KOM). In accessibility and feasibility, the collaboration group scored 4.10, lower than KOM\u0026rsquo;s 4.59. In adherence to evidence-based practice, the collaboration group achieved 4.41, while KOM scored 4.63.\u003c/p\u003e\n\u003cp\u003eQuantitative prescription similarity metrics demonstrated that BLEU scores were 0.0065 (MS), 0.0455 (KOM), and 0.0500 (collaboration group). ROUGE-L scores were 0.1126 (MS), 0.2590 (KOM), and 0.2340 (collaboration group). BERTs were 0.8021 (MS), 0.7996 (KOM), and 0.8116 (collaboration group), indicating superior semantic alignment with reference standards in the human-AI collaborative condition (Figure 5f).\u003c/p\u003e\n\u003cp\u003eFor the complete clinical workflow, the MS group required 586 \u0026plusmn; 56 seconds per case. In contrast, the collaboration group completed identical tasks in 361 \u0026plusmn; 42 seconds (Figure 5c), demonstrating a 38.5% reduction in processing time.\u0026nbsp;\u003c/p\u003e"},{"header":"Discussion","content":"\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003eMain Finding\u003c/h2\u003e\u003cp\u003eThis study introduces the KOM system; the first evaluated multi-agent systems for KOA. The Assessment Agent acquires patient-related information and treatment goals through interactive dialogue and analyzes knee radiographs to classify KOA severity in accordance with established clinical criteria. The Risk Agent forecasts functional and radiographic outcomes at one- and four-years follow-up while identifying patient-specific risk factors to inform intervention planning. The Therapy Agents Group, designed to simulate multidisciplinary team discussions, generates evidence-based, personalized management plans by integrating domain-specific knowledge from rehabilitation, exercise, surgical, psychological, and nutritional specialties. Evaluation results indicate that the Assessment Agent demonstrates enhanced performance compared to general-purpose AI models across assessment parameters. The Risk Agent accurately predicted functional and radiographic outcomes at 1-year and 4-year follow-ups. The Therapy Agents Group developed evidence-based, individualized management plans that demonstrated higher quality than those generated by current language models across multiple evaluation metrics.\u003c/p\u003e\u003cp\u003eIn a clinical evaluation study comprising three groups (physicians alone, KOM alone, and doctoral trainee-KOM collaboration), the collaboration group demonstrated significant advantages. Expert reviewers approved 93.0% of treatment plans from the doctoral trainee-KOM collaboration compared to 53.8% for physicians alone and 77.4% for KOM alone. Quality assessment across seven clinical criteria revealed superior performance in the collaborative condition, particularly in content completeness, personalization, and safety considerations. Notably, the doctoral trainee-KOM collaboration resulted in a 38.5% reduction in processing time compared to physicians working independently.\u003c/p\u003e\u003cp\u003eThese findings suggest that KOM may serve as a useful clinical decision support tool that can enhance both the quality and efficiency of KOA management while providing a methodological framework for developing similar systems for other chronic degenerative disorders.\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eAssessment Agent via LLM-DL Hybrid Architecture\u003c/h3\u003e\n\u003cp\u003eThe Assessment Agent of the KOM system employs a hybrid architecture that integrates deep learning-based image interpretation with prompt-optimized LLM. The system utilizes a ResNet-based convolutional neural network trained on more than 12,000 standardized bilateral anteroposterior knee radiographs from the OAI database. This enables classification of KOA severity, alongside accurate assessments of medial and lateral joint space narrowing, osteophyte presence, and subchondral bone sclerosis.\u003c/p\u003e\u003cp\u003eWhile general-purpose vision-language models (VLMs) such as GPT-4V demonstrate zero-shot generalization capabilities in open-domain tasks, our findings reveal significant limitations when applied to specialized clinical imaging interpretation\u003csup\u003e\u003cspan additionalcitationids=\"CR16 CR17\" citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. In our evaluation, these models demonstrated inadequate accuracy and consistency in grading KOA severity under zero-shot conditions. Traditional machine learning techniques, including gradient-boosted trees and convolutional neural networks trained on curated OA datasets, have repeatedly shown performance comparable to expert readers in radiographic grading.\u003csup\u003e\u003cspan additionalcitationids=\"CR20\" citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e. For patient information collection, LLMs have demonstrated effectiveness in generating structured clinical narratives and supporting interactive clinical workflows\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. Recognizing these strengths, we developed a hybrid LLM-DL framework that produces a flexible, interpretable, and clinically applicable workflow for case generation. This approach aligns with recent developments in hybrid architectures, such as DeepDR-LLM, which combined image-based transformers with LLMs fine-tuned on 370,000 real-world diabetes management records to generate personalized recommendations\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e. Those suggest that LLM-DL architectures provide an adaptable and controllable solution for structured documentation in medical domains with defined task parameters and standardized inputs.\u003c/p\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003eRisk Agent with Etiological Analysis\u003c/h2\u003e\u003cp\u003eThe Risk Agent in KOM forecasts both symptomatic and structural trajectories of KOA using supervised machine learning algorithms. This component models the temporal changes in KOOS subdomains and KL grades at 1-year and 4-year intervals. The model was trained on 31 multimodal features, including demographics, baseline KOOS scores, and radiographic measurements such as joint space width, osteophyte presence, and sclerosis grades. By generating personalized risk projections over clinically meaningful timeframes, this approach may help address an existing gap in KOA management, which is especially valuable given the heterogeneous progression of the disease. The model captures the known discordance between subjective symptoms and objective structural changes in KOA\u003csup\u003e\u003cspan additionalcitationids=\"CR20\" citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e. Jointly modeling function and structure enables improved risk stratification, facilitating the development of adaptive therapeutic strategies. The system demonstrates generalization across different time horizons, suggesting a degree of temporal stability in the evaluated tasks.\u003c/p\u003e\u003cp\u003eWhile recent research explores molecular biomarkers, genomics, and metabolomics for predicting KOA progression\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e, these approaches face significant implementation barriers in clinical settings. Such methods often require invasive procedures, costly assays, or surgical specimens\u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e. Furthermore, their availability in routine clinical environments remains limited, and center-specific biases and demographic variations frequently compromise their generalizability\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e,\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003eTherapy Agents Group via Multi-Agent Collaboration\u003c/h2\u003e\u003cp\u003eThe third core component of the KOM system utilizes a multi-agent architecture that generates personalized intervention plans tailored to individual patient clinical presentations and etiologies. This framework integrates specialized agents focused on distinct therapeutic domains, including exercise prescription, pharmacological and surgical interventions, nutritional planning, and psychological support. These domain-specific agents operate independently while being coordinated by a central clinical agent that synthesizes their outputs into a cohesive, individualized treatment strategy.\u003c/p\u003e\u003cp\u003eThis multi-agent approach represents a departure from monolithic language models toward a modular design that enhances transparency, domain expertise, and decision quality. Comparative evaluations against leading language models demonstrated that KOM delivered superior performance across key clinical metrics, including recommendation accuracy, personalization, and actionability. The performance differential was most significant in complex cases involving multiple comorbidities or atypical presentations, where general-purpose models typically produced either overly generic or inconsistent recommendations. Our investigation of RAG with prompt engineering for incorporating external medical literature yielded limited performance improvements. In several evaluation categories, the RAG-enhanced model underperformed compared to its base language model, particularly in terms of clinical adaptability and semantic coherence. This limitation likely results from contextual inconsistencies caused by semantic drift in retrieved documents, a recognized challenge in current RAG implementations for specialized clinical reasoning tasks \u003csup\u003e\u003cspan additionalcitationids=\"CR27\" citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e. In contrast, the KOM multi-agent framework demonstrated effective capabilities in problem decomposition, domain-specific reasoning, and iterative refinement. These strengths align with findings from other domains: decentralized multi-agent reinforcement learning has shown improved task execution and generalization in complex network systems\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e. At the same time, Meta AI's Cicero system achieved expert-level performance in strategic gameplay through coordinated decision-making among specialized agents\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003eEnhancing Clinical Process through AI-Clinician Collaboration\u003c/h2\u003e\u003cp\u003eBeyond its core capabilities, we evaluated the KOM system in collaborative scenarios with physicians to simulate real-world clinical workflows. Physicians used KOM for image interpretation and treatment planning assistance, with comparative analyses revealing that these AI-physician partnerships performed better within our evaluation settings than either component used independently across diagnostic accuracy, workflow efficiency, and treatment quality metrics. This collaborative approach exemplifies the human-AI symbiosis paradigm in healthcare, where AI systems function as intelligent assistants that enhance decision quality while reducing cognitive load.\u003c/p\u003e\u003cp\u003eRecent research strongly supports this collaborative model. Goh et al. \u003csup\u003e31\u003c/sup\u003e demonstrated that emergency physicians using an LLM achieved 15% higher diagnostic accuracy and greater decision consistency compared to unaided controls. In a multicenter study, clinicians using GPT-4 for patient case evaluation reported 20% faster processing times and a 12% improvement in review consistency. Similarly, Ayers et al. \u003csup\u003e32\u003c/sup\u003e demonstrated that a fine-tuned chatbot matched primary care physicians in terms of completeness and empathy when addressing common health questions. These findings underscore the practical value of AI-human collaboration, particularly in educational settings where structured guidance is crucial. By enabling physicians to explore complex reasoning with AI support, KOM functioned as a supportive tool for decision making in this study that helps develop diagnostic reasoning and decision-making skills.\u003c/p\u003e\u003cp\u003eWe anticipate that AI-assisted clinical education will become fundamental to modern medical training. As language models improve in contextual understanding and interfaces become more intuitive, systems like KOM will evolve into intelligent learning partners, potentially transforming how clinical reasoning is taught and assessed\u003csup\u003e\u003cspan additionalcitationids=\"CR34\" citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003eLimitation\u003c/h2\u003e\u003cp\u003eDespite these strengths, several limitations exist. First, while Assessment Agent performed well in retrospective testing, clinical implementation requires prospective validation with doctoral trainee oversight. Second, KOM focuses solely on knee osteoarthritis and may generate inappropriate outputs for other knee conditions; future versions should include a preliminary classifier to distinguish KOA from non-KOA pathologies. Third, the progression prediction model could benefit from additional factors beyond the current 31 features, which themselves present data collection challenges; wearable integration and feature optimization could address these issues. Finally, the intervention component lacks advanced mechanisms for resolving conflicting recommendations across different therapeutic domains, currently relying on basic prioritization strategies rather than sophisticated conflict resolution.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\u003ch2\u003eImplication for Future Research\u003c/h2\u003e\u003cp\u003eIn future research, we aim to enhance our agents while maintaining their clinical utility and effectiveness. Although our current integration of deep learning with large language models effectively addresses clinical needs, we seek to develop more compact models with higher integration capacity that could potentially function on mobile devices for real-time assessment. Regarding diagnostic evaluation, we plan to implement embedded technologies for disease monitoring, which could reduce dependence on traditional radiographic examinations that require specialized equipment, thereby streamlining the assessment process. For the prediction component, we plan to conduct further parameter analysis to identify more readily obtainable and clinically relevant variables. Regarding treatment recommendations, we intend to validate the efficacy of AI-generated management plans compared to conventional approaches through controlled clinical studies. Building upon our KOA management framework, this methodological approach can be adapted to other chronic conditions that require comprehensive management strategies.\u003c/p\u003e\u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study presents KOM, a multi-agent artificial intelligence system for precision management of knee osteoarthritis. It successfully performs patient information collection, disease assessment, progression prediction, risk factor identification, and generates management plans. In our evaluation, it performed better than the general-purpose AI models tested on the specified setting tasks. The three-arm comparative study demonstrated that doctoral trainee-KOM collaboration achieved superior performance while reducing processing time. This study establishes a methodological foundation for developing scalable, evidence-based management strategies that may be adapted to address other chronic disorders.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003e\u003cstrong\u003eEthical Considerations and Study Design \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFor this study, we retrospectively recruited 300 patients with KOA from West China Hospital, Sichuan University. Among them, 250 cases with complete baseline and follow-up data were included for retrospective validation of the treatment planning module. In addition, a subset of 50 de-identified cases was selected to construct a simulated cohort for prospective evaluation under controlled experimental conditions. The Ethics Committee of West China Hospital approved the study protocol (approval number: 23-2277). Radiographic data for deep learning model development were obtained exclusively from the publicly accessible OAI database \u003csup\u003e36\u003c/sup\u003e, a NIH-funded longitudinal observational study that provides standardized bilateral knee radiographs with corresponding clinical and demographic data; the OAI dataset is publicly licensed for research purposes and did not require additional approval.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAll chatbot queries were conducted between January and April 2025 in Chengdu, Sichuan, China. For performance evaluation, three senior orthopedic and sports medicine experts independently assessed model outputs under a double-masked design in which evaluators were unaware of model identities; no patients or public participants were involved. In addition to accuracy and consistency, we screened model outputs for potentially harmful, misleading, or biased responses (eg, unsafe medication recommendations, diagnostic errors, or demographic bias) and found none. Apart from the OAI data, which remains under its original license, all other datasets and code used in this study are owned by the research team; however, our example code and study prompts have been made publicly available to promote transparency and reproducibility. In the development of KOM, we initial attempted using single-agent to perform all tasks in the KOA care pathway. But it performed poorly, which leads to the adoption of a multi-agent strategy in which each agent was specifically trained for different tasks. Through iterative rounds of testing and development the structure of KOM system was finalized, which included the Assessment Agent, Risk Agent, and Therapy Agents Group.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAssessment Agent\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFunctionality and Workflow\u003c/p\u003e\n\u003cp\u003eUpon accessing the KOM system, patients first indicate whether bilateral knee anteroposterior radiographs are available. When provided, the system automatically performs deep learning-based analysis of each knee, generating assessments of osteoarthritis severity, joint space narrowing, subchondral bone sclerosis, and osteophyte presence.\u003c/p\u003e\n\u003cp\u003eSubsequently, the intelligent conversational interface collects structured patient information, encompassing demographics, chief complaints, medical and family histories, current treatments, and lifestyle factors such as physical activity, occupational loading, and prior joint injuries. It also gathers data on metabolic and hormonal status, psychological and nutritional health, and treatment preferences. The interface identifies and prompts for missing information to ensure data collection. Patients can request clarification on medical terminology or data requirements in real-time. For unavailable clinical assessments such as the KOOS score, the Assessment Agent guides patients through the complete assessment protocol.\u003c/p\u003e\n\u003cp\u003eDevelopment of the image analysis module\u003c/p\u003e\n\u003cp\u003eThe imaging analysis models were trained on 12,719 bilateral knee radiographs obtained from the publicly available OAI database. The pipeline first performs joint localization using two independently trained U-Net models, which define the regions of interest for subsequent processing. Multi-task image analysis is then conducted using eleven task-specific deep neural networks based on the ResNet architecture, enabling simultaneous classification of KOA severity, joint space narrowing, subchondral sclerosis, and osteophyte presence.\u003c/p\u003e\n\u003cp\u003eKnee Joint Localization with an Enhanced UNet Pipeline\u003c/p\u003e\n\u003cp\u003eTo automatically identify the central regions of bilateral knees in radiographs, we implemented an optimized UNet\u003csup\u003e37\u003c/sup\u003e-based segmentation framework designed explicitly for high-contrast X-ray images. This localization process served as the foundation for all subsequent image analyses.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWe trained the model on 200 manually annotated anteroposterior radiographs, each paired with binary masks outlining knee centers. The architecture featured a five-layer UNet with skip connections, batch normalization, and ReLU activations. We initialized weights using Kaiming normalization to promote stable convergence. To address the significant foreground-background imbalance typical in knee radiographs, we employed an equally weighted hybrid loss function combining binary cross-entropy and Dice loss. During training, input images and masks underwent synchronized random flipping and resized cropping to maintain alignment. Performance evaluation utilized two metrics: bounding box intersection-over-union (IoU) and center point localization error. The Hungarian algorithm matched predicted and ground truth regions, ensuring unbiased metric computation. Training proceeded for 40 epochs using the Adam optimizer with cosine annealing learning rate scheduling, with model selection based on validation IoU performance.\u003c/p\u003e\n\u003cp\u003eRadiographic Classification of KOA Features Using Transfer-Learned ResNet50\u003c/p\u003e\n\u003cp\u003eTo facilitate multidimensional classification of KOA-related radiographic features, we implemented a ResNet\u003csup\u003e38\u003c/sup\u003e-50-based deep learning pipeline trained on localized anteroposterior (AP) knee radiographs. Following initial joint localization, each cropped image was processed through a modified ResNet-50 architecture, where the final classification layer was replaced with task-specific output heads.\u003c/p\u003e\n\u003cp\u003eClassification Targets and Dataset Construction\u003c/p\u003e\n\u003cp\u003eWe curated a series of supervised classification tasks covering 11 clinically relevant radiographic features: KOA severity; Joint space narrowing of the medial and lateral compartments; Subchondral bone sclerosis (medial femur, lateral femur, medial tibia, lateral tibia); Osteophyte presence (binary detection across the same four compartments). For each task, we constructed balanced datasets containing 500 images per severity level. The data was partitioned into training, validation, and testing sets using an 8:1:1 ratio to ensure a methodical approach to model development and evaluation.\u003c/p\u003e\n\u003cp\u003eKL Grading Adjustment and Label Consolidation\u003c/p\u003e\n\u003cp\u003eWhile the Kellgren-Lawrence (KL) grading system represents the established standard for radiographic KOA severity assessment\u003csup\u003e39\u003c/sup\u003e, model development revealed consistent difficulties in discriminating between KL grades 0 and 1.\u0026nbsp;Expert review, conducted by two musculoskeletal radiologists and a senior orthopedic surgeon, confirmed that the visual differences between these grades were subtle and had minimal implications for treatment\u003csup\u003e40,41\u003c/sup\u003e. Consequently, we consolidated KL 0 (normal) and KL 1 (doubtful) into a unified None/Doubt category. The remaining KL grades were mapped as follows: KL 2 \u0026rarr; Mild, KL 3 \u0026rarr; Moderate, KL 4 \u0026rarr; Severe. This restructuring produced a clinically meaningful and computationally effective four-class KOA severity framework:\u003c/p\u003e\n\u003cp\u003eNone/Doubt (KL 0\u0026ndash;1)\u003c/p\u003e\n\u003cp\u003eMild (KL 2)\u003c/p\u003e\n\u003cp\u003eModerate (KL 3)\u003c/p\u003e\n\u003cp\u003eSevere (KL 4)\u003c/p\u003e\n\u003cp\u003eTo maintain compatibility with clinical standards, multimodal models were initially trained on the original five-grade KL classification and subsequently converted to the four-grade schema during downstream evaluation.\u003c/p\u003e\n\u003cp\u003eAll radiographs underwent standardized bone window processing (center: 300 HU, width: 1500 HU) to enhance bony structure contrast and minimize irrelevant soft-tissue variation. Images were normalized to the [0, 1] range, resized to 256 \u0026times; 256 pixels, and randomly cropped to 224 \u0026times; 224 pixels during training. Data augmentation included horizontal flipping, small-angle rotations, and brightness/contrast adjustments, applied synchronously to maintain label consistency. Model training was performed using a five-fold cross-validation approach. For each fold, we optimized the network using the Adam optimizer (initial learning rate: 1e-5) with a stepwise decay schedule (\u0026gamma; = 0.1 every seven epochs). Cross-entropy loss was used as the objective function. An early stopping mechanism (patience: 10 epochs; minimum delta: 1e-6) preserved the best-performing model state based on validation loss.\u003c/p\u003e\n\u003cp\u003eGround truth annotations from the OAI dataset were subjected to multi-level expert validation to ensure labeling accuracy and inter-rater consensus. To mitigate overfitting and maximize generalization, only center-cropped images were used during validation and testing.\u003c/p\u003e\n\u003cp\u003eModel performance was evaluated using classification accuracy, confusion matrices, and area under the receiver operating characteristic curve (AUC-ROC) across all classes. These metrics were calculated per fold and averaged to obtain robust task-specific performance estimates. To enhance interpretability, Gradient-weighted Class Activation Mapping (Grad-CAM) was applied to visualize spatial attention maps on representative validation samples. These visualizations confirmed the appropriate model focus and identified anatomical regions that informed specific predictions. Training histories, including per-epoch loss, accuracy curves, confusion matrices, and ROC plots, were documented to ensure transparency and reproducibility.\u003c/p\u003e\n\u003cp\u003eDevelopment of KOA Patient Information Collection module\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTo facilitate standardized and clinically interpretable documentation of KOA patient profiles, we developed an Assessment Agent based on a LLM optimized through domain-specific prompt engineering. The agent was designed to function as a clinical intake assistant, conducting dynamic natural language dialogue to compile clinical profiles for each patient.\u003c/p\u003e\n\u003cp\u003eSystem Design and Implementation\u003c/p\u003e\n\u003cp\u003eThe generation process began with a structured system prompt that established the documentation template and directed the model to gather information across 11 essential clinical domains: demographics, chief complaint and history of present illness, radiographic findings, past and family history, current treatment status, psychological well-being, nutritional condition, treatment goals and preferences, and available rehabilitation resources. To enhance clinical utility, the prompt incorporated mechanisms for real-time clarification of medical terminology. When simulated patients expressed uncertainty regarding specialized concepts (e.g., KOOS scoring or subchondral sclerosis), the LLM automatically provided context-appropriate explanations before continuing the assessment dialogue.\u003c/p\u003e\n\u003cp\u003eEvaluation Methodology\u003c/p\u003e\n\u003cp\u003eWe evaluated the Assessment Agent using 100 simulated KOA patient scenarios from West China Hospital. For each simulation, a trained research assistant assumed a predefined patient persona and engaged in an interactive dialogue with the LLM. Each session proceeded until the model determined that sufficient clinical information had been collected, at which point it autonomously concluded the interview process. All generated outputs were anonymized and randomized for subsequent expert review.\u003c/p\u003e\n\u003cp\u003eTo assess the clinical quality of the generated structured cases, we conducted a blinded expert evaluation. Three senior orthopedic and sports medicine physicians independently evaluated each case across four predetermined dimensions:\u003c/p\u003e\n\u003cp\u003eField completeness: Assessment of whether all expected clinical fields contained adequate information\u003c/p\u003e\n\u003cp\u003eLogical consistency: Evaluation of internal coherence and clinical plausibility of the narrative\u003c/p\u003e\n\u003cp\u003eMedical accuracy: Assessment of appropriate and correct application of clinical terminology and judgment\u003c/p\u003e\n\u003cp\u003eReadability: Evaluation of clarity, fluency, and professional expression\u003c/p\u003e\n\u003cp\u003eBefore formal assessment, all expert evaluators participated in a calibration session. This session included review and discussion of representative examples illustrating high-, medium-, and low-quality cases, followed by collaborative refinement of the scoring rubric. During the formal evaluation phase, each expert was blinded to both the origin and sequence of the cases, and no communication was permitted during individual assessments. Each dimension was rated on a 1-5 scale, with the final score for each case calculated by averaging ratings across the three evaluators.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAgent Benchmarking and Comparative Evaluation\u003c/p\u003e\n\u003cp\u003eTo assess the diagnostic performance of the KOM system relative to current vision-language models, we conducted a benchmarking study comparing five leading multimodal large language models (MLLMs): Google Gemini 2.0 Pro, GPT-4o, Claude 3.7 Sonnet, QwenMax VL, and LLaMA 3.2 90B Vision Instruct. All models were evaluated under identical input and task conditions to ensure methodological consistency.\u003c/p\u003e\n\u003cp\u003eEvaluation Dataset and Protocol\u003c/p\u003e\n\u003cp\u003eThe evaluation dataset consisted of 500 bilateral knee radiographs (1,000 knees in total) obtained from the publicly available OAI cohort. All samples were excluded from any previous training or fine-tuning of the KOM system or its constituent components. For each case, a standardized input prompt was constructed to elicit three clinically relevant outputs:\u003c/p\u003e\n\u003cp\u003eBilateral KOA severity grading (based on the Kellgren-Lawrence scale)\u003c/p\u003e\n\u003cp\u003eOA presence detection for each knee (binary classification)\u003c/p\u003e\n\u003cp\u003eLeft knee localization (spatial discrimination to assess model orientation awareness)\u003c/p\u003e\n\u003cp\u003eTo maintain consistency across model evaluations, all knee radiographs underwent preprocessing to ensure uniform viewing orientation (with the left knee positioned on the right side, facing the observer), and standardized diagnostic prompts were employed. Each model received identical image-text inputs and was evaluated based solely on its unmodified output without manual intervention.\u003c/p\u003e\n\u003cp\u003eEvaluation Metrics and Ground Truth\u003c/p\u003e\n\u003cp\u003eFor KOA severity grading, the reference standard was derived from the revised OAI dataset and mapped into a four-class severity framework: None/Doubt (KL 0-1), Mild (KL 2), Moderate (KL 3), and Severe (KL 4). The KOM system was explicitly designed to predict KOA severity grades that align directly with this classification schema. All other multimodal agents were first prompted to produce KL grades, which were then converted into corresponding severity categories using the same KL-to-severity mapping protocol.\u003c/p\u003e\n\u003cp\u003eFor the OA presence detection task, models employed different output approaches. The KOM system internally predicted a KL severity grade for each knee, from which OA presence was derived using a predefined threshold; cases classified as KL 0 or 1 were categorized as \u0026quot;No OA.\u0026quot; At the same time, KL \u0026ge;2 was designated as \u0026quot;OA present.\u0026quot; In contrast, multimodal LLMs were prompted to directly determine whether OA was present or absent for each knee, without generating an explicit KL grade.\u003c/p\u003e\n\u003cp\u003eTo enable valid comparison, ground truth OA labels were derived using consistent KL-based thresholding: KL 0-1 \u0026rarr; No OA, KL 2-4 \u0026rarr; OA present. Model predictions were binarized accordingly, and accuracy was calculated separately for left and right knees.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eRisk Agent\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFunctionality and Workflow\u003c/p\u003e\n\u003cp\u003eUpon activation of the symptom and radiographic prediction agent, patients provide baseline data encompassing 31 clinical parameters, including body mass index (BMI), age, body weight, KOOS subscale scores, and bilateral knee muscle strength measurements. These parameters can be seamlessly populated from the previously completed structured clinical documentation Agent or directly entered by the patient.\u003c/p\u003e\n\u003cp\u003eSymptom Prediction Submodule\u003c/p\u003e\n\u003cp\u003eTo predict KOOS subscale outcomes at 1-year and 4-year follow-ups, regression models utilizing XGBoost\u003csup\u003e42\u003c/sup\u003e, LightGBM\u003csup\u003e43\u003c/sup\u003e, Random Forest\u003csup\u003e44\u003c/sup\u003e, Gradient Boosting\u003csup\u003e45\u003c/sup\u003e, Support Vector Regression (SVR\u003csup\u003e46\u003c/sup\u003e), and Elastic Net\u003csup\u003e47\u003c/sup\u003e algorithms were developed.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe input dataset comprised 31 patient parameters that underwent preprocessing, including removal of incomplete cases, categorical encoding, and z-score standardization. Model performance was evaluated using five-fold cross-validation, which employed multiple metrics: R\u0026sup2;, mean squared error (MSE), mean absolute error (MAE), and Pearson correlation coefficient (Pearson r). Feature importance analyses were conducted and visualized via bar plots, residual analyses, and scatter plots to provide insights into model predictions. Quantitative evaluation results are presented in the supplementary Graphs S3.\u003c/p\u003e\n\u003cp\u003eRadiographic Prediction Submodule\u003c/p\u003e\n\u003cp\u003eFor radiographic outcome prediction, structured clinical data from the OAI dataset were utilized to forecast KL grades for both knees at 1-year and 4-year intervals, constituting four distinct prediction tasks. The dataset underwent stratified splitting (70% training, 30% validation) and class balancing to ensure uniform representation with 1,000 cases per KL grade. Eight machine learning algorithms were evaluated:XGBoost\u003csup\u003e42\u003c/sup\u003e, LightGBM\u003csup\u003e43\u003c/sup\u003e, Random Forest\u003csup\u003e44\u003c/sup\u003e, Gradient Boosting\u003csup\u003e45\u003c/sup\u003e, AdaBoost\u003csup\u003e48\u003c/sup\u003e, Support Vector Machine (SVM\u003csup\u003e49\u003c/sup\u003e), K-Nearest Neighbors (KNN\u003csup\u003e50\u003c/sup\u003e), and Multi-layer Perceptron (MLP). Model robustness was assessed using 100 iterations of Monte Carlo cross-validation, with performance quantified through accuracy (ACC), weighted precision, recall, F1-score, and macro-area under the receiver operating characteristic curve (macro-AUC). Confusion matrices and ROC curves illustrating misclassification patterns and model discriminative capabilities are provided in the Supplementary Graphs S3.\u003c/p\u003e\n\u003cp\u003eRisk Factor Analysis Submodule\u003c/p\u003e\n\u003cp\u003eTo produce individualized risk factor assessments, predictive models were enhanced with SHAP\u003csup\u003e51\u003c/sup\u003e analyses. These analyses produce interactive force plots that illustrate the relative contributions of key clinical parameters, such as BMI, body weight, age, and muscle strength, in predicting patient-specific osteoarthritis progression risk.\u003c/p\u003e\n\u003cp\u003eEach prediction task utilizes the single best-performing model, allowing comparative analyses of risk factor contributions. This flexible analytical framework supports both consensus-driven and divergent risk assessments, providing clinicians with insights to tailor interventions to individual patient profiles.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTherapy Agents Group\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFunctionality and Workflow\u003c/p\u003e\n\u003cp\u003eWhen entering the multidisciplinary Intervention Agent, patient profiles that integrate structured clinical data from the Assessment Agent and personalized predictive outcomes with risk factor analyses produced by the Risk Agent are used as foundational inputs. This Agent employs a collaborative multi-agent artificial intelligence architecture to generate tailored therapeutic recommendations across diverse clinical domains autonomously. The system incorporates specialized intelligent agents functioning as exercise prescriptionists, surgical and pharmacological interventionists, and nutritional and psychological specialists. Each agent independently develops structured, evidence-informed treatment recommendations within its domain of expertise. A clinical decision-making agent subsequently synthesizes these domain-specific interventions to produce a patient-specific management strategy. Throughout this process, the system prioritizes clinical applicability, scientific accuracy, patient safety, and individualized care. Detailed metrics for evaluating the clinical effectiveness of these interventions are described in subsequent sections.\u003c/p\u003e\n\u003cp\u003eKnowledge Base Construction\u003c/p\u003e\n\u003cp\u003eTo ensure evidence-based therapeutic recommendations, extensive domain-specific knowledge bases were developed, encompassing five key intervention categories: exercise rehabilitation, surgical techniques, rehabilitation interventions, nutrition, and psychological therapies. A literature search conducted in the PubMed database initially identified 33,641 articles and clinical guidelines. Through article evaluation, these were refined to a core repository of 4,017 high-quality, peer-reviewed publications and internationally recognized clinical guidelines. Each selected document underwent targeted extraction of results and recommendations sections, excluding unrelated content to optimize relevance and retrieval accuracy. The annotated excerpts were organized into structured repositories optimized for RAG, thereby enhancing the accuracy and specificity of knowledge retrieval and agent-generated therapeutic recommendations. Complete details of literature selection criteria, evaluation methodologies, and knowledge base composition are provided within the supplementary materials.\u003c/p\u003e\n\u003cp\u003eDevelopment of Individual and Multi-Agent Architectures\u003c/p\u003e\n\u003cp\u003eThe multidisciplinary clinical recommendation system integrates four distinct intelligent agents, each leveraging the Qwen-Max large language model as a foundational engine, further optimized through targeted RAG techniques and tailored prompt engineering.\u003c/p\u003e\n\u003cp\u003eExercise Prescriptionist Agent: applies the FITT-VP framework (Frequency, Intensity, Time, Type, Volume, and Progression), dynamically customizing exercise regimens based on patient-specific clinical data and therapeutic objectives retrieved via RAG.\u003c/p\u003e\n\u003cp\u003eSurgical and Medication Specialist Agent: Matches patient profiles against surgical guidelines and pharmacological guideline databases, providing detailed and precise intervention proposals with appropriate dosing, timing, and procedural specifications.\u003c/p\u003e\n\u003cp\u003eNutritional and Psychological Specialist Agent: Integrates nutritional prescriptions strictly adhering to the ABCMV principles (Adequacy, Balance, Calorie control, Moderation, Variety), supplemented by tailored psychological management strategies responsive to individual patient needs.\u003c/p\u003e\n\u003cp\u003eClinical Decision-Making Agent: Serves as the integration hub where outputs from the specialized agents converge. This agent critically evaluates and synthesizes the recommendations, optimizing outcomes along dimensions of accuracy, comprehensiveness, personalization, patient safety, and domain-specific professional standards.\u003c/p\u003e\n\u003cp\u003eThrough this architecture, the final integrated prescription is validated and tailored to each patient\u0026apos;s unique clinical profile and therapeutic requirements.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eClinical Validation with Real-World Patient Data\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe retrospectively assembled a cohort of 250 KOA patients from West China Hospital, Sichuan University. This dataset captured detailed baseline demographics, clinical examinations, radiographic findings, etiological diagnoses, patient treatment preferences, and institutional resource availability. Clinical management strategies were classified into five categories:\u003c/p\u003e\n\u003cp\u003eConservative management (n = 73)\u003c/p\u003e\n\u003cp\u003eTotal knee arthroplasty (n = 62)\u003c/p\u003e\n\u003cp\u003eUnicompartmental knee arthroplasty (n = 43)\u003c/p\u003e\n\u003cp\u003eOsteotomy (n = 39)\u003c/p\u003e\n\u003cp\u003eArthroscopic surgery (n = 33)\u003c/p\u003e\n\u003cp\u003eEvaluation Protocol\u003c/p\u003e\n\u003cp\u003eFor each patient case, the KOM system, GPT-4o, Claude 3.7, and five additional vision-language models independently generated treatment recommendations. To eliminate bias, all model outputs were de-identified and randomized into a single pool for evaluation. Three senior orthopedic experts, each with over ten years of specialized clinical experience, conducted fully blinded evaluations of all recommendations using a unified seven-dimensional rubric.\u003c/p\u003e\n\u003cp\u003eEvidence-based practice\u003c/p\u003e\n\u003cp\u003eCompleteness\u003c/p\u003e\n\u003cp\u003eExercise prescription\u003c/p\u003e\n\u003cp\u003eNutrition prescription\u003c/p\u003e\n\u003cp\u003ePersonalization\u003c/p\u003e\n\u003cp\u003eAccessibility and feasibility\u003c/p\u003e\n\u003cp\u003eSafety\u003c/p\u003e\n\u003cp\u003eBefore formal scoring, the reviewers participated in a calibration workshop where they jointly reviewed five exemplar cases representing both exemplary and suboptimal recommendations. This process resolved scoring discrepancies and resulted in a detailed evaluation manual, ensuring harmonized threshold definitions and high inter-rater consistency. In parallel, standard clinical prescriptions were developed as benchmark references, with comparative linguistic similarity analyses (including BLEU, BERT, and ROUGE metrics) quantifying each model\u0026rsquo;s adherence to the gold-standard clinical treatment protocol.\u003c/p\u003e\n\u003cp\u003eProspective Evaluation Using a Simulated Patient Cohort\u003c/p\u003e\n\u003cp\u003eStudy Design and Participant Selection\u003c/p\u003e\n\u003cp\u003eWe extracted de-identified records for 50 KOA patients from the West China Hospital database. Baseline weight-bearing radiographs and accompanying clinical data underwent independent review and confirmation by two senior orthopedic surgeons; only cases approved by both reviewers advanced to the simulated evaluation. Twenty doctoral candidates without prior KOA-specific imaging or therapeutic training (age 25\u0026ndash;35 years, \u0026le;1 year clinical rotation) were randomized via a computerized draw application into two arms (n = 10 each):\u003c/p\u003e\n\u003cp\u003e\u0026quot;Physicians-only\u0026quot; group\u003c/p\u003e\n\u003cp\u003e\u0026quot;Physicians + KOM\u0026quot; group\u003c/p\u003e\n\u003cp\u003eRandomization was performed by an independent data manager using sealed electronic envelopes to ensure allocation concealment. Before case evaluation, all participants attended a single standardized training session that covered KOA radiographic interpretation and prescription formulation protocols.\u003c/p\u003e\n\u003cp\u003eClinical Materials and Assessment Methodology\u003c/p\u003e\n\u003cp\u003eCorresponding radiographic data for each simulated case were drawn from the OAI, ensuring robust clinical validity and standardized imaging support. Radiographic grading accuracy was calculated as the proportion of cases for which the predicted KOA severity exactly matched the adjudicated reference grade in our corrected OAI database. We recorded the total time required for radiographic interpretation and prescription development tasks across all groups, quantitatively assessing efficiency gains attributable to KOM system integration.\u003c/p\u003e\n\u003cp\u003eTreatment Plan Evaluation\u003c/p\u003e\n\u003cp\u003eTo evaluate clinical decision-making quality, we generated treatment plans for 50 simulated KOA patient cases across three cohorts:\u003c/p\u003e\n\u003cp\u003eKOM group (KOM runs three times per case)\u003c/p\u003e\n\u003cp\u003ephysicians\u0026rsquo; group (three different physicians per case)\u003c/p\u003e\n\u003cp\u003ecollaboration group (three physicians using KOM)\u003c/p\u003e\n\u003cp\u003eThis process resulted in a total of 450 de-identified plans. All plans were pooled, randomized, and stripped of origin labels to prevent evaluation bias.\u003c/p\u003e\n\u003cp\u003eTwo senior orthopedic specialists, each with over a decade of specialized experience, independently scored every plan using a harmonized seven-dimensional rubric:\u003c/p\u003e\n\u003cp\u003eEvidence-based practice\u003c/p\u003e\n\u003cp\u003eCompleteness\u003c/p\u003e\n\u003cp\u003eExercise prescription\u003c/p\u003e\n\u003cp\u003eNutrition prescription\u003c/p\u003e\n\u003cp\u003ePersonalization\u003c/p\u003e\n\u003cp\u003eAccessibility and feasibility\u003c/p\u003e\n\u003cp\u003eSafety\u003c/p\u003e\n\u003cp\u003eBefore formal review, the experts jointly examined five exemplar plans (representing both high and low quality), reconciled scoring discrepancies, and finalized a detailed evaluation manual to ensure consistent application of rating thresholds.\u003c/p\u003e\n\u003cp\u003eTo anchor assessments in best-practice care, a third senior specialist created gold-standard prescriptions for each case based on current clinical guidelines. Finally, we quantitatively compared each free-text plan against its benchmark using established textual similarity metrics (BLEU, BERT, and ROUGE), enabling a appraisal of each plan\u0026apos;s coherence with expert protocols.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;procedural steps, the full scoring rubric, and calibration details are provided in the Supplementary Methods.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eStatistical Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo standardize scores within each model across metrics, we applied row-wise z-score normalization to the model-by-metric mean matrix used for visualization (Fig. 4f). For model \u003cimg width=\"13\" height=\"22\" src=\"https://myfiles.space/user_files/132203_cef980177e9a226b/132203_custom_files/img1763535551.gif\" alt=\"image\"\u003e\u0026nbsp;and metric \u003cimg width=\"8\" height=\"22\" src=\"https://myfiles.space/user_files/132203_cef980177e9a226b/132203_custom_files/img1763535551.gif\" alt=\"image\"\u003e, with mean score \u003cimg width=\"27\" height=\"23\" src=\"https://myfiles.space/user_files/132203_cef980177e9a226b/132203_custom_files/img1763535551.gif\" alt=\"image\"\u003e, we computed\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"https://myfiles.space/user_files/132203_cef980177e9a226b/132203_custom_files/img1763535598.png\" style=\"width: 395px; height: 48.5381px;\" width=\"395\" height=\"48.5381\"\u003e\u003c/p\u003e\n\u003cp\u003e\u003cimg width=\"4\" height=\"22\" src=\"https://myfiles.space/user_files/132203_cef980177e9a226b/132203_custom_files/img1763535571.gif\" alt=\"image\"\u003e\u0026nbsp;where\u003cimg width=\"22\" height=\"24\" src=\"https://myfiles.space/user_files/132203_cef980177e9a226b/132203_custom_files/img1763535572.gif\" alt=\"image\"\u003e\u0026nbsp;and \u003cimg width=\"18\" height=\"22\" src=\"https://myfiles.space/user_files/132203_cef980177e9a226b/132203_custom_files/img1763535572.gif\" alt=\"image\"\u003e\u0026nbsp;denote the mean and standard deviation of all metric scores for model \u003cimg width=\"13\" height=\"22\" src=\"https://myfiles.space/user_files/132203_cef980177e9a226b/132203_custom_files/img1763535572.gif\" alt=\"image\"\u003e, respectively. This emphasizing the relative distribution of metrics within a model rather than between-model differences. The resulting z-score matrix was used for heatmap visualization and pattern analysis.\u003c/p\u003e\n\u003cp\u003eFor statistical analysis of between-group differences, diagnostic accuracy and task completion time were treated as continuous variables and assessed for normality using the Shapiro-Wilk test. For normally distributed data, independent samples t-tests were used; for non-normally distributed data, the Mann-Whitney U test was employed for between-group comparisons. Treatment quality scores, being ordinal variables, were consistently analyzed using Mann-Whitney U tests. Statistical significance for all between-group comparisons was established at p \u0026lt; 0.05.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was funded by the Youth Research Fund of Sichuan Science and Technology Planning Department. Grant number 23NSFSC4894 (received by Xi Chen)\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eW.L., X.C., and Z.J. are the main designers of the study. W.L., X.C. are the main executors of the study. K.L., and J.L. contributed to the study by managing and supervising the revision work and providing critical feedback during the major revision process. H.Z., H.C., K.L., and W.L. served as consultants for computer science-related knowledge. H.C. and W.L. developed the code for this study and performed the model training. W.L., X.C., Z.J., L.Z., K.Z., R.T., L.W., and M.Y. evaluated the models\u0026apos; responses and prepared the test dataset. W.L., X.C., and Z.J. participated in drafting the manuscript. K.L., and J.L. provided overall guidance and supervision for the project. All authors have read and approved the final version of the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics declarations\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was approved by the Ethics Committee of West China Hospital, Sichuan University (Approval No. 23-2277). All procedures complied with the Declaration of Helsinki and relevant national regulations, including China\u0026rsquo;s Personal Information Protection Law.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCode availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe complete source code for this project is publicly accessible at https://github.com/jacobliuweizhi/KOM. A demonstration of the implementation is available through our interactive web interface at https://huggingface.co/spaces/Miemie123/Streamlit?page=Tailored+Therapy+Recommendation\u0026amp;start=1.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe code developed for this study is available at: https://github.com/jacobliuweizhi/KOM under the GNU Affero General Public License v3.0. W.L., H.Z., H.C., and K.L. contributed to the code development and are responsible for maintaining the repository. Reference documents used in the RAG module are listed in the repository. Osteoarthritis-related imaging and clinical data used in this study are accessible through the OAIdatabase (https://nda.nih.gov/oai), subject to data use agreements.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eSteinmetz JD et al (2023) Global, regional, and national burden of osteoarthritis, 1990\u0026ndash;2020 and projections to 2050: a systematic analysis for the Global Burden of Disease Study 2021. Lancet Rheumatol 5:e508\u0026ndash;e522\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMartel-Pelletier J, Boileau C, Pelletier J-P, Roughley PJ (2008) Cartilage in normal and osteoarthritis conditions. Best Pract Res Clin Rheumatol 22:351\u0026ndash;384\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKloppenburg M, Namane M, Cicuttini F, Osteoarthritis (2025) Lancet 405:71\u0026ndash;85. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1016/S0140-6736(24)02322-5\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1016/S0140-6736(24)02322-5\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBliddal H et al (2024) Once-Weekly Semaglutide in Persons with Obesity and Knee Osteoarthritis. N Engl J Med 391:1573\u0026ndash;1583. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1056/NEJMoa2403664\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1056/NEJMoa2403664\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eThomas KA et al (2020) Automated Classification of Radiographic Knee Osteoarthritis Severity Using Deep Neural Networks. Radiology: Artif Intell 2:e190065. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1148/ryai.2020190065\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1148/ryai.2020190065\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLeung K et al (2020) Prediction of Total Knee Replacement and Diagnosis of Osteoarthritis by Using Deep Learning on Knee Radiographs: Data from the Osteoarthritis Initiative. Radiology 296:584\u0026ndash;593. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1148/radiol.2020192091\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1148/radiol.2020192091\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNorman B, Pedoia V, Majumdar S (2018) Use of 2D U-Net Convolutional Neural Networks for Automated Cartilage and Meniscus Segmentation of Knee MR Imaging Data to Determine Relaxometry and Morphometry. Radiology 288:177\u0026ndash;185. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1148/radiol.2018172322\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1148/radiol.2018172322\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTiulpin A et al (2019) Multimodal Machine Learning-based Knee Osteoarthritis Progression Prediction from Plain Radiographs and Clinical Data. Sci Rep 9:20038. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/s41598-019-56527-3\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/s41598-019-56527-3\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCastagno S, Birch M, van der Schaar M, McCaskie A (2024) Predicting rapid progression in knee osteoarthritis: a novel and interpretable automated machine learning approach, with specific focus on young patients and early disease. \u003cem\u003eAnnals of the Rheumatic Diseases\u003c/em\u003e, ard-2024-225872 \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1136/ard-2024-225872\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1136/ard-2024-225872\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNielsen RL et al (2024) Data-driven identification of predictive risk biomarkers for subgroups of osteoarthritis using interpretable machine learning. Nat Commun 15:2817. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/s41467-024-46663-4\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/s41467-024-46663-4\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDu K et al (2025) Comparing Artificial Intelligence\u0026ndash;Generated and Clinician-Created Personalized Self-Management Guidance for Patients With Knee Osteoarthritis: Blinded Observational Study. J Med Internet Res 27:e67830. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.2196/67830\u003c/span\u003e\u003cspan address=\"https://doi.org:10.2196/67830\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang L et al (2024) Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. npj Digit Med 7:41. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/s41746-024-01029-4\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/s41746-024-01029-4\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChen X et al (2024) Evaluating and Enhancing Large Language Models' Performance in Domain-Specific Medicine: Development and Usability Study With DocOA. J Med Internet Res 26:e58158. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.2196/58158\u003c/span\u003e\u003cspan address=\"https://doi.org:10.2196/58158\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChen X et al (2025) Enhancing diagnostic capability with multi-agents conversational large language models. NPJ Digit Med 8:159. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/s41746-025-01550-0\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/s41746-025-01550-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRadford A et al in \u003cem\u003eInternational conference on machine learning.\u003c/em\u003e 8748\u0026ndash;8763 (PmLR)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi B et al (2025) LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition? \u003cem\u003earXiv preprint arXiv:2503.07487\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJang J et al (2024) Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders. Sci Rep 14:23199\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJiang Y, Chen C, Nguyen D, Mervak BM, Tan C (2024) Gpt-4v cannot generate radiology reports yet. \u003cem\u003earXiv preprint arXiv:2407.12176\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePi S-W, Lee B-D, Lee MS, Lee HJ (2023) Ensemble deep-learning networks for automated osteoarthritis grading in knee X-ray images. Sci Rep 13:22887\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang P et al (2023) Large language models are not fair evaluators. Preprint Arxiv\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSwiecicki A et al (2021) Deep learning-based algorithm for assessment of knee osteoarthritis severity in radiographs matches performance of radiologists. Comput Biol Med 133:104334\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eElnashar A, White J, Schmidt DC (2025) Enhancing structured data generation with GPT-4o evaluating prompt efficiency across prompt styles. Front Artif Intell 8:1558938\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi J et al (2024) Integrated image-based deep learning and language models for primary diabetes care. Nat Med 30:2886\u0026ndash;2896. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/s41591-024-03139-8\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/s41591-024-03139-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYu KKH et al (2025) Investigative needle core biopsies support multimodal deep-data generation in glioblastoma. Nat Commun 16:3957. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/s41467-025-58452-8\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/s41467-025-58452-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSosinsky A et al (2024) Insights for precision oncology from the integration of genomic and clinical data of 13,880 tumors from the 100,000 Genomes Cancer Programme. Nat Med 30:279\u0026ndash;289. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/s41591-023-02682-0\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/s41591-023-02682-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZakka C et al (2024) Almanac \u0026mdash; Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI 1:AIoa2300068. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:doi:10.1056/AIoa2300068\u003c/span\u003e\u003cspan address=\"https://doi.org:doi:10.1056/AIoa2300068\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCeresa M et al (2025) Retrieval Augmented Generation Evaluation for Health Documents. \u003cem\u003earXiv preprint arXiv:2505.04680\u003c/em\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYang R et al (2025) Retrieval-augmented generation for generative artificial intelligence in health care. npj Health Syst 2:2. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/s44401-024-00004-1\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/s44401-024-00004-1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003e\u0026dagger; MFARDT et al (2022) Human-level play in the game of \u0026lt;\u0026thinsp;i\u0026thinsp;\u0026gt;\u0026thinsp;Diplomacy\u0026thinsp;by combining language models with strategic reasoning. Science 378:1067\u0026ndash;1074. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:doi:10.1126/science.ade9097\u003c/span\u003e\u003cspan address=\"https://doi.org:doi:10.1126/science.ade9097\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMa C, Li A, Du Y, Dong H, Yang Y (2024) Efficient and scalable reinforcement learning for large-scale network control. Nat Mach Intell 6:1006\u0026ndash;1020. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/s42256-024-00879-7\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/s42256-024-00879-7\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGoh E et al (2024) Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open 7:e2440969\u0026ndash;e2440969. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1001/jamanetworkopen.2024.40969\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1001/jamanetworkopen.2024.40969\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAyers JW et al (2023) Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 183:589\u0026ndash;596. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1001/jamainternmed.2023.1838\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1001/jamainternmed.2023.1838\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGaber F et al (2025) Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. npj Digit Med 8:263. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/s41746-025-01684-1\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/s41746-025-01684-1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCar J et al (2025) The Digital Health Competencies in Medical Education Framework: An International Consensus Statement Based on a Delphi Study. JAMA Netw Open 8:e2453131\u0026ndash;e2453131. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1001/jamanetworkopen.2024.53131\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1001/jamanetworkopen.2024.53131\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTordjman M et al (2025) Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat Med. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/s41591-025-03726-3\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/s41591-025-03726-3\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eInvestigators OI (2008) The Osteoarthritis Initiative: A Knee Health Study. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://nda.nih.gov/oai/\u003c/span\u003e\u003cspan address=\"https://nda.nih.gov/oai/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRonneberger O, Fischer P, Brox T (2015) in \u003cem\u003eMedical image computing and computer-assisted intervention\u0026ndash;MICCAI 2015: 18th international conference, Munich, Germany, October 5\u0026ndash;9\u003c/em\u003e, proceedings, part III 18. 234\u0026ndash;241 (Springer)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHe K, Zhang X, Ren S, Sun J in \u003cem\u003eProceedings of the IEEE conference on computer vision and pattern recognition.\u003c/em\u003e 770\u0026ndash;778\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKellgren JH, Lawrence JS (1957) Radiological assessment of osteo-arthrosis. Ann Rheum Dis 16:494\u0026ndash;502. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1136/ard.16.4.494\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1136/ard.16.4.494\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKohn MD, Sassoon AA, Fernando ND (2016) Classifications in Brief: Kellgren-Lawrence Classification of Osteoarthritis. Clin Orthop Relat Res 474:1886\u0026ndash;1893. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1007/s11999-016-4732-4\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1007/s11999-016-4732-4\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTang S et al (2025) Osteoarthritis. Nat Rev Dis Primers 11:10. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org:10.1038/s41572-025-00594-6\u003c/span\u003e\u003cspan address=\"https://doi.org:10.1038/s41572-025-00594-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChen T, Guestrin C in \u003cem\u003eProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining.\u003c/em\u003e 785\u0026ndash;794\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKe G et al (2017) Lightgbm: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBreiman L (2001) Random forests. Mach Learn 45:5\u0026ndash;32\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFriedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat, 1189\u0026ndash;1232\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDrucker H, Burges CJ, Kaufman L, Smola A, Vapnik V (1996) Support vector regression machines. Adv Neural Inf Process Syst 9\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Royal Stat Soc Ser B: Stat Methodol 67:301\u0026ndash;320\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFreund Y, Schapire RE in \u003cem\u003eicml.\u003c/em\u003e 148\u0026ndash;156 (Citeseer)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273\u0026ndash;297\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21\u0026ndash;27\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLundberg SM, Lee S-I (2017) A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 30\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"West China Hospital of Sichuan University","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Knee osteoarthritis (KOA), automated evaluation, risk prediction, treatment prescription, multi-agent system, personalized management, clinical decision support, imaging analysis, clinician–AI collaboration, care efficiency","lastPublishedDoi":"10.21203/rs.3.rs-8147049/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8147049/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eKnee osteoarthritis (KOA) affects more than 600\u0026nbsp;million individuals globally and is associated with significant pain, functional impairment, and disability. While personalized multidisciplinary interventions have the potential to slow disease progression and enhance quality of life, they typically require substantial medical resources and expertise, making them difficult to implement in resource-limited settings. To address this challenge, we developed KOM, a multi-agent system designed to automate KOA evaluation, risk prediction, and treatment prescription. This system assists clinicians in performing essential tasks across the KOA care pathway and supports the generation of tailored management plans based on individual patient profiles, disease status, risk factors, and contraindications. In benchmark experiments, KOM demonstrated superior performance compared to several general-purpose large language models in imaging analysis and prescription generation. A randomized three-arm simulation study further revealed that collaboration between KOM and clinicians reduced total diagnostic and planning time by 38.5% and resulted in improved treatment quality compared to each approach used independently. These findings indicate that KOM could help facilitate automated KOA management and, when integrated into clinical workflows, has the potential to enhance care efficiency. The modular architecture of KOM may also offer valuable insights for developing AI-assisted management systems for other chronic conditions.\u003c/p\u003e","manuscriptTitle":"KOM: A Multi-Agent Artificial Intelligence System for Precision Management of Knee Osteoarthritis (KOA)","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-11-19 12:19:02","doi":"10.21203/rs.3.rs-8147049/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"01e1f405-c919-4492-963e-0b0cd5163f6f","owner":[],"postedDate":"November 19th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-12-01T10:14:08+00:00","versionOfRecord":[],"versionCreatedAt":"2025-11-19 12:19:02","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8147049","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8147049","identity":"rs-8147049","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.