Mobility Functional Status Ascertainment in Electronic Health Records using Large Language Models | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Mobility Functional Status Ascertainment in Electronic Health Records using Large Language Models Xingyi Liu, Muskan Garg, Heling Jia, Jennifer St. Sauver, Sandeep R. Pagali, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7104310/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 23 Jan, 2026 Read the published version in Scientific Reports → Version 1 posted 3 You are reading this latest preprint version Abstract With global aging, assessing functional status is vital for precision medicine. Electronic Health Records (EHRs), particularly unstructured data, hold abundant information on patient mobility. This study explores using Large Language Models (LLMs) to extract and standardize mobility status from unstructured EHR data (i.e., clinical notes). We annotated 600 clinical notes from three health care institutions located in southeastern Minnesota and west-central Wisconsin, focusing on expressions of mobility and associated impairment. Leveraging the open-source Llama 3 model, we tested various prompting strategies—including zero-shot, few-shot, and task decomposition—and evaluated their performance. Error analysis showed that while the model sometimes inferred impairments without explicit evidence, most errors were clinically reasonable, often reflecting borderline or ambiguous cases. While considering reasonable inference as correct, at the patient-level, Mobility Extraction achieves a micro-average accuracy of 0.952 with an F1-score of 0.962, and Impairment Classification produces a micro-average accuracy of 0.912 and an F1-score of 0.948. A local, deterministic setup improved trustworthiness by ensuring consistent outputs, safeguarding privacy, and demonstrating cross-institution generalizability. These findings highlight the feasibility of LLM-based solutions for extracting mobility functional status from unstructured EHR data, supporting both clinical applications and research. Biological sciences/Computational biology and bioinformatics Health sciences/Health care Physical sciences/Mathematics and computing Health sciences/Medical research Figures Figure 1 Figure 2 Figure 3 1 Introduction The global demographic landscape is undergoing a shift, with a rapidly increasing proportion of older adults 1 . This demographic transition necessitates a fundamental change in healthcare, moving beyond a traditional focus on mortality to prioritize the maintenance of functional status and quality of life 2 , 3 . Functional status, broadly defined as an individual’s ability to perform activities necessary for daily living, is a powerful predictor of health outcomes, healthcare utilization, and overall well-being, particularly in older populations and those with chronic conditions 4 – 6 . Among the various domains of functional status, mobility – the ability to move oneself within an environment – is essential. Mobility limitations are strongly associated with increased risk of falls, hospitalization, institutionalization, reduced social participation, and diminished overall quality of life 7 – 9 . Accurate and timely assessment of mobility functional status is therefore essential for effective geriatric care, rehabilitation, and the development of personalized interventions. Traditionally, mobility functional status has been assessed through standardized questionnaires (e.g., self-reported activity limitations) and performance-based measures (e.g., Timed Up & Go test, gait speed assessment) 10 . However, a rich source of information on patient mobility also resides within electronic health records (EHRs) 11 , 12 . Clinicians routinely document observations, patient reports, and assessments related to mobility in clinical notes. Unfortunately, this information is typically recorded in unstructured or semi-structured text, making it difficult to extract and analyze systematically. The complexity, variability, and nuanced language used in clinical documentation pose a significant challenge to traditional data extraction techniques. This represents a major barrier to leveraging the wealth of information contained within unstructured EHRs to improve the understanding and management of mobility limitations. Furthermore, manual chart review, the traditional method for extracting information from unstructured text, is prohibitively time-consuming and expensive for large-scale research or routine clinical use 13 . This study proposes a novel approach to address this challenge by employing Large Language Models (LLMs) to automatically ascertain and standardize mobility functional status from unstructured EHR data (i.e., clinical notes). LLMs, with their advanced capabilities in natural language understanding and generation, offer a powerful means of processing and interpreting complex textual data 14 , 15 . Prior research has demonstrated the potential of natural language processing (NLP) techniques, including earlier machine learning approaches and rule-based systems, for extracting information related to functional status and mobility from clinical text 16 , 17 . For example, Bales et al. modified an existing NLP system to automate the assignment of selected International Classification of Functioning, Disability, and Health (ICF) codes 18 by updating the lexicon and coding tables 19 . Agaronnik et al. employed the ClinicalRegex NLP software to search EHRs for functional status documentation, utilizing an ontology with five keyword categories to identify pertinent information 20 . Newman-Griffis et al. applied NLP techniques to analyze patient functioning information in clinical documents related to federal disability benefit claims from the U.S. Social Security Administration, achieving robust performance with an F1-score exceeding 80% on two datasets 21 . More recently, Fu et al. introduced FedFSA, a hybrid and federated NLP framework designed to extract functional status information from EHRs across multiple healthcare institutions 22 . These earlier efforts, while showing promise, often faced limitations in terms of generalizability, scalability, and the ability to handle the full complexity and ambiguity of clinical language. Recent advances in LLM architecture and training have demonstrated remarkable performance on a wide range of NLP tasks, including information extraction, question answering, and text summarization, overcoming many of the limitations of prior NLP approaches 23 – 25 . These capabilities make LLMs particularly well-suited to the task of extracting structured information from unstructured clinical text, where contextual understanding and the ability to handle ambiguity are crucial. By leveraging the power of an open-source LLM, Llama 3 26 , we aim to develop a robust and scalable pipeline that can automatically extract, classify, and standardize mobility-related information from unstructured clinical text. The contribution of this study is fourfold. First, it develops a comprehensive annotation scheme grounded in the ICF framework to identify and classify expressions related to mobility functional status in clinical notes. Second, it evaluates the performance of Llama 3 in extracting and classifying mobility impairment status by employing various prompt engineering techniques, such as zero-shot learning, few-shot learning, error-informed prompt refinement, and task decomposition. Third, it assesses the generalizability of the LLM-based approach across different healthcare institutions. Finally, it identifies common error patterns and evaluates the overall trustworthiness of the model. We hypothesize that LLMs with well-constructed prompt engineering strategies can effectively extract mobility functional status information from unstructured EHR data, providing a scalable solution for both clinical practice and research. The successful implementation of this approach has the potential to significantly enhance precision medicine by enabling more timely and personalized interventions, improving the monitoring of patient progress, and facilitating large-scale research on mobility and aging. This, in turn, could lead to improved clinical decision-making, better resource allocation, and ultimately, improved quality of life for older adults and individuals with mobility limitations. The remainder of this paper is organized as follows. Section 2 presents the results of our experiments, including performance metrics and error analysis. Section 3 discusses the implications of our findings, limitations of the study, and directions for future research. Section 4 details the materials and methods used in this study, including the study population, annotation process, LLM selection, and prompt engineering strategies. 2 Results 2.1 Data analysis Figure 1 provides a comprehensive analysis of mobility classes at both the section and note levels. In Fig. 1(a), the number of sections corresponding to each combination of the five mobility classes is shown. It reveals that 10 sections mention all five classes, whereas 2,170 sections mention none. Figure 2 (b) extends this analysis to the note level, illustrating the number of notes corresponding to each combination of the five mobility classes. It indicates that four notes mention all five classes, while 288 notes do not mention any. Additionally, Fig. 1(c) presents a pie chart showing how many mobility classes each section contains. As shown, 57% of sections mention no mobility classes, 23.1% mention one, 12.8% mention two, 5.8% mention three, 1.2% mention four, and 0.3% mention all five classes. Similarly, Fig. 1(d) illustrates how many mobility classes appear in entire notes. It reveals that 48% of notes do not mention any mobility classes, 29.3% mention one, 9.0% mention two, 8.5% mention three, 4.5% mention four, and 0.7% reference all five. A combined overview was created to show how often each mobility class was mentioned, along with the distribution of impairment statuses (Impaired vs. Unimpaired). At the section level, the most frequently mentioned mobility class is “Changing and maintaining body position”, appearing in 853 sections; of these, 608 indicate Impaired status, while the remainder suggest Unimpaired. In contrast, the least frequently mentioned mobility class at the section level is “Carrying, moving and handling objects”, noted in only 203 sections, 156 of which suggest Impaired status. At the note level, the most frequently mentioned mobility class is “Walking and moving”, found in 222 notes; 197 of these suggest Impaired status, with the remainder indicating Unimpaired. Meanwhile, the least frequently mentioned mobility class at the note level is “Carrying, moving and handling objects”, referenced by only 11 notes, 8 of which suggest Impaired status. Figure 1 (e)-(i) illustrates the distribution of sections for each of the five mobility classes. For each class, we identified the three most frequently mentioned sections and grouped all remaining sections under “other”. The results showed that “history of present illness” ranks among the top three for all five mobility classes, while “assessment” appears in the top three for four of the five classes, with the exception of “Moving around using transportation”. 2.2 Baseline Performance with Zero-Shot Learning Table 1 presents the baseline section-level performance on Task 1 (Mobility Extraction) and Task 2 (Impairment Classification). This evaluation was conducted across the three participating institutions and five mobility functional status classes. Task 1 metrics include class- and institution-specific precision, recall, and F1-scores. The micro-average F1-scores across all mobility classes are 0.663, 0.638, and 0.516 for site 1, site 2, and site 3, respectively. Overall, these baseline metrics varied by class and location, with some classes performing consistently well while others presented room for improvement. They serve as a reference point against which more complex prompting strategies and configurations can be compared. Table 1 Performance of Task 1 (Mobility Extraction) and Task 2 (Impairment Classification) Across Mobility Classes and Institutions with Zero-Shot Learning at Section Level. Class Site 1 Site 2 Site 3 Task 1 Task 2 Task 1 Task 2 Task 1 Task 2 P R F1 F1 P R F1 F1 P R F1 F1 Changing and maintaining body position 0.745 0.802 0.772 0.837 0.627 0.789 0.699 0.805 0.605 0.665 0.634 0.757 Carrying, moving, and handling objects 0.589 0.768 0.667 0.894 0.577 0.602 0.589 0.814 0.273 0.545 0.364 0.857 Walking and moving 0.584 0.989 0.734 0.964 0.542 0.997 0.703 0.930 0.214 1.000 0.352 0.962 Moving around using transportation 0.688 0.611 0.647 0.649 0.771 0.771 0.771 0.718 0.931 0.849 0.888 0.755 Mobility, unspecified 0.341 0.905 0.495 0.872 0.357 0.960 0.521 0.836 0.252 0.730 0.375 0.914 Average 0.534 0.874 0.663 0.888 0.500 0.880 0.638 0.857 0.381 0.798 0.516 0.833 *P: Precision, R: Recall, F1: F1-score. Regarding Task 2 (Impairment Classification), the class “Walking and moving” demonstrated best performance with F1-scores ranging from 0.930 to 0.964 across three sites. The class “Moving around using transportation” achieved the lowest performance with F1-scores ranging from 0.649 to 0.755 across sites. The micro-average F1-scores across all mobility classes are 0.888, 0.857, and 0.833 for sites 1–3 respectively. 2.3 Comparison of Few-Shot Learning and Error-Informed Prompt Refinement Approaches Figure 2 (a) presents a comparison of section-level F1-scores for Mobility Extraction across five mobility functional status classes and three institutions under five configurations: the baseline zero-shot learning method, three few-shot learning methods (Random Selection, K-Means Error Clustering Selection, and Similarity-Based Selection), and the Error-Informed Prompt Refinement strategy. In each bar cluster, the bars represent the section-level F1-scores for a specific class and institution combination. Overall, while most few-shot configurations underperform relative to the zero-shot baseline, the extent of this performance gap varies by both class and institution. In some cases, few-shot learning slightly outperforms the baseline, whereas in others the decline is more pronounced. One possible explanation is that longer prompts can lead the model to "forget" crucial information provided earlier in the context. In few-shot settings, the additional examples may dilute or overshadow key information like definitions of mobility classes from the beginning of the prompt. Notably, the Error-Informed Prompt Refinement approach yields significant improvements in most scenarios, with the exceptions of the “Mobility, Unspecific” class at all the three sites and the “Moving around using transportation” class at Site 3. These per-class, per-institution comparisons provide valuable insights into the conditions under which each method is most effective. 2.4 Task Decomposition Method Performance and Integration Figure 2 (b) illustrates the comparison of the zero-shot baseline F1-scores for Mobility Extraction against those obtained using the two task decomposition methods at section level: (1) Single LLM with Two-Task Prompt (Chain-of-Thought Prompting) and (2) Two LLMs with Task Specialization. As shown in Fig. 2 (b), neither task decomposition approach consistently outperforms the zero-shot baseline across all classes and institutions—sometimes exceeding the baseline, and other times falling short. We then integrated the two task decomposition methods with the top-performing Error-Informed Prompt Refinement strategy. Upon re-evaluation, this combined approach did not consistently surpass the performance of the Error-Informed Prompt Refinement strategy applied alone across all classes and institutions. These findings suggest that more complex configurations do not necessarily lead to better outcomes. Overall, our best configuration remains the standalone Error-Informed Prompt Refinement strategy. 2.5 Best Performance with Error-Informed Prompt Refinement The section-level performance of the Error-Informed Prompt Refinement is summarized in Supplementary Table S2. The micro-average F1-scores across all mobility classes are 0.729, 0.723, and 0.608 for sites 1, 2, and 3, respectively. Regarding Task 2 (Impairment Classification) with the Error-Informed Prompt Refinement, notably, the “Walking and moving” class performed best, with F1-scores between 0.913 and 0.957 across the three sites. The micro-average F1-scores across all mobility classes are 0.832, 0.814, and 0.785 for sites 1, 2, and 3, respectively. Supplementary Table S3 presents a comprehensive summary of the section-level, note-level and patient-level performance with the Error-Informed Prompt Refinement configuration for both tasks, aggregated across all institutions. The micro-average F1-scores for Task 1 and Task 2 were 0.695 and 0.815 at the section-level, 0.819 and 0.849 at the note-level and 0.876 and 0.897 at the patient-level, respectively. 2.6 Error Analyses This section presents major error themes identified during our analyses. We highlight key areas for refinement and inform future enhancements to model performance and reliability. Major error themes include missing functional status context, lack of medical knowledge, ambiguous definition, issues of certainty, exclusion and inference. Inference: Inference errors are a common source of false positives. For example, when a section mentions “right lower leg pain that began approximately 3 years ago with activities”, the LLM inferred an impairment in “Changing and maintaining body position”. Although clinically reasonable, such inferences were excluded during annotation because there was no direct mention of a limitation in that mobility class. Similarly, phrases like “right hand pain, hand swelling” led the model to infer an impairment in “Carrying, moving and handling objects”, and “has more knee pain with the exercises” prompted an inference of impairment in “Walking and moving”. In these cases, the absence of explicit evidence for a functional limitation necessitated their exclusion from impaired status assignments. Another error pattern arises from the handling of non-specific symptoms. When sections mention term such as “dizziness”, “headaches”, “vertigo”, “weakness”, or “debility”, the LLM often assigned an impairment in the “Mobility, unspecified” class. However, in the absence of a direct reference to a functional limitation (e.g., difficulty ambulating or transferring), our annotation guidelines mandate that such general symptoms be excluded from mobility impairment classification. This overgeneralization by the LLM contributes to false positives and decreases the precision for the “Mobility, unspecified” class. Ambiguous Definitions: Some mobility concepts share similar semantics, leading to overlapping or inconsistent classifications. For instance, repeated falls were frequently classified by the LLM as impairments in “Walking and moving”. However, our clinical rationale dictated that such cases should be classified under “Changing and maintaining body position” to better capture issues with postural control. Similarly, mentions of wheelchair use were often split between “Walking and moving” and “Moving around using transportation”. To enhance clarity and consistency, we consolidated wheelchair-related instances exclusively under “Moving around using transportation”. Missing Definition: The class “Mobility, unspecified” remained problematic across tasks. This challenge arises primarily from the lack of a precise ICF definition for this class, rendering its scope inherently ambiguous. In practice, this class is meant to capture general aspects of mobility—such as everyday activities and exercise—but without clear boundaries. For example, when a clinical note states, “patient returned fully to yoga,” it implies an impairment within this class. Exclusion: Documentation pertaining to treatment plans, discussions, suggestions, instructions, recommendations, goals or advice is generally not considered indicative of a patient’s mobility functional status. This rule applies unless such documentation explicitly mentions that the patient is expected to improve in a specific mobility class, which in turn implies that an impairment exists. For example, if a note states that ’the patient will improve in walking ability with physical therapy’, this suggests an underlying impairment in that area and should be annotated accordingly. Otherwise, routine mentions of treatment-related content are excluded from contributing to the functional status classification. However, we observed that the LLM sometimes struggles to exclude such content even when the prompt explicitly directs it to do so. Limitations in Medical Knowledge: LLMs trained on general corpora may lack the nuanced medical expertise required for certain assessments. For instance, the Focus on Therapeutic Outcomes (FOTO) system is a specialized web-based assessment where a low final score (ranging from 0 to 100) indicates impairment in the “Mobility, unspecified” class. Similarly, the Patient Specific Functional Scale (PSFS) is widely used to capture patient-specific reports of functional ability. In clinical practice, a low current PSFS score—or an expectation of future improvement—serves as a direct indicator of impaired functional status in the “Mobility, unspecified” class. The Timed Up and Go (TUG) Test is a standardized assessment for evaluating mobility and the risk of falls. However, our analysis revealed that the LLM does not consistently integrate these quantitative assessments into its interpretation of clinical notes. As a result, numerical scores or standardized test results might be inferred indirectly. As noted in inference error analysis, some false positive inference cases are clinically reasonable to be considered correct (i.e., true positive). Therefore, we also evaluated the performance of the LLM after treating such reasonable inferences as correct. Table 2 presents a comprehensive summary for both task 1 and task 2, aggregated across all institutions under this assumption. The performance was improved compared to the original results (Supplementary Table S3); the micro-average F1-score for task 1 and task 2 at the section-level increased to 0.890 and 0.878, respectively (originally 0.695 and 0.815), 0.942 and 0.925 at note-level, 0.962 and 0.948 at patient-level. Table 2 F1-score of Two Tasks Across Mobility Classes Aggregated Across All Institutions with Error-Informed Prompt Refinement When Considering Reasonable Inference as Correct. Class Section-Level Note-Level Patient-Level Task 1 Task2 Task 1 Task2 Task 1 Task 2 Changing and maintaining body position 0.867 0.866 0.930 0.929 0.956 0.951 Carrying, moving, and handling objects 0.877 0.878 0.909 0.902 0.907 0.917 Walking and moving 0.948 0.951 0.982 0.984 0.989 0.982 Moving around using transportation 0.877 0.730 0.939 0.750 0.966 0.824 Mobility, unspecified 0.833 0.810 0.916 0.903 0.963 0.966 Average 0.890 0.878 0.942 0.925 0.962 0.948 *F1: F1-score. 2.7 Trustworthiness Analyses LLMs used in clinical contexts must demonstrate trustworthiness across several dimensions, including reliability, safety, and robustness. In this section, we evaluate our approach and identify how our methods and configurations contribute to these attributes. Reliability: We configured the LLM with a temperature setting of 0, ensuring deterministic responses. By eliminating randomness in the generation process, the model’s outputs are reproducible: repeated queries yield consistent results. Additionally, our error analysis revealed that although the LLM does make mistakes, these errors are generally “reasonable” rather than erratic or nonsensical. For instance, the model might misclassify a borderline case of mobility impairment, but it does not invent implausible conditions. Such understandable, bounded errors indicate that the model’s performance is stable and predictable, which is critical for building clinical trust. Safety: Maintaining patient privacy and data confidentiality is paramount in healthcare applications. To this end, we deployed the LLM on secure local servers, preventing any transfer of protected health information (PHI) to external environments. By avoiding reliance on remote or third-party cloud services, we mitigated the risk of unintended data exposure and ensure compliance with privacy regulations. This closed-loop infrastructure provides a safer environment for processing sensitive clinical notes. Generalizability: A trustworthy LLM must demonstrate robust performance across various clinical contexts. Although the training dataset was sourced from a single site, our analysis shows that refinements—such as introducing examples or error informed instructions—improved the model’s performance not only at the original site but also at two additional independent sites. This form of transfer learning highlights the model’s adaptability and generalizability, suggesting that improvements identified in one environment can successfully translate to others. Consequently, the LLM remains effective under slightly shifted data distributions, thereby increasing its overall trustworthiness as a tool for broad clinical application. By addressing each of these dimensions—ensuring reproducibility and logical consistency, protecting patient data privacy, and demonstrating cross-institutional adaptability—we underscore the trustworthiness of our LLM-based annotation pipeline. 3 Discussion This study demonstrates the feasibility of employing LLMs with well-constructed prompt engineering to automatically extract and standardize mobility functional status information from unstructured EHR data. By leveraging an open-source LLM (Llama 3) alongside tailored prompt engineering strategies—including zero-shot, few-shot, and error-informed prompt refinement—we developed a pipeline that reliably identifies and classifies mobility-related expressions across diverse clinical note sections. Our evaluations across three healthcare institutions reveal that, while performance varies among different mobility classes, the refined prompt strategy significantly enhances overall accuracy and consistency. The Error-Informed Prompt Refinement configuration was designed to address specific shortcomings observed in the baseline model by incorporating feedback from previous errors. By analyzing misclassifications and instances of low precision, the model was able to adapt its prompts to better clarify ambiguous cases. This adjustment was particularly beneficial for classes that initially suffered from high variability, such as “Carrying, moving, and handling objects” and “Walking and moving” in the Mobility Extraction task. For Impairment Classification, while the top-performing class (“Walking and moving”) maintained high performance, the refinement process contributed to a more consistent performance for classes like “Changing and maintaining body position”. The narrower range of F1-scores (0.793–0.851) in this setting suggests that error-informed adjustments helped reduce performance fluctuations across institutions. The Error-Informed method sometimes exhibited a trade-off where enhancements in precision for some classes came at the expense of recall. For example, in Impairment Classification, although the precision improved for class “Mobility, unspecified”, the corresponding recall dropped, which in turn affected the F1-score. This indicates that while the refined prompts can be more targeted, they may sometimes become overly conservative, missing some correct instances. Portability of the model across different healthcare institutions remains an important area for further exploration. Our development took place at an academic research center (Site 1) and was tested at two other institutions (Sites 2 and 3) that include data from community-based real-world practices to validate generalizability. In future work, we aim to test different combinations of development and validation, and also expand data sources with more diverse patient populations, clinical note structures, and documentation styles. This would allow us to systematically assess model robustness and understand the impact of data heterogeneity on performance metrics. Another limitation of our approach is that we did not fine-tune the LLM on the annotated dataset. Although fine-tuning has the potential to improve performance, our annotated dataset was relatively small, and fine-tuning large models typically requires substantial amounts of labeled data to avoid overfitting. In addition, fine-tuning can be computationally expensive and may increase the risk of overfitting to specific institution documentation styles. Instead, we relied on prompt-engineering strategies that can be more data-efficient and flexible. However, future research could explore lightweight fine-tuning methods (e.g., LoRA or adapter-based approaches) if larger annotated corpora become available. Moreover, the deterministic configuration and secure local deployment of the model address key concerns regarding reproducibility and patient data privacy, further reinforcing the pipeline’s trustworthiness. Despite challenges—such as handling ambiguous language and balancing precision with recall—the findings indicate that LLMs can serve as scalable and generalizable tools for clinical applications. Future work may further explore additional domains of trustworthiness, such as fairness and interpretability, to provide an even more comprehensive assessment of the model’s reliability and suitability for clinical environments. Longitudinal validations in real-world settings will be essential for fully integrating these methods into clinical decision-making processes, ultimately contributing to improved patient care and personalized intervention strategies. Overall, our work lays a promising foundation for advancing the automated extraction of functional status data, highlighting the transformative role of LLMs in modern healthcare analytics. 4 Methods 4.1 Data Preparation This study included three institutions an academic referral medical center and two community-based practices in the upper midwest. Agreements between these institutions preclude direct comparisons of data; therefore, the sites are designated by number (sites 1–3). The annotation guidelines were developed using definitions from ICF to annotate mobility functional status data in clinical notes. We considered five mobility functional status classes, each comprising one or more ICF subclasses. Supplementary Table S1 lists these classes alongside their corresponding ICF codes and subclasses. We restricted our analysis to five primary mobility classes, rather than using their more granular ICF subclasses. This decision was driven by two practical considerations. First, our annotation process revealed a high degree of semantic overlap among the ICF subclasses, making it difficult to obtain reliable annotation for those narrower categories. As a result, we consolidated the ICF-based taxonomy into five major classes as defined in ICF to enhance labeling consistency and reduce confusion during both manual annotation and automated extraction. Second, LLMs may struggle with overly fine-grained distinctions in clinical text, especially if subclass definitions overlap or are ambiguously documented. To facilitate focused and coherent annotation, each clinic note was divided into sections. Clinical notes at Site 1 include a variety of sections, such as History of Present Illness, Past Medical/Surgical History, Physical Examination, Diagnosis and more. We chose section-level analysis for several reasons. First, clinical notes can be lengthy, and feeding entire notes into a LLM risks exceeding context length limits—potentially truncating important information. Second, sections typically focus on specific aspects of a patient’s health (e.g., History of Present Illness, Assessment), which makes them natural, semantically coherent units for targeted extraction tasks. Third, section-level segmentation facilitates more precise few-shot prompts and helps the model focus on local contextual cues relevant to mobility. Finally, working at the section level can improve interpretability, since clinicians often review notes by scanning sections, allowing direct alignment of the LLM’s output with established clinical documentation practices. Within each section, a trained annotator manually identified and labeled mobility classes with related expressions and assigned the appropriate impairment statuses. The annotator assigned "Impaired" if the section describes difficulties, challenges, limitations, potential issues, or impairments, and "Unimpaired" if the section explicitly describes normal function or abilities. If a section did not receive either “Impaired” or “Unimpaired,” this indicated that there was no mobility-related information present in that section. Additional details on the annotation process and guidelines can be found in the Supplementary Section. In total, 600 notes were annotated. Of these, 200 were physical therapy (PT) or occupational therapy (OT) notes from Site 1, 200 were PT/OT notes from Site 2, and 200 were unrestricted clinic notes from Site 3. PT/OT notes typically provide abundant information regarding mobility, while including general clinic notes from Site 3 helps demonstrate the method’s broader applicability. After splitting the notes into sections, the dataset comprised 3,810 sections (1,153 in Site 1, 1,075 in Site 2, and 1,582 in Site 3). On average, each section contained 826 characters, with section lengths averaging 932 characters in Site 1, 847 in Site 2, and 7,334 in Site 3. A single section may contain information relevant to multiple mobility classes or none. 4.2 Experiment For prompt development, 200 sections were randomly selected from Site 1 to form a training dataset; the remaining sections served as the test set. Performance was evaluated using precision, recall, and F1-score metrics, both class-specific and institution-specific, across various experimental configurations. An LLM-based annotation pipeline was developed to extract mobility functional status information. The pipeline consists of three core components: (1) the LLM, (2) a task-specific prompt combined with a clinical note section, and (3) a post-processing step that converts the LLM’s text output into structured data. As shown in Fig. 3 , the process begins with dividing the clinical notes into distinct sections. Each section is then embedded within a task-specific prompt explicitly designed for mobility functional status extraction. The LLM processes the prompt and generates a textual response based on its analysis and interpretation, and finally, a post-processing step transforms the LLM’s generative text output into structured objects. Specifically, for each section, the LLM produces, for each of the five mobility classes, a single label—either “Impaired” if there is evidence of limitation, “Unimpaired” if normal function is described, or “None” if the section contains no information about that class. Llama 3, an advanced open-source LLM developed by Meta AI, was employed for this task. Llama 3 provides enhanced efficiency and performance. Integrating Llama 3 into a local server environment ensured both patient data privacy and the use of powerful computational resources for annotation tasks. This setup allowed for the handling of sensitive medical information within a secure infrastructure. The open-source nature of Llama 3 further facilitated customization and adaptability to specific project requirements. 4.2.1 Performance Evaluation We calculate performance at the class level. For each mobility class, we first evaluate Mobility Extraction by considering any section labeled “Impaired” or “Unimpaired” for that class as “Mentioned” and all other sections as “Not Mentioned.” Treating “Mentioned” as the positive class and “Not Mentioned” as the negative class, we compute precision, recall, and F1-score by comparing the LLM’s predicted Mentioned/Not Mentioned labels against the human annotations for that class. Next, for Impairment Classification, we restrict our analysis to those sections confirmed as “Mentioned.” Within this subset, the LLM’s output—distinguishing between “Impaired” and “Unimpaired”—is compared to the human reference label. In other words, once a section is known to address the mobility class, we assess whether the LLM’s choice of “Impaired” versus “Unimpaired” matches the gold‐standard annotation, and we calculate an F1‐score for each class with “Impaired” as the positive label. 4.3 Prompting Engineering 4.3.1 Zero Shot Learning In the zero-shot configuration, the prompt includes only the core task description, the definitions of all relevant mobility classes, a general instruction, and a final question. It does not provide any examples. Under these conditions, the model relies solely on the given prompt, without guidance from sample inputs or outputs. Evaluating zero-shot performance establishes a baseline against which subsequent methods can be compared. 4.3.2 Few Shot Learning Few-shot learning leverages a small number of example cases embedded in the prompt to improve the model’s reasoning and output quality. Unlike zero-shot prompting, few-shot prompts present the model with illustrative examples—each accompanied by detailed explanations—to better convey the task’s underlying reasoning steps. In this study, a five-shot configuration was used, with the selected examples spanning a variety of scenarios to promote more nuanced understanding. Three strategies were tested for choosing these examples from the training dataset: Random Selection: Five samples are chosen at random. K-Means Error Clustering Selection: We first identified the errors that occurred under zero-shot conditions on the training dataset. We then applied the k-means clustering algorithm to group these error cases into five clusters. From the center of each cluster, we selected one representative sample, thereby focusing on common mistakes. Similarity-Based Selection: For each test case, we identified the top five most similar training samples using a K-nearest neighbors approach. These closely related examples serve as relevant reference points that closely match the characteristics of the test input. Through these strategies, we aimed to discover which method of choosing examples leads to the greatest performance improvement, demonstrating how few-shot learning can be optimized to achieve more accurate and reliable results. 4.3.3 Error-Informed Prompt Refinement A systematic error analysis of the zero-shot results was conducted to identify patterns in the model’s mistakes, such as common misclassifications or omissions. We manually examined all false positive and false negative outputs across sections in the training set, catalogued the most common sources of error, and distilled these into a set of explicit “exclusion” and “inclusion” rules. These rules were then injected as additional instructions into the prompt so that the LLM would be guided toward avoiding the same mistakes. For example, whenever the section mentioned “patient will begin home exercise program”, the model tended to treat these as evidence of impairment—even though they are plans or instructions. To address this, we added explicit instructions telling the model to ignore any text that describes a treatment plan, recommendation, or exercise regimen. 4.3.4 Task Decomposition Our overarching goal was to extract expressions related to a patient’s mobility functional status and classify their impairment status (impaired vs. non-impaired). To potentially improve performance, we divided the task into two subtasks: Mobility Extraction and Impairment Classification. This decomposition enables a clearer, more focused approach and allows for specialized handling of each step. To implement this, we explored two different setups: Single LLM with Two-Task Prompt (Chain-of-Thought Prompting): A single LLM is guided sequentially through the two subtasks with a multi-step prompt. The prompt first instructs the model to assess whether the section contains mobility-related descriptions for each class, then directs it to classify the impairment status. By leveraging a chain-of-thought process, this setup encourages methodical, stepwise reasoning. Two LLMs with Task Specialization: Alternatively, this approach employs two dedicated LLMs, each focusing on a single subtask. The first LLM determines whether relevant mobility descriptions are present. The second LLM then uses those determinations to classify impairment status. By dividing responsibilities in this manner, each model can specialize its role. Declarations Ethics approval and consent to participate The study was approved by the Mayo Clinic Institutional Review Board and the Olmsted Medical Center Institutional Review Boards. Patients included in this study provided research authorization. Competing Interests Statement The author(s) declare no competing interests. Funding This study was supported by Eric and Wendy Schmidt Fund for AI Research and Innovation and NIH (National Institutes of Health) R01 AG068007. Author Contribution X.L. developed the model, analyzed, and interpreted the data, drafted the initial manuscript, and revised the manuscript. H.J. performed data curation and corpus annotation. S.P. and J.S. helped with cohort identification and corpus annotation. S.P., J.S., H.J. and M.G. participated in the interpretation of the data and contributed to manuscript editing and revisions. S.S conceptualized and designed the study, supervised data collection, model development, and analysis, interpreted the data, and reviewed and finalized the manuscript. Acknowledgement We acknowledge the support of our funding agencies, which are listed below in the funding section. Data Availability The data used for this work was from electronic health records which include identifiable data and thus cannot be shared due to privacy and legal reasons. References Nations, U. World population ageing report. World Popul. Ageing (2024). Organization, W. H. et al. World report on ageing and health (2024). Mather, M. & Scommegna, P. Fact sheet: aging in the united states. Popul Ref. Bureau (2024). Mayer-Oakes, S. A., Oye, R. K. & Leake, B. Predictors of mortality in older patients following medical intensive care: the importance of functional status. J. Am. Geriatr. Soc. 39 , 862–868 (1991). Ponzetto, M. et al. Risk factors for early and late mortality in hospitalized older patients: the continuing importance of functional status. Journals Gerontol. Ser. A: Biol. Sci. Med. Sci. 58 , M1049–M1054 (2003). Narain, P. et al. Predictors of immediate and 6-month outcomes in hospitalized elderly patients: The importance of functional status. J. Am. Geriatr. Soc. 36 , 775–783 (1988). Metz, D. H. Mobility of older people and their quality of life. Transp. policy . 7 , 149–152 (2000). Musselwhite, C. & Haddad, H. Mobility, accessibility and quality of later life. Qual. Ageing Older Adults . 11 , 25–37 (2010). Forhan, M. & Gill, S. V. Obesity, functional mobility and quality of life. Best Pract. Res. Clin. Endocrinol. metabolism . 27 , 129–137 (2013). Rosen, S. L. & Reuben, D. B. Geriatric assessment tools. Mt. Sinai J. Medicine: J. Transl Pers. Med. 78 , 489–497 (2011). Kimia, A. A., Savova, G., Landschaft, A. & Harper, M. B. An introduction to natural language processing: how you can get more from those electronic notes you are generating. Pediatr. Emerg. care . 31 , 536–541 (2015). Fu, S. et al. Quality assessment of functional status documentation in EHRs across different healthcare institutions. Front. Digit. Heal . 4 , 958539 (2022). Kreimeyer, K. et al. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J. biomedical Inf. 73 , 14–29 (2017). Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29 , 1930–1940 (2023). Singhal, K. et al. Large language models encode clinical knowledge. Nature 620 , 172–180 (2023). Newman-Griffis, D. & Zirikly, A. Embedding transfer for low-resource medical named entity recognition: a case study on patient mobility. arXiv preprint arXiv:1806.02814 (2018). Kukafka, R., Bales, M. E., Burkhardt, A. & Friedman, C. Human and automated coding of rehabilitation discharge summaries according to the international classification of functioning, disability, and health. J. Am. Med. Inf. Assoc. 13 , 508–515 (2006). Organization, W. H. International Classification of Functioning, Disability, and Health: Children & Youth Version: ICF-CY (World Health Organization, 2007). Bales, M., Kukafka, R., Burkhardt, A. & Friedman, C. Extending a medical language processing system to the functional status domain. In AMIA Annual Symposium Proceedings , vol. 888 (American Medical Informatics Association, 2005). (2005). Agaronnik, N., Lindvall, C., El-Jawahri, A., He, W. & Iezzoni, L. Use of natural language processing to assess frequency of functional status documentation for patients newly diagnosed with colorectal cancer. JAMA Oncol. 6 , 1628–1630 (2020). Newman-Griffis, D. et al. Linking free text documentation of functioning and disability to the icf with natural language processing. Front. rehabilitation Sci. 2 , 742702 (2021). Fu, S. et al. Fedfsa: Hybrid and federated framework for functional status ascertainment across institutions. J. Biomed. Inf. 152 , 104623 (2024). Goel, A. et al. PMLR,. Llms accelerate annotation for medical information extraction. In Machine Learning for Health (ML4H) , 82–100 (2023). Zaretsky, J. et al. Generative artificial intelligence to transform inpatient discharge summaries to patient-friendly language and format. JAMA Netw. open. 7 , e240357–e240357 (2024). Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat Medicine 1–8 (2025). Dubey, A. et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024). Additional Declarations No competing interests reported. Supplementary Files Supplementary.docx Cite Share Download PDF Status: Published Journal Publication published 23 Jan, 2026 Read the published version in Scientific Reports → Version 1 posted Editorial decision: Revision requested 26 Jul, 2025 Submission checks completed at journal 22 Jul, 2025 First submitted to journal 22 Jul, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7104310","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":491119747,"identity":"085658fa-87d5-46b1-a459-e66aef7a8b0e","order_by":0,"name":"Xingyi Liu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA2UlEQVRIiWNgGAWjYBACPmYQaQDEzMwHoGIJ+LWwIbSwwZQS0oJg8hgQqYWdx4CZp+CO3YbjPF83/GyzYeBnzzHAr4WZx4BxhsGz5A2Hebfd7G1LY5DseUNYC8MHg8PJBkAtN3jbDjMY3CDCFoYEsBaeZzf/ArXYE6UFaIsdUAvbbbAtEgS1sBUcnGFwOEHyMJvZbZlzaTwSZ54V4NXCz39442OeP4ft+c4ffnbzTZmNHH978ga8WkDgABAnLgCRjGwMPASVw4C9fAOI+kO0hlEwCkbBKBhBAABk30LN08NRYQAAAABJRU5ErkJggg==","orcid":"","institution":"Mayo Clinic","correspondingAuthor":true,"prefix":"","firstName":"Xingyi","middleName":"","lastName":"Liu","suffix":""},{"id":491119750,"identity":"21e8cf80-1a11-439d-9011-be9a451c50f5","order_by":1,"name":"Muskan Garg","email":"","orcid":"","institution":"Mayo Clinic","correspondingAuthor":false,"prefix":"","firstName":"Muskan","middleName":"","lastName":"Garg","suffix":""},{"id":491119752,"identity":"65a05faa-4e09-43f4-886f-0242b3d195ae","order_by":2,"name":"Heling Jia","email":"","orcid":"","institution":"Mayo Clinic","correspondingAuthor":false,"prefix":"","firstName":"Heling","middleName":"","lastName":"Jia","suffix":""},{"id":491119756,"identity":"bc8b3f72-d345-450f-b635-3753bed4f04b","order_by":3,"name":"Jennifer St. Sauver","email":"","orcid":"","institution":"Mayo Clinic","correspondingAuthor":false,"prefix":"","firstName":"Jennifer","middleName":"St.","lastName":"Sauver","suffix":""},{"id":491119757,"identity":"af3b8a58-6872-4ff4-82ed-e2ffb9691b68","order_by":4,"name":"Sandeep R. Pagali","email":"","orcid":"","institution":"Mayo Clinic","correspondingAuthor":false,"prefix":"","firstName":"Sandeep","middleName":"R.","lastName":"Pagali","suffix":""},{"id":491119758,"identity":"a7aab475-2202-49e0-86d2-95f3521d76a1","order_by":5,"name":"Sunghwan Sohn","email":"","orcid":"","institution":"Mayo Clinic","correspondingAuthor":false,"prefix":"","firstName":"Sunghwan","middleName":"","lastName":"Sohn","suffix":""}],"badges":[],"createdAt":"2025-07-11 20:08:08","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7104310/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7104310/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41598-026-37025-9","type":"published","date":"2026-01-23T15:58:43+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":87817154,"identity":"0b9f90da-a854-473b-a8a6-d22612c2078d","added_by":"auto","created_at":"2025-07-29 10:21:49","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":121653,"visible":true,"origin":"","legend":"\u003cp\u003eAnalysis of Mobility Classes at Section and Note Levels and Distributions of Sections for Mobility Classes. \u003cstrong\u003e(a)\u003c/strong\u003e Section‐level Venn diagram showing the number of clinical note sections (n = 3,810) containing each combination of the five mobility functional status classes (Negative = no classes mentioned). \u003cstrong\u003e(b) \u003c/strong\u003eNote‐level Venn diagram showing the number of clinical notes (n = 600) containing each combination of the five classes. \u003cstrong\u003e(c)\u003c/strong\u003ePie chart of the proportion of sections mentioning 0–5 classes: 57.0% none, 23.1% one, 12.8% two, 5.8% three, 1.2% four, 0.3% five. \u003cstrong\u003e(d)\u003c/strong\u003e Pie chart of the proportion of notes mentioning 0–5 classes: 48.0% none, 29.3% one, 9.0% two, 8.5% three, 4.5% four, 0.7% five. \u003cstrong\u003e(e-i)\u003c/strong\u003e Distribution of sections for “Changing and maintaining body position”, “Carrying, moving, and handling objects”, “Walking and moving”, “Moving around using transportation” and “Mobility, unspecified”, respectively, showing the three most frequent sections.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7104310/v1/df36797c631942d92f5994fd.png"},{"id":87817160,"identity":"a8e31d8e-7c53-4e30-8171-274a9df2656f","added_by":"auto","created_at":"2025-07-29 10:21:49","extension":"jpeg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":381054,"visible":true,"origin":"","legend":"\u003cp\u003eSection‐level F1-scores for mobility extraction across five functional status classes and three institutions under different prompt and task configurations. \u003cstrong\u003e(a) \u003c/strong\u003eComparison of zero-shot baseline, three few-shot learning strategies and error-informed prompt refinement approach. \u003cstrong\u003e(b)\u003c/strong\u003e Comparison of zero-shot baseline and two task decomposition setups. (CB: Changing and maintaining body position, CO: Carrying, moving, and handling objects, WM: Walking and moving, MT: Moving around using transportation, MU: Mobility, unspecified).\u003c/p\u003e","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7104310/v1/f96c9853c02c7f7f0c76d16f.jpeg"},{"id":87816717,"identity":"6d524e2b-7fe6-40d0-bd1c-60bda9f56cc5","added_by":"auto","created_at":"2025-07-29 10:13:49","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":21372,"visible":true,"origin":"","legend":"\u003cp\u003eLLM-based Annotation Pipeline. Clinical notes are segmented into discrete sections. Each section is embedded in a task-specific prompt defining the five mobility classes. The LLM processes the prompt and outputs.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7104310/v1/dc33992fe725824d6db510d2.png"},{"id":101151986,"identity":"6a4550b4-67e5-4e38-b165-7326bf388f65","added_by":"auto","created_at":"2026-01-26 16:09:04","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1351334,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7104310/v1/47a85723-880c-4c37-8168-39e6125fc90e.pdf"},{"id":87818175,"identity":"a45a1e7b-0baa-48ef-978d-a61bb85953f7","added_by":"auto","created_at":"2025-07-29 10:29:49","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":96450,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementary.docx","url":"https://assets-eu.researchsquare.com/files/rs-7104310/v1/079f11ef26b9815fe677d14d.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Mobility Functional Status Ascertainment in Electronic Health Records using Large Language Models","fulltext":[{"header":"1 Introduction","content":"\u003cp\u003eThe global demographic landscape is undergoing a shift, with a rapidly increasing proportion of older adults\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. This demographic transition necessitates a fundamental change in healthcare, moving beyond a traditional focus on mortality to prioritize the maintenance of functional status and quality of life\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e,\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. Functional status, broadly defined as an individual\u0026rsquo;s ability to perform activities necessary for daily living, is a powerful predictor of health outcomes, healthcare utilization, and overall well-being, particularly in older populations and those with chronic conditions\u003csup\u003e\u003cspan additionalcitationids=\"CR5\" citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e. Among the various domains of functional status, mobility \u0026ndash; the ability to move oneself within an environment \u0026ndash; is essential. Mobility limitations are strongly associated with increased risk of falls, hospitalization, institutionalization, reduced social participation, and diminished overall quality of life\u003csup\u003e\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e. Accurate and timely assessment of mobility functional status is therefore essential for effective geriatric care, rehabilitation, and the development of personalized interventions.\u003c/p\u003e\u003cp\u003eTraditionally, mobility functional status has been assessed through standardized questionnaires (e.g., self-reported activity limitations) and performance-based measures (e.g., Timed Up \u0026amp; Go test, gait speed assessment)\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e. However, a rich source of information on patient mobility also resides within electronic health records (EHRs)\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e. Clinicians routinely document observations, patient reports, and assessments related to mobility in clinical notes. Unfortunately, this information is typically recorded in unstructured or semi-structured text, making it difficult to extract and analyze systematically. The complexity, variability, and nuanced language used in clinical documentation pose a significant challenge to traditional data extraction techniques. This represents a major barrier to leveraging the wealth of information contained within unstructured EHRs to improve the understanding and management of mobility limitations. Furthermore, manual chart review, the traditional method for extracting information from unstructured text, is prohibitively time-consuming and expensive for large-scale research or routine clinical use\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eThis study proposes a novel approach to address this challenge by employing Large Language Models (LLMs) to automatically ascertain and standardize mobility functional status from unstructured EHR data (i.e., clinical notes). LLMs, with their advanced capabilities in natural language understanding and generation, offer a powerful means of processing and interpreting complex textual data\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e,\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. Prior research has demonstrated the potential of natural language processing (NLP) techniques, including earlier machine learning approaches and rule-based systems, for extracting information related to functional status and mobility from clinical text\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e,\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e. For example, Bales et al. modified an existing NLP system to automate the assignment of selected International Classification of Functioning, Disability, and Health (ICF) codes\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e by updating the lexicon and coding tables\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. Agaronnik et al. employed the ClinicalRegex NLP software to search EHRs for functional status documentation, utilizing an ontology with five keyword categories to identify pertinent information\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e. Newman-Griffis et al. applied NLP techniques to analyze patient functioning information in clinical documents related to federal disability benefit claims from the U.S. Social Security Administration, achieving robust performance with an F1-score exceeding 80% on two datasets\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e. More recently, Fu et al. introduced FedFSA, a hybrid and federated NLP framework designed to extract functional status information from EHRs across multiple healthcare institutions\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. These earlier efforts, while showing promise, often faced limitations in terms of generalizability, scalability, and the ability to handle the full complexity and ambiguity of clinical language. Recent advances in LLM architecture and training have demonstrated remarkable performance on a wide range of NLP tasks, including information extraction, question answering, and text summarization, overcoming many of the limitations of prior NLP approaches\u003csup\u003e\u003cspan additionalcitationids=\"CR24\" citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e. These capabilities make LLMs particularly well-suited to the task of extracting structured information from unstructured clinical text, where contextual understanding and the ability to handle ambiguity are crucial. By leveraging the power of an open-source LLM, Llama 3\u003csup\u003e26\u003c/sup\u003e, we aim to develop a robust and scalable pipeline that can automatically extract, classify, and standardize mobility-related information from unstructured clinical text.\u003c/p\u003e\u003cp\u003eThe contribution of this study is fourfold. First, it develops a comprehensive annotation scheme grounded in the ICF framework to identify and classify expressions related to mobility functional status in clinical notes. Second, it evaluates the performance of Llama 3 in extracting and classifying mobility impairment status by employing various prompt engineering techniques, such as zero-shot learning, few-shot learning, error-informed prompt refinement, and task decomposition. Third, it assesses the generalizability of the LLM-based approach across different healthcare institutions. Finally, it identifies common error patterns and evaluates the overall trustworthiness of the model.\u003c/p\u003e\u003cp\u003eWe hypothesize that LLMs with well-constructed prompt engineering strategies can effectively extract mobility functional status information from unstructured EHR data, providing a scalable solution for both clinical practice and research. The successful implementation of this approach has the potential to significantly enhance precision medicine by enabling more timely and personalized interventions, improving the monitoring of patient progress, and facilitating large-scale research on mobility and aging. This, in turn, could lead to improved clinical decision-making, better resource allocation, and ultimately, improved quality of life for older adults and individuals with mobility limitations.\u003c/p\u003e\u003cp\u003eThe remainder of this paper is organized as follows. Section 2 presents the results of our experiments, including performance metrics and error analysis. Section 3 discusses the implications of our findings, limitations of the study, and directions for future research. Section 4 details the materials and methods used in this study, including the study population, annotation process, LLM selection, and prompt engineering strategies.\u003c/p\u003e"},{"header":"2 Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\n\u003ch2\u003e2.1 Data analysis\u003c/h2\u003e\n\u003cp\u003eFigure 1 provides a comprehensive analysis of mobility classes at both the section and note levels. In Fig.\u0026nbsp;1(a), the number of sections corresponding to each combination of the five mobility classes is shown. It reveals that 10 sections mention all five classes, whereas 2,170 sections mention none. Figure\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e(b) extends this analysis to the note level, illustrating the number of notes corresponding to each combination of the five mobility classes. It indicates that four notes mention all five classes, while 288 notes do not mention any.\u003c/p\u003e\n\u003cp\u003eAdditionally, Fig.\u0026nbsp;1(c) presents a pie chart showing how many mobility classes each section contains. As shown, 57% of sections mention no mobility classes, 23.1% mention one, 12.8% mention two, 5.8% mention three, 1.2% mention four, and 0.3% mention all five classes. Similarly, Fig.\u0026nbsp;1(d) illustrates how many mobility classes appear in entire notes. It reveals that 48% of notes do not mention any mobility classes, 29.3% mention one, 9.0% mention two, 8.5% mention three, 4.5% mention four, and 0.7% reference all five.\u003c/p\u003e\n\u003cp\u003eA combined overview was created to show how often each mobility class was mentioned, along with the distribution of impairment statuses (Impaired vs. Unimpaired). At the section level, the most frequently mentioned mobility class is \u0026ldquo;Changing and maintaining body position\u0026rdquo;, appearing in 853 sections; of these, 608 indicate Impaired status, while the remainder suggest Unimpaired. In contrast, the least frequently mentioned mobility class at the section level is \u0026ldquo;Carrying, moving and handling objects\u0026rdquo;, noted in only 203 sections, 156 of which suggest Impaired status. At the note level, the most frequently mentioned mobility class is \u0026ldquo;Walking and moving\u0026rdquo;, found in 222 notes; 197 of these suggest Impaired status, with the remainder indicating Unimpaired. Meanwhile, the least frequently mentioned mobility class at the note level is \u0026ldquo;Carrying, moving and handling objects\u0026rdquo;, referenced by only 11 notes, 8 of which suggest Impaired status.\u003c/p\u003e\n\u003cp\u003eFigure 1 (e)-(i) illustrates the distribution of sections for each of the five mobility classes. For each class, we identified the three most frequently mentioned sections and grouped all remaining sections under \u0026ldquo;other\u0026rdquo;. The results showed that \u0026ldquo;history of present illness\u0026rdquo; ranks among the top three for all five mobility classes, while \u0026ldquo;assessment\u0026rdquo; appears in the top three for four of the five classes, with the exception of \u0026ldquo;Moving around using transportation\u0026rdquo;.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\n\u003ch2\u003e2.2 Baseline Performance with Zero-Shot Learning\u003c/h2\u003e\n\u003cp\u003eTable\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e presents the baseline section-level performance on Task 1 (Mobility Extraction) and Task 2 (Impairment Classification). This evaluation was conducted across the three participating institutions and five mobility functional status classes. Task 1 metrics include class- and institution-specific precision, recall, and F1-scores. The micro-average F1-scores across all mobility classes are 0.663, 0.638, and 0.516 for site 1, site 2, and site 3, respectively. Overall, these baseline metrics varied by class and location, with some classes performing consistently well while others presented room for improvement. They serve as a reference point against which more complex prompting strategies and configurations can be compared.\u003c/p\u003e\n\u003cdiv class=\"gridtable\"\u003e\n\u003ctable id=\"Tab1\" border=\"1\"\u003e\u003ccaption\u003e\n\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n\u003cdiv class=\"CaptionContent\"\u003e\n\u003cp\u003ePerformance of Task 1 (Mobility Extraction) and Task 2 (Impairment Classification) Across Mobility Classes and Institutions with Zero-Shot Learning at Section Level.\u003c/p\u003e\n\u003c/div\u003e\n\u003c/caption\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n\u003cth rowspan=\"3\" align=\"left\"\u003e\n\u003cp\u003eClass\u003c/p\u003e\n\u003c/th\u003e\n\u003cth colspan=\"4\" align=\"left\"\u003e\n\u003cp\u003eSite 1\u003c/p\u003e\n\u003c/th\u003e\n\u003cth colspan=\"4\" align=\"left\"\u003e\n\u003cp\u003eSite 2\u003c/p\u003e\n\u003c/th\u003e\n\u003cth colspan=\"4\" align=\"left\"\u003e\n\u003cp\u003eSite 3\u003c/p\u003e\n\u003c/th\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003cth colspan=\"3\" align=\"left\"\u003e\n\u003cp\u003eTask 1\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eTask 2\u003c/p\u003e\n\u003c/th\u003e\n\u003cth colspan=\"3\" align=\"left\"\u003e\n\u003cp\u003eTask 1\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eTask 2\u003c/p\u003e\n\u003c/th\u003e\n\u003cth colspan=\"3\" align=\"left\"\u003e\n\u003cp\u003eTask 1\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eTask 2\u003c/p\u003e\n\u003c/th\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eP\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eR\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eF1\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eF1\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eP\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eR\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eF1\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eF1\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eP\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eR\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eF1\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eF1\u003c/p\u003e\n\u003c/th\u003e\n\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eChanging and maintaining body position\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.745\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.802\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.772\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.837\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.627\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.789\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.699\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.805\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.605\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.665\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.634\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.757\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eCarrying, moving, and handling objects\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.589\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.768\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.667\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.894\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.577\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.602\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.589\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.814\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.273\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.545\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.364\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.857\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eWalking and moving\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.584\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.989\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.734\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.964\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.542\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.997\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.703\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.930\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.214\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e1.000\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.352\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.962\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eMoving around using transportation\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.688\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.611\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.647\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.649\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.771\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.771\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.771\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.718\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.931\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.849\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.888\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.755\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eMobility, unspecified\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.341\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.905\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.495\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.872\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.357\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.960\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.521\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.836\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.252\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.730\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.375\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.914\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eAverage\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.534\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.874\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.663\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.888\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.500\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.880\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.638\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.857\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.381\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.798\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.516\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.833\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003ctfoot\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"13\"\u003e*P: Precision, R: Recall, F1: F1-score.\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tfoot\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003eRegarding Task 2 (Impairment Classification), the class \u0026ldquo;Walking and moving\u0026rdquo; demonstrated best performance with F1-scores ranging from 0.930 to 0.964 across three sites. The class \u0026ldquo;Moving around using transportation\u0026rdquo; achieved the lowest performance with F1-scores ranging from 0.649 to 0.755 across sites. The micro-average F1-scores across all mobility classes are 0.888, 0.857, and 0.833 for sites 1\u0026ndash;3 respectively.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\n\u003ch2\u003e2.3 Comparison of Few-Shot Learning and Error-Informed Prompt Refinement Approaches\u003c/h2\u003e\n\u003cp\u003eFigure \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e(a) presents a comparison of section-level F1-scores for Mobility Extraction across five mobility functional status classes and three institutions under five configurations: the baseline zero-shot learning method, three few-shot learning methods (Random Selection, K-Means Error Clustering Selection, and Similarity-Based Selection), and the Error-Informed Prompt Refinement strategy. In each bar cluster, the bars represent the section-level F1-scores for a specific class and institution combination.\u003c/p\u003e\n\u003cp\u003eOverall, while most few-shot configurations underperform relative to the zero-shot baseline, the extent of this performance gap varies by both class and institution. In some cases, few-shot learning slightly outperforms the baseline, whereas in others the decline is more pronounced. One possible explanation is that longer prompts can lead the model to \"forget\" crucial information provided earlier in the context. In few-shot settings, the additional examples may dilute or overshadow key information like definitions of mobility classes from the beginning of the prompt.\u003c/p\u003e\n\u003cp\u003eNotably, the Error-Informed Prompt Refinement approach yields significant improvements in most scenarios, with the exceptions of the \u0026ldquo;Mobility, Unspecific\u0026rdquo; class at all the three sites and the \u0026ldquo;Moving around using transportation\u0026rdquo; class at Site 3. These per-class, per-institution comparisons provide valuable insights into the conditions under which each method is most effective.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\n\u003ch2\u003e2.4 Task Decomposition Method Performance and Integration\u003c/h2\u003e\n\u003cp\u003eFigure \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e(b) illustrates the comparison of the zero-shot baseline F1-scores for Mobility Extraction against those obtained using the two task decomposition methods at section level: (1) Single LLM with Two-Task Prompt (Chain-of-Thought Prompting) and (2) Two LLMs with Task Specialization.\u003c/p\u003e\n\u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e(b), neither task decomposition approach consistently outperforms the zero-shot baseline across all classes and institutions\u0026mdash;sometimes exceeding the baseline, and other times falling short. We then integrated the two task decomposition methods with the top-performing Error-Informed Prompt Refinement strategy. Upon re-evaluation, this combined approach did not consistently surpass the performance of the Error-Informed Prompt Refinement strategy applied alone across all classes and institutions. These findings suggest that more complex configurations do not necessarily lead to better outcomes. Overall, our best configuration remains the standalone Error-Informed Prompt Refinement strategy.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\n\u003ch2\u003e2.5 Best Performance with Error-Informed Prompt Refinement\u003c/h2\u003e\n\u003cp\u003eThe section-level performance of the Error-Informed Prompt Refinement is summarized in Supplementary Table S2. The micro-average F1-scores across all mobility classes are 0.729, 0.723, and 0.608 for sites 1, 2, and 3, respectively.\u003c/p\u003e\n\u003cp\u003eRegarding Task 2 (Impairment Classification) with the Error-Informed Prompt Refinement, notably, the \u0026ldquo;Walking and moving\u0026rdquo; class performed best, with F1-scores between 0.913 and 0.957 across the three sites. The micro-average F1-scores across all mobility classes are 0.832, 0.814, and 0.785 for sites 1, 2, and 3, respectively.\u003c/p\u003e\n\u003cp\u003eSupplementary Table S3 presents a comprehensive summary of the section-level, note-level and patient-level performance with the Error-Informed Prompt Refinement configuration for both tasks, aggregated across all institutions. The micro-average F1-scores for Task 1 and Task 2 were 0.695 and 0.815 at the section-level, 0.819 and 0.849 at the note-level and 0.876 and 0.897 at the patient-level, respectively.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\n\u003ch2\u003e2.6 Error Analyses\u003c/h2\u003e\n\u003cp\u003eThis section presents major error themes identified during our analyses. We highlight key areas for refinement and inform future enhancements to model performance and reliability. Major error themes include missing functional status context, lack of medical knowledge, ambiguous definition, issues of certainty, exclusion and inference.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003eInference: Inference errors are a common source of false positives. For example, when a section mentions \u0026ldquo;right lower leg pain that began approximately 3 years ago with activities\u0026rdquo;, the LLM inferred an impairment in \u0026ldquo;Changing and maintaining body position\u0026rdquo;. Although clinically reasonable, such inferences were excluded during annotation because there was no direct mention of a limitation in that mobility class. Similarly, phrases like \u0026ldquo;right hand pain, hand swelling\u0026rdquo; led the model to infer an impairment in \u0026ldquo;Carrying, moving and handling objects\u0026rdquo;, and \u0026ldquo;has more knee pain with the exercises\u0026rdquo; prompted an inference of impairment in \u0026ldquo;Walking and moving\u0026rdquo;. In these cases, the absence of explicit evidence for a functional limitation necessitated their exclusion from impaired status assignments.\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eAnother error pattern arises from the handling of non-specific symptoms. When sections mention term such as \u0026ldquo;dizziness\u0026rdquo;, \u0026ldquo;headaches\u0026rdquo;, \u0026ldquo;vertigo\u0026rdquo;, \u0026ldquo;weakness\u0026rdquo;, or \u0026ldquo;debility\u0026rdquo;, the LLM often assigned an impairment in the \u0026ldquo;Mobility, unspecified\u0026rdquo; class. However, in the absence of a direct reference to a functional limitation (e.g., difficulty ambulating or transferring), our annotation guidelines mandate that such general symptoms be excluded from mobility impairment classification. This overgeneralization by the LLM contributes to false positives and decreases the precision for the \u0026ldquo;Mobility, unspecified\u0026rdquo; class.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003eAmbiguous Definitions: Some mobility concepts share similar semantics, leading to overlapping or inconsistent classifications. For instance, repeated falls were frequently classified by the LLM as impairments in \u0026ldquo;Walking and moving\u0026rdquo;. However, our clinical rationale dictated that such cases should be classified under \u0026ldquo;Changing and maintaining body position\u0026rdquo; to better capture issues with postural control. Similarly, mentions of wheelchair use were often split between \u0026ldquo;Walking and moving\u0026rdquo; and \u0026ldquo;Moving around using transportation\u0026rdquo;. To enhance clarity and consistency, we consolidated wheelchair-related instances exclusively under \u0026ldquo;Moving around using transportation\u0026rdquo;.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eMissing Definition: The class \u0026ldquo;Mobility, unspecified\u0026rdquo; remained problematic across tasks. This challenge arises primarily from the lack of a precise ICF definition for this class, rendering its scope inherently ambiguous. In practice, this class is meant to capture general aspects of mobility\u0026mdash;such as everyday activities and exercise\u0026mdash;but without clear boundaries. For example, when a clinical note states, \u0026ldquo;patient returned fully to yoga,\u0026rdquo; it implies an impairment within this class.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eExclusion: Documentation pertaining to treatment plans, discussions, suggestions, instructions, recommendations, goals or advice is generally not considered indicative of a patient\u0026rsquo;s mobility functional status. This rule applies unless such documentation explicitly mentions that the patient is expected to improve in a specific mobility class, which in turn implies that an impairment exists. For example, if a note states that \u0026rsquo;the patient will improve in walking ability with physical therapy\u0026rsquo;, this suggests an underlying impairment in that area and should be annotated accordingly. Otherwise, routine mentions of treatment-related content are excluded from contributing to the functional status classification. However, we observed that the LLM sometimes struggles to exclude such content even when the prompt explicitly directs it to do so.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eLimitations in Medical Knowledge: LLMs trained on general corpora may lack the nuanced medical expertise required for certain assessments. For instance, the Focus on Therapeutic Outcomes (FOTO) system is a specialized web-based assessment where a low final score (ranging from 0 to 100) indicates impairment in the \u0026ldquo;Mobility, unspecified\u0026rdquo; class. Similarly, the Patient Specific Functional Scale (PSFS) is widely used to capture patient-specific reports of functional ability. In clinical practice, a low current PSFS score\u0026mdash;or an expectation of future improvement\u0026mdash;serves as a direct indicator of impaired functional status in the \u0026ldquo;Mobility, unspecified\u0026rdquo; class. The Timed Up and Go (TUG) Test is a standardized assessment for evaluating mobility and the risk of falls. However, our analysis revealed that the LLM does not consistently integrate these quantitative assessments into its interpretation of clinical notes. As a result, numerical scores or standardized test results might be inferred indirectly.\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eAs noted in inference error analysis, some false positive inference cases are clinically reasonable to be considered correct (i.e., true positive). Therefore, we also evaluated the performance of the LLM after treating such reasonable inferences as correct. Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e presents a comprehensive summary for both task 1 and task 2, aggregated across all institutions under this assumption. The performance was improved compared to the original results (Supplementary Table S3); the micro-average F1-score for task 1 and task 2 at the section-level increased to 0.890 and 0.878, respectively (originally 0.695 and 0.815), 0.942 and 0.925 at note-level, 0.962 and 0.948 at patient-level.\u003c/p\u003e\n\u003cdiv class=\"gridtable\"\u003e\n\u003ctable id=\"Tab2\" border=\"1\"\u003e\u003ccaption\u003e\n\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n\u003cdiv class=\"CaptionContent\"\u003e\n\u003cp\u003eF1-score of Two Tasks Across Mobility Classes Aggregated Across All Institutions with Error-Informed Prompt Refinement When Considering Reasonable Inference as Correct.\u003c/p\u003e\n\u003c/div\u003e\n\u003c/caption\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n\u003cth rowspan=\"2\" align=\"left\"\u003e\n\u003cp\u003eClass\u003c/p\u003e\n\u003c/th\u003e\n\u003cth colspan=\"2\" align=\"left\"\u003e\n\u003cp\u003eSection-Level\u003c/p\u003e\n\u003c/th\u003e\n\u003cth colspan=\"2\" align=\"left\"\u003e\n\u003cp\u003eNote-Level\u003c/p\u003e\n\u003c/th\u003e\n\u003cth colspan=\"2\" align=\"left\"\u003e\n\u003cp\u003ePatient-Level\u003c/p\u003e\n\u003c/th\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eTask 1\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eTask2\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eTask 1\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eTask2\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eTask 1\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eTask 2\u003c/p\u003e\n\u003c/th\u003e\n\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eChanging and maintaining body position\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.867\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.866\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.930\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.929\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.956\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.951\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eCarrying, moving, and handling objects\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.877\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.878\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.909\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.902\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.907\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.917\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eWalking and moving\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.948\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.951\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.982\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.984\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.989\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.982\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eMoving around using transportation\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.877\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.730\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.939\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.750\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.966\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.824\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eMobility, unspecified\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.833\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.810\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.916\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.903\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.963\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.966\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eAverage\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.890\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.878\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.942\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.925\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.962\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e0.948\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003ctfoot\u003e\n\u003ctr\u003e\n\u003ctd colspan=\"7\"\u003e*F1: F1-score.\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tfoot\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\n\u003ch2\u003e2.7 Trustworthiness Analyses\u003c/h2\u003e\n\u003cp\u003eLLMs used in clinical contexts must demonstrate trustworthiness across several dimensions, including reliability, safety, and robustness. In this section, we evaluate our approach and identify how our methods and configurations contribute to these attributes.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003eReliability: We configured the LLM with a temperature setting of 0, ensuring deterministic responses. By eliminating randomness in the generation process, the model\u0026rsquo;s outputs are reproducible: repeated queries yield consistent results. Additionally, our error analysis revealed that although the LLM does make mistakes, these errors are generally \u0026ldquo;reasonable\u0026rdquo; rather than erratic or nonsensical. For instance, the model might misclassify a borderline case of mobility impairment, but it does not invent implausible conditions. Such understandable, bounded errors indicate that the model\u0026rsquo;s performance is stable and predictable, which is critical for building clinical trust.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eSafety: Maintaining patient privacy and data confidentiality is paramount in healthcare applications. To this end, we deployed the LLM on secure local servers, preventing any transfer of protected health information (PHI) to external environments. By avoiding reliance on remote or third-party cloud services, we mitigated the risk of unintended data exposure and ensure compliance with privacy regulations. This closed-loop infrastructure provides a safer environment for processing sensitive clinical notes.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eGeneralizability: A trustworthy LLM must demonstrate robust performance across various clinical contexts. Although the training dataset was sourced from a single site, our analysis shows that refinements\u0026mdash;such as introducing examples or error informed instructions\u0026mdash;improved the model\u0026rsquo;s performance not only at the original site but also at two additional independent sites. This form of transfer learning highlights the model\u0026rsquo;s adaptability and generalizability, suggesting that improvements identified in one environment can successfully translate to others. Consequently, the LLM remains effective under slightly shifted data distributions, thereby increasing its overall trustworthiness as a tool for broad clinical application.\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eBy addressing each of these dimensions\u0026mdash;ensuring reproducibility and logical consistency, protecting patient data privacy, and demonstrating cross-institutional adaptability\u0026mdash;we underscore the trustworthiness of our LLM-based annotation pipeline.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"3 Discussion","content":"\u003cp\u003eThis study demonstrates the feasibility of employing LLMs with well-constructed prompt engineering to automatically extract and standardize mobility functional status information from unstructured EHR data. By leveraging an open-source LLM (Llama 3) alongside tailored prompt engineering strategies\u0026mdash;including zero-shot, few-shot, and error-informed prompt refinement\u0026mdash;we developed a pipeline that reliably identifies and classifies mobility-related expressions across diverse clinical note sections. Our evaluations across three healthcare institutions reveal that, while performance varies among different mobility classes, the refined prompt strategy significantly enhances overall accuracy and consistency. The Error-Informed Prompt Refinement configuration was designed to address specific shortcomings observed in the baseline model by incorporating feedback from previous errors. By analyzing misclassifications and instances of low precision, the model was able to adapt its prompts to better clarify ambiguous cases. This adjustment was particularly beneficial for classes that initially suffered from high variability, such as \u0026ldquo;Carrying, moving, and handling objects\u0026rdquo; and \u0026ldquo;Walking and moving\u0026rdquo; in the Mobility Extraction task. For Impairment Classification, while the top-performing class (\u0026ldquo;Walking and moving\u0026rdquo;) maintained high performance, the refinement process contributed to a more consistent performance for classes like \u0026ldquo;Changing and maintaining body position\u0026rdquo;. The narrower range of F1-scores (0.793\u0026ndash;0.851) in this setting suggests that error-informed adjustments helped reduce performance fluctuations across institutions.\u003c/p\u003e\u003cp\u003eThe Error-Informed method sometimes exhibited a trade-off where enhancements in precision for some classes came at the expense of recall. For example, in Impairment Classification, although the precision improved for class \u0026ldquo;Mobility, unspecified\u0026rdquo;, the corresponding recall dropped, which in turn affected the F1-score. This indicates that while the refined prompts can be more targeted, they may sometimes become overly conservative, missing some correct instances.\u003c/p\u003e\u003cp\u003ePortability of the model across different healthcare institutions remains an important area for further exploration. Our development took place at an academic research center (Site 1) and was tested at two other institutions (Sites 2 and 3) that include data from community-based real-world practices to validate generalizability. In future work, we aim to test different combinations of development and validation, and also expand data sources with more diverse patient populations, clinical note structures, and documentation styles. This would allow us to systematically assess model robustness and understand the impact of data heterogeneity on performance metrics.\u003c/p\u003e\u003cp\u003eAnother limitation of our approach is that we did not fine-tune the LLM on the annotated dataset. Although fine-tuning has the potential to improve performance, our annotated dataset was relatively small, and fine-tuning large models typically requires substantial amounts of labeled data to avoid overfitting. In addition, fine-tuning can be computationally expensive and may increase the risk of overfitting to specific institution documentation styles. Instead, we relied on prompt-engineering strategies that can be more data-efficient and flexible. However, future research could explore lightweight fine-tuning methods (e.g., LoRA or adapter-based approaches) if larger annotated corpora become available.\u003c/p\u003e\u003cp\u003eMoreover, the deterministic configuration and secure local deployment of the model address key concerns regarding reproducibility and patient data privacy, further reinforcing the pipeline\u0026rsquo;s trustworthiness. Despite challenges\u0026mdash;such as handling ambiguous language and balancing precision with recall\u0026mdash;the findings indicate that LLMs can serve as scalable and generalizable tools for clinical applications. Future work may further explore additional domains of trustworthiness, such as fairness and interpretability, to provide an even more comprehensive assessment of the model\u0026rsquo;s reliability and suitability for clinical environments.\u003c/p\u003e\u003cp\u003eLongitudinal validations in real-world settings will be essential for fully integrating these methods into clinical decision-making processes, ultimately contributing to improved patient care and personalized intervention strategies. Overall, our work lays a promising foundation for advancing the automated extraction of functional status data, highlighting the transformative role of LLMs in modern healthcare analytics.\u003c/p\u003e"},{"header":"4 Methods","content":"\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003e4.1 Data Preparation\u003c/h2\u003e\u003cp\u003eThis study included three institutions an academic referral medical center and two community-based practices in the upper midwest. Agreements between these institutions preclude direct comparisons of data; therefore, the sites are designated by number (sites 1\u0026ndash;3). The annotation guidelines were developed using definitions from ICF to annotate mobility functional status data in clinical notes. We considered five mobility functional status classes, each comprising one or more ICF subclasses. Supplementary Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e lists these classes alongside their corresponding ICF codes and subclasses.\u003c/p\u003e\u003cp\u003eWe restricted our analysis to five primary mobility classes, rather than using their more granular ICF subclasses. This decision was driven by two practical considerations. First, our annotation process revealed a high degree of semantic overlap among the ICF subclasses, making it difficult to obtain reliable annotation for those narrower categories. As a result, we consolidated the ICF-based taxonomy into five major classes as defined in ICF to enhance labeling consistency and reduce confusion during both manual annotation and automated extraction. Second, LLMs may struggle with overly fine-grained distinctions in clinical text, especially if subclass definitions overlap or are ambiguously documented.\u003c/p\u003e\u003cp\u003eTo facilitate focused and coherent annotation, each clinic note was divided into sections. Clinical notes at Site 1 include a variety of sections, such as History of Present Illness, Past Medical/Surgical History, Physical Examination, Diagnosis and more. We chose section-level analysis for several reasons. First, clinical notes can be lengthy, and feeding entire notes into a LLM risks exceeding context length limits\u0026mdash;potentially truncating important information. Second, sections typically focus on specific aspects of a patient\u0026rsquo;s health (e.g., History of Present Illness, Assessment), which makes them natural, semantically coherent units for targeted extraction tasks. Third, section-level segmentation facilitates more precise few-shot prompts and helps the model focus on local contextual cues relevant to mobility. Finally, working at the section level can improve interpretability, since clinicians often review notes by scanning sections, allowing direct alignment of the LLM\u0026rsquo;s output with established clinical documentation practices.\u003c/p\u003e\u003cp\u003eWithin each section, a trained annotator manually identified and labeled mobility classes with related expressions and assigned the appropriate impairment statuses. The annotator assigned \"Impaired\" if the section describes difficulties, challenges, limitations, potential issues, or impairments, and \"Unimpaired\" if the section explicitly describes normal function or abilities. If a section did not receive either \u0026ldquo;Impaired\u0026rdquo; or \u0026ldquo;Unimpaired,\u0026rdquo; this indicated that there was no mobility-related information present in that section. Additional details on the annotation process and guidelines can be found in the Supplementary Section.\u003c/p\u003e\u003cp\u003eIn total, 600 notes were annotated. Of these, 200 were physical therapy (PT) or occupational therapy (OT) notes from Site 1, 200 were PT/OT notes from Site 2, and 200 were unrestricted clinic notes from Site 3. PT/OT notes typically provide abundant information regarding mobility, while including general clinic notes from Site 3 helps demonstrate the method\u0026rsquo;s broader applicability. After splitting the notes into sections, the dataset comprised 3,810 sections (1,153 in Site 1, 1,075 in Site 2, and 1,582 in Site 3). On average, each section contained 826 characters, with section lengths averaging 932 characters in Site 1, 847 in Site 2, and 7,334 in Site 3. A single section may contain information relevant to multiple mobility classes or none.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003e4.2 Experiment\u003c/h2\u003e\u003cp\u003eFor prompt development, 200 sections were randomly selected from Site 1 to form a training dataset; the remaining sections served as the test set. Performance was evaluated using precision, recall, and F1-score metrics, both class-specific and institution-specific, across various experimental configurations.\u003c/p\u003e\u003cp\u003eAn LLM-based annotation pipeline was developed to extract mobility functional status information. The pipeline consists of three core components: (1) the LLM, (2) a task-specific prompt combined with a clinical note section, and (3) a post-processing step that converts the LLM\u0026rsquo;s text output into structured data. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e3\u003c/span\u003e, the process begins with dividing the clinical notes into distinct sections. Each section is then embedded within a task-specific prompt explicitly designed for mobility functional status extraction. The LLM processes the prompt and generates a textual response based on its analysis and interpretation, and finally, a post-processing step transforms the LLM\u0026rsquo;s generative text output into structured objects. Specifically, for each section, the LLM produces, for each of the five mobility classes, a single label\u0026mdash;either \u0026ldquo;Impaired\u0026rdquo; if there is evidence of limitation, \u0026ldquo;Unimpaired\u0026rdquo; if normal function is described, or \u0026ldquo;None\u0026rdquo; if the section contains no information about that class.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eLlama 3, an advanced open-source LLM developed by Meta AI, was employed for this task. Llama 3 provides enhanced efficiency and performance. Integrating Llama 3 into a local server environment ensured both patient data privacy and the use of powerful computational resources for annotation tasks. This setup allowed for the handling of sensitive medical information within a secure infrastructure. The open-source nature of Llama 3 further facilitated customization and adaptability to specific project requirements.\u003c/p\u003e\u003cdiv id=\"Sec14\" class=\"Section3\"\u003e\u003ch2\u003e4.2.1 Performance Evaluation\u003c/h2\u003e\u003cp\u003eWe calculate performance at the class level. For each mobility class, we first evaluate Mobility Extraction by considering any section labeled \u0026ldquo;Impaired\u0026rdquo; or \u0026ldquo;Unimpaired\u0026rdquo; for that class as \u0026ldquo;Mentioned\u0026rdquo; and all other sections as \u0026ldquo;Not Mentioned.\u0026rdquo; Treating \u0026ldquo;Mentioned\u0026rdquo; as the positive class and \u0026ldquo;Not Mentioned\u0026rdquo; as the negative class, we compute precision, recall, and F1-score by comparing the LLM\u0026rsquo;s predicted Mentioned/Not Mentioned labels against the human annotations for that class. Next, for Impairment Classification, we restrict our analysis to those sections confirmed as \u0026ldquo;Mentioned.\u0026rdquo; Within this subset, the LLM\u0026rsquo;s output\u0026mdash;distinguishing between \u0026ldquo;Impaired\u0026rdquo; and \u0026ldquo;Unimpaired\u0026rdquo;\u0026mdash;is compared to the human reference label. In other words, once a section is known to address the mobility class, we assess whether the LLM\u0026rsquo;s choice of \u0026ldquo;Impaired\u0026rdquo; versus \u0026ldquo;Unimpaired\u0026rdquo; matches the gold‐standard annotation, and we calculate an F1‐score for each class with \u0026ldquo;Impaired\u0026rdquo; as the positive label.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\u003ch2\u003e4.3 Prompting Engineering\u003c/h2\u003e\u003cdiv id=\"Sec16\" class=\"Section3\"\u003e\u003ch2\u003e4.3.1 Zero Shot Learning\u003c/h2\u003e\u003cp\u003eIn the zero-shot configuration, the prompt includes only the core task description, the definitions of all relevant mobility classes, a general instruction, and a final question. It does not provide any examples. Under these conditions, the model relies solely on the given prompt, without guidance from sample inputs or outputs. Evaluating zero-shot performance establishes a baseline against which subsequent methods can be compared.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec17\" class=\"Section3\"\u003e\u003ch2\u003e4.3.2 Few Shot Learning\u003c/h2\u003e\u003cp\u003eFew-shot learning leverages a small number of example cases embedded in the prompt to improve the model\u0026rsquo;s reasoning and output quality. Unlike zero-shot prompting, few-shot prompts present the model with illustrative examples\u0026mdash;each accompanied by detailed explanations\u0026mdash;to better convey the task\u0026rsquo;s underlying reasoning steps.\u003c/p\u003e\u003cp\u003eIn this study, a five-shot configuration was used, with the selected examples spanning a variety of scenarios to promote more nuanced understanding. Three strategies were tested for choosing these examples from the training dataset:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eRandom Selection: Five samples are chosen at random.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eK-Means Error Clustering Selection: We first identified the errors that occurred under zero-shot conditions on the training dataset. We then applied the k-means clustering algorithm to group these error cases into five clusters. From the center of each cluster, we selected one representative sample, thereby focusing on common mistakes.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eSimilarity-Based Selection: For each test case, we identified the top five most similar training samples using a K-nearest neighbors approach. These closely related examples serve as relevant reference points that closely match the characteristics of the test input.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThrough these strategies, we aimed to discover which method of choosing examples leads to the greatest performance improvement, demonstrating how few-shot learning can be optimized to achieve more accurate and reliable results.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec18\" class=\"Section3\"\u003e\u003ch2\u003e4.3.3 Error-Informed Prompt Refinement\u003c/h2\u003e\u003cp\u003eA systematic error analysis of the zero-shot results was conducted to identify patterns in the model\u0026rsquo;s mistakes, such as common misclassifications or omissions. We manually examined all false positive and false negative outputs across sections in the training set, catalogued the most common sources of error, and distilled these into a set of explicit \u0026ldquo;exclusion\u0026rdquo; and \u0026ldquo;inclusion\u0026rdquo; rules. These rules were then injected as additional instructions into the prompt so that the LLM would be guided toward avoiding the same mistakes. For example, whenever the section mentioned \u0026ldquo;patient will begin home exercise program\u0026rdquo;, the model tended to treat these as evidence of impairment\u0026mdash;even though they are plans or instructions. To address this, we added explicit instructions telling the model to ignore any text that describes a treatment plan, recommendation, or exercise regimen.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec19\" class=\"Section3\"\u003e\u003ch2\u003e4.3.4 Task Decomposition\u003c/h2\u003e\u003cp\u003eOur overarching goal was to extract expressions related to a patient\u0026rsquo;s mobility functional status and classify their impairment status (impaired vs. non-impaired). To potentially improve performance, we divided the task into two subtasks: Mobility Extraction and Impairment Classification. This decomposition enables a clearer, more focused approach and allows for specialized handling of each step. To implement this, we explored two different setups:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eSingle LLM with Two-Task Prompt (Chain-of-Thought Prompting): A single LLM is guided sequentially through the two subtasks with a multi-step prompt. The prompt first instructs the model to assess whether the section contains mobility-related descriptions for each class, then directs it to classify the impairment status. By leveraging a chain-of-thought process, this setup encourages methodical, stepwise reasoning.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eTwo LLMs with Task Specialization: Alternatively, this approach employs two dedicated LLMs, each focusing on a single subtask. The first LLM determines whether relevant mobility descriptions are present. The second LLM then uses those determinations to classify impairment status. By dividing responsibilities in this manner, each model can specialize its role.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003ch2\u003eEthics approval and consent to participate\u003c/h2\u003e\u003cp\u003eThe study was approved by the Mayo Clinic Institutional Review Board and the Olmsted Medical Center Institutional Review Boards. Patients included in this study provided research authorization.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003ch2\u003eCompeting Interests Statement\u003c/h2\u003e\u003cp\u003eThe author(s) declare no competing interests.\u003c/p\u003e\u003c/p\u003e\u003ch2\u003eFunding\u003c/h2\u003e\u003cp\u003eThis study was supported by Eric and Wendy Schmidt Fund for AI Research and Innovation and NIH (National Institutes of Health) R01 AG068007.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eX.L. developed the model, analyzed, and interpreted the data, drafted the initial manuscript, and revised the manuscript. H.J. performed data curation and corpus annotation. S.P. and J.S. helped with cohort identification and corpus annotation. S.P., J.S., H.J. and M.G. participated in the interpretation of the data and contributed to manuscript editing and revisions. S.S conceptualized and designed the study, supervised data collection, model development, and analysis, interpreted the data, and reviewed and finalized the manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgement\u003c/h2\u003e\u003cp\u003eWe acknowledge the support of our funding agencies, which are listed below in the funding section.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe data used for this work was from electronic health records which include identifiable data and thus cannot be shared due to privacy and legal reasons.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eNations, U. World population ageing report. \u003cem\u003eWorld Popul. Ageing\u003c/em\u003e (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOrganization, W. H. et al. World report on ageing and health (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMather, M. \u0026amp; Scommegna, P. Fact sheet: aging in the united states. \u003cem\u003ePopul Ref. Bureau\u003c/em\u003e (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMayer-Oakes, S. A., Oye, R. K. \u0026amp; Leake, B. Predictors of mortality in older patients following medical intensive care: the importance of functional status. \u003cem\u003eJ. Am. Geriatr. Soc.\u003c/em\u003e \u003cb\u003e39\u003c/b\u003e, 862\u0026ndash;868 (1991).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePonzetto, M. et al. Risk factors for early and late mortality in hospitalized older patients: the continuing importance of functional status. \u003cem\u003eJournals Gerontol. Ser. A: Biol. Sci. Med. Sci.\u003c/em\u003e \u003cb\u003e58\u003c/b\u003e, M1049\u0026ndash;M1054 (2003).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNarain, P. et al. Predictors of immediate and 6-month outcomes in hospitalized elderly patients: The importance of functional status. \u003cem\u003eJ. Am. Geriatr. Soc.\u003c/em\u003e \u003cb\u003e36\u003c/b\u003e, 775\u0026ndash;783 (1988).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMetz, D. H. Mobility of older people and their quality of life. \u003cem\u003eTransp. policy\u003c/em\u003e. \u003cb\u003e7\u003c/b\u003e, 149\u0026ndash;152 (2000).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMusselwhite, C. \u0026amp; Haddad, H. Mobility, accessibility and quality of later life. \u003cem\u003eQual. Ageing Older Adults\u003c/em\u003e. \u003cb\u003e11\u003c/b\u003e, 25\u0026ndash;37 (2010).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eForhan, M. \u0026amp; Gill, S. V. Obesity, functional mobility and quality of life. \u003cem\u003eBest Pract. Res. Clin. Endocrinol. metabolism\u003c/em\u003e. \u003cb\u003e27\u003c/b\u003e, 129\u0026ndash;137 (2013).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRosen, S. L. \u0026amp; Reuben, D. B. Geriatric assessment tools. \u003cem\u003eMt. Sinai J. Medicine: J. Transl Pers. Med.\u003c/em\u003e \u003cb\u003e78\u003c/b\u003e, 489\u0026ndash;497 (2011).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKimia, A. A., Savova, G., Landschaft, A. \u0026amp; Harper, M. B. An introduction to natural language processing: how you can get more from those electronic notes you are generating. \u003cem\u003ePediatr. Emerg. care\u003c/em\u003e. \u003cb\u003e31\u003c/b\u003e, 536\u0026ndash;541 (2015).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFu, S. et al. Quality assessment of functional status documentation in EHRs across different healthcare institutions. \u003cem\u003eFront. Digit. Heal\u003c/em\u003e. \u003cb\u003e4\u003c/b\u003e, 958539 (2022).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKreimeyer, K. et al. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. \u003cem\u003eJ. biomedical Inf.\u003c/em\u003e \u003cb\u003e73\u003c/b\u003e, 14\u0026ndash;29 (2017).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eThirunavukarasu, A. J. et al. Large language models in medicine. \u003cem\u003eNat. Med.\u003c/em\u003e \u003cb\u003e29\u003c/b\u003e, 1930\u0026ndash;1940 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSinghal, K. et al. Large language models encode clinical knowledge. \u003cem\u003eNature\u003c/em\u003e \u003cb\u003e620\u003c/b\u003e, 172\u0026ndash;180 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNewman-Griffis, D. \u0026amp; Zirikly, A. Embedding transfer for low-resource medical named entity recognition: a case study on patient mobility. \u003cem\u003earXiv preprint arXiv:1806.02814\u003c/em\u003e (2018).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKukafka, R., Bales, M. E., Burkhardt, A. \u0026amp; Friedman, C. Human and automated coding of rehabilitation discharge summaries according to the international classification of functioning, disability, and health. \u003cem\u003eJ. Am. Med. Inf. Assoc.\u003c/em\u003e \u003cb\u003e13\u003c/b\u003e, 508\u0026ndash;515 (2006).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOrganization, W. H. \u003cem\u003eInternational Classification of Functioning, Disability, and Health: Children \u0026amp; Youth Version: ICF-CY\u003c/em\u003e (World Health Organization, 2007).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBales, M., Kukafka, R., Burkhardt, A. \u0026amp; Friedman, C. Extending a medical language processing system to the functional status domain. In \u003cem\u003eAMIA Annual Symposium Proceedings\u003c/em\u003e, vol. 888 (American Medical Informatics Association, 2005). (2005).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAgaronnik, N., Lindvall, C., El-Jawahri, A., He, W. \u0026amp; Iezzoni, L. Use of natural language processing to assess frequency of functional status documentation for patients newly diagnosed with colorectal cancer. \u003cem\u003eJAMA Oncol.\u003c/em\u003e \u003cb\u003e6\u003c/b\u003e, 1628\u0026ndash;1630 (2020).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNewman-Griffis, D. et al. Linking free text documentation of functioning and disability to the icf with natural language processing. \u003cem\u003eFront. rehabilitation Sci.\u003c/em\u003e \u003cb\u003e2\u003c/b\u003e, 742702 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFu, S. et al. Fedfsa: Hybrid and federated framework for functional status ascertainment across institutions. \u003cem\u003eJ. Biomed. Inf.\u003c/em\u003e \u003cb\u003e152\u003c/b\u003e, 104623 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGoel, A. et al. PMLR,. Llms accelerate annotation for medical information extraction. In \u003cem\u003eMachine Learning for Health (ML4H)\u003c/em\u003e, 82\u0026ndash;100 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZaretsky, J. et al. Generative artificial intelligence to transform inpatient discharge summaries to patient-friendly language and format. \u003cem\u003eJAMA Netw. open.\u003c/em\u003e \u003cb\u003e7\u003c/b\u003e, e240357\u0026ndash;e240357 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSinghal, K. et al. Toward expert-level medical question answering with large language models. \u003cem\u003eNat Medicine\u003c/em\u003e \u003cb\u003e1\u0026ndash;8\u003c/b\u003e (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDubey, A. et al. The llama 3 herd of models. \u003cem\u003earXiv preprint arXiv:2407.21783\u003c/em\u003e (2024).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-7104310/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7104310/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eWith global aging, assessing functional status is vital for precision medicine. Electronic Health Records (EHRs), particularly unstructured data, hold abundant information on patient mobility. This study explores using Large Language Models (LLMs) to extract and standardize mobility status from unstructured EHR data (i.e., clinical notes). We annotated 600 clinical notes from three health care institutions located in southeastern Minnesota and west-central Wisconsin, focusing on expressions of mobility and associated impairment. Leveraging the open-source Llama 3 model, we tested various prompting strategies\u0026mdash;including zero-shot, few-shot, and task decomposition\u0026mdash;and evaluated their performance. Error analysis showed that while the model sometimes inferred impairments without explicit evidence, most errors were clinically reasonable, often reflecting borderline or ambiguous cases. While considering reasonable inference as correct, at the patient-level, Mobility Extraction achieves a micro-average accuracy of 0.952 with an F1-score of 0.962, and Impairment Classification produces a micro-average accuracy of 0.912 and an F1-score of 0.948. A local, deterministic setup improved trustworthiness by ensuring consistent outputs, safeguarding privacy, and demonstrating cross-institution generalizability. These findings highlight the feasibility of LLM-based solutions for extracting mobility functional status from unstructured EHR data, supporting both clinical applications and research.\u003c/p\u003e","manuscriptTitle":"Mobility Functional Status Ascertainment in Electronic Health Records using Large Language Models","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-29 10:13:44","doi":"10.21203/rs.3.rs-7104310/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-07-26T04:30:25+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-07-22T22:02:30+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-07-22T21:59:10+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"addba80a-d6b3-4590-a712-b5b078463c30","owner":[],"postedDate":"July 29th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":52151947,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":52151948,"name":"Health sciences/Health care"},{"id":52151949,"name":"Physical sciences/Mathematics and computing"},{"id":52151950,"name":"Health sciences/Medical research"}],"tags":[],"updatedAt":"2026-01-26T16:05:39+00:00","versionOfRecord":{"articleIdentity":"rs-7104310","link":"https://doi.org/10.1038/s41598-026-37025-9","journal":{"identity":"scientific-reports","isVorOnly":false,"title":"Scientific Reports"},"publishedOn":"2026-01-23 15:58:43","publishedOnDateReadable":"January 23rd, 2026"},"versionCreatedAt":"2025-07-29 10:13:44","video":"","vorDoi":"10.1038/s41598-026-37025-9","vorDoiUrl":"https://doi.org/10.1038/s41598-026-37025-9","workflowStages":[]},"version":"v1","identity":"rs-7104310","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7104310","identity":"rs-7104310","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.