Fine-Tuning LLaMA2 for Summarizing Discharge Notes: Evaluating the Role of Highlighted Information

doi:10.21203/rs.3.rs-7181141/v1

Fine-Tuning LLaMA2 for Summarizing Discharge Notes: Evaluating the Role of Highlighted Information

2025 · doi:10.21203/rs.3.rs-7181141/v1

preprint OA: closed

Full text JSON View at publisher

Full text 150,665 characters · extracted from preprint-html · click to expand

Fine-Tuning LLaMA2 for Summarizing Discharge Notes: Evaluating the Role of Highlighted Information | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Fine-Tuning LLaMA2 for Summarizing Discharge Notes: Evaluating the Role of Highlighted Information Mahshad Koohi Habibi Dehkordi, Yehoshua Perl, Fadi P Deek, Hao Liu This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7181141/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Purpose: This study investigates whether incorporating highlighted information in discharge notes improves the quality of the summaries generated by Large Language Models (LLMs). Specifically, it evaluates the effect of using highlighted versus unhighlighted inputs for fine-tuning LLaMA2-13B model for the summarization task. Methods: We fine-tuned the LlaMA2-13B model in two variants using MIMIC-IV-Ext-BHC dataset: one variant fine-tuned with the highlighted discharge notes (H-LLaMA), and the other variant on the same set of notes without highlighting (U-LLaMA). Highlighting was performed automatically using a Cardiology Interface Terminology (CIT) presented in our previous work. H-LLaMA and U-LLaMA were evaluated on a randomly selected test set of 100 discharge notes using multiple metrics (including BERTScore, ROUGE-L, BLEU, and SummaC_CONV). Additionally, LLM-based judgment via ChatGPT-4o was used to rate coherence, fluency, conciseness, and correctness, alongside a manual completeness evaluation on a random sample of 20 notes. Results: H-LLaMA consistently outperformed U-LLaMA across all metrics. H-summaries, generated using H-LLaMA, in comparison to U-summaries, generated using U-LLaMA, achieved higher BERTScore (63.75 vs. 59.61), ROUGE-L (23.43 vs. 21.82), BLEU (10.4 vs. 8.41), and SummaC_CONV (67.7 vs. 40.2). Manual review also showed improved completeness for H-summaries (54.2% vs. 48.1%). All improvements were statistically significant (p < 0.05). Moreover, LLM-based evaluation indicated higher average ratings across coherence, correctness, and conciseness. Conclusion: Incorporating highlighted information into discharge notes for fine-tuning LLMs enhances the summarization quality. This approach provides a scalable method for improving discharge note summarization and has the potential to support better clinical decision-making through more informative and reliable summaries. Bioinformatics Large Language Models LLaMMa Fine-tuning Discharge notes Summarization Electronic Health Records Figures Figure 1 Figure 2 Introduction Electronic Health Records (EHRs) contain a wide range of clinical information, including details of a patient’s hospital stay, progress notes, medical history, medications, vital signs, and diagnostic reports [1]. The exponential growth of EHRs has revolutionized medical data management, enhancing access to patient records, interoperability and aiding healthcare professionals in making informed decisions [2]. However, EHRs are often perceived as cluttered and difficult to navigate, which can significantly hinder healthcare providers' ability to efficiently extract relevant insights, potentially impacting clinical decision-making and patient safety [3, 4]. Several factors contribute to this complexity, including the vast amount of abbreviations and medical jargon, variability in how different healthcare providers document data in EHRs, copy-pasting practices that clutter records with redundant information, and the design of EHR systems themselves. Summarization of EHRs can solve this problem to a great extent by enabling healthcare providers to quickly access the most relevant information, reducing errors, and supporting informed clinical decisions [2, 5]. Given the complexity and volume of EHR data, automated summarization methods have become increasingly important. There has been a notable shift from traditional text summarization methods to techniques that use Large Language Models (LLMs) [6-8]. These models have evolved from pre-training and fine-tuning approaches to prompt-based methods [9-11]. While LLMs demonstrate strong potential in summarization tasks, several challenges persist. One critical issue is "hallucination," where LLMs generate plausible but factually incorrect information. In the healthcare domain, such inaccuracies can lead to misdiagnoses, inappropriate treatment plans, or worse [12, 13]. Additionally, LLMs may overlook crucial clinical details, potentially compromising patient care [14]. Fine-tuning LLMs for EHR summarization offers a promising solution to mitigating these challenges by enhancing its understanding of medical terminology and context [15-17]. Fine-tuning is a process that allows LLMs to adapt their general knowledge to specific domains by training them on smaller, domain-relevant datasets, enabling them to learn task-specific patterns and terminology. During fine-tuning, the model's pre-trained weights are updated to optimize its performance for the target task, allowing it to learn task-specific features and improve accuracy [18, 19]. By adapting LLMs to healthcare-specific datasets, fine-tuned models can better grasp domain-specific terminologies and contextual nuances, leading to improved performance in summarization tasks [20]. Moreover, fine-tuning can reduce hallucinations by reinforcing factual content through curated training data [16, 17, 21, 22]. This approach also ensures the inclusion of essential clinical details that might otherwise be overlooked in generic LLM outputs. The process of fine-tuning LLMs for EHR summarization relies on high-quality datasets [23]. Studies have shown that training models on domain-specific data significantly enhances their ability to interpret and summarize clinical narratives effectively [24, 25]. The quality of the dataset used for fine-tuning directly impacts the model's summarization performance [26]. The MIMIC-IV-BHC dataset, a curated collection of discharge notes that are a key component of EHRs, and corresponding Brief Hospital Course (BHC) summaries, has been introduced [27, 28] introduced. This study benchmarked 5 models (GPT-3.5, GPT-4, Clinical-T5-Large, Llama2-13B, and FLAN-UL2 using both prompting-based and fine-tuning adaptation strategies. The reported results demonstrate that fine-tuned Llama2-13B achieved the highest quantitative scores (BLEU, BERT-Score, ROUGE-L). However, although fine-tuning can help reduce the frequency of hallucinations by training the model on domain-specific data, it does not entirely eliminate the risk of generating inaccurate information, as the underlying architecture of LLMs can still lead to hallucinations [29]. These studies [29, 30] emphasize that while fine-tuning LLMs can improve their performance, there are still challenges, including the potential loss of previously acquired knowledge during the fine-tuning process, which can lead to missing critical information in the generated summaries. In our previous study [31], we demonstrated that summarizing discharge notes with highlighted information, using a prompt engineering strategy, improves the accuracy of summaries compared to summarizing unhighlighted discharge notes. The highlighting is performed with the Cardiology Interface Terminology (CIT) designed with the benefit of Machine Learning (ML) techniques in our previous work [32]. In this study, we fine-tune LLaMA2-13B on the MIMIC-IV-BHC dataset using the LoRA technique. We then investigate the impact of incorporating highlighted discharge notes, in which the detailed information is highlighted, into the summarization process. Specifically, we compare the summaries generated when the fine-tuned model is provided with highlighted discharge notes versus when it is fine-tuned with the original, unhighlighted ones. To evaluate our approach, we analyze the resulting summaries, from both approaches, using BLEU, BERT-Score, and ROUGE-L metrics. Additionally, we assess the completeness of the summaries through manual review and employ LLMs as judges to evaluate summaries quality based on coherence, fluency, conciseness, and correctness. We aim to enhance the clarity and usability of clinical records by systematically examining how highlighting input notes affects the performance of a fine-tuned LLaMA2-13B model in discharge note summarization. This, in turn, can support better healthcare decision-making and improve patient outcomes. Background 2.1 Text summarization and related work Text summarization techniques can be broadly classified into three categories: extractive summarization , abstractive summarization , and LLM-based summarization [33] . Extractive summarization [34] selects key sentences or phrases directly from the original text without modifying their wording. It relies on ranking mechanisms to identify the most informative parts of the content. However, such techniques contend with some disadvantages including a potential lack of coherence, as extracted sentences may not flow naturally, and the inability to paraphrase or generalize information [33]. Unlike extractive methods, abstractive summarization [35] generates new text that conveys the key ideas of the original content in a more concise and coherent manner. It offers enhanced fluency and improved contextual understanding, but it also carries the risk of introducing factual inconsistencies, increasing computational costs, and requiring large-scale training datasets to ensure accuracy [33]. With the advent of LLMs, and due to their strong performance in summarization tasks [6, 36-38], there has been a significant shift toward using LLMs for text summarization. LLM-based summarization utilizes pre-trained transformer models to generate summaries based on prompts or fine-tuning. These models can perform both extractive and abstractive summarization, adapting their output based on the task requirements. As a result, numerous research efforts have focused on developing automated summarization methods for medical texts. Five LLMs were tested in generating summaries for discharge notes using MIMIC-IV-Ext-BHC dataset [27]. The study covering both open-source models (Clinical-T5-Large, FLAN-UL2, and Llama2-13B) and proprietary models (GPT-3.5 and GPT-4). These models were adapted using different strategies, including fine-tuning and in-context learning. The results showed that fine-tuned Llama2-13B achieved the best performance among open-source models based on quantitative metrics such as BLEU and BERT-Score. However, GPT-4 with in-context learning demonstrated the most robustness across varying input lengths and was preferred over human summaries in a clinical reader study, where five clinicians compared its summaries to those written by human experts. The results also emphasized that while open-source models like Llama2-13B performed well and could match human-written summaries, proprietary models, like GPT-4 had a clear edge in producing summaries that clinicians preferred. Furthermore, a framework for radiology report summarization, using ChatGPT, has been proposed [37]. In-context learning and iterative optimization was used to improve the Automatic Impression Generation (AIG) task, which involves summarizing the "Findings" section of a radiology report into the "Impression" section. Instead of fine-tuning the model, prompts using similar reports retrieved via a similarity search technique were dynamically constructed. These retrieved examples provide contextual information, allowing ChatGPT to better generate relevant summaries. The method was evaluated on MIMIC-CXR and OpenI datasets, demonstrating state-of-the-art performance in radiology report summarization without requiring additional training data. Finally, a system for EHR summarization, using the Google Flan-T5 model, has also been proposed [2] to generate clinician-focused summaries based on clinician-specified topics. Flan-T5 were fine-tuned on an EHR question-answering dataset formatted in the Stanford Question Answering Dataset (SQuAD) style. The fine-tuning process utilized the Seq2SeqTrainer from the Hugging Face Transformers library [39], with optimized hyperparameters to enhance performance. The results achieved are an Exact Match (EM) score of 81.81%, ROUGE scores (ROUGE-1: 96.03%, ROUGE-2: 86.67%, ROUGE-L: 96.10%), and a BLEU score of 63%. 1.1 Dataset The MIMIC-IV-Ext-BHC dataset [27, 28] is a collection of Brief Hospital Course (BHC) summaries paired with their corresponding discharge notes extracted from the MIMIC-IV-Note database [40, 41]. MIMIC-IV-Ext-BHC is created by preprocessing discharge summaries from MIMIC-IV-Note, which contains 331,794 de-identified clinical notes from 145,915 patients admitted to the Beth Israel Deaconess Medical Center between 2008 and 2019. Both datasets are hosted on PhysioNet [28, 40] and can be accessed after signing a Data Use Agreement and completing the required training. The collection of patient information and the creation of these research resources were reviewed by the Institutional Review Board (IRB) [42] at the Beth Israel Deaconess Medical Center, which granted a waiver of informed consent and approved the data-sharing initiative. The MIMIC-IV-Ext-BHC dataset is designed to facilitate research on hospital course summarization, addressing the challenge of extracting concise and relevant information from lengthy clinical narratives. This dataset covers a diverse patient population reflective of the broader MIMIC-IV cohort, which includes various age groups, genders, and medical conditions. To create the MIMIC-IV-Ext-BHC dataset, the MIMIC-IV notes are preprocessed through tokenization, section identification, normalization, and cleaning. This process separated the BHC from the rest of the clinical note. To validate data quality, a manual review of 100 randomly sampled clinical notes, and a clinical team reviewed 30 note-summary pairs were conducted, reporting no significant issues with the extracted content. The MIMIC-IV-Ext-BHC dataset, consisting of 270,033 discharge note–BHC pairs, contains the columns note_id, input, target, input_tokens, and target_tokens. Note_id is the unique identifier for each row in the dataset, which matches the note_id column from the original MIMIC-IV-Note dataset. The input field includes preprocessed discharge note texts excluding the BHC section. The target field includes the standardized and cleaned BHC text, providing a concise summary of the patient's hospital course. Input_tokens and target_tokens columns store the tokenized lengths of the clinical notes and BHC summaries. 1.2 Automatic highlighting In our previous work [32, 43, 44], we presented a multi-stage method for curating Cardiology Interface Terminology (CIT), tailored through an automatic method for highlighting information in discharge notes of cardiology patients. This process consists of two main phases. In the first stage, we constructed an initial version of the CIT, referred to as ICIT, by incorporating concepts from 11 cardiology-related subhierarchies of SNOMED CT [45]. However, ICIT alone did not sufficiently capture all detailed information present in the discharge notes. To address this, in the second stage, we employed a semi-automatic, iterative approach to enrich ICIT. Specifically, we mined fine-granularity phrases from discharge notes that contained ICIT concepts. All the mined phrases are reviwed both automatically and manually, and the legitimate ones are added to CIT, forming a new version of CIT. All the illegiteimate phrases, structurally or sematicaly, are added to a reject list (R). In each iteration, we used the latest version of CIT to highlight the build dataset and evaluated its performance by measuring two metrics: coverage and breadth. Coverage is defined as the percentage of total number of words highlighted in a note; and breadth is defined as the average number of words of highlighted concepts. This iterative process continued until further improvements in coverage became negligible. The output of the second stage serves as the tarining data for the third stage. All concepts included in the resulting CIT at the end of stage two were considered as positive samples (labeled as 1), while all phrases in the rejection list R were considered as negative samples (labeled as 0). The third stage proceeded as follows: First, we embedded the labeled phrases of training data using Clinical BioBERT [46], and then we trained a neural network (NN) classifier on the embedded dataset. After conducting a grid search [47] to fine-tune the hyperparameters, we ended up having an NN model consisting of a single hidden layer with 100 neurons, ReLU activation function [48], and Adam optimizer [49]. Once trained, the model was used to classify newly extracted phrases from discharge notes as either legitimate concepts or illegitimate phrases. Phrases classified as legitimate were added to CIT, resulting in the final terminology, referred to as CIT ML . CIT ML demonstrated a coverage of 74.21% and a breadth of 1.68 for the test dataset. Figure 1 shows an excerpt of a discharge note highlighted by CIT ML , with information highlighted in blue background color. Method For this study, we fine-tuned two variants of the LLaMA 2 (13B) model separately: one using highlighted discharge notes, which we refer to it as H-LLaMA, and the other one using the same notes without highlighted content, which we refer to it as U-LLaMA. The fine-tuning procedure was identical in both settings, with the only difference being whether the input included highlighted information. We will delve into the details of data preparation and the fine-tuning procedure in the following sections. 2.1 Data preparation As described in the Background section, the MIMIC-IV-Ext-BHC dataset consists of clinical note-summary pairs, where each pair represents a complete discharge note text and its corresponding condensed BHC summary. Each row in this dataset includes a note_id which is a unique identifier matching the note_id column from the original MIMIC-IV-Note database. In our previous work, we developed CIT to highlight the detailed information of discharge notes of cardiology patients. Therefore, to identify cardiology-related records in MIMIC-IV-Note, we focused on two Intensive Care Units (ICUs) associated with cardiology: Coronary Care Unit (CCU) and Cardiac Vascular Intensive Care Unit (CVICU). We queried the "discharge" table in MIMIC-IV-Note for patients admitted to CCU or CVICU. This resulted in approximately 18,600 records, from which we extracted their note_id values. Next, we filtered MIMIC-IV-Ext-BHC to retain only the rows where the note_id matched those from patients admitted to CCU or CVICU in MIMIC_IV. This resulted in approximately 14,000 records. From the 14,000 cardiology-related records, we randomly selected 1,000 records for fine-tuning the model, which we refer to as training data, and another 100 records for evaluating summaries generated by the fine-tuned model, which we refer to as test data. We have highlighted the discharge notes of training data and test data using the automatic highlighting method we proposed and reported on in [32, 43, 44], which automatically generates highlights for each note into an HTML file. The highlighted information is enclosed in tags with the background color #ADD8E6. To prevent the model from being confused by HTML tags, each HTML file was converted into plain text, and the highlighted information was enclosed within square brackets (‘[’ and ‘]’). We prepared four datasets for model training and evaluation: Unhighlighted Training Set (UTrain): This set consists of 1,000 original discharge notes without any highlighted information, each paired with its corresponding BHC summary. Unhighlighted Test Set (UTest): This set includes 100 original, unhighlighted discharge notes along with their corresponding BHC summaries and was used for evaluation purposes. Highlighted Training Set (HTrain): This set contains the same 1,000 discharge notes, those used in UTrain, but with detailed information highlighted. The highlighted content was enclosed within square brackets (‘[’ and ‘]’). Each note is paired with its corresponding BHC summary. Highlighted Test Set (HTest): This set includes 100 discharge notes, the same ones used in UTest, with highlighted information enclosed within square brackets. Each note is paired with its associated BHC summaries. We converted both UTrain and HTrain into JSON format. Each training example was represented as a JSON object with three fields: "instruction", "input", and "output", resulting in two separate JSON datasets. The UTrain dataset used the instruction: " Summarize the clinical note into a brief hospital course." The HTrain dataset used the instruction: " Summarize the clinical note into a brief hospital course, focusing on the information enclosed within '[' and ']'." For both datasets, the prompt (i.e., instruction + input) and the output (i.e., BHC) were tokenized, concatenated, and padded or truncated to a fixed length of 4,096 tokens. As in MIMIC-IV-Ext-BHC dataset, the input token length averaged 2,267 ± 914 and the output token length averaged 564 ± 410 [28], the combined prompt and output length typically remained within this limit, and truncation was rarely needed. 2.2 Fine-Tuning procedure For fine-tuning, we used the HuggingFace Transformers library [39] in combination with PEFT (Parameter-Efficient Fine-Tuning) [50] and LoRA (Low-Rank Adaptation) [51]. The model was initialized from a pre-trained LLaMA-2 13B checkpoint in HuggingFace format. The tokenizer was also loaded from the same checkpoint and extended to include the custom discharge note section tags. Special tags of the input discharge notes such as , , , , etc. were explicitly added to the tokenizer as special tokens. These tags preserved semantic structure and helped guide the model’s understanding of discharge note organization. The model was fine-tuned locally using 2 NVIDIA GPUs, 66 CPU cores, and 256 GB of RAM, ensuring that no third party had access to the notes. Fine-tuning was performed using LoRA with the commonly adopted configuration: rank = 8, α = 32, and dropout = 0.10. LoRA adaptation was applied specifically to the query and value projection layers of the attention mechanism, referred to as q_proj and v_proj in the model implementation. Training was conducted for 9 epochs using the AdamW optimizer [52] with a learning rate of 2e-4 and weight decay of 0.005. We used a per-device batch size of 5, and gradient accumulation over 8 steps. The dataset was split into 90% for training and 10% for evaluation. 2.3 Generating summary from fine-tuned models After fine tuning, each of the two fine-tuned LoRA adaptors were separately merged with the base LLaMA-2 13B model to produce standalone fine-tuned models for inferences. Inferences used generation prompt structures consistent with the training format. Following best practices in recent clinical text generation literature [53-55], inference was performed using nucleus sampling [56] with temperature = 0.7 and top-p = 0.9. We refer to the summaries generated by U-LLaMA as U-summaries, and those from H-LLaMA as H-summaries. 2.4 Evaluation metrics Evaluation involves assessing the quality of the generated summaries in relation to the golden standard summary, in this case, BHC. Summaries should be evaluated from different perspectives. In most related studies, extensive manual work has been conducted to assess summary quality alongside automatic metrics. We performed evaluation in three different categories: Manual evaluation, Automatic evaluation, and LLM-based evaluation. Automatic and LLM-based evaluations were conducted on all 100 notes in the test dataset, and a manual evaluation was performed on a random subset of 20 notes. 2.4.1 Manual Evaluation Completeness : Measures how well the summary captures the information from the reference BHC summary. A high completeness score means the summary includes more information with missing less critical details. Completeness is calculated using (1). We evaluate the completeness of the generated summaries based on the reference summary, rather than the entire discharge note. This is because much of the information in the discharge note is not expected to appear in the summary. When gold-standard reference summaries are available, we assume that only the information contained in the reference summary is necessary to include in the generated summary. 2.4.2 Automatic Evaluation BERTScore : BERTScore [57] uses contextual embeddings from a pre-trained BERT model to compute semantic similarity between the generated and reference summaries. Unlike traditional n-gram overlap methods, BERTScore compares the similarity of words in the generated and reference texts based on their meaning rather than exact word matches. BERTScore provides three main scores: Precision: How much of the generated text is semantically supported by the reference. Recall: How much of the reference is captured by the generated text. F1 score: The harmonic mean of precision and recall, often used as the main BERTScore. In this study we also report F1 score as the BERTScore. ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation – Longest Common Subsequence ): ROUGE metrics [58] measures how much of the reference summary appears in the generated summary by measuring the overlap of n-grams, word sequences, and longest common subsequences between the generated summary and reference summaries. One of Common variant of ROUGE is ROUGE-L which measures the longest common subsequence to capture fluency and coherence. Its ability to reflect holistic similarity rather than just local n-gram matches, makes it more suitable for evaluating summary-level similarity. BLEU (Bilingual Evaluation Understudy): BLEU metric [59] measures how much of the generated summary appears in the reference summary by measuring n-gram overlap between the generated summary and reference summary. BLEU is widely used for assessing machine translation and text generation tasks, including summarization. Sumac_CONV: SummaC-CONV [60], specifically designed for the summarization task, is a factual consistency evaluation metric that leverages Natural Language Inference (NLI) [61] to assess whether a generated summary is factually consistent with its reference text. Unlike naive sentence-level approaches, SummaC-CONV does not assume a one-to-one alignment between source and summary sentences. Instead, it compares each summary sentence with multiple relevant sentences from the reference text, aggregating entailment scores. By mapping each summary sentence to multiple sentences , it correctly evaluates cases where multiple facts are summarized into a single sentence or appear in a different order than in the original document. 2.4.3 LLM-based evaluation LLMs have shown great potential as evaluators for LLM-generated summaries, evaluating various aspects of the generated texts [62, 63]. In this study, we use ChatGPT 4o (through Azure) to evaluate generated summaries of each discharge note. We crafted and refined a prompt to instruct ChatGPT 4o to evaluate each summary based on four criteria, and then assign a score from 1 to 5 for each, where 1 indicates the lowest and 5 the highest. Here is the final version of the prompt: “ Act as a cardiologist and read a discharge note (A) alongside its reference summary (B) and two LLM-generated summaries (C and D). Your task is to grade both summaries on a scale of 1 to 5 for the following four metrics: Coherence: Does the summary maintain a logical flow of ideas and present information clearly? Fluency: Is it grammatically well-formed and easy to read? Conciseness: Does it effectively condense information, keeping only the most relevant clinical details? Correctness: Does the summary accurately reflect the content of the discharge note (A) and the reference summary (B), without introducing false or misleading information? Assign a score (1-5) for each metric for both summaries (C and D). A score of 1 indicates the lowest quality, while 5 indicates the highest.” Results Figure 2 (a) shows an example of a BHC summary of one discharge note, containing 181 words, alongside the H-summary (Fig. 2 b) and U-summary of its corresponding discharge note (Fig. 1 c), which contain 126 and 135 words, respectively. In this example, six key pieces of information are highlighted in green in Fig. 2 (a). Among these green highlights, the blue highlights in Fig. 2 (b) indicate information that is present in the original note (a) but missing from the U-summary (c). Conversely, yellow highlights denote information present in the U-summary (c) but absent from the H-summary (b). Pink highlights in Fig. 2 (a) indicate information that does not appears in either summary. All unhighlighted pieces of information in (a) corresponds to information that appears in both summaries. The H-summary achieves a completeness score of 56.6%, while the U-summary achieves a completeness score of 47.8%. Manual Evaluation For random 20 notes, the average completeness of the U-summaries is 48.1%, while the H-summaries demonstrated a higher average completeness of 54.2%, an increase of 6 percentage points. Additionally, for 17 notes, the completeness of the H-summary is higher than the corresponding U-summary, and for 3 notes, the completeness is equal. Using Fisher Exact test [ 64 , 65 ], we compared the completeness of H-summary group (17 notes) and the U-summary group (3 notes) with a significance level of 0.05. The statistical value was 0.0009, indicating a statistically significant result (p < 0.05). Automatic evaluation As shown in Table 1 , the average BERTScore of the U-summaries for 100 notes is 59.61, while that of the H-summaries is 63.75, with 91 notes having higher BERTScore for H-summaries than U-summaries. Using Fisher Exact test, we compared the BERTScore of H-summary group (91 notes) and the U-summary group (9 notes) with a significance level of 0.05. The statistical value was less than 0.00001, indicating a statistically significant result (p < 0.05). The average ROUGE-L score for the U-summaries is 21.82, whereas it is 23.43 for the H-summaries, with 90 notes having higher ROUGE-L score for H-summaries than U-summaries. Using Fisher Exact test, we compared the ROUGE-L score of H-summary group (90 notes) and the U-summary group (10 notes) with a significance level of 0.05. The statistical value was less than 0.00001, indicating a statistically significant result (p < 0.05). Also, the average BLEU score of the U-summaries is 8.41, compared to 10.4 for the H-summaries, with 81 notes having higher BLEU score for H-summaries than U-summaries. Using Fisher Exact test, we compared the BLEU score of H-summary group (81 notes) and the U-summary group (19 notes) with a significance level of 0.05. The statistical value was less than 0.00001, indicating a statistically significant result (p < 0.05). Finally, the average Sumac_CONV score is 40.2 for the U-summaries and 67.7 for the H-summaries, with 98 notes having higher Sumac_CONV score for H-summaries than U-summaries. Using Fisher Exact test, we compared the Sumac_CONV score of H-summary group (98 notes) and the U-summary group (2 notes) with a significance level of 0.05. The statistical value was 0.0002, indicating a statistically significant result (p < 0.05). Table 1 The average results of automatic evaluations for U_summaries (U_S) and H_summaries (H_S) Word count BERTScore ROUGE-L BELU Sumac_CONV HCB U_S H_S U_S H_S U_S H_S U_S H_S U_S H_S 393.19 300.38 331.67 59.61 63.75 21.82 23.43 8.41 10.4 40.2 67.7 LLMs evaluation ChatGPT 4o generally rated both U_summaries and H_summaries highly across all four criteria (Coherence, Fluency, Conciseness, and Correctness) indicating that both types of summaries are overall well-written. However, H_summaries received slightly higher average scores than U_summaries across three criteria (Coherence, Conciseness, and Correctness), and for Fluency, the averages were almost equal. The average scores for Coherence are 4.84 (U_summaries) and 4.91 (H_summaries), for Fluency are 4.91 (U_summaries) and 4.92 (H_summaries), for Conciseness are 4.65 (U_summaries) and 4.81 (H_summaries), and for Correctness are 4.79 (U_summaries) and 4.91 (H_summaries). When considering the overall average across all criteria, H_summaries scored 4.88, slightly outperforming U_summaries which scored 4.79. Also, as shown in Table 1 , the average length of BHC summaries for these 100 notes is 393.19, while the average length of U-summaries is 300.38 and that of H-summaries is 331.67, indicating that H-summaries tend to include more information. Table 2 The results of LLM evaluation for U_summaries (U_S) and H_summaries (H_S) Coherence Fluency Conciseness Correctness Average U_S H_S U_S H_S U_S H_S U_S H_S U_S H_S 4.84 4.91 4.91 4.92 4.65 4.81 4.79 4.91 4.79 4.88 Table 3 shows the minimum and maximum values, as well as the standard deviation, for all evaluation metrics for U_summaries and H_summaries. Table 3 Minimum, maximum, and standard deviation for evaluation metrics for U_summaries (U_S) and H_summaries (H_S) summaries. Metric Summary Min Max STD Completeness U_S 23.2 73.3 14.01 H_S 29.3 80.0 12.36 BERTScore U_S 57.8 69.2 2.86 H_S 59.3 69.3 2.75 ROUGE-L U_S 14.1 34.3 7.19 H_S 19.2 39.3 5.54 BELU U_S 1.2 13.7 5.35 H_S 4.2 22.4 4.97 Sumac U_S 7.5 63.4 15.13 H_S 60.1 89.3 8.21 Coherence U_S 4 5 0.36 H_S 4 5 0.22 Fluency U_S 4 5 0.36 H_S 4 5 0.22 Conciseness U_S 4 5 0.43 H_S 4 5 0.40 Correctness U_S 4 5 0.40 H_S 4 5 0.30 Discussion In this study, we fine-tuned LLaMA 2 (13B) twice, separately, under the same conditions: once using unhighlighted discharge notes and once more using the same notes but highlighted. By keeping everything else constant, we aimed to isolate and measure the effect of highlighting on fine tuning. Results show that using highlighted discharge notes for fine-tuning improves the quality of the generated summaries compared to fine-tuning with unhighlighted notes across all evaluation metrics. For the completeness, BERTScore, ROUGE-L, BLEU, and SummaC_CONV metrics, we achieved a statistical significance of (p < 0.05) using Fisher’s Exact Test with a significance level of 0.05. The BHC section in discharge notes offers a concise narrative of the patient's clinical journey. Composing this section is widely recognized as a cognitively demanding and time-intensive task for clinicians since it requires synthesizing a large volume of notes and reports generated throughout the patient's hospitalization into a coherent summary [66, 67].This process is not only laborious but also susceptible to errors, given the high documentation burden [66, 68]. Furthermore, BHCs exhibit significant variability in both style and content. Authored by different clinicians, they reflect diverse writing habits, and they frequently alternate between extractive and abstractive summarization strategies [67]. Prior research has shown that discharge summaries and their BHC sections may omit critical information or introduce excessive, redundant, or even erroneous content [68, 69]. Also, Sometimes new or summary-only information is added to BHC that that are not documented elsewhere clearly. Although BHCs are commonly used as reference summaries for training summarization models, they are inherently noisy. Their variability in coherence, completeness, and potential misrepresentation of clinical facts introduce significant challenges for model training and evaluation [67, 69]. After reviewing several examples, we noted that some BHCs are very short, shallow, and lack important details, while others are overly long and include too much information. There is no consistent structure or style among them. Indeed, the BHC targets have a mean token length of 564 with a standard deviation of 410 [28], which is notably high, indicating substantial variation in target length. This inconsistency makes it harder for the language model to learn what kind of summary it is supposed to generate. If all BHCs followed a similar pattern, both the highlighted and unhighlighted models, would have a clearer target during training. Another factor that impacts the quality of the generated summaries is the presence of errors or mismatches between the original discharge notes and their corresponding BHCs. In some cases, the BHC includes information that does not appear in the original note. For example, in note “11874424-DS-9,” the “atorvastatin 80 mg” is mentioned in the BHC, but the original discharge note only refers to a prior dose of atorvastatin 40 mg, with no indication of a change to 80 mg. Similarly, the original note states that the patient was accompanied by his son, while the BHC refers to a daughter. These kinds of inaccuracies or additions in the BHCs negatively affect the quality of the summaries generated by the LLM. Since the model generates summaries based only on the original note, it should not include information that was never there. As a result, when these summaries are evaluated against BHCs, the evaluation scores decrease, not because the model failed, but because the reference summaries are flawed. Future work: For future work, we plan to fine-tune the model using a larger dataset. Additionally, we aim to train the model on a subset of examples whose BHCs follow a more consistent style. For instance, we can select samples where the ratio of BHC length to the original discharge note length falls within a similar range. This would help ensure that the summaries, whether brief or detailed, are more uniform in structure, making it easier to evaluate the model’s performance. We also aim to conduct a preliminary evaluation of BHCs using LLMs, assessing them based on their corresponding discharge notes to determine which are more comprehensive and less erroneous. This evaluation will help identify BHCs suitable to serve as gold-standard summaries. The BHCs that receive higher ratings will be selected, along with their corresponding discharge notes, to form the training dataset for fine-tuning the model. We also plan to conduct a hybrid training where we fine-tune LLaMA using a mixed dataset that includes both highlighted and non-highlighted inputs. This will allow us to observe how the model behaves when trained with both types of data. Also, we plan to increase the test set for manual evaluation to 100 notes in order to achieve a more accurate assessment. Conclusion This study investigates the effect of highlighting information in discharge notes on the summaries generated by Large Language Models (LLMs). Highlighting is done automatically using a Cardiology Interface Terminology (CIT) proposed in our previous work. To carry out our experiment, we fine-tuned the LLaMA2-13B model twice using the MIMIC-IV-Ext-BHC dataset: once with the highlighted discharge notes and once more with the same set of discharge notes without highlighting. Our results demonstrate that incorporating highlighted information into the fine-tuning process improves the quality of discharge note summarization. That is, summaries generated from highlighted inputs consistently outperformed those from unhighlighted inputs across all evaluation metrics, including BERTScore, ROUGE-L, BLEU, SummaC_CONV, completeness, and LLM-based judgments. This work provides a scalable framework for improving discharge notes summarization using fine-tuned LLMs. By enhancing the clarity and completeness of clinical summaries, this approach has the potential to support more effective healthcare delivery and better-informed decision-making. Statements and Declarations Data availability The code used for fine-tuning the LLaMA 2–13B model is publicly available at: https://github.com/mahshadkoohihd/Fine_tuning_LLaMA2 Competing Interests The authors declare no competing interests. Acknowledgments The authors have no acknowledgments to declare. References Menachemi, N. and T.H. Collum, Benefits and drawbacks of electronic health record systems. Risk management and healthcare policy, 2011: p. 47-55. Madzime, R. and C. Nyirenda, Enhanced Electronic Health Records Text Summarization Using Large Language Models. arXiv preprint arXiv:2410.09628, 2024. Bowman, S., Impact of electronic health record systems on information integrity: quality and safety implications. Perspectives in health information management, 2013. 10 (Fall). O’Malley, A.S., et al., Are electronic medical records helpful for care coordination? Experiences of physician practices. Journal of general internal medicine, 2010. 25 : p. 177-185. Apathy, N.C., et al., Documentation dynamics: note composition, burden, and physician efficiency. Health Services Research, 2023. 58 (3): p. 674-685. Zhao, W.X., et al., A survey of large language models. arXiv preprint arXiv:2303.18223, 2023. Thirunavukarasu, A.J., et al., Large language models in medicine. Nature medicine, 2023. 29 (8): p. 1930-1940. Hadi, M.U., et al., A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints, 2023. Kwag, K.H., et al., Providing doctors with high-quality information: an updated evaluation of web-based point-of-care information summaries. Journal of medical Internet research, 2016. 18 (1): p. e15. Shestov, A., et al., Finetuning large language models for vulnerability detection. IEEE Access, 2025. Rallapalli, S., et al., Fine-Tuning LLMs for Report Summarization: Analysis on Supervised and Unsupervised Data. arXiv preprint arXiv:2503.10676, 2025. Pivovarov, R. and N. Elhadad, Automated methods for the summarization of electronic health records. Journal of the American Medical Informatics Association, 2015. 22 (5): p. 938-947. Sarzynski, E., et al., Opportunities to improve clinical summaries for patients at hospital discharge. BMJ quality & safety, 2017. 26 (5): p. 372-380. Casey, J.A., et al., Using electronic health records for population health research: a review of methods and applications. Annual review of public health, 2016. 37 (1): p. 61-81. Wu, X.-K., et al., LLM Fine-Tuning: Concepts, Opportunities, and Challenges. Big Data and Cognitive Computing, 2025. 9 (4): p. 87. Hu, M., et al., Mitigating large language model hallucination with faithful finetuning. arXiv preprint arXiv:2406.11267, 2024. Rumiantsau, M., et al., Beyond Fine-Tuning: Effective Strategies for Mitigating Hallucinations in Large Language Models for Data Analytics. arXiv preprint arXiv:2410.20024, 2024. Liu, C., et al., CPMI-ChatGLM: Parameter-efficient fine-tuning ChatGLM with Chinese patent medicine instructions. Scientific Reports, 2024. 14 (1): p. 6403. Hamzah, F. and N. Sulaiman, Optimizing Llama 7B for Medical Question Answering: A Study on Fine-Tuning Strategies and Performance on the MultiMedQA Dataset. Li, I., et al., Neural natural language processing for unstructured data in electronic health records: a review. Computer Science Review, 2022. 46 : p. 100511. Perković, G., A. Drobnjak, and I. Botički. Hallucinations in llms: Understanding and addressing challenges . in 2024 47th MIPRO ICT and Electronics Convention (MIPRO) . 2024. IEEE. Jha, S., et al. Dehallucinating large language models using formal methods guided iterative prompting . in 2023 IEEE International Conference on Assured Autonomy (ICAA) . 2023. IEEE. Parthasarathy, V.B., et al., The ultimate guide to fine-tuning llms from basics to breakthroughs: An exhaustive review of technologies, research, best practices, applied research challenges and opportunities. arXiv preprint arXiv:2408.13296, 2024. Ahmad, P.N., et al., BIR: Biomedical Information Retrieval System for Cancer Treatment in Electronic Health Record Using Transformers. Sensors, 2023. 23 (23): p. 9355. He, Z., et al., Enriching real-world data with social determinants of health for health outcomes and health equity: successes, challenges, and opportunities. Yearbook of Medical Informatics, 2023. 32 (01): p. 253-263. Majdik, Z.P., et al., Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study. JMIR AI, 2024. 3 : p. e52095. Aali, A., et al., A dataset and benchmark for hospital course summarization with adapted large language models. Journal of the American Medical Informatics Association, 2024: p. ocae312. Aali, A., et al. MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital Course Summarization . 2024; Available from: https://physionet.org/content/labelled-notes-hospital-course/1.1.0/. Du, X., et al., Generative large language models in electronic health records for patient care since 2023: a systematic review. medRxiv, 2024. Acharya, A., et al., Clinical risk prediction using language models: benefits and considerations. Journal of the American Medical Informatics Association, 2024: p. ocae030. Koohi Habibi Dehkordi, M., et al., Improving Large Language Models Summarization by Highlighting Discharge Notes: A Comparative Evaluation. JMIR Med Inform (forthcoming). doi:10.2196/66476, 2025. Dehkordi, M.K.H., et al. Skimming of Electronic Health Records Highlighted by an Interface Terminology Curated with Machine Learning Mining . in BIOSTEC (2) . 2024. Jin, H., et al., A comprehensive survey on process-oriented automatic text summarization with exploration of llm-based methods. arXiv preprint arXiv:2403.02901, 2024. Zhong, M., et al., Extractive summarization as text matching. arXiv preprint arXiv:2004.08795, 2020. Gupta, S. and S.K. Gupta, Abstractive summarization: An overview of the state of the art. Expert Systems with Applications, 2019. 121 : p. 49-65. Van Veen, D., et al., Adapted large language models can outperform medical experts in clinical text summarization. Nature medicine, 2024. 30 (4): p. 1134-1142. Ma, C., et al., An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT. IEEE Transactions on Artificial Intelligence, 2024. Hake, J., et al., Quality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts. The Annals of Family Medicine, 2024. 22 (2): p. 113-120. Wolf, T., et al., Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019. Johnson, A.E., et al., MIMIC-IV, a freely accessible electronic health record dataset. Scientific data, 2023. 10 (1): p. 1. Johnson, A., et al., Mimic-iv. PhysioNet. Available online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021), 2020: p. 49-55. Grady, C., Institutional review boards: Purpose and challenges. Chest, 2015. 148 (5): p. 1148-1155. Dehkordi, M.K.H., et al., Using annotation for computerized support for fast skimming of cardiology electronic health record notes , in 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) . 2023, IEEE. p. 4043-4050. Mahshad Koohi H. Dehkordi, S.Z., Yehoshua Perl, Fadi P. Deek, Gai Elhanan, Andrew J. Einstein, Curation of a Cardiology Interface Terminology for Highlighting Electronic Health Records using Machine Learning. Submitted to a Journal, 2024. Donnelly, K., SNOMED-CT: The advanced terminology and coding system for eHealth. Stud Health Technol Inform, 2006. 121 : p. 279-90. Alsentzer, E., et al., Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323, 2019. Liashchynskyi, P. and P. Liashchynskyi, Grid search, random search, genetic algorithm: a big comparison for NAS. arXiv preprint arXiv:1912.06059, 2019. Agarap, A.F., Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018. Jais, I.K.M., A.R. Ismail, and S.Q. Nisa, Adam optimization algorithm for wide and deep neural network. Knowledge Engineering and Data Science, 2019. 2 (1): p. 41-46. Han, Z., et al., Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608, 2024. Hu, E.J., et al., Lora: Low-rank adaptation of large language models. ICLR, 2022. 1 (2): p. 3. Llugsi, R., et al. Comparison between Adam, AdaMax and Adam W optimizers to implement a Weather Forecast based on Neural Networks for the Andean city of Quito . in 2021 IEEE Fifth Ecuador Technical Chapters Meeting (ETCM) . 2021. IEEE. Koraş, O.A., et al., Towards Conditioning Clinical Text Generation for User Control. arXiv preprint arXiv:2502.17571, 2025. Su, Y., et al., A contrastive framework for neural text generation. Advances in Neural Information Processing Systems, 2022. 35 : p. 21548-21561. Peng, C., et al., A study of generative large language model for medical research and healthcare. NPJ digital medicine, 2023. 6 (1): p. 210. Holtzman, A., et al., The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019. Zhang, T., et al., Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries . in Text summarization branches out . 2004. Papineni, K., et al. Bleu: a method for automatic evaluation of machine translation . in Proceedings of the 40th annual meeting of the Association for Computational Linguistics . 2002. Laban, P., et al., SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 2022. 10 : p. 163-177. MacCartney, B., Natural language inference . 2009: Stanford University. Sun, Z., et al., Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 2024. 36 . He, Z., et al., Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study. ArXiv, 2024. Upton, G.J., Fisher's exact test. Journal of the Royal Statistical Society: Series A (Statistics in Society), 1992. 155 (3): p. 395-402. Test, F.E. Fisher Exact Test . Available from: https://www.socscistatistics.com/tests/fisher/default2.aspx. Adams, G., J. Zuckerg, and N. Elhadad. A meta-evaluation of faithfulness metrics for long-form hospital-course summarization . in Machine Learning for Healthcare Conference . 2023. PMLR. Adams, G., et al. What’s in a summary? laying the groundwork for advances in hospital-course summarization . in Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting . 2021. Searle, T., et al., Discharge summary hospital course summarisation of in patient electronic health record text with clinical concept guided deep pre-trained transformer models. Journal of Biomedical Informatics, 2023. 141 : p. 104358. Adams, G., Generating Faithful and Complete Hospital-Course Summaries from the Electronic Health Record . 2024: Columbia University. Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7181141","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":488824058,"identity":"735b36ff-507e-4bb9-bad0-be46149f36f6","order_by":0,"name":"Mahshad Koohi Habibi Dehkordi","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAsklEQVRIiWNgGAWjYDACHuaDj39UMDCzgTnEaWFLNmY4A9TCRrwWHjNpxjYgg2gtuj1njI0L591h55NvYHzwto0ILWZn2wofz9z2DOQwZsO5RGk5z7zZgHfbYZAWNmle4rQwmEnwzgFrYf9NnJazLWbSvA0QW5iJ03LmWLLhjGMgLYnNknPOEaUl+eCDDzWHk+WbDx/88KaMCC0wkMzAwNhAgnogsCNN+SgYBaNgFIwoAABs6zGvbhVY8wAAAABJRU5ErkJggg==","orcid":"","institution":"","correspondingAuthor":true,"prefix":"","firstName":"Mahshad","middleName":"Koohi Habibi","lastName":"Dehkordi","suffix":""},{"id":488824059,"identity":"b33ca2a7-cab2-4d67-9e97-9d8d7a6ad483","order_by":1,"name":"Yehoshua Perl","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Yehoshua","middleName":"","lastName":"Perl","suffix":""},{"id":488824060,"identity":"1167e942-21b7-4cfb-bc22-6c8ba9ed7d2c","order_by":2,"name":"Fadi P Deek","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Fadi","middleName":"P","lastName":"Deek","suffix":""},{"id":488824061,"identity":"99e4a665-c4ec-4119-a02d-60be2dcf99c8","order_by":3,"name":"Hao Liu","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Hao","middleName":"","lastName":"Liu","suffix":""}],"badges":[],"createdAt":"2025-07-21 23:13:14","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-7181141/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7181141/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":87367188,"identity":"71fa6035-f5e6-472e-802f-723319f031fd","added_by":"auto","created_at":"2025-07-23 06:43:20","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":240483,"visible":true,"origin":"","legend":"\u003cp\u003eAn excerpt of a discharge note highlighted by CIT\u003csub\u003eML\u003c/sub\u003e with coverage of 68.01% and breadth of 1.8\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7181141/v1/e1d84f7a00c5cf0dd08c5dfd.png"},{"id":87367189,"identity":"91428185-965b-45a4-a180-c4e755603352","added_by":"auto","created_at":"2025-07-23 06:43:20","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":820470,"visible":true,"origin":"","legend":"\u003cp\u003ea) A BHC summary with 181 words, b) The H-summary of the corresponding discharge note with 135 words and 56.5% completeness, BERTScore 67.7, ROUGE-L 38.2, BELU 13.9 and Sumac_CONV 61.4, c) The U-summary of the same note with 126 words and 47.8% completeness, BERTScore 66.1, ROUGE-L 33.4, BELU 12.2 and Sumac_CONV 34.7. Green highlights in (a) indicates the information appeared in one of (b) or (c). The pink highlight in (a) indicates missed information in both summaries. The blue highlight in (b) indicates information form (a) that are missed in (c). The yellow highlights in (c) indicate information items from (a) that are missed in (b)\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7181141/v1/446e3742c5c146f29280d233.png"},{"id":87369305,"identity":"cf9be709-15ff-4b46-8b00-b37f56d65849","added_by":"auto","created_at":"2025-07-23 06:59:21","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1849835,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7181141/v1/d8f86995-5963-47c4-8fb2-c40b4d0627b2.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eFine-Tuning LLaMA2 for Summarizing Discharge Notes:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEvaluating the Role of Highlighted Information\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"Introduction","content":"\u003cp\u003eElectronic Health Records (EHRs) contain a wide range of clinical information, including details of a patient\u0026rsquo;s hospital stay, progress notes, medical history, medications, vital signs, and diagnostic reports [1]. The exponential growth of EHRs has revolutionized medical data management, enhancing access to patient records, interoperability and aiding healthcare professionals in making informed decisions [2].\u003c/p\u003e\n\u003cp\u003eHowever, EHRs are often perceived as cluttered and difficult to navigate, which can significantly hinder healthcare providers\u0026apos; ability to efficiently extract relevant insights, potentially impacting clinical decision-making and patient safety [3, 4]. Several factors contribute to this complexity, including the vast amount of abbreviations and medical jargon, variability in how different healthcare providers document data in EHRs, copy-pasting practices that clutter records with redundant information, and the design of EHR systems themselves.\u003c/p\u003e\n\u003cp\u003eSummarization of EHRs can solve this problem to a great extent by enabling healthcare providers to quickly access the most relevant information, reducing errors, and supporting informed clinical decisions [2, 5].\u003c/p\u003e\n\u003cp\u003eGiven the complexity and volume of EHR data, automated summarization methods have become increasingly important. There has been a notable shift from traditional text summarization methods to techniques that use Large Language Models (LLMs) [6-8]. These models have evolved from pre-training and fine-tuning approaches to prompt-based methods [9-11]. While LLMs demonstrate strong potential in summarization tasks, several challenges persist. One critical issue is \u0026quot;hallucination,\u0026quot; where LLMs generate plausible but factually incorrect information. In the healthcare domain, such inaccuracies can lead to misdiagnoses, inappropriate treatment plans, or worse [12, 13]. Additionally, LLMs may overlook crucial clinical details, potentially compromising patient care [14].\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eFine-tuning LLMs for EHR summarization offers a promising solution to mitigating these challenges by enhancing its understanding of medical terminology and context [15-17]. \u003cstrong\u003eFine-tuning\u003c/strong\u003e is a process that allows LLMs to adapt their general knowledge to specific domains by training them on smaller, domain-relevant datasets, enabling \u003cstrong\u003ethem\u003c/strong\u003e to learn task-specific patterns and terminology. During fine-tuning, the model\u0026apos;s pre-trained weights are updated to optimize its performance for the target task, allowing it to learn task-specific features and improve accuracy\u0026nbsp;[18, 19].\u003c/p\u003e\n\u003cp\u003eBy adapting LLMs to healthcare-specific datasets, fine-tuned models can better grasp domain-specific terminologies and contextual nuances, leading to improved performance in summarization tasks [20]. Moreover, fine-tuning can reduce hallucinations by reinforcing factual content through curated training data [16, 17, 21, 22]. This approach also ensures the inclusion of essential clinical details that might otherwise be overlooked in generic LLM outputs.\u003c/p\u003e\n\u003cp\u003eThe process of fine-tuning LLMs for EHR summarization relies on high-quality datasets [23]. Studies have shown that training models on domain-specific data significantly enhances their ability to interpret and summarize clinical narratives effectively [24, 25]. The quality of the dataset used for fine-tuning directly impacts the model\u0026apos;s summarization performance [26].\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe MIMIC-IV-BHC dataset, a curated collection of discharge notes that are a key component of EHRs, and corresponding Brief Hospital Course (BHC) summaries, has been introduced [27, 28] introduced. This study benchmarked 5 models (GPT-3.5, GPT-4, Clinical-T5-Large, Llama2-13B, and FLAN-UL2 using both prompting-based and fine-tuning adaptation strategies. The reported results demonstrate that fine-tuned Llama2-13B achieved the highest quantitative scores (BLEU, BERT-Score, ROUGE-L).\u003c/p\u003e\n\u003cp\u003eHowever, although fine-tuning can help reduce the frequency of hallucinations by training the model on domain-specific data, it does not entirely eliminate the risk of generating inaccurate information, as the underlying architecture of LLMs can still lead to hallucinations [29]. These studies [29, 30] emphasize that while fine-tuning LLMs can improve their performance, there are still challenges, including the potential loss of previously acquired knowledge during the fine-tuning process, which can lead to missing critical information in the generated summaries.\u003c/p\u003e\n\u003cp\u003eIn our previous study [31], we demonstrated that summarizing discharge notes with highlighted information, using a prompt engineering strategy, improves the accuracy of summaries compared to summarizing unhighlighted discharge notes. The highlighting is performed with the Cardiology Interface Terminology (CIT) designed with the benefit of Machine Learning (ML) techniques in our previous work [32].\u003c/p\u003e\n\u003cp\u003eIn this study, we fine-tune LLaMA2-13B on the MIMIC-IV-BHC dataset using the LoRA technique. We then investigate the impact of incorporating highlighted discharge notes, in which the detailed information is highlighted, into the summarization process. Specifically, we compare the summaries generated when the fine-tuned model is provided with highlighted discharge notes versus when it is fine-tuned with the original, unhighlighted ones.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTo evaluate our approach, we analyze the resulting summaries, from both approaches, using BLEU, BERT-Score, and ROUGE-L metrics. Additionally, we assess the completeness of the summaries through manual review and employ LLMs as judges to evaluate summaries quality based on coherence, fluency, conciseness, and correctness. We aim to enhance the clarity and usability of clinical records by systematically examining how highlighting input notes affects the performance of a fine-tuned LLaMA2-13B model in discharge note summarization. This, in turn, can support better healthcare decision-making and improve patient outcomes.\u003c/p\u003e"},{"header":"Background","content":"\u003ch2\u003e2.1 Text summarization and related work\u003c/h2\u003e\n\u003cp\u003eText summarization techniques can be broadly classified into three categories: \u003cstrong\u003eextractive summarization\u003c/strong\u003e\u003cstrong\u003e, \u003c/strong\u003e\u003cstrong\u003eabstractive summarization\u003c/strong\u003e\u003cstrong\u003e, \u003c/strong\u003eand\u003cstrong\u003eLLM-based summarization \u003c/strong\u003e\u003cstrong\u003e[33]\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003eExtractive summarization [34] selects key sentences or phrases directly from the original text without modifying their wording. It relies on ranking mechanisms to identify the most informative parts of the content. However, such techniques contend with some disadvantages including a potential lack of coherence, as extracted sentences may not flow naturally, and the inability to paraphrase or generalize information [33].\u003c/p\u003e\n\u003cp\u003eUnlike extractive methods, abstractive summarization [35] generates new text that conveys the key ideas of the original content in a more concise and coherent manner. It offers enhanced fluency and improved contextual understanding, but it also carries the risk of introducing factual inconsistencies, increasing computational costs, and requiring large-scale training datasets to ensure accuracy [33].\u003c/p\u003e\n\u003cp\u003eWith the advent of LLMs, and due to their strong performance in summarization tasks [6, 36-38], there has been a significant shift toward using LLMs for text summarization. LLM-based summarization utilizes pre-trained transformer models to generate summaries based on prompts or fine-tuning. These models can perform both extractive and abstractive summarization, adapting their output based on the task requirements. As a result, numerous research efforts have focused on developing automated summarization methods for medical texts.\u003c/p\u003e\n\u003cp\u003eFive LLMs were tested in generating summaries for discharge notes using MIMIC-IV-Ext-BHC dataset [27]. The study covering both open-source models (Clinical-T5-Large, FLAN-UL2, and Llama2-13B) and proprietary models (GPT-3.5 and GPT-4). These models were adapted using different strategies, including fine-tuning and in-context learning. The results showed that fine-tuned Llama2-13B achieved the best performance among open-source models based on quantitative metrics such as BLEU and BERT-Score. However, GPT-4 with in-context learning demonstrated the most robustness across varying input lengths and was preferred over human summaries in a clinical reader study, where five clinicians compared its summaries to those written by human experts. The results also emphasized that while open-source models like Llama2-13B performed well and could match human-written summaries, proprietary models, like GPT-4 had a clear edge in producing summaries that clinicians preferred.\u003c/p\u003e\n\u003cp\u003eFurthermore, a framework for radiology report summarization, using ChatGPT, has been proposed [37]. In-context learning and iterative optimization was used to improve the Automatic Impression Generation (AIG) task, which involves summarizing the \u0026quot;Findings\u0026quot; section of a radiology report into the \u0026quot;Impression\u0026quot; section. Instead of fine-tuning the model, prompts using similar reports retrieved via a similarity search technique were dynamically constructed. These retrieved examples provide contextual information, allowing ChatGPT to better generate relevant summaries. The method was evaluated on MIMIC-CXR and OpenI datasets, demonstrating state-of-the-art performance in radiology report summarization without requiring additional training data. \u003c/p\u003e\n\u003cp\u003eFinally, a system for EHR summarization, using the Google Flan-T5 model, has also been proposed [2] to generate clinician-focused summaries based on clinician-specified topics. Flan-T5 were fine-tuned on an EHR question-answering dataset formatted in the Stanford Question Answering Dataset (SQuAD) style. The fine-tuning process utilized the Seq2SeqTrainer from the Hugging Face Transformers library [39], with optimized hyperparameters to enhance performance. The results achieved are an Exact Match (EM) score of 81.81%, ROUGE scores (ROUGE-1: 96.03%, ROUGE-2: 86.67%, ROUGE-L: 96.10%), and a BLEU score of 63%.\u003c/p\u003e\n\u003ch2\u003e1.1 Dataset\u003c/h2\u003e\n\u003cp\u003eThe MIMIC-IV-Ext-BHC dataset [27, 28] is a collection of Brief Hospital Course (BHC) summaries paired with their corresponding discharge notes extracted from the MIMIC-IV-Note database [40, 41]. MIMIC-IV-Ext-BHC is created by preprocessing discharge summaries from MIMIC-IV-Note, which contains 331,794 de-identified clinical notes from 145,915 patients admitted to the Beth Israel Deaconess Medical Center between 2008 and 2019. Both datasets are hosted on PhysioNet [28, 40] and can be accessed after signing a Data Use Agreement and completing the required training. The collection of patient information and the creation of these research resources were reviewed by the Institutional Review Board (IRB) [42] at the Beth Israel Deaconess Medical Center, which granted a waiver of informed consent and approved the data-sharing initiative.\u003c/p\u003e\n\u003cp\u003eThe MIMIC-IV-Ext-BHC dataset is designed to facilitate research on hospital course summarization, addressing the challenge of extracting concise and relevant information from lengthy clinical narratives. This dataset covers a diverse patient population reflective of the broader MIMIC-IV cohort, which includes various age groups, genders, and medical conditions.\u003c/p\u003e\n\u003cp\u003eTo create the MIMIC-IV-Ext-BHC dataset, the MIMIC-IV notes are preprocessed through tokenization, section identification, normalization, and cleaning. This process separated the BHC from the rest of the clinical note.\u003c/p\u003e\n\u003cp\u003eTo validate data quality, a manual review of 100 randomly sampled clinical notes, and a clinical team reviewed 30 note-summary pairs were conducted, reporting no significant issues with the extracted content.\u003c/p\u003e\n\u003cp\u003eThe MIMIC-IV-Ext-BHC dataset, consisting of 270,033 discharge note\u0026ndash;BHC pairs, contains the columns note_id, input, target, input_tokens, and target_tokens. Note_id is the unique identifier for each row in the dataset, which matches the note_id column from the original MIMIC-IV-Note dataset. The input field includes preprocessed discharge note texts excluding the BHC section. The target field includes the standardized and cleaned BHC text, providing a concise summary of the patient\u0026apos;s hospital course. Input_tokens and target_tokens columns store the tokenized lengths of the clinical notes and BHC summaries.\u003c/p\u003e\n\u003ch2\u003e1.2 Automatic highlighting\u003c/h2\u003e\n\u003cp\u003eIn our previous work [32, 43, 44], we presented a multi-stage method for curating Cardiology Interface Terminology (CIT), tailored through an automatic method for highlighting information in discharge notes of cardiology patients. This process consists of two main phases. In the first stage, we constructed an initial version of the CIT, referred to as ICIT, by incorporating concepts from 11 cardiology-related subhierarchies of SNOMED CT [45]. However, ICIT alone did not sufficiently capture all detailed information present in the discharge notes. To address this, in the second stage, we employed a semi-automatic, iterative approach to enrich ICIT. Specifically, we mined fine-granularity phrases from discharge notes that contained ICIT concepts. All the mined phrases are reviwed both automatically and manually, and the legitimate ones are added to CIT, forming a new version of CIT. All the illegiteimate phrases, structurally or sematicaly, are added to a reject list (R). In each iteration, we used the latest version of CIT to highlight the build dataset and evaluated its performance by measuring two metrics: coverage and breadth. Coverage is defined as the percentage of total number of words highlighted in a note; and breadth is defined as the average number of words of highlighted concepts.\u003c/p\u003e\n\u003cp\u003eThis iterative process continued until further improvements in coverage became negligible. The output of the second stage serves as the tarining data for the third stage. All concepts included in the resulting CIT at the end of stage two were considered as positive samples (labeled as 1), while all phrases in the rejection list R were considered as negative samples (labeled as 0). The third stage proceeded as follows: First, we embedded the labeled phrases of training data using Clinical BioBERT [46], and then we trained a neural network (NN) classifier on the embedded dataset. After conducting a grid search [47] to fine-tune the hyperparameters, we ended up having an NN model consisting of a single hidden layer with 100 neurons, ReLU activation function [48], and Adam optimizer [49]. Once trained, the model was used to classify newly extracted phrases from discharge notes as either legitimate concepts or illegitimate phrases. Phrases classified as legitimate were added to CIT, resulting in the final terminology, referred to as CIT\u003csub\u003eML\u003c/sub\u003e. CIT\u003csub\u003eML\u003c/sub\u003e demonstrated a coverage of 74.21% and a breadth of 1.68 for the test dataset. Figure 1 shows an excerpt of a discharge note highlighted by CIT\u003csub\u003eML\u003c/sub\u003e, with information highlighted in blue background color.\u003c/p\u003e"},{"header":"Method","content":"\u003cp\u003eFor this study, we fine-tuned two variants of the LLaMA 2 (13B) model \u0026nbsp;separately: one using highlighted discharge notes, which we refer to it as H-LLaMA, and the other one using the same notes without highlighted content, which we refer to it as U-LLaMA. The fine-tuning procedure was identical in both settings, with the only difference being whether the input included highlighted information. We will delve into the details of data preparation and the fine-tuning procedure in the following sections.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e2.1\u0026nbsp; \u0026nbsp;\u0026nbsp;Data preparation\u003c/p\u003e\n\u003cp\u003eAs described in the Background section, the MIMIC-IV-Ext-BHC dataset consists of clinical note-summary pairs, where each pair represents a complete discharge note text and its corresponding condensed BHC summary. Each row in this dataset includes a note_id which is a unique identifier matching the note_id column from the original MIMIC-IV-Note database.\u003c/p\u003e\n\u003cp\u003eIn our previous work, we developed CIT to highlight the detailed information of discharge notes of cardiology patients. Therefore, to identify cardiology-related records in MIMIC-IV-Note, we focused on two Intensive Care Units (ICUs) associated with cardiology: Coronary Care Unit (CCU) and Cardiac Vascular Intensive Care Unit (CVICU). We queried the \u0026quot;discharge\u0026quot; table in MIMIC-IV-Note for patients admitted to CCU or CVICU. This resulted in approximately 18,600 records, from which we extracted their note_id values. Next, we filtered MIMIC-IV-Ext-BHC to retain only the rows where the note_id matched those from patients admitted to CCU or CVICU in MIMIC_IV. This resulted in approximately 14,000 records.\u003c/p\u003e\n\u003cp\u003eFrom the 14,000 cardiology-related records, we randomly selected 1,000 records for fine-tuning the model, which we refer to as training data, and another 100 records for evaluating summaries generated by the fine-tuned model, which we refer to as test data.\u003c/p\u003e\n\u003cp\u003eWe have highlighted the discharge notes of training data and test data using the automatic highlighting method we proposed and reported on in [32, 43, 44], which automatically generates highlights for each note into an HTML file. The highlighted information is enclosed in \u0026lt;span\u0026gt; tags with the background color #ADD8E6. To prevent the model from being confused by HTML tags, each HTML file was converted into plain text, and the highlighted information was enclosed within square brackets (\u0026lsquo;[\u0026rsquo; and \u0026lsquo;]\u0026rsquo;).\u003c/p\u003e\n\u003cp\u003eWe prepared four datasets for model training and evaluation:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eUnhighlighted Training Set (UTrain):\u003c/strong\u003e This set consists of 1,000 original discharge notes without any highlighted information, each paired with its corresponding BHC summary.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eUnhighlighted Test Set (UTest):\u003c/strong\u003e This set includes 100 original, unhighlighted discharge notes along with their corresponding BHC summaries and was used for evaluation purposes.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHighlighted Training Set (HTrain):\u003c/strong\u003e This set contains the same 1,000 discharge notes, those used in UTrain, but with detailed information highlighted. The highlighted content was enclosed within square brackets (\u0026lsquo;[\u0026rsquo; and \u0026lsquo;]\u0026rsquo;). Each note is paired with its corresponding BHC summary.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHighlighted Test Set (HTest):\u003c/strong\u003e This set includes 100 discharge notes, the same ones used in UTest, with highlighted information enclosed within square brackets. Each note is paired with its associated BHC summaries.\u003c/p\u003e\n\u003cp\u003eWe converted both UTrain and HTrain into JSON format. Each training example was represented as a JSON object with three fields: \u0026quot;instruction\u0026quot;, \u0026quot;input\u0026quot;, and \u0026quot;output\u0026quot;, resulting in two separate JSON datasets.\u003c/p\u003e\n\u003cp\u003eThe UTrain dataset used the instruction: \u0026quot;\u003cem\u003eSummarize the clinical note into a brief hospital course.\u0026quot;\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eThe HTrain dataset used the instruction: \u0026quot;\u003cem\u003eSummarize the clinical note into a brief hospital course, focusing on the information enclosed within \u0026apos;[\u0026apos; and \u0026apos;]\u0026apos;.\u0026quot;\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eFor both datasets, the prompt (i.e., instruction + input) and the output (i.e., BHC) were tokenized, concatenated, and padded or truncated to a fixed length of 4,096 tokens. As in MIMIC-IV-Ext-BHC dataset, the input token length averaged 2,267 \u0026plusmn; 914 and the output token length averaged 564 \u0026plusmn; 410 [28], the combined prompt and output length typically remained within this limit, and truncation was rarely needed.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e2.2\u0026nbsp; \u0026nbsp;\u0026nbsp;Fine-Tuning procedure\u003c/p\u003e\n\u003cp\u003eFor fine-tuning, we used the HuggingFace Transformers library [39] in combination with PEFT (Parameter-Efficient Fine-Tuning) [50] and LoRA (Low-Rank Adaptation) [51]. The model was initialized from a pre-trained LLaMA-2 13B checkpoint in HuggingFace format. The tokenizer was also loaded from the same checkpoint and extended to include the custom discharge note section tags. Special tags of the input discharge notes such as\u0026nbsp;\u0026lt;SEX\u0026gt;,\u0026nbsp;\u0026lt;SERVICE\u0026gt;,\u0026nbsp;\u0026lt;ALLERGIES\u0026gt;,\u0026nbsp;\u0026lt;CHIEF COMPLAINT\u0026gt;,\u0026nbsp;etc. were explicitly added to the tokenizer as special tokens. These tags preserved semantic structure and helped guide the model\u0026rsquo;s understanding of discharge note organization.\u003c/p\u003e\n\u003cp\u003eThe model was fine-tuned locally using 2 NVIDIA GPUs, 66 CPU cores, and 256 GB of RAM, ensuring that no third party had access to the notes. Fine-tuning was performed using LoRA with the commonly adopted configuration: rank = 8, \u0026alpha; = 32, and dropout = 0.10. LoRA adaptation was applied specifically to the query and value projection layers of the attention mechanism, referred to as \u003cem\u003eq_proj\u003c/em\u003e and \u003cem\u003ev_proj\u003c/em\u003e in the model implementation. Training was conducted for 9 epochs using the AdamW optimizer [52] with a learning rate of 2e-4 and weight decay of 0.005. We used a per-device batch size of 5, and gradient accumulation over 8 steps. The dataset was split into 90% for training and 10% for evaluation.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e2.3\u0026nbsp; \u0026nbsp;\u0026nbsp;Generating summary from fine-tuned models\u003c/p\u003e\n\u003cp\u003eAfter fine tuning, each of the two fine-tuned LoRA adaptors were separately merged with the base LLaMA-2 13B model to produce standalone fine-tuned models for inferences. Inferences used generation prompt structures consistent with the training format. Following best practices in recent clinical text generation literature [53-55], inference was performed using nucleus sampling [56] with temperature = 0.7 and top-p = 0.9. We refer to the summaries generated by U-LLaMA as U-summaries, and those from H-LLaMA as H-summaries.\u003c/p\u003e\n\u003cp\u003e2.4\u0026nbsp;\u0026nbsp;Evaluation metrics\u003c/p\u003e\n\u003cp\u003eEvaluation involves assessing the quality of the generated summaries in relation to the golden standard summary, in this case, BHC. Summaries should be evaluated from different perspectives. In most related studies, extensive manual work has been conducted to assess summary quality alongside automatic metrics. We performed evaluation in three different categories:\u003c/p\u003e\n\u003cp\u003eManual evaluation, Automatic evaluation, and LLM-based evaluation. Automatic and LLM-based evaluations were conducted on all 100 notes in the test dataset, and a manual evaluation was performed on a random subset of 20 notes.\u003c/p\u003e\n\u003cp\u003e2.4.1\u0026nbsp; \u0026nbsp;\u0026nbsp;Manual Evaluation\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompleteness\u003c/strong\u003e: Measures how well the summary captures the information from the reference BHC summary. A high completeness score means the summary includes more information with missing less critical details. Completeness is calculated using (1).\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAApsAAABFCAYAAAAFOS0SAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAB47SURBVHhe7Z3vaxzH/cff+j5v3JX80BiT1YMGJwScVV1c94ENzarCDxqSdkXVBwYHp+cHhbRFIXeGEopqn0hSMI59KjWEkuZOxAVTeqpOBRl8Z2NLJ3FHbRqw9ijG9NGdFdd/wHwfeD/D7Odm706SL5GszwsWbnfn52dmdj4385mZAaWUgiAIgiAIgiD0gf/jDwRBEARBEAThWSHKpiAIgiAIgtA3RNkUBEEQBEEQ+oYom4IgCIIgCELfEGVTEARBEARB6BuibAqCIAiCIAh9Q5RNQRAEQRAEoW+IsikIgiAIgiD0DVE2BUEQBEEQhL4hyqYgCIIgCILQN0TZFARBEARBEPqGKJuCIAiCIAhC3xBlUxAEQRAEQegbomwKgiAIgiAIfUOUTUEQBEEQBKFviLIpbJlKpYIzZ87o3+Pj49yJIAiCIAi7FFE2hS1z7949HDhwAADw8OFDHDp0iDvRzM3NYWBgAI1Gg7/aVgwODqJer/PHz4Tx8XHMzMzwxzG2IqdWq4UzZ85gcHAQAwMDmJ6e5k62DfV6HYODg/zxlvLfK5lMBsPDw/zxhrCV5ejo6LaX+9dJUhlvhIGBAQwMDGBubo6/EgRhByDKprBlHj9+jP379wMAHjx4oH/bGBsbg1IKL774In+1bSgUClhfX8err77KX22ZRqOB2dlZHDx4kL+KsRU5TUxM4NGjR7h//z6y2Sz27NnDnWwbZmZm8N3vfpc/3lL+e2Vqagpra2v8cc8kleX8/DwA4MiRI7Hnu5WkMt4I5XIZAHD48GH+ShCEnYAShE1SLpcVAOvl+z53rnzfVwBUNptVSilVq9WU53nKcRwVhqHKZrPKcZzEe6WUKhaLynVd5XmeajabKp1OKwD6nqjVaioIAgVAOY6jcrmcUkqpZrOpnxeLRVUul5XrugpALDzzoribzaZKpVL6eRAEKgiCnvKilFL5fF45jhMLO5/P6zQTG5UTEYah8jwvFn65XDZCViqbzer8+r4fk6vjOG1yTafTXePn9zw+yrPv+8p1Xf2Op5Xe8fwTxWJR+3FdV+etW/pscPnw/FM5e57HvSrVpSzDMFQAVK1W07Lm5RyGYawu8bxyOslRRW2R5OY4jioWi211ndoO3Xfzr6J4EdWDMAy1G9N/p7QllbHqEKeytDVq80nkcjmdBsdxVD6fV81mMzG9FFa3usPvVZe6wu9NarVaLL9BEOhy52Ga7S8pHySbpDouCNsJUTaFLWMqLeZvTrPZVDA6ePqQIuqkqINApMjlcjl9T+/S6bRWclOplKrVajE3KurIHcdR6XRaqegjj6jzz+Vy+qMfBIFKpVKq2Wwqx3F0On3f135NPM9TnuepMAy1H3Jn5iUIglheTIXP7OiS4P56DVsZio6tDNLptHJdV9VqNaWU0spyklzT6bTK5XIbLiczPsdxdBp932/LOyyKjy1vxWIx5paUC7VB+RAUnoris+XfdGMjqSzz+bxyXVenKwgC5Rt/vqjuZLNZ1Ww2VT6f7xqPKccgCGLxlstlrWAppVQqlVK+71vruorqsancJvkvl8sqn8/rfFJd8TxP16HNlnFSnARva4gULxtm+1ZRmvL5fCz/Zn7N9r3Rus3rShAEsbqSSqVUuVzW99QO6ZtE3xvyX4z+FNjqH7W/fD6varWaSqVSMdkWoz8PgrATSP7CCUKPmB2lqbRxSBFqGiOQ9CE3OwPz3qY00EeZOhfuh3fu9N4Mw/O8NjcEdQIm+Wgky0y7YxmN6Zb2JEXWpBc52cJWRjo51CGb7rPZbEwG1EHyMJUlftWlnCgPZlhBEMTyTuVo5lMl5J+UM4L8Ejx9PD2cfD7fphTx/HdTNpPKkisF3F0qlVJBNCJeLpe1ImeDZGHWM1NxVEop13W1UlSMRshoJF9Z6jpXNrv5J4UyqZw2U8ad4rS1NS4DE0qHmWYT13Vjf4JMJVdZ6o7qUreVpa5QPpPqDlemqT2aeeRhcvifElKqBWEnkPwlFYQuZKMpNttl+2B26uDpo8s7J/7RVhYliXcGYCNsvFMk97bOy9YJqEhhMDt4m0LE82JLO1dQbfQiJ1vYytKpEdlo+tyEu02n023xEjz+buVki8913Vjec7mcNT6ef4rLHK3lHS9PH08PJ5VKtSmKPP/pdNoqSyKpLG3KjenOMabfadSN1zfCHMElSFFTRn1FNDVLo3MEr+u8rXTzryLllMtKbaGMu8XZS1vjFI1paPNPKB/pp7jNusTrTre6rSx1JctGuXndAfsm8TquLGFyKC+1aDank1tB2G7IAiFh00xOTqJcLsP3fSilYr+PHj3KnePGjRsYGRmJPbt79y5838fQ0BAQrWz3PE/f37x5E0EQxPysrq7i+PHj+v7KlStwHCcW5759+/Tvq1evxt7/+9//BqJFKJzbt2/H4jehFfcA8OGHH7a5u3v3blvafd/X7+v1OtbX17suckiSU6ewieXl5ZhsTMyV161WC4VCAW+88YZ+trKygp/85Cf63mQz5WTGVygUEIZhLO/Xr1/H66+/ru8JW/4BxBYLffrpp7H4epUPsbCwgFdeeSX2bGVlJZaehYWFRFkmlWWj0UAYhnpBTKVSaXO3vr6OcrkMpRTm5+dx6tQpa31DtPjOXFwzNzeHMAzxve99DwDw5MkT4KkmhEePHmF+fh4//OEPtXuq6xT/nTt3AEC3hW7+AaBareJHP/pR7BmxmTLuJU7e1lzXTZQRorZ8//59uK6Lc+fO6edLS0twXVfXnXPnzsXuscm6zdvK6upqW90x2xbYN+ns2bNtcuFhcijNT548wZUrV/DBBx9wJ4KwbXlulc3h4WG9XYbQX1577TXrb06j0cCBAwdQKBRQqVQAAIuLizE/vHNaWVnBoUOHUKlUtJ+FhQU8fvwYiDrz8+fP49KlS9qP4zi4d+8eEHXO7733Xuz9rVu34HmevjehcBGtoi0UCvp+dXUVADA9PY1Go4GhoSHU63W99c3i4mJb2o8fP67TTp0sonRnMhl9b5Ikp05hI1Igq9UqXn75Ze2O2LNnD9bW1tBqtdBqtTAxMQHXdfHWW29pN6VSyeoXmywnim9ubg7Xrl3TbinfX331FWBs1UTbHPH8f+tb3wIiBY/8Ly0t4f3339dh9iIfkzAMsW/fPmQyGbRaLSDK//e//33tplqt4siRI5iZmdFuiKSy/PLLL+E4jlYMqB7CyLf5vFKpYGJiouMWTyTHSqWCCxcuAEDbTgkkm0KhgImJCf2c6jopTqRYNRqNWN1O8k+ye+mll/Qzk82WMTrECUtbGx4ejrU1ol6vY3x8HK1Wy6qMPnjwAI7jAFFYYRhieHg4lv/N1G3eVhYWFtrqzsGDB2N15969e1oOjuNgz549sTbOw7Th+z4uXryI1dVV65/lZ0mr1cL09DRGR0f5qxjT09N6m7WRkZHE7ako79Qvj4+P9217uWeJrd51olKpbMj9roEPdT4vNJtN5bpux2kwYetks1lt22T+tpE2VlgSYNPZ3J6K+6FpMDdaVcun7VQ07UXvPc9re89t6EzK0cIFh9kI0jQdxUdTa77v6+k2PlXG0051EtEigKRpQe5P9RC2SphuJprGKlnHWKhA8KlDzkbLKQxD5bqulmMYLZBwjQVKuVzOWoY8LMVWPNOiDJNe5GPiRSuQyY9NdtyNSVJZptn0aa1Wa5vezRsr2W15MeFytE1LU15t5cqnpG31tpN/Pj1swtO2mTK2xdlLWyPKxm4SJE9er52oPefzeV3OvN1upG7ztmKbmud1JxuZHPm+r2rR4p9OYSaRTrCffdaQ+QalOQlKD03t0zembDGj8qJFX81mU4XRzhlfR162Qj5aIMdpNpsql8spN2GBVrFYjLU74Tm32UQPW4okke+yQlT4ZrDZOglPSUerzYXng1Qq1aboegn2k8LuwI92Cegn6WgVvLIsLjMhG1JzYVYzsgnmfqg/Nf9Y0Z+IzfbR/aYcbc3FaUaL2oJoW7EkstmstFWD53YanYbnN7OxcqPRwNmzZzvafAnfDDdu3LBOl+12Wq0WZmdncfr0af5K2KFcvnxZ/261WshkMgjDEKdOnYq5E55vRkZG9BT0yZMnrfbwz5KpqSm88847ANDxW/vPf/4TALT9MCL3vu+jVCrFTE+uXbsGx3Fi5h9kBvDFF1/oZ9uJkydP4pe//CV/jKGhIYyNjXU8KQ8ATp06hcuXL1vNeHYjPSmblei8a7K1GB4ebhNgoVDAyMiIdnPmzJlYZctkMhgcHNSG//V6XR/rRjYhtmemf7IJsbnlNlW3b98GEmyNeH7Ms7wzmQxc10UYhiiVStqNad+ESCE1jwQcGRnR9kj1el3bjJL9Ctm1DA4OWm1aCoVCzM50ZGSkLU6SIbkZHR3dETYvz5Ll5WWUSiU5CtBgbm4Oe/fuRRAEmJyc5K+FHUo2m8XZs2cxMDCAvXv3IgxDXL9+va+nKgnbj2q1isOHD+PEiROxvmojDAwMPPO+4vr164DFfpjsX2lxGiKbVtspUp7noVqt8sffOLTYbSt2sUNDQ3j99ddx8eJF/mp3woc6OTT8nY225yDbEnMoP5VKKcfY2Jfc0PA4TQfRsHnO2CiabFnMZ2TnQ+Fx//l8Xrul9PE91vjGxwTZfpF7Hpcy0m/a8ZiQTRGlgWx2yG6J7FJoCiIdbdjbTLAjpfhoyozsvMzpBUq3affF0y0IgiAIypjGpb6I+tuN2BL6vt/WXxF+dKoRh/p0s2+CZWpddQjjm6aT+QBB+ewEubHZ0e82OkqKDL75hsOmgEnp4bZFtkpKyqIZHilapmJoU2hVgn+VYJvpRid4mNiUSArTtCUhBdRWQZrs5Bh6BovCSwbQZny2SmyrtOl0OubPJk/6gHQClj0wky4uQ0EQBGHnUi6XVSqVUm503CcNGvWKrd8hkhTF50HZhEXP4Nj6bQ7pR0kDV7uJjtPoV65cwfr6On7/+9/Hnj8ti6d89NFHcF23p+H9mzdvAkAsvFu3bgEA/vCHP7Q941PgNv82aK87vo/eu+++C9/3MTY2hnq9jkKhgJ///OdIpVKxqYDr16+37cVGfPzxxwCAX/3qV2g0Gpibm8Po6Cg8z4ttI9OKtqEZHx+PDcVXq9W2vfv2798PRFvtEFNTUzF/L774IpaWlmLmC8vLyx1tahCVVa/XRqdgaTpfLrnkkkuub+5K4uHDh9q8y3Vd7N+/v2ufsduhPrabTWYv0N6qd+/e5a92HR2VzdnZWXieZ1W6YChUfHNaRHuGffvb3449W1hYaAtvcXERnufFjJ7pGW8UKysrCIIg5p8qxp49e/SzpaUlAIjZiDQaDVSrVW2H+eabb+LatWu4dOlSbA9GROm05QmRTNbX17F37154nocLFy7g7bffxvz8fCy9tHnyiRMn9DNKK1+0ND4+jlQqhV/84heJdpi/+93v4LoufvCDH7TZw35TcGVVLrnkkkuur/+yMT09jX/961/4y1/+Asdx8Mknn+DBgwc4c+YMd9oXaH/cbtA+qMLzTUdlMwzDRKULhgGwedoDogULAPDjH/849tymmC4tLbWdFlIqldrc0XP+b+Mf//gHAMROoLhx40ZsY2UA+O9//wsAKBaLUEphbW0NhUKhbUS20WhgfX29zeiZCMMQ6XQayjj94p133mlTjOmfjHmaBqXVtprw0qVLKBaLWFtbw7Fjx9o2eR4aGsLy8jKy2SwKhULbAqok+D/gTpcsuhEEQXg+mJycxNTUFIaGhlCtVvHw4UNMTk62Da5sFloIxBcLLy4uAmzhkOd5KJVKhqunlEol68Ih4fmjo7IJNmLYKxcuXIDjODEFkCqkecpCPTryzayUNnfmc3NUsNVq4fLly22jncvLyz1X4JmZmdjqcFJKDx48aLjqTL1ebzsNxjY6u7Ky0rad0vj4uM7b2NgY7ty5g/X1dVy9elW7GRwc1L8nJyfx2WefoVqttjVyG/wfcKdro9PogiAIwvZHKZU4gLJZ6AhTMnsjlpaWkEqlYs/oGE6zz6IZvJMnT+pn2wEaDCKleSv873//AwxTud1MR2XT93388Y9/1FO2lUoFo6OjetTt6NGjcF0XX3zxBVrRMXhnzpxBqVTCZ599FlO0qEJ+5zvf0c9oZNTcp4t44YUXUCgU9PY/dLzb559/DkQjkKOjo3BdF5988knMb7VaxWuvvYZGo6FHLl966SU4joM///nPOq3T09P405/+1Ha+MYyjxaanp2O2lL7vY3Z2VsugUCjgzTffxE9/+lPD99N/bHzEdm1tDYjSTumqVqu4ePGiljEpmaRU00grjTq2Wi38/e9/h+M4bTatgiAIgvAsaLVaWFtba9szkzh69Ch838f58+e1EklT9L/5zW9ibk+dOgXHcfDuu++i0Wig1Wrh7bffhud5bbOL2wHf9615NvnPf/4DWEZ2TWiGs9fBr+cavmLIpFar6dVitDqLrxCv1WrK87yOblS06oyfbkJHW5nwI9AI2sqIdu13LMecERSu7/uxFeW1Wk0fbUbhc//NZlPn2XXdNjfme0RH1fFV63wrIyIdHVHmR0eW0TNTfp7nxVa1h2GoVxOSG/9rOEVCEARB2H3k8/lYH2f2O7xPaxpHVPK+jcP1iaT+eztAq8iTsMnGRhAEie92GwMqybp4mzE8PIzTp0/LVO8OoVKp4PPPP8elS5dQqVRw8eLFtk3qBUEQBGE7Mjw8jAsXLmxpY/fBwUH87W9/s67T2G10nEbfLtBWRi+//DJ/JWxT7t27pxeOPXz4sG1hl8nc3BwGBgbaFkVtNwYHB607BTwLxsfHY+YaNjKZDIaHh/njniATFzqBajsvBqvX6zE7ZeLrqCdbkTFhK0s67Wy7yL1er8dOUes0FSgIu5G//vWv+O1vf8sf98zMzAxSqZQomhE7QtmkrYxeeOEF/krYpjx+/FgbRT948KCjgfTY2BiUUolbbG0HCoVC22K2Z0Wj0cDs7GzXRWlTU1Pa7nejTExM4NGjR7h//z6y2eymFv59XczMzFhtnL6OerIVGaNDWc7PzwOWbc++CVqtFo4dO4ZDhw5BKQXf93vepkYQdguvvvoqfv3rX29qq6h6vY4HDx5gamqKv9q98Hn17QjZOgZB0GYfKWwvyF7VdtlsV8iGh+xzyQbYcRwVhqE+pjPpXkWnQNEJGc1mU9cXfsJSrVaL2fySbWyz2dTPi8WiKpfL2kbWDM+8KG5usxQEgT55olteVGQb5DhOLGxuF6UMGyGy1aUjUymPlAbzJC4V2fyaNsFmGEQ2m9X5Ne2ceRwkh3Q63TVv/J7HR3nmttw8rfSO1xOiWCxqP67r6rx1S58NLh+e/yQZE53KMgxDheikMpI1L2eyzya/PK/Kkq9cLqfTSJTLZS0vh51gRm3FTKP5Lqm8iXw+r/27rqvCMGxrP2Yc/OQUM21gp67ZwrZBeab85fP5mC29GWc2m9Wy4bLjdYLfK4tMzDrA7006fRd4mKack/JB4STJROgftVqtra12YqPudws7QtkUdh6m0mL+5jSjoz6pg6cPLgyDdLoPgkDlcjl9T+/S0dnziIzOa7VazI0yjl6ljrMWnWdfq9VULpfTxutBEGjDdcdxdDr96Ix7jud5yvM83emacZh5CYIglhdT4TM7xCSK0bGqKpKZLc+mGxNSdGxlkE6nleu6qhYZ9VOnmBRHOp1WuVxuw+Vkxuc4js6/7/tteYdFSbHJjfJLbkkJURuQvclWZEwklSUpUpQuvnCA6k42WpCYT1igYMtXsVjUiky5XNYKmIoWS5rxKCMtJkn5pfJWRt5Ifm60gNLWflTUNkyFmWRH4aVSKV12SWFzzHarInnk8/lYGkx/ZrvdaJ3lMgmCIFYHUqmUKpfL+t5sX0nfBR4ml3M+n1e1Wk2lUqlYPSpGCrwg7FTav2aC8AwwO0pTaeOQItRkK/7BRnbMe5vSQB9v6oS4H96503szDM9ybj0BiwKUj0ayzLQ7bCSJ58WW9iRF1iSfz7cpMdTJUVhJihClk0Mdt5mWbDYbkwGPw4TnTXUpJyprM6wgCGJ5p3I0ZaoS6gkpZwT5JXj6eHo4W5ExkVSWXHng7lKpVExh9KKdN2zwfJm4rquVpmI0gmaOHioWF4fnlyD512o1FUYjqnRP8PbDlc0kBbKXsLlbnifCdd3YnxtT8VYJsutWR7hMqJ4l1Ylevgs8TA7/s0FKtSDsVJK/moKwCbLZrIIxRWdetg9rpw6ePtZcAeEfd2VRkninATbCxhUfcs8VSmUoZVwB8n1fj+KoBIWI58WWdt4R2UilUm1KTDraNsu8tynLttEtZUyfm3C3PA4Tnrdu5WSLz3XdWN5zuZw1Pl5PKC5zNIl30Dx9PD2crciYSCpLmxJkunOM6XcaneP1jeD5IqieIppeptE7TpLSpyz5JUgBRDTFHbAt7nj74e3PVl5Et7A5RWMa2lRI+Qg+ycOMk8uuW51VFplk2eg1rxO9fBd4mBxTAW82mx3dCsJOYEcsEBJ2DpOTkyiXy/B9H0qp2G/bqrwbN260bX5/9+5d+L6vDwW4d+9e7DSmmzdvIgiCmJ/V1VUcP35c31+5cgWO48Ti3Ldvn/599erV2Hs6YMC2zcXt27fbToMizKNaP/zwwzZ3d+/ebUu7eYoUnaJlO1jAZGFhAa+88krs2crKSuxY14WFhZgMiOXlZetzRNt7EK1WC4VCAW+88YZ+trKyok//4GymnMz4CoUCwjCM5f369evWo2pt9QRAbLHQp59+Gouvm+w5W5ExOpQl7aZBi54qlUqbu/X1dZTLZSilMD8/j1OnTlnrGyxyJ548eQI81ZT0UbrmKW6IyjgMw8SFSknl/fjxY92O6ahf81AJaj+Upzt37gCWo3nN8qJNs7uFzRkbG8P9+/fhui7OnTunny8tLcF1XR3HuXPnYvewyK6XOstlsrq62lYnzDaDHr4LPEwOpfnJkye4cuUKPvjgA+5EEHYUomwKfYHOzeW/OY1GAwcOHEChUNDbrywuLsb8cAVkZWUFhw4dQqVS0X4WFhbw+PFjIOrMz58/HzsD2HEcfQrV3Nwc3nvvvdj7W7duwfM8fW9C4SJaKW3uF7q6ugoAmJ6eRqPRwNDQEOr1ut76ZnFxsS3tx48f12knBQFRuvmxp0QYhti3bx8ymYzupEulUuxY12q1iiNHjmBmZka7abVaqFar1m3D9uzZg7W1NbSiE7UmJibgui7eeust7aZUKln9YpPlRPHNzc3h2rVr2i3l+6uvvgKMrZpomyNeT2j1NG1FlclksLS0hPfff1+H2U32nM3KmEgqyy+//BKO42gFguohjHybzyuVCiYmJhK3eOJy55BMCoUCJiYmYu9ICUxS5jqVt1lXpqen8fHHH+t31H5IoSKlrtFoxNoLpW1mZiaWtk5hE7RdU6vValO0Ee164TgOELXHMAwxPDwcSwOXXS91lstkYWGhrU4cPHgwVie6fRd4mDZ838fFixexurpq/RMsCDsKPtQpCFslm83qaTrzt420ZcWrOR2nLHZX3A9Nl7nGSlY+lVk0Vsh6ntf2ntvQmZSjRRfcRpCm8yg+moLzfV9PmcGyQMZMe7PZ1OlKdThRw4tW0VJYtmlJ7ibJHdE0Vsw6lhO5+BQjZ6PlFLLTwcJo0ZZrLFCiKVVehjwsxVa10+INk26y53D52WTH3ZgklSWfZq3Vam3TwHljJbstLyZc7iaUR1t5qgRTBqJTeTctJ6uZ8KljW1sw02b67xY2UTZ2iSA58fpK7TSfz+v88Pa4kTrLZWKbmud1ott3gYeZRDr9dDFdN3eCsBPYMScICUIShUIBH330EZaXl/mrXU8mk8Hs7OyW9o4Unh9GRkYwMjISG9UXtiejo6M4e/ZsmymCIOxEZBpd2PHcuHHDOq2222m1WpidncXp06f5K2EXUq/XUa1W8bOf/Yy/ErYJIyMj2oTk5MmTomgKzw2ibAo7nuXlZZRKpW1zFOB2YG5uDnv37kUQBJicnOSvhV1GJpPBsWPHkM/nRYHZxlSrVRw+fBgnTpzA+Pg4fy0IOxaZRhcEQRAEQRD6hoxsCoIgCIIgCH1DlE1BEARBEAShb4iyKQiCIAiCIPQNUTYFQRAEQRCEviHKpiAIgiAIgtA3RNkUBEEQBEEQ+oYom4IgCIIgCELfEGVTEARBEARB6BuibAqCIAiCIAh9Q5RNQRAEQRAEoW+IsikIgiAIgiD0DVE2BUEQBEEQhL4hyqYgCIIgCILQN0TZFARBEARBEPqGKJuCIAiCIAhC3xBlUxAEQRAEQegbomwKgiAIgiAIfUOUTUEQBEEQBKFv/D80rIYkxPk++gAAAABJRU5ErkJggg==\"\u003e\u003c/p\u003e\n\u003cp\u003eWe evaluate the completeness of the generated summaries based on the reference summary, rather than the entire discharge note. This is because much of the information in the discharge note\u0026nbsp;is not expected to appear in the summary. When gold-standard reference summaries are available, we assume that only the information contained in the reference summary is necessary to include in the generated summary.\u003c/p\u003e\n\u003ch3\u003e2.4.2\u0026nbsp; \u0026nbsp;\u0026nbsp;Automatic Evaluation\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eBERTScore\u003c/strong\u003e: BERTScore [57] uses contextual embeddings from a pre-trained BERT model to compute semantic similarity between the generated and reference summaries. Unlike traditional n-gram overlap methods, BERTScore compares the similarity of words in the generated and reference texts based on their meaning rather than exact word matches.\u003c/p\u003e\n\u003cp\u003eBERTScore provides three main scores:\u003c/p\u003e\n\u003cp\u003ePrecision: How much of the generated text is semantically supported by the reference.\u003c/p\u003e\n\u003cp\u003eRecall: How much of the reference is captured by the generated text.\u003c/p\u003e\n\u003cp\u003eF1 score: The harmonic mean of precision and recall, often used as the main BERTScore. In this study we also report F1 score as the BERTScore.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eROUGE-L (Recall-Oriented Understudy for Gisting Evaluation \u0026ndash; Longest Common Subsequence ):\u003c/strong\u003e ROUGE metrics\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e[58]\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003emeasures \u003cstrong\u003ehow much of the reference summary appears in the generated summary\u003c/strong\u003e by measuring the overlap of n-grams, word sequences, and longest common subsequences between the generated summary and reference summaries. One of Common variant of ROUGE is ROUGE-L which measures the longest common subsequence to capture fluency and coherence. Its ability to reflect holistic similarity rather than just local n-gram matches, makes it more suitable for evaluating summary-level similarity.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBLEU (Bilingual Evaluation Understudy):\u003c/strong\u003e BLEU metric [59] measures how much of the generated summary appears in the reference summary by measuring n-gram overlap between the generated summary and reference summary. BLEU is widely used for assessing machine translation and text generation tasks, including summarization.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSumac_CONV:\u003c/strong\u003e SummaC-CONV [60], specifically designed for the summarization task, is a factual consistency evaluation metric that leverages Natural Language Inference (NLI) [61] to assess whether a generated summary is factually consistent with its reference text. Unlike naive sentence-level approaches, SummaC-CONV does not assume a one-to-one alignment between source and summary sentences. Instead, it compares each summary sentence with multiple relevant sentences from the reference text, aggregating entailment scores. By mapping \u003cstrong\u003eeach summary sentence to multiple sentences\u003c/strong\u003e\u003cstrong\u003e,\u003c/strong\u003e it correctly evaluates cases where multiple facts are summarized into a single sentence or appear in a different order than in the original document.\u003c/p\u003e\n\u003ch3\u003e2.4.3\u0026nbsp; \u0026nbsp;\u0026nbsp;LLM-based evaluation\u003c/h3\u003e\n\u003cp\u003eLLMs have shown great potential as evaluators for LLM-generated summaries, evaluating various aspects of the generated texts [62, 63]. In this study, we use ChatGPT 4o (through Azure) to evaluate generated summaries of each discharge note. We crafted and refined a prompt to instruct ChatGPT 4o to evaluate each summary based on four criteria, and then assign a score from 1 to 5 for each, where 1 indicates the lowest and 5 the highest. Here is the final version of the prompt:\u003c/p\u003e\n\u003cp\u003e\u0026ldquo;\u003cem\u003eAct as a cardiologist and read a discharge note (A) alongside its reference summary (B) and two LLM-generated summaries (C and D). Your task is to grade both summaries on a scale of 1 to 5 for the following four metrics:\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eCoherence: Does the summary maintain a logical flow of ideas and present information clearly?\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eFluency: Is it grammatically well-formed and easy to read?\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eConciseness: Does it effectively condense information, keeping only the most relevant clinical details?\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eCorrectness: Does the summary accurately reflect the content of the discharge note (A) and the reference summary (B), without introducing false or misleading information?\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eAssign a score (1-5) for each metric for both summaries (C and D). A score of 1 indicates the lowest quality, while 5 indicates the highest.\u0026rdquo;\u003c/em\u003e\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eFigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e(a) shows an example of a BHC summary of one discharge note, containing 181 words, alongside the H-summary (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb) and U-summary of its corresponding discharge note (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec), which contain 126 and 135 words, respectively.\u003c/p\u003e\u003cp\u003eIn this example, six key pieces of information are highlighted in green in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e(a). Among these green highlights, the blue highlights in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e(b) indicate information that is present in the original note (a) but missing from the U-summary (c). Conversely, yellow highlights denote information present in the U-summary (c) but absent from the H-summary (b). Pink highlights in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e(a) indicate information that does not appears in either summary. All unhighlighted pieces of information in (a) corresponds to information that appears in both summaries. The H-summary achieves a completeness score of 56.6%, while the U-summary achieves a completeness score of 47.8%.\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eManual Evaluation\u003c/strong\u003e\u003cp\u003eFor random 20 notes, the average completeness of the U-summaries is 48.1%, while the H-summaries demonstrated a higher average completeness of 54.2%, an increase of 6 percentage points. Additionally, for 17 notes, the completeness of the H-summary is higher than the corresponding U-summary, and for 3 notes, the completeness is equal. Using Fisher Exact test [\u003cspan citationid=\"CR64\" class=\"CitationRef\"\u003e64\u003c/span\u003e, \u003cspan citationid=\"CR65\" class=\"CitationRef\"\u003e65\u003c/span\u003e], we compared the completeness of H-summary group (17 notes) and the U-summary group (3 notes) with a significance level of 0.05. The statistical value was 0.0009, indicating a statistically significant result (p\u0026thinsp;\u0026lt;\u0026thinsp;0.05).\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eAutomatic evaluation\u003c/strong\u003e\u003cp\u003eAs shown in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, the average BERTScore of the U-summaries for 100 notes is 59.61, while that of the H-summaries is 63.75, with 91 notes having higher BERTScore for H-summaries than U-summaries. Using Fisher Exact test, we compared the BERTScore of H-summary group (91 notes) and the U-summary group (9 notes) with a significance level of 0.05. The statistical value was less than 0.00001, indicating a statistically significant result (p\u0026thinsp;\u0026lt;\u0026thinsp;0.05).\u003c/p\u003e\u003c/p\u003e\u003cp\u003eThe average ROUGE-L score for the U-summaries is 21.82, whereas it is 23.43 for the H-summaries, with 90 notes having higher ROUGE-L score for H-summaries than U-summaries. Using Fisher Exact test, we compared the ROUGE-L score of H-summary group (90 notes) and the U-summary group (10 notes) with a significance level of 0.05. The statistical value was less than 0.00001, indicating a statistically significant result (p\u0026thinsp;\u0026lt;\u0026thinsp;0.05).\u003c/p\u003e\u003cp\u003eAlso, the average BLEU score of the U-summaries is 8.41, compared to 10.4 for the H-summaries, with 81 notes having higher BLEU score for H-summaries than U-summaries. Using Fisher Exact test, we compared the BLEU score of H-summary group (81 notes) and the U-summary group (19 notes) with a significance level of 0.05. The statistical value was less than 0.00001, indicating a statistically significant result (p\u0026thinsp;\u0026lt;\u0026thinsp;0.05).\u003c/p\u003e\u003cp\u003eFinally, the average Sumac_CONV score is 40.2 for the U-summaries and 67.7 for the H-summaries, with 98 notes having higher Sumac_CONV score for H-summaries than U-summaries. Using Fisher Exact test, we compared the Sumac_CONV score of H-summary group (98 notes) and the U-summary group (2 notes) with a significance level of 0.05. The statistical value was 0.0002, indicating a statistically significant result (p\u0026thinsp;\u0026lt;\u0026thinsp;0.05).\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eThe average results of automatic evaluations for U_summaries (U_S) and H_summaries (H_S)\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"11\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c11\" colnum=\"11\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colspan=\"3\" nameend=\"c3\" namest=\"c1\"\u003e\u003cp\u003eWord count\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c5\" namest=\"c4\"\u003e\u003cp\u003eBERTScore\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c7\" namest=\"c6\"\u003e\u003cp\u003eROUGE-L\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c9\" namest=\"c8\"\u003e\u003cp\u003eBELU\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c11\" namest=\"c10\"\u003e\u003cp\u003eSumac_CONV\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eHCB\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c9\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c10\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c11\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003e393.19\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003e300.38\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e331.67\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e\u003cb\u003e59.61\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e63.75\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e21.82\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e\u003cb\u003e23.43\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e\u003cb\u003e8.41\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c9\"\u003e\u003cp\u003e\u003cb\u003e10.4\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c10\"\u003e\u003cp\u003e\u003cb\u003e40.2\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c11\"\u003e\u003cp\u003e\u003cb\u003e67.7\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eLLMs evaluation\u003c/strong\u003e\u003cp\u003eChatGPT 4o generally rated both U_summaries and H_summaries highly across all four criteria (Coherence, Fluency, Conciseness, and Correctness) indicating that both types of summaries are overall well-written. However, H_summaries received slightly higher average scores than U_summaries across three criteria (Coherence, Conciseness, and Correctness), and for Fluency, the averages were almost equal.\u003c/p\u003e\u003c/p\u003e\u003cp\u003eThe average scores for Coherence are 4.84 (U_summaries) and 4.91 (H_summaries), for Fluency are 4.91 (U_summaries) and 4.92 (H_summaries), for Conciseness are 4.65 (U_summaries) and 4.81 (H_summaries), and for Correctness are 4.79 (U_summaries) and 4.91 (H_summaries). When considering the overall average across all criteria, H_summaries scored 4.88, slightly outperforming U_summaries which scored 4.79.\u003c/p\u003e\u003cp\u003eAlso, as shown in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, the average length of BHC summaries for these 100 notes is 393.19, while the average length of U-summaries is 300.38 and that of H-summaries is 331.67, indicating that H-summaries tend to include more information.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eThe results of LLM evaluation for U_summaries (U_S) and H_summaries (H_S)\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"10\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e\u003cp\u003eCoherence\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e\u003cp\u003eFluency\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c6\" namest=\"c5\"\u003e\u003cp\u003eConciseness\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c8\" namest=\"c7\"\u003e\u003cp\u003eCorrectness\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c10\" namest=\"c9\"\u003e\u003cp\u003eAverage\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c9\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c10\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e4.84\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e4.91\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e4.91\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e4.92\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e4.65\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e4.81\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e4.79\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c8\"\u003e\u003cp\u003e4.91\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c9\"\u003e\u003cp\u003e\u003cb\u003e4.79\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c10\"\u003e\u003cp\u003e\u003cb\u003e4.88\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows the minimum and maximum values, as well as the standard deviation, for all evaluation metrics for U_summaries and H_summaries.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eMinimum, maximum, and standard deviation for evaluation metrics for U_summaries (U_S) and H_summaries (H_S) summaries.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMetric\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSummary\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMin\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eMax\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eSTD\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eCompleteness\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e23.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e73.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e14.01\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e29.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e80.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e12.36\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eBERTScore\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e57.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e69.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e2.86\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e59.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e69.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e2.75\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eROUGE-L\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e14.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e34.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e7.19\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e19.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e39.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e5.54\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eBELU\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e1.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e13.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e5.35\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e4.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e22.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e4.97\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eSumac\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e7.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e63.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e15.13\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e60.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e89.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e8.21\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eCoherence\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.36\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.22\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eFluency\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.36\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.22\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eConciseness\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.43\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.40\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eCorrectness\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eU_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.40\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eH_S\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.30\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eIn this study, we fine-tuned LLaMA 2 (13B) twice, separately, under the same conditions: once using unhighlighted discharge notes and once more using the same\u0026nbsp;notes\u0026nbsp;but highlighted. By keeping everything else constant, we aimed to isolate and measure the effect of highlighting on fine tuning. Results show that using highlighted discharge notes for fine-tuning improves the quality of the generated summaries compared to fine-tuning with unhighlighted notes across all evaluation metrics. For the completeness, BERTScore, ROUGE-L, BLEU, and SummaC_CONV metrics, we achieved a statistical significance of (p \u0026lt; 0.05) using Fisher\u0026rsquo;s Exact Test with a significance level of 0.05.\u003c/p\u003e\n\u003cp\u003eThe BHC section in discharge notes offers a concise narrative of the patient\u0026apos;s clinical journey. Composing this section is widely recognized as a cognitively demanding and time-intensive task for clinicians since it requires synthesizing a large volume of notes and reports generated throughout the patient\u0026apos;s hospitalization into a coherent summary [66, 67].This process is not only laborious but also susceptible to errors, given the high documentation burden [66, 68]. Furthermore, BHCs exhibit significant variability in both style and content. Authored by different clinicians, they reflect diverse writing habits, and they frequently alternate between extractive and abstractive summarization strategies [67]. Prior research has shown that discharge summaries and their BHC sections may omit critical information or introduce excessive, redundant, or even erroneous content [68, 69]. Also, Sometimes new or summary-only information is added to BHC that that are not documented elsewhere clearly.\u003c/p\u003e\n\u003cp\u003eAlthough BHCs are commonly used as reference summaries for training summarization models, they are inherently noisy. Their variability in coherence, completeness, and potential misrepresentation of clinical facts introduce significant challenges for model training and evaluation [67, 69]. After reviewing several examples, we noted that some BHCs are very short, shallow, and lack important details, while others are overly long and include too much information. There is no consistent structure or style among them. Indeed, the BHC targets have a mean token length of 564 with a standard deviation of 410 [28], which is notably high, indicating substantial variation in target length. This inconsistency makes it harder for the language model to learn what kind of summary it is supposed to generate. If all BHCs followed a similar pattern, both the highlighted and unhighlighted models, would have a clearer target during training.\u003c/p\u003e\n\u003cp\u003eAnother factor that impacts the quality of the generated summaries is the presence of errors or mismatches between the original discharge notes and their corresponding BHCs. In some cases, the BHC includes information that does not appear in the original note. For example, in note \u0026ldquo;11874424-DS-9,\u0026rdquo; the \u0026ldquo;atorvastatin 80 mg\u0026rdquo; is mentioned in the BHC, but the original\u0026nbsp;discharge note\u0026nbsp;only refers to a prior dose of atorvastatin 40 mg, with no indication of a change to 80 mg. Similarly, the original note states that the patient was accompanied by his son, while the BHC refers to a daughter. These kinds of inaccuracies or additions in the BHCs negatively affect the quality of the summaries generated by the LLM. Since the model generates summaries based only on the original note, it should not include information that was never there. As a result, when these summaries are evaluated against BHCs, the evaluation scores decrease, not because the model failed, but because the reference summaries are flawed.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFuture work:\u0026nbsp;\u003c/strong\u003eFor future work, we plan to fine-tune the model using a larger dataset. Additionally, we aim to train the model on a subset of examples whose BHCs follow a more consistent style. For instance, we can select samples where the ratio of BHC length to the original discharge note length falls within a similar range. This would help ensure that the summaries, whether brief or detailed, are more uniform in structure, making it easier to evaluate the model\u0026rsquo;s performance. We also aim to conduct a preliminary evaluation of BHCs using LLMs, assessing them based on their corresponding discharge notes to determine which are more comprehensive and less erroneous. This evaluation will help identify BHCs suitable to serve as gold-standard summaries. The BHCs that receive higher ratings will be selected, along with their corresponding discharge notes, to form the training dataset for fine-tuning the model. We also plan to conduct a hybrid training where we fine-tune LLaMA using a mixed dataset that includes both highlighted and non-highlighted inputs. This will allow us to observe how the model behaves when trained with both types of data. Also, we plan to increase the test set for manual evaluation to 100 notes in order to achieve a more accurate assessment.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study investigates the effect of highlighting information in discharge notes on the summaries generated by Large Language Models (LLMs). Highlighting is done automatically using a Cardiology Interface Terminology (CIT) proposed in our previous work. To carry out our experiment, we fine-tuned the LLaMA2-13B model twice using the MIMIC-IV-Ext-BHC dataset: once with the highlighted discharge notes and once more with the same set of discharge notes without highlighting.\u003c/p\u003e\u003cp\u003eOur results demonstrate that incorporating highlighted information into the fine-tuning process improves the quality of discharge note summarization. That is, summaries generated from highlighted inputs consistently outperformed those from unhighlighted inputs across all evaluation metrics, including BERTScore, ROUGE-L, BLEU, SummaC_CONV, completeness, and LLM-based judgments.\u003c/p\u003e\u003cp\u003eThis work provides a scalable framework for improving discharge notes summarization using fine-tuned LLMs. By enhancing the clarity and completeness of clinical summaries, this approach has the potential to support more effective healthcare delivery and better-informed decision-making.\u003c/p\u003e"},{"header":"Statements and Declarations","content":"\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe code used for fine-tuning the LLaMA 2\u0026ndash;13B model is publicly available at: https://github.com/mahshadkoohihd/Fine_tuning_LLaMA2\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgments\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors have no acknowledgments to declare.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eMenachemi, N. and T.H. Collum, \u003cem\u003eBenefits and drawbacks of electronic health record systems.\u003c/em\u003e Risk management and healthcare policy, 2011: p. 47-55.\u003c/li\u003e\n \u003cli\u003eMadzime, R. and C. Nyirenda, \u003cem\u003eEnhanced Electronic Health Records Text Summarization Using Large Language Models.\u003c/em\u003e arXiv preprint arXiv:2410.09628, 2024.\u003c/li\u003e\n \u003cli\u003eBowman, S., \u003cem\u003eImpact of electronic health record systems on information integrity: quality and safety implications.\u003c/em\u003e Perspectives in health information management, 2013. \u003cstrong\u003e10\u003c/strong\u003e(Fall).\u003c/li\u003e\n \u003cli\u003eO\u0026rsquo;Malley, A.S., et al., \u003cem\u003eAre electronic medical records helpful for care coordination? Experiences of physician practices.\u003c/em\u003e Journal of general internal medicine, 2010. \u003cstrong\u003e25\u003c/strong\u003e: p. 177-185.\u003c/li\u003e\n \u003cli\u003eApathy, N.C., et al., \u003cem\u003eDocumentation dynamics: note composition, burden, and physician efficiency.\u003c/em\u003e Health Services Research, 2023. \u003cstrong\u003e58\u003c/strong\u003e(3): p. 674-685.\u003c/li\u003e\n \u003cli\u003eZhao, W.X., et al., \u003cem\u003eA survey of large language models.\u003c/em\u003e arXiv preprint arXiv:2303.18223, 2023.\u003c/li\u003e\n \u003cli\u003eThirunavukarasu, A.J., et al., \u003cem\u003eLarge language models in medicine.\u003c/em\u003e Nature medicine, 2023. \u003cstrong\u003e29\u003c/strong\u003e(8): p. 1930-1940.\u003c/li\u003e\n \u003cli\u003eHadi, M.U., et al., \u003cem\u003eA survey on large language models: Applications, challenges, limitations, and practical usage.\u003c/em\u003e Authorea Preprints, 2023.\u003c/li\u003e\n \u003cli\u003eKwag, K.H., et al., \u003cem\u003eProviding doctors with high-quality information: an updated evaluation of web-based point-of-care information summaries.\u003c/em\u003e Journal of medical Internet research, 2016. \u003cstrong\u003e18\u003c/strong\u003e(1): p. e15.\u003c/li\u003e\n \u003cli\u003eShestov, A., et al., \u003cem\u003eFinetuning large language models for vulnerability detection.\u003c/em\u003e IEEE Access, 2025.\u003c/li\u003e\n \u003cli\u003eRallapalli, S., et al., \u003cem\u003eFine-Tuning LLMs for Report Summarization: Analysis on Supervised and Unsupervised Data.\u003c/em\u003e arXiv preprint arXiv:2503.10676, 2025.\u003c/li\u003e\n \u003cli\u003ePivovarov, R. and N. Elhadad, \u003cem\u003eAutomated methods for the summarization of electronic health records.\u003c/em\u003e Journal of the American Medical Informatics Association, 2015. \u003cstrong\u003e22\u003c/strong\u003e(5): p. 938-947.\u003c/li\u003e\n \u003cli\u003eSarzynski, E., et al., \u003cem\u003eOpportunities to improve clinical summaries for patients at hospital discharge.\u003c/em\u003e BMJ quality \u0026amp; safety, 2017. \u003cstrong\u003e26\u003c/strong\u003e(5): p. 372-380.\u003c/li\u003e\n \u003cli\u003eCasey, J.A., et al., \u003cem\u003eUsing electronic health records for population health research: a review of methods and applications.\u003c/em\u003e Annual review of public health, 2016. \u003cstrong\u003e37\u003c/strong\u003e(1): p. 61-81.\u003c/li\u003e\n \u003cli\u003eWu, X.-K., et al., \u003cem\u003eLLM Fine-Tuning: Concepts, Opportunities, and Challenges.\u003c/em\u003e Big Data and Cognitive Computing, 2025. \u003cstrong\u003e9\u003c/strong\u003e(4): p. 87.\u003c/li\u003e\n \u003cli\u003eHu, M., et al., \u003cem\u003eMitigating large language model hallucination with faithful finetuning.\u003c/em\u003e arXiv preprint arXiv:2406.11267, 2024.\u003c/li\u003e\n \u003cli\u003eRumiantsau, M., et al., \u003cem\u003eBeyond Fine-Tuning: Effective Strategies for Mitigating Hallucinations in Large Language Models for Data Analytics.\u003c/em\u003e arXiv preprint arXiv:2410.20024, 2024.\u003c/li\u003e\n \u003cli\u003eLiu, C., et al., \u003cem\u003eCPMI-ChatGLM: Parameter-efficient fine-tuning ChatGLM with Chinese patent medicine instructions.\u003c/em\u003e Scientific Reports, 2024. \u003cstrong\u003e14\u003c/strong\u003e(1): p. 6403.\u003c/li\u003e\n \u003cli\u003eHamzah, F. and N. Sulaiman, \u003cem\u003eOptimizing Llama 7B for Medical Question Answering: A Study on Fine-Tuning Strategies and Performance on the MultiMedQA Dataset.\u003c/em\u003e\u003c/li\u003e\n \u003cli\u003eLi, I., et al., \u003cem\u003eNeural natural language processing for unstructured data in electronic health records: a review.\u003c/em\u003e Computer Science Review, 2022. \u003cstrong\u003e46\u003c/strong\u003e: p. 100511.\u003c/li\u003e\n \u003cli\u003ePerković, G., A. Drobnjak, and I. Botički. \u003cem\u003eHallucinations in llms: Understanding and addressing challenges\u003c/em\u003e. in \u003cem\u003e2024 47th MIPRO ICT and Electronics Convention (MIPRO)\u003c/em\u003e. 2024. IEEE.\u003c/li\u003e\n \u003cli\u003eJha, S., et al. \u003cem\u003eDehallucinating large language models using formal methods guided iterative prompting\u003c/em\u003e. in \u003cem\u003e2023 IEEE International Conference on Assured Autonomy (ICAA)\u003c/em\u003e. 2023. IEEE.\u003c/li\u003e\n \u003cli\u003eParthasarathy, V.B., et al., \u003cem\u003eThe ultimate guide to fine-tuning llms from basics to breakthroughs: An exhaustive review of technologies, research, best practices, applied research challenges and opportunities.\u003c/em\u003e arXiv preprint arXiv:2408.13296, 2024.\u003c/li\u003e\n \u003cli\u003eAhmad, P.N., et al., \u003cem\u003eBIR: Biomedical Information Retrieval System for Cancer Treatment in Electronic Health Record Using Transformers.\u003c/em\u003e Sensors, 2023. \u003cstrong\u003e23\u003c/strong\u003e(23): p. 9355.\u003c/li\u003e\n \u003cli\u003eHe, Z., et al., \u003cem\u003eEnriching real-world data with social determinants of health for health outcomes and health equity: successes, challenges, and opportunities.\u003c/em\u003e Yearbook of Medical Informatics, 2023. \u003cstrong\u003e32\u003c/strong\u003e(01): p. 253-263.\u003c/li\u003e\n \u003cli\u003eMajdik, Z.P., et al., \u003cem\u003eSample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.\u003c/em\u003e JMIR AI, 2024. \u003cstrong\u003e3\u003c/strong\u003e: p. e52095.\u003c/li\u003e\n \u003cli\u003eAali, A., et al., \u003cem\u003eA dataset and benchmark for hospital course summarization with adapted large language models.\u003c/em\u003e Journal of the American Medical Informatics Association, 2024: p. ocae312.\u003c/li\u003e\n \u003cli\u003eAali, A., et al. \u003cem\u003eMIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital Course Summarization\u003c/em\u003e. 2024; Available from: https://physionet.org/content/labelled-notes-hospital-course/1.1.0/.\u003c/li\u003e\n \u003cli\u003eDu, X., et al., \u003cem\u003eGenerative large language models in electronic health records for patient care since 2023: a systematic review.\u003c/em\u003e medRxiv, 2024.\u003c/li\u003e\n \u003cli\u003eAcharya, A., et al., \u003cem\u003eClinical risk prediction using language models: benefits and considerations.\u003c/em\u003e Journal of the American Medical Informatics Association, 2024: p. ocae030.\u003c/li\u003e\n \u003cli\u003eKoohi Habibi Dehkordi, M., et al., \u003cem\u003eImproving Large Language Models Summarization by Highlighting Discharge Notes: A Comparative Evaluation.\u003c/em\u003e JMIR Med Inform (forthcoming). doi:10.2196/66476, 2025.\u003c/li\u003e\n \u003cli\u003eDehkordi, M.K.H., et al. \u003cem\u003eSkimming of Electronic Health Records Highlighted by an Interface Terminology Curated with Machine Learning Mining\u003c/em\u003e. in \u003cem\u003eBIOSTEC (2)\u003c/em\u003e. 2024.\u003c/li\u003e\n \u003cli\u003eJin, H., et al., \u003cem\u003eA comprehensive survey on process-oriented automatic text summarization with exploration of llm-based methods.\u003c/em\u003e arXiv preprint arXiv:2403.02901, 2024.\u003c/li\u003e\n \u003cli\u003eZhong, M., et al., \u003cem\u003eExtractive summarization as text matching.\u003c/em\u003e arXiv preprint arXiv:2004.08795, 2020.\u003c/li\u003e\n \u003cli\u003eGupta, S. and S.K. Gupta, \u003cem\u003eAbstractive summarization: An overview of the state of the art.\u003c/em\u003e Expert Systems with Applications, 2019. \u003cstrong\u003e121\u003c/strong\u003e: p. 49-65.\u003c/li\u003e\n \u003cli\u003eVan Veen, D., et al., \u003cem\u003eAdapted large language models can outperform medical experts in clinical text summarization.\u003c/em\u003e Nature medicine, 2024. \u003cstrong\u003e30\u003c/strong\u003e(4): p. 1134-1142.\u003c/li\u003e\n \u003cli\u003eMa, C., et al., \u003cem\u003eAn Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT.\u003c/em\u003e IEEE Transactions on Artificial Intelligence, 2024.\u003c/li\u003e\n \u003cli\u003eHake, J., et al., \u003cem\u003eQuality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts.\u003c/em\u003e The Annals of Family Medicine, 2024. \u003cstrong\u003e22\u003c/strong\u003e(2): p. 113-120.\u003c/li\u003e\n \u003cli\u003eWolf, T., et al., \u003cem\u003eHuggingface\u0026apos;s transformers: State-of-the-art natural language processing.\u003c/em\u003e arXiv preprint arXiv:1910.03771, 2019.\u003c/li\u003e\n \u003cli\u003eJohnson, A.E., et al., \u003cem\u003eMIMIC-IV, a freely accessible electronic health record dataset.\u003c/em\u003e Scientific data, 2023. \u003cstrong\u003e10\u003c/strong\u003e(1): p. 1.\u003c/li\u003e\n \u003cli\u003eJohnson, A., et al., \u003cem\u003eMimic-iv.\u003c/em\u003e PhysioNet. Available online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021), 2020: p. 49-55.\u003c/li\u003e\n \u003cli\u003eGrady, C., \u003cem\u003eInstitutional review boards: Purpose and challenges.\u003c/em\u003e Chest, 2015. \u003cstrong\u003e148\u003c/strong\u003e(5): p. 1148-1155.\u003c/li\u003e\n \u003cli\u003eDehkordi, M.K.H., et al., \u003cem\u003eUsing annotation for computerized support for fast skimming of cardiology electronic health record notes\u003c/em\u003e, in \u003cem\u003e2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)\u003c/em\u003e. 2023, IEEE. p. 4043-4050.\u003c/li\u003e\n \u003cli\u003eMahshad Koohi H. Dehkordi, S.Z., Yehoshua Perl, Fadi P. Deek, Gai Elhanan, Andrew J. Einstein, \u003cem\u003eCuration of a Cardiology Interface Terminology for Highlighting Electronic Health Records using Machine Learning.\u003c/em\u003e Submitted to a Journal, 2024.\u003c/li\u003e\n \u003cli\u003eDonnelly, K., \u003cem\u003eSNOMED-CT: The advanced terminology and coding system for eHealth.\u003c/em\u003e Stud Health Technol Inform, 2006. \u003cstrong\u003e121\u003c/strong\u003e: p. 279-90.\u003c/li\u003e\n \u003cli\u003eAlsentzer, E., et al., \u003cem\u003ePublicly available clinical BERT embeddings.\u003c/em\u003e arXiv preprint arXiv:1904.03323, 2019.\u003c/li\u003e\n \u003cli\u003eLiashchynskyi, P. and P. Liashchynskyi, \u003cem\u003eGrid search, random search, genetic algorithm: a big comparison for NAS.\u003c/em\u003e arXiv preprint arXiv:1912.06059, 2019.\u003c/li\u003e\n \u003cli\u003eAgarap, A.F., \u003cem\u003eDeep learning using rectified linear units (relu).\u003c/em\u003e arXiv preprint arXiv:1803.08375, 2018.\u003c/li\u003e\n \u003cli\u003eJais, I.K.M., A.R. Ismail, and S.Q. Nisa, \u003cem\u003eAdam optimization algorithm for wide and deep neural network.\u003c/em\u003e Knowledge Engineering and Data Science, 2019. \u003cstrong\u003e2\u003c/strong\u003e(1): p. 41-46.\u003c/li\u003e\n \u003cli\u003eHan, Z., et al., \u003cem\u003eParameter-efficient fine-tuning for large models: A comprehensive survey.\u003c/em\u003e arXiv preprint arXiv:2403.14608, 2024.\u003c/li\u003e\n \u003cli\u003eHu, E.J., et al., \u003cem\u003eLora: Low-rank adaptation of large language models.\u003c/em\u003e ICLR, 2022. \u003cstrong\u003e1\u003c/strong\u003e(2): p. 3.\u003c/li\u003e\n \u003cli\u003eLlugsi, R., et al. \u003cem\u003eComparison between Adam, AdaMax and Adam W optimizers to implement a Weather Forecast based on Neural Networks for the Andean city of Quito\u003c/em\u003e. in \u003cem\u003e2021 IEEE Fifth Ecuador Technical Chapters Meeting (ETCM)\u003c/em\u003e. 2021. IEEE.\u003c/li\u003e\n \u003cli\u003eKoraş, O.A., et al., \u003cem\u003eTowards Conditioning Clinical Text Generation for User Control.\u003c/em\u003e arXiv preprint arXiv:2502.17571, 2025.\u003c/li\u003e\n \u003cli\u003eSu, Y., et al., \u003cem\u003eA contrastive framework for neural text generation.\u003c/em\u003e Advances in Neural Information Processing Systems, 2022. \u003cstrong\u003e35\u003c/strong\u003e: p. 21548-21561.\u003c/li\u003e\n \u003cli\u003ePeng, C., et al., \u003cem\u003eA study of generative large language model for medical research and healthcare.\u003c/em\u003e NPJ digital medicine, 2023. \u003cstrong\u003e6\u003c/strong\u003e(1): p. 210.\u003c/li\u003e\n \u003cli\u003eHoltzman, A., et al., \u003cem\u003eThe curious case of neural text degeneration.\u003c/em\u003e arXiv preprint arXiv:1904.09751, 2019.\u003c/li\u003e\n \u003cli\u003eZhang, T., et al., \u003cem\u003eBertscore: Evaluating text generation with bert.\u003c/em\u003e arXiv preprint arXiv:1904.09675, 2019.\u003c/li\u003e\n \u003cli\u003eLin, C.-Y. \u003cem\u003eRouge: A package for automatic evaluation of summaries\u003c/em\u003e. in \u003cem\u003eText summarization branches out\u003c/em\u003e. 2004.\u003c/li\u003e\n \u003cli\u003ePapineni, K., et al. \u003cem\u003eBleu: a method for automatic evaluation of machine translation\u003c/em\u003e. in \u003cem\u003eProceedings of the 40th annual meeting of the Association for Computational Linguistics\u003c/em\u003e. 2002.\u003c/li\u003e\n \u003cli\u003eLaban, P., et al., \u003cem\u003eSummaC: Re-visiting NLI-based models for inconsistency detection in summarization.\u003c/em\u003e Transactions of the Association for Computational Linguistics, 2022. \u003cstrong\u003e10\u003c/strong\u003e: p. 163-177.\u003c/li\u003e\n \u003cli\u003eMacCartney, B., \u003cem\u003eNatural language inference\u003c/em\u003e. 2009: Stanford University.\u003c/li\u003e\n \u003cli\u003eSun, Z., et al., \u003cem\u003ePrinciple-driven self-alignment of language models from scratch with minimal human supervision.\u003c/em\u003e Advances in Neural Information Processing Systems, 2024. \u003cstrong\u003e36\u003c/strong\u003e.\u003c/li\u003e\n \u003cli\u003eHe, Z., et al., \u003cem\u003eQuality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.\u003c/em\u003e ArXiv, 2024.\u003c/li\u003e\n \u003cli\u003eUpton, G.J., \u003cem\u003eFisher\u0026apos;s exact test.\u003c/em\u003e Journal of the Royal Statistical Society: Series A (Statistics in Society), 1992. \u003cstrong\u003e155\u003c/strong\u003e(3): p. 395-402.\u003c/li\u003e\n \u003cli\u003eTest, F.E. \u003cem\u003eFisher Exact Test\u003c/em\u003e. Available from: https://www.socscistatistics.com/tests/fisher/default2.aspx.\u003c/li\u003e\n \u003cli\u003eAdams, G., J. Zuckerg, and N. Elhadad. \u003cem\u003eA meta-evaluation of faithfulness metrics for long-form hospital-course summarization\u003c/em\u003e. in \u003cem\u003eMachine Learning for Healthcare Conference\u003c/em\u003e. 2023. PMLR.\u003c/li\u003e\n \u003cli\u003eAdams, G., et al. \u003cem\u003eWhat\u0026rsquo;s in a summary? laying the groundwork for advances in hospital-course summarization\u003c/em\u003e. in \u003cem\u003eProceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting\u003c/em\u003e. 2021.\u003c/li\u003e\n \u003cli\u003eSearle, T., et al., \u003cem\u003eDischarge summary hospital course summarisation of in patient electronic health record text with clinical concept guided deep pre-trained transformer models.\u003c/em\u003e Journal of Biomedical Informatics, 2023. \u003cstrong\u003e141\u003c/strong\u003e: p. 104358.\u003c/li\u003e\n \u003cli\u003eAdams, G., \u003cem\u003eGenerating Faithful and Complete Hospital-Course Summaries from the Electronic Health Record\u003c/em\u003e. 2024: Columbia University.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Large Language Models, LLaMMa, Fine-tuning, Discharge notes, Summarization, Electronic Health Records ","lastPublishedDoi":"10.21203/rs.3.rs-7181141/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7181141/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003ePurpose: \u003c/strong\u003eThis study investigates whether incorporating highlighted information in discharge notes improves the quality of the summaries generated by Large Language Models (LLMs). Specifically, it evaluates the effect of using highlighted versus unhighlighted inputs for fine-tuning LLaMA2-13B model for the summarization task.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethods: \u003c/strong\u003eWe fine-tuned the LlaMA2-13B model in two variants using MIMIC-IV-Ext-BHC dataset: one variant fine-tuned with the highlighted discharge notes (H-LLaMA), and the other variant on the same set of notes without highlighting (U-LLaMA). Highlighting was performed automatically using a Cardiology Interface Terminology (CIT) presented in our previous work. H-LLaMA and U-LLaMA were evaluated on a randomly selected test set of 100 discharge notes using multiple metrics (including BERTScore, ROUGE-L, BLEU, and SummaC_CONV). Additionally, LLM-based judgment via ChatGPT-4o was used to rate coherence, fluency, conciseness, and correctness, alongside a manual completeness evaluation on a random sample of 20 notes.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults: \u003c/strong\u003eH-LLaMA consistently outperformed U-LLaMA across all metrics. H-summaries, generated using H-LLaMA, in comparison to U-summaries, generated using U-LLaMA, achieved higher BERTScore (63.75 vs. 59.61), ROUGE-L (23.43 vs. 21.82), BLEU (10.4 vs. 8.41), and SummaC_CONV (67.7 vs. 40.2). Manual review also showed improved completeness for H-summaries (54.2% vs. 48.1%). All improvements were statistically significant (p \u0026lt; 0.05). Moreover, LLM-based evaluation indicated higher average ratings across coherence, correctness, and conciseness.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusion: \u003c/strong\u003eIncorporating highlighted information into discharge notes for fine-tuning LLMs enhances the summarization quality. This approach provides a scalable method for improving discharge note summarization and has the potential to support better clinical decision-making through more informative and reliable summaries.\u003c/p\u003e","manuscriptTitle":"Fine-Tuning LLaMA2 for Summarizing Discharge Notes:\nEvaluating the Role of Highlighted Information","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-23 06:43:15","doi":"10.21203/rs.3.rs-7181141/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"9e57c963-2f51-46c2-b1e4-7a7bd7f7850d","owner":[],"postedDate":"July 23rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":51890829,"name":"Bioinformatics"}],"tags":[],"updatedAt":"2025-07-23T06:43:15+00:00","versionOfRecord":[],"versionCreatedAt":"2025-07-23 06:43:15","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7181141","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7181141","identity":"rs-7181141","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00