Natural Language Processing method to Unravel Long COVID's clinical condition in hospitalized patients | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Natural Language Processing method to Unravel Long COVID's clinical condition in hospitalized patients Soraya Smaili, Pilar Veras, Vinícius Araújo, Henrique Zatti, Caio Vinícius Luis, and 13 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4262099/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 13 Sep, 2024 Read the published version in Cell Death & Disease → Version 1 posted 10 You are reading this latest preprint version Abstract Long COVID is characterized by persistent symptoms beyond established timeframes, presenting a significant challenge in understanding its clinical manifestations and implications. In this study, we present a novel application of natural language processing (NLP) techniques to automatically extract unstructured data from a Long COVID survey conducted at a prominent university hospital in São Paulo, Brazil. Our phonetic text clustering (PTC) method enables the exploration of unstructured EHR data to unify different written forms of similar terms into a single phonemic representation. We use n-gram text analysis to detect compound words and negated terms in Portuguese-BR, focusing on medical conditions and symptoms related to Long COVID. By leveraging NLP, we aim to contribute to a deeper understanding of this chronic condition and its implications for healthcare systems worldwide. The model developed in this study has the potential for scalability and applicability in other healthcare settings, facilitating broader research efforts and informing clinical decision-making for Long COVID patients. Health sciences/Medical research/Epidemiology Health sciences/Diseases/Infectious diseases/Viral infection Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction Advances in emerging technologies such as artificial intelligence (AI) and machine learning (ML) hold promise for the development of healthcare transformation in prediction, contact tracing, screening, diagnosis and treatment, significantly improving medical practice 1–6 . One potential utility of AI is to assist in the extraction of information from electronic medical records. Natural language processing (NLP) is a subfield of AI that enables computers to learn from unstructured medical records and adapt to new language patterns over time, which can be useful for administrative and research purposes. Although efforts have been made to use NLP to extract information from medical records in the English language, studies in languages other than English are still emerging 7 and are urgently needed for specific systems. Historically, clinically relevant information from electronic health records (EHRs) has been extracted via manual review by clinical experts, resulting in scalability and cost 8 . This is particularly evident for chronic diseases, where clinical notes are more common than structured medical records data 9 . These unstructured data provide a great opportunity to test the performance of NLP in automatically extracting clinically meaningful information, which may be usefull for research and administrative purpose 8 . Long COVID is a chronic disease characterized by persistence of symptoms for more than one month (according to Central of Disease Control- CDC) or more than three months (according to World Health Organization - WHO) that still lacks a definitive clinical characterization. New tools that help perform a meticulous analysis of vast amounts of unstructured data from EHR can uncover patterns, symptoms, and outcomes that might otherwise elude traditional research methods enabling a deeper comprehension of Long COVID. Recent descriptions of long COVID are based on studies conducted and compiled in 2021 and more detailed studies conducted in 2022 with the aim of reaching a consensus 10–12 . Based on these studies, many works have sought to highlight what actually occurs after SARS-CoV-2 infection. It remains unclear why the virus causes so many different symptoms affecting a variety of systems and what defines their frequency and prevalence. Thus, understanding what happens after COVID-19 and identifying which sequelae correspond to the postacute disease are increasingly important. Several symptoms can persist for many months or years, in addition to elevated risks of complications and death 13,14 . More recently, studies have provided more evidence about the existence of a large set of symptoms related to post acute COVID-19, termed long COVID 15 . However, there are still may doubts and inconsistencies related to long COVID, especially regarding patients who were hospitalized, as well as the correlation between the length of hospitalization and the severity of the disease. Therefore, it is necessary to investigate and follow patients who were hospitalized with COVID-19 in different hospitals and to clearly identify the risk factors that require attention after COVID-19 and how these factors affect patients’ lives after the disease. In this study, we sought to classify and automatically extract data from a Long covid survey from a large hospital in the city of São Paulo, one of the Brazilian cities most affected by the pandemic. We analyzed the EHR and created a model that can be applied in other hospitals. Materials and Methods Study design and Data source This is a cross sectional study that uses a national database for training a model of NLP and EHR from a referral hospital for testing the model. The training dataset provided the tokens for text mining. We then performed text mining on the testing dataset and compared the results with those of manual human classification. The training dataset was built from the information system for severe acute respiratory illness (SIVEP-Gripe), in which all COVID-19 hospital admissions and deaths in Brazil were registered by federal law. The SIVEP-Gripe is the national registration database? for severe acute respiratory syndrome (SARS) in Brazil including COVID-19 data, and all COVID-19 hospitalizations and deaths. SARS is defined as an individual who presents with dyspnea/respiratory discomfort, persistent pressure or pain in the chest, oxygen saturation less than 95% without oxygen, or cyanosis of the lips or face (https://www.gov.br/saude/pt-br/assuntos/coronavirus/artigos/definicao-e-casos-suspeitos). This database has been widely used as a source for other epidemiological studies 16–19 . In the present work, the SIVEP-Gripe was used to create a token dictionary for unstructured text from clinical questionnaires. The dataset is publicly available at https://opendatasus.saude.gov.br/group/dados-sobre-srag. The testing dataset was built from the EHR of patients who were hospitalized for COVID-19 at the Hospital São Paulo, the University Hospital (UH) of the Federal University of São Paulo (Unifesp), from March 2020 to June 2022 and were followed after discharge at a Post-COVID-19 Disease Unit (PCDU). The PCDU is a multidisciplinary unit where health professionals assist patients and administer a questionnaire to gather information on any prolonged signs or symptoms after the acute phase of COVID-19. This questionnaire includes information on acute COVID-19, evolution of the infection, medical history, and post-acute phase signs and symptoms. These data were also linked to demographic information and SARS-CoV-2 PCR results from patients. Hospital São Paulo uses multiple information systems, resulting in the dispersion of relevant information across different databases. As an initial search strategy to extract data, terminologies related to the infectious disease COVID-19 were applied to the Clinical Notes Database (MongoDB) for the period from March 1, 2020 to September 30, 2022. A total of eight clinical encounter forms were preselected from this data structure, and the clinical records of 16,017 patients were collected. The form data were extracted and grouped into eight JSON (JavaScript object notation) files that were converted to the Tidy Data format and saved as CSV (comma separated values) text files. Following the analysis of the files by the technical-scientific team, the clinical encounter form "Post-COVID Care (Pneumo)", containing records of 440 patients, was chosen for the analysis of demographic and hospital historical data. The second database of interest was the general patient record, a relational database (Oracle) that was queried using the Standard Query Language (SQL). We extracted data on emergency room visits, outpatient consultations, appointments, exam results, hospitalizations, and surgeries. The extracted data from this database were stored in *.XLSX (Microsoft Excel 2007) file format. The extracted medical conditions and patient symptoms were validated with the assistance of ambulatory pneumology and infectiology at the UH. Clinical data collection The SIVEP-Gripe dataset (training dataset), which contains ≥2.6 million entries and information on medical conditions and symptoms in the form of unstructured text, was used for the training dataset. Comorbidity and symptom information was extracted from this system via a phoneme approach using the metaphonept-br library (https://github.com/carlosjordao/metaphone-ptbr). All data processing was run in Python using Jupyter Notebooks. Before beginning the text analysis, it was necessary to normalize and clean the text strings. To this end, we utilized regular expressions to clean special characters and to make a diverse set of separators between words uniform to a whitespace. The phonetic text clustering (PTC) method groups terms together according to their phonetics, effectively consolidating variations of similar terms into a single phonemic representation and using n-gram text analysis to detect compound words 20 . This method not only captured and grouped terms but also allowed the accommodation of synonyms, abbreviations, typographical errors, and the different conjunctions found in Brazilian Portuguese. To ensure the accuracy of these results, a dictionary of similar terms was carefully curated by five specialists in internal medicine, infectology, pharmacology, pathology, otorhinolaryngology and public health and three medical students. This step prevented the grouping of different terms into the same phoneme. This curation process was crucial in enabling the use of the dictionary to identify medical conditions from unstructured text from different and more complex contexts. PTC validation with long COVID questionnaires To validate the PTC method, the clinical information of patients collected from the long COVID questionnaire was organized into structured and unstructured data (Testing dataset). The structured data consisted of yes–no and multiple-choice answers, as well as numerical variables. The unstructured data comprised textual patient reports, including records of symptoms, clinical signs, laboratory tests, previous medical conditions, and lifestyle habits, such as smoking. Subsequently, an automated approach to process the unstructured variables from the long COVID questionnaires was applied. We searched for all previously defined terms in the curated dictionary and focused on the most frequent comorbidities associated with COVID-19, which included obesity, hypertension, diabetes mellitus, chronic obstructive pulmonary disease (COPD), asthma, hypothyroidism, and hyperthyroidism. Information on patients' smoking history was also collected to classify patients as smokers or former smokers. Medical records that did not include information on comorbidities or smoking history were classified as having no comorbidities or having a non-smoking history. In addition, information on long COVID symptoms (cough, fatigue, headache and myalgia) (https://www.cdc.gov/coronavirus/2019-ncov/long-term-effects/index.html) was collected. For symptoms, the focus was on terms in the questionnaires that described patient information in the present, excluding the text referring to symptoms reported in the past (acute phase). Importantly, the medical condition terms detected in the unstructured text of the Long COVID dataset could be in the context of a negative report, i.e., the patient confirming or denying the medical condition. To address this issue, we assessed negative operators such as “deny” ( nega , in Brazilian Portuguese) that appeared before the comorbidity of interest and until the following sentence with regular expressions ( https://docs.python.org/3/library/re.html ). This allowed us to capture instances where patients denied having the specified condition. Figure 1 shows a diagram detailing the method developed in this study using the extracted dataset. Long COVID study population The study population included adults (≥18 years) who were hospitalized due to acute COVID-19 at the Hospital São Paulo and discharged. The data were collected between March 2020 and June 2022. We excluded individuals who 1) had no SARS-CoV-2 PCR result or had a negative result; 2) did not have a recorded date of first symptoms; 3) completed questionnaires less than 30 days after the first occurrence of symptoms; 4) were hospitalized for more than 120 days; 5) did not have records of COVID-19 evolution during the acute phase; 6) had date inconsistencies; (7) had encounters after the first questionnaire; or (8) had no severity classification in the acute phase. Demographic variables from the long COVID dataset, such as sex, age (stratified into 18-39, 40-59, 60-79, and ≥80 years), and race (divided into white, black, mixed-brown, Asian, and Indigenous), were evaluated. Due to the small sample size, Asian and Indigenous individuals were combined for the analysis. Additionally, other variables, such as medical history, length of hospital stay (stratified into 0-14, 15-30, 31-60, and ≥60 days), and severity of the acute phase of COVID-19, were included. The severity categories were defined as moderate (non-ICU ward), severe (intensive care unit; ICU), or critical (ICU with mechanical ventilation). Statistical analysis To validate the accuracy of the automated approach, we compared the automated results with manually searched and labeled clinical data. The manual labeling was performed by six of the authors with clinical training, and each record was individually labeled three times. We conducted Pearson's chi-square test to compare the automated and manual term counts to assess the accuracy of the text mining. Then, we performed a descriptive analysis of the long COVID findings to validate our findings against those of previous studies on the topic. Ethical approval The Brazilian National Commission in Research Ethics approved the research protocol (CONEP approval number 4.921.308 and CAAE registration no. 58619822.6.1001.5505). Results Automated labeling of the training dataset First, we investigated the records from the SIVEP-Gripe. A total of 2,490,196 SARS records of patients admitted to hospitals between December 31, 2019, and March 27, 2023 were collected. All records were then analyzed as input for the PTC tokenization of medical conditions and symptoms, which were used to create the dictionary that was used to structure the data and to create the database for long COVID. Overall, 635,921 (25.5%) records reported one or more medical conditions, and 849,976 (34.1%) reported one or more SARS-related symptoms in the unstructured text field (Figure S1 and Table 1). From the unstructured clinical data, a dictionary collecting synonyms, misspelled and derivative words into a unique term (Table S1) was produced. Based on this dictionary, 20 of the most frequent medical conditions and 10 of the most frequent symptoms (Table 1) were captured for further analyses. SARS patient records were stratified by medical conditions and symptoms in a "yes/no'' format, such as diabetes mellitus, obesity, cardiopathy, loss of smell, loss of taste, fatigue and cough. The results showed that 22,458 of the terms containing medical conditions captured from the unstructured text overlapped with at least one of the binary comorbidities with a "yes" response in the questionnaire. In addition, 1,418 terms overlapped with at least one of the binary symptom variables with a "yes" response. Thus, to evaluate the gain of information, records with overlapping medical conditions or symptoms were excluded. The terms that were not included in the binary variables from the questionnaire appeared more frequently in the unstructured text annotation (Figure 2). Among medical conditions, the most frequent term captured by the automated reading was "hypertension", present in 303,109 entries, representing 11.3% of the total database (Figure 2A), followed by “smoker” in 61,110 entries, representing 2.3%; "hypothyroidism" in 39,550 entries, representing 1.5%; and "COPD" in 30,387 entries, representing 1.1%. Additionally, "smoking" was found in 61,110 entries (2.3%). Among symptoms, the most frequent terms were "headache", present in 215,225 entries, representing 8.0% of the total database; "myalgia", present in 213,035 entries, representing 7.9%; "asthenia", present in 124,086 entries, representing 4.6%; and "runny nose", present in 122,766 entries, representing 4.5% (Figure 2B). Validating text mining on EHRs Data from patients who were admitted with COVID-19 at the Hospital São Paulo, stayed in the hospital for more than 30 days, and were followed at the PCDU after discharge were evaluated. To validate the PTC method on these data obtained from Hospital São Paulo, 398 post-COVID patient questionnaires collected from the PCDU (Figure S2) were cross-checked. The dictionary derived from records of SARS-hospitalized patients was applied. Medical conditions and symptoms from these post-COVID-19 patients were extracted and studied by using an automated method. The results obtained were compared with those obtained through manual searches conducted by specialists, which showed a high degree of similarity in present, absent and negated terms. According to this method, the similarity ranged from 93% to 99% for medical condition terms and from 87% to 95% for symptom terms (Table S2). The statistical significance of these findings is reflected in the p values for all terms, which were less than 0.01 (Figure 3, Table S2). The study population was divided into individuals who reported no symptoms (29.1%) and those with at least one symptom 30 days after the onset of COVID-19 (70.9%) (Table 2). Demographic characteristics were similar between these two groups; however, patients with three or more medical conditions showed more post-COVID-19 symptoms after 30 days of discharge from the hospital than individuals without comorbidities (24% with symptoms against 17% without symptoms). For patients who presented with at least one symptom, the most prevalent symptom was dyspnea (77.7%), followed by cough (21.3%) and fatigue (13.5%). Low oxygen saturation (below 27.3%) was the most common continuous variable reported. In terms of lifestyle, 25.9% were former smokers. A total of 48.6% of the population with symptoms after 30 days had hypertension, 26.9% had diabetes, and 15.2% had obesity (Figure 4). Discussion In the present study, three different developments resulted from this study. First, we built a text mining workflow that was able to extract structured medical information from clinical notes in Brazilian Portuguese. Second, this method, in conjunction with the validated text tokens, could be used as a platform for future analyses of long COVID in hospitals that use different systems. Finally, the method was applied back to the training dataset (SIVEP-Gripe), enriching the national database and resulting in more detailed clinical characterizations of SARS in Brazil in the last decade. The method developed for text mining of clinical data was based on grouping synonyms by phoneme. Our method was able to extract clinical information that was not available previously as variables, with a total informational gain of 32.30% for the 30 categories of comorbidities and symptoms from the records of hospitalized SARS patients. Furthermore, we validated our method against human labeling using electronic records from patients who returned to the post-COVID-19 unit after being discharged for 30 days, which allowed us to describe the clinical findings related to long COVID in those patients. The initial difficulty was structuring a database from a set of unstructured data that would allow subsequent analysis of a disease such as COVID-19 and post acute symptoms, characterized as long COVID. The benchmarks were previous studies on COVID-19 and vaccine effectiveness using national health system datasets, from which cohorts for studies on the effectiveness of different vaccines administered in Brazil were formed, and national databases 18,21 . Thus, it was possible to enrich the same dataset and cross-check the informational gain using data from patients who were admitted to the UH and who, after discharge, were followed up at the PCDU due to various symptoms. After defining the sample, advanced methodologies were used for data extraction from the database, and sensitivity analyses were used to define the modeling. The inclusion and exclusion criteria were based on the creation of a dictionary containing the most frequent long COVID symptoms, organized from medical records using International Classification of Diseases (ICD) codes and the PTC method. This structure allowed the extraction of data from unstructured text to enrich the study population information from EHRs. The method developed in this study exhibited good performance and was subsequently used to investigate the effects of long COVID in patients who were admitted to the UH and were followed for several months after being discharged. Phonemic representation has been used previously to cluster variations in writing and represent these clusters as an n-gram, but this is the first time that it has been used for clinical notes in Brazilian Portuguese 20 . This plot captured groups of variations in terms, such as close synonyms, abbreviations, and typographical errors typical of the language, which confirmed the validation and interpretability of the PTC method. Importantly, the construction of this method allowed for a more accurate analysis of symptoms in patients followed by the PCDU of Hospital São Paulo, which showed that the majority of individuals presented dyspnea as a prevalent symptom, often accompanied by low oxygen saturation. These data are in accordance with other studies that used different methods, including the studies that reported low oxygen saturation during physical exercise 22,23 . Since dyspnea is one of the most frequent and well-documented symptoms of long COVID, it is notable that it was detected by our study and method and provided further information concerning low oxygen saturation. In addition, other symptoms, such as fatigue and muscle pain, were detected that had been described by other authors 24 , corroborating the quality of the new method to extract symptoms from non-structured data. Importantly, the curation and constant maintenance of the dictionary will be continued, and we will update the dictionary with new information and terms used by services. Thus, new qualifiers of clinical conditions, such as different degrees of dyspnea and the evolution of these clinical conditions over time, which may encompass periods of improvement and worsening, will be included in the dictionary. In addition, creating specific platforms to characterize and identify a little-known and difficult-to-diagnose condition, such as long COVID, represents an important advance for data modeling and decision-making after the occurrence of COVID-19. The tool created from the methods used in this study has characteristics that indicate the possibility of analyzing data in the language in which medical records are written, in addition to machine and human checking, which can overcome the lack of homogeneity in different records and allow more accurate results. These results are important, although it is important to emphasize that the risks of death and hospitalization remained statistically high in different phases of the pandemic, particularly in those who were hospitalized during the acute phase of SARS-CoV-2 infection and in countries such as Brazil 25 , in which a high number of cases were reported and, therefore, must also consider the substantial number of individuals with COVID-19 sequelae. Since there is also evidence of COVID-19 sequelae in individuals who were not hospitalized, it is crucial to emphasize the importance of treating those who are infected and preventing reinfections. Therefore, reducing the risk of long-term sequelae remains a need in terms of public health and health policies. Finally, there are still many gaps and regional disparities in long COVID research. In particular, there are significant geographic gaps in the available research data, with an abundance of studies originating from Northern Hemisphere populations and a paucity of information regarding long COVID in low- and middle-income countries. There is a critical need for more focused research in these regions. Therefore, the use of NLP to evaluate nonstructured EHRs provides a great opportunity to improve the knowledge of long COVID in areas with resource-limited settings. The method and modeling presented in this work and the use of cohorts of data to predict and treat long COVID patients will be crucial, and more studies should be performed to not only increase knowledge but also develop the necessary care and rehabilitation methods in addition to the planning and capacity of the primary health care system. In this context, studies such as the present one should be expanded to help understand long COVID and predict its effects. The results of these studies will allow the development of prevention or treatment systems that will achieve higher quality standards in population health even in the face of the pandemic. Declarations Conflict of Interest: The authors declare no conflicts of interest related to the present work. Availability of Data and Materials: Due to the nature of the research, due to [ethical/legal/commercial] supporting data is not available. Acknowledgments: The authors acknowledge Dr. Lucia Pellanda; Dr. Ethel Maciel; Dr. Adhemar Arthur Chioro; and Dr. Nisia Trindade for their support and discussion on this work. We also acknowledge the support of Fiotec- Fiocruz, FAP-Unifesp, CNPq 400504/2023-5 and FAPESP 2019/02821-8. References Elpeltagy M, Sallam H. Automatic prediction of COVID – 19 from chest images using modified ResNet50. Multimed Tools Appl 2021; 80: 26451–26463. Abbar S, Mokbel M. The role of AI in digital contact tracing. In: Leveraging Artificial Intelligence in Global Epidemics . Elsevier, 2021, pp 203–221. Chowdhury MEH, Rahman T, Khandakar A, Mazhar R, Kadir MA, Mahbub ZB et al. Can AI Help in Screening Viral and COVID-19 Pneumonia? IEEE Access 2020; 8: 132665–132676. Cau R, Faa G, Nardi V, Balestrieri A, Puig J, Suri JS et al. Long-COVID diagnosis: From diagnostic to advanced AI-driven models. European Journal of Radiology 2022; 148: 110164. Ke Y-Y, Peng T-T, Yeh T-K, Huang W-Z, Chang S-E, Wu S-H et al. Artificial intelligence approach fighting COVID-19 with repurposing drugs. Biomedical Journal 2020; 43: 355–362. Chang Z, Zhan Z, Zhao Z, You Z, Liu Y, Yan Z et al. Application of artificial intelligence in COVID-19 medical area: a systematic review. J Thorac Dis 2021; 13: 7034–7053. Névéol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P. Clinical Natural Language Processing in languages other than English: opportunities and challenges. J Biomed Semant 2018; 9: 12. Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V. Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review. JMIR Med Inform 2019; 7: e12239. Wei W-Q, Teixeira PL, Mo H, Cronin RM, Warner JL, Denny JC. Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance. Journal of the American Medical Informatics Association 2016; 23: e20–e27. Nurek M, Rayner C, Freyer A, Taylor S, Järte L, MacDermott N et al. Recommendations for the recognition, diagnosis, and management of long COVID: a Delphi study. Br J Gen Pract 2021; 71: e815–e825. Soriano JB, Murthy S, Marshall JC, Relan P, Diaz JV. A clinical case definition of post-COVID-19 condition by a Delphi consensus. The Lancet Infectious Diseases 2022; 22: e102–e107. McGrath LJ, Scott AM, Surinach A, Chambers R, Benigno M, Malhotra D. Use of the Postacute Sequelae of COVID-19 Diagnosis Code in Routine Clinical Practice in the US. JAMA Netw Open 2022; 5: e2235089. Kingery JR, Safford MM, Martin P, Lau JD, Rajan M, Wehmeyer GT et al. Health Status, Persistent Symptoms, and Effort Intolerance One Year After Acute COVID-19 Infection. J GEN INTERN MED 2022; 37: 1218–1225. Bowe B, Xie Y, Al-Aly Z. Acute and postacute sequelae associated with SARS-CoV-2 reinfection. Nat Med 2022; 28: 2398–2405. Bowe B, Xie Y, Al-Aly Z. Postacute sequelae of COVID-19 at 2 years. Nat Med 2023; 29: 2347–2357. Ranzani OT, Bastos LSL, Gelli JGM, Marchesi JF, Baião F, Hamacher S et al. Characterisation of the first 250 000 hospital admissions for COVID-19 in Brazil: a retrospective analysis of nationwide data. The Lancet Respiratory Medicine 2021; 9: 407–418. Oliveira EA, Colosimo EA, E Silva ACS, Mak RH, Martelli DB, Silva LR et al. Risk factors for COVID-19 mortality in hospitalised children and adolescents in Brazil – Authors’ reply. The Lancet Child & Adolescent Health 2021; 5: e40–e42. Cerqueira-Silva T, Andrews JR, Boaventura VS, Ranzani OT, De Araújo Oliveira V, Paixão ES et al. Effectiveness of CoronaVac, ChAdOx1 nCoV-19, BNT162b2, and Ad26.COV2.S among individuals with previous SARS-CoV-2 infection in Brazil: a test-negative, case-control study. The Lancet Infectious Diseases 2022; 22: 791–801. Florentino PTV, Alves FJO, Cerqueira-Silva T, de Araújo Oliveira V, Júnior JBS, Penna GO et al. Effectiveness of BNT162b2 booster after CoronaVac primary regimen in pregnant people during omicron period in Brazil. The Lancet Infectious Diseases 2022; 22: 1669–1670. Bird S, Klein E, Loper E. Natural language processing with Python . 1st ed. O’Reilly: Beijing; Cambridge [Mass.], 2009. Cerqueira-Silva T, Katikireddi SV, De Araujo Oliveira V, Flores-Ortiz R, Júnior JB, Paixão ES et al. Vaccine effectiveness of heterologous CoronaVac plus BNT162b2 in Brazil. Nat Med 2022; 28: 838–843. Schäfer H, Teschler M, Mooren FC, Schmitz B. Altered tissue oxygenation in patients with post COVID-19 syndrome. Microvascular Research 2023; 148: 104551. Guarnieri G, Lococo S, Bertagna De Marchi L, Cecchetto A, Molena B, Arcaro G et al. Persistent oxygen desaturation during exercise in patients with long COVID. In: 01.05 - Clinical respiratory physiology, exercise and functional imaging . European Respiratory Society, 2022, p 3725. Global Burden of Disease Long COVID Collaborators, Wulf Hanson S, Abbafati C, Aerts JG, Al-Aly Z, Ashbaugh C et al. Estimated Global Proportions of Individuals With Persistent Fatigue, Cognitive, and Respiratory Symptom Clusters Following Symptomatic COVID-19 in 2020 and 2021. JAMA 2022; 328: 1604. Katikireddi SV, Cerqueira-Silva T, Vasileiou E, Robertson C, Amele S, Pan J et al. Two-dose ChAdOx1 nCoV-19 vaccine protection against COVID-19 hospital admissions and deaths over time: a retrospective, population-based cohort study in Scotland and Brazil. The Lancet 2022; 399: 25–35. Tables Tables 1-2 is available in the Supplementary Files section. Additional Declarations (Not answered) Supplementary Files Table1.xlsx Table 1 Table2.xlsx Table 2 SupplFlorentinoetal20240314.docx Suppl Cite Share Download PDF Status: Published Journal Publication published 13 Sep, 2024 Read the published version in Cell Death & Disease → Version 1 posted Editorial decision: revise 14 Jun, 2024 Review # 2 received at journal 07 Jun, 2024 Review # 3 received at journal 03 Jun, 2024 Reviewer # 3 agreed at journal 17 May, 2024 Reviewer # 2 agreed at journal 06 May, 2024 Reviewer # 1 agreed at journal 13 Apr, 2024 Reviewers invited by journal 13 Apr, 2024 Submission checks completed at journal 13 Apr, 2024 Editor assigned by journal 13 Apr, 2024 First submitted to journal 13 Apr, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4262099","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":290840845,"identity":"bbb50d05-1df8-442e-af62-485a6b6127e2","order_by":0,"name":"Soraya Smaili","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAyElEQVRIiWNgGAWjYLCChAIJOQYGxoYPII4BcVoMJIyBWhpnEK8FqCyxAaiHOC3y/acTPzwwsEjfcLu5seHDnzsM5tIHCJh/I3ezBNBhuRvuHGxsnNn2jMGyL4GAFgneDRAtNxLbH/M2HGYwOEPQYWc3/wBqSTe4kdjY/OcPEVoYDuRuA9mSANbCwEaEFqBftlkAtRjOBGpp7G17xmPZQ4TDbv6oqJPnu5H+sOHHnzty5jyEHIbuTlI1ALWQrGMUjIJRMAqGPwAASjxKkZPulxcAAAAASUVORK5CYII=","orcid":"https://orcid.org/0000-0001-5844-1368","institution":"Universidade Federal de São Paulo","correspondingAuthor":true,"prefix":"","firstName":"Soraya","middleName":"","lastName":"Smaili","suffix":""},{"id":290840846,"identity":"8773c93a-8714-44b2-8302-dd1aa9e47e4b","order_by":1,"name":"Pilar Veras","email":"","orcid":"","institution":"University of Sao Paulo","correspondingAuthor":false,"prefix":"","firstName":"Pilar","middleName":"","lastName":"Veras","suffix":""},{"id":290840847,"identity":"8f61108b-e736-44d4-becf-c92eab90cd5a","order_by":2,"name":"Vinícius Araújo","email":"","orcid":"","institution":"Instituto Gonçalo Moniz","correspondingAuthor":false,"prefix":"","firstName":"Vinícius","middleName":"","lastName":"Araújo","suffix":""},{"id":290840848,"identity":"b4f3e695-bbb1-4035-b0e7-57076bacf50d","order_by":3,"name":"Henrique Zatti","email":"","orcid":"","institution":"Instituto Gonçalo Moniz","correspondingAuthor":false,"prefix":"","firstName":"Henrique","middleName":"","lastName":"Zatti","suffix":""},{"id":290840849,"identity":"cf061efc-5c91-46b4-8be1-de079884c233","order_by":4,"name":"Caio Vinícius Luis","email":"","orcid":"","institution":"Universidade Federal de São Paulo","correspondingAuthor":false,"prefix":"","firstName":"Caio","middleName":"Vinícius","lastName":"Luis","suffix":""},{"id":290840850,"identity":"e91a50eb-e5b0-41b6-9c36-38e184ee3b62","order_by":5,"name":"Célia Regina Cavalcanti","email":"","orcid":"","institution":"Universidade Federal de São Paulo","correspondingAuthor":false,"prefix":"","firstName":"Célia","middleName":"Regina","lastName":"Cavalcanti","suffix":""},{"id":290840851,"identity":"88fdd844-e5d7-4138-9943-33815c762b96","order_by":6,"name":"Matheus Henrique de Oliveira","email":"","orcid":"","institution":"Universidade Federal de São Paulo","correspondingAuthor":false,"prefix":"","firstName":"Matheus","middleName":"Henrique","lastName":"de Oliveira","suffix":""},{"id":290840852,"identity":"034c9098-faaa-4b9b-be6e-46fdb60d0f44","order_by":7,"name":"Anderson Henrique Leao","email":"","orcid":"","institution":"Universidade Federal de São Paulo","correspondingAuthor":false,"prefix":"","firstName":"Anderson","middleName":"Henrique","lastName":"Leao","suffix":""},{"id":290840853,"identity":"69f1ecb6-03f5-4c2f-b5a8-4e882a892130","order_by":8,"name":"Juracy Bertoldo Junior","email":"","orcid":"","institution":"Instituto Gonçalo Moniz","correspondingAuthor":false,"prefix":"","firstName":"Juracy","middleName":"Bertoldo","lastName":"Junior","suffix":""},{"id":290840854,"identity":"87e05035-5083-44cf-bf3b-1eafcb99e101","order_by":9,"name":"George Barbosa","email":"","orcid":"","institution":"Instituto Gonçalo Moniz","correspondingAuthor":false,"prefix":"","firstName":"George","middleName":"","lastName":"Barbosa","suffix":""},{"id":290840855,"identity":"60d791fc-2f8a-4f32-bb7a-2e9beb2cf035","order_by":10,"name":"Ernesto Ravera","email":"","orcid":"","institution":"Universidade Federal de São Paulo","correspondingAuthor":false,"prefix":"","firstName":"Ernesto","middleName":"","lastName":"Ravera","suffix":""},{"id":290840856,"identity":"11458102-1ce7-4334-b735-67054244e84d","order_by":11,"name":"Alberto Cebukin","email":"","orcid":"","institution":"Universidade Federal de São Paulo","correspondingAuthor":false,"prefix":"","firstName":"Alberto","middleName":"","lastName":"Cebukin","suffix":""},{"id":290840857,"identity":"ed51b5bd-a27a-499f-9bd7-cd1eedfaec42","order_by":12,"name":"Renata David","email":"","orcid":"","institution":"Universidade Federal de São Paulo","correspondingAuthor":false,"prefix":"","firstName":"Renata","middleName":"","lastName":"David","suffix":""},{"id":290840858,"identity":"cb3e39bd-0493-4bfa-a450-22f059fa9155","order_by":13,"name":"Danilo de Melo","email":"","orcid":"","institution":"Universidade de São Paulo","correspondingAuthor":false,"prefix":"","firstName":"Danilo","middleName":"","lastName":"de Melo","suffix":""},{"id":290840859,"identity":"78dd6d86-5469-43aa-b185-600ba5d81de7","order_by":14,"name":"Tales Machado","email":"","orcid":"","institution":"Instituto Gonçalo Moniz","correspondingAuthor":false,"prefix":"","firstName":"Tales","middleName":"","lastName":"Machado","suffix":""},{"id":290840860,"identity":"844247cb-fa9f-49d7-8a4b-36b8943bac5c","order_by":15,"name":"Nancy Bellei","email":"","orcid":"","institution":"Universidade Federal de São Paulo","correspondingAuthor":false,"prefix":"","firstName":"Nancy","middleName":"","lastName":"Bellei","suffix":""},{"id":290840861,"identity":"d191b1cb-a807-4c67-8ef1-3993c409cfe3","order_by":16,"name":"Viviane Boaventura","email":"","orcid":"","institution":"Instituto Gonçalo Moniz","correspondingAuthor":false,"prefix":"","firstName":"Viviane","middleName":"","lastName":"Boaventura","suffix":""},{"id":290840862,"identity":"870eac3b-7b04-49d4-82f4-887c29a984d9","order_by":17,"name":"Manoel Barral-Neto","email":"","orcid":"https://orcid.org/0000-0002-5823-7903","institution":"Gonçalo Moniz Research Institute","correspondingAuthor":false,"prefix":"","firstName":"Manoel","middleName":"","lastName":"Barral-Neto","suffix":""}],"badges":[],"createdAt":"2024-04-13 14:25:16","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4262099/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4262099/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41419-024-07043-4","type":"published","date":"2024-09-13T04:00:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":55061602,"identity":"6fe5aea0-0feb-4fe9-8e53-6eb50f03c455","added_by":"auto","created_at":"2024-04-22 02:50:32","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":78749,"visible":true,"origin":"","legend":"\u003cp\u003eDiagram detailing the method developed during the present study and the dataset extracted for further analyses.\u003c/p\u003e","description":"","filename":"Figure1.png","url":"https://assets-eu.researchsquare.com/files/rs-4262099/v1/b2c61ad602bb5e9d2b6495fd.png"},{"id":55061617,"identity":"ac82eeb4-22c4-486f-ba04-eba01710f8f3","added_by":"auto","created_at":"2024-04-22 02:50:38","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":275528,"visible":true,"origin":"","legend":"\u003cp\u003eGain of information from clinical data extracted by the phonetic text clustering (PTC) method in hospitalized patients with severe acute respiratory syndrome (SARS) in Brazil who tested positive for COVID-19. (A) Most frequent terms captured from medical conditions. (B) Most frequent terms captured from symptoms.\u003c/p\u003e","description":"","filename":"Figure2.png","url":"https://assets-eu.researchsquare.com/files/rs-4262099/v1/978a184d335d49f2a603ccf8.png"},{"id":55061601,"identity":"25c4d85a-bd87-4898-9de6-18c76db4a7cc","added_by":"auto","created_at":"2024-04-22 02:50:31","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":181465,"visible":true,"origin":"","legend":"\u003cp\u003eComparison between the manual and automated methods of the most frequently reported terms in unstructured text for (A) symptoms and (B) medical conditions in the population studied.\u003c/p\u003e","description":"","filename":"Figure4.png","url":"https://assets-eu.researchsquare.com/files/rs-4262099/v1/d919245457f204be1bf91c50.png"},{"id":55061605,"identity":"3ee93198-e5b2-4a5a-b9e6-e34e976d0d46","added_by":"auto","created_at":"2024-04-22 02:50:33","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":797990,"visible":true,"origin":"","legend":"\u003cp\u003eBased on the methods developed and validated, the most prevalent symptoms and medical conditions related to long COVID were investigated and analyzed in the study population. (A) Symptoms and (B) medical conditions reported in the questionnaire.\u003c/p\u003e","description":"","filename":"Figure3.png","url":"https://assets-eu.researchsquare.com/files/rs-4262099/v1/ddc9a8feac43e73423c29f97.png"},{"id":64508896,"identity":"50d3dc06-1eb2-4ce1-a6aa-84de55bd05c9","added_by":"auto","created_at":"2024-09-14 07:10:06","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1664555,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4262099/v1/cc55cb18-9886-439c-940a-b7755ec0cfa8.pdf"},{"id":55061603,"identity":"4f6e2287-6a36-4000-950a-524d9c8746a8","added_by":"auto","created_at":"2024-04-22 02:50:32","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":10647,"visible":true,"origin":"","legend":"Table 1","description":"","filename":"Table1.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-4262099/v1/d96c967fb17017d05ff687ff.xlsx"},{"id":55061607,"identity":"d5c4b2dd-b0e9-4883-8dd7-bbc5be49361e","added_by":"auto","created_at":"2024-04-22 02:50:34","extension":"xlsx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":6021,"visible":true,"origin":"","legend":"Table 2","description":"","filename":"Table2.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-4262099/v1/517a2142d1e75e9c2f9c4673.xlsx"},{"id":55061575,"identity":"7886bc0a-a86f-4e4a-86a7-40c91a48d4ea","added_by":"auto","created_at":"2024-04-22 02:50:30","extension":"docx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":274381,"visible":true,"origin":"","legend":"Suppl","description":"","filename":"SupplFlorentinoetal20240314.docx","url":"https://assets-eu.researchsquare.com/files/rs-4262099/v1/5f6ef964ec6673358dec4ca2.docx"}],"financialInterests":"(Not answered)","formattedTitle":"Natural Language Processing method to Unravel Long COVID's clinical condition in hospitalized patients","fulltext":[{"header":"Introduction","content":"\u003cp\u003eAdvances in emerging technologies such as artificial intelligence (AI) and machine learning (ML) hold promise for the development of healthcare transformation in prediction, contact tracing, screening, diagnosis and treatment, significantly improving medical practice \u003csup\u003e1\u0026ndash;6\u003c/sup\u003e. One potential utility of AI is to assist in the extraction of information from electronic medical records. Natural language processing (NLP) is a subfield of AI that enables computers to learn from unstructured medical records and adapt to new language patterns over time, which can be useful for administrative and research purposes. Although efforts have been made to use NLP to extract information from medical records in the English language, studies in languages other than English are still emerging\u003csup\u003e7\u003c/sup\u003e and are urgently needed for specific systems.\u003c/p\u003e \u003cp\u003eHistorically, clinically relevant information from electronic health records (EHRs) has been extracted via manual review by clinical experts, resulting in scalability and cost\u003csup\u003e8\u003c/sup\u003e. This is particularly evident for chronic diseases, where clinical notes are more common than structured medical records data\u003csup\u003e9\u003c/sup\u003e. These unstructured data provide a great opportunity to test the performance of NLP in automatically extracting clinically meaningful information, which may be usefull for research and administrative purpose\u003csup\u003e8\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eLong COVID is a chronic disease characterized by persistence of symptoms for more than one month (according to Central of Disease Control- CDC) or more than three months (according to World Health Organization - WHO) that still lacks a definitive clinical characterization. New tools that help perform a meticulous analysis of vast amounts of unstructured data from EHR can uncover patterns, symptoms, and outcomes that might otherwise elude traditional research methods enabling a deeper comprehension of Long COVID. Recent descriptions of long COVID are based on studies conducted and compiled in 2021 and more detailed studies conducted in 2022 with the aim of reaching a consensus\u003csup\u003e10\u0026ndash;12\u003c/sup\u003e. Based on these studies, many works have sought to highlight what actually occurs after SARS-CoV-2 infection. It remains unclear why the virus causes so many different symptoms affecting a variety of systems and what defines their frequency and prevalence. Thus, understanding what happens after COVID-19 and identifying which sequelae correspond to the postacute disease are increasingly important. Several symptoms can persist for many months or years, in addition to elevated risks of complications and death\u003csup\u003e13,14\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eMore recently, studies have provided more evidence about the existence of a large set of symptoms related to post acute COVID-19, termed long COVID\u003csup\u003e15\u003c/sup\u003e. However, there are still may doubts and inconsistencies related to long COVID, especially regarding patients who were hospitalized, as well as the correlation between the length of hospitalization and the severity of the disease. Therefore, it is necessary to investigate and follow patients who were hospitalized with COVID-19 in different hospitals and to clearly identify the risk factors that require attention after COVID-19 and how these factors affect patients\u0026rsquo; lives after the disease.\u003c/p\u003e \u003cp\u003eIn this study, we sought to classify and automatically extract data from a Long covid survey from a large hospital in the city of S\u0026atilde;o Paulo, one of the Brazilian cities most affected by the pandemic. We analyzed the EHR and created a model that can be applied in other hospitals.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cp\u003e\u003cstrong\u003eStudy design and Data source\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis is a cross sectional study that uses a national database for training a model of NLP and EHR from a referral hospital for testing the model. The training dataset provided the tokens for text mining. We then performed text mining on the testing dataset and compared the results with those of manual human classification.\u003c/p\u003e\n\u003cp\u003eThe training dataset was built from the information system for severe acute respiratory illness (SIVEP-Gripe), in which all COVID-19 hospital admissions and deaths in Brazil were registered by federal law. The SIVEP-Gripe is the national registration database? for severe acute respiratory syndrome (SARS) in Brazil including COVID-19 data, and all COVID-19 hospitalizations and deaths. SARS is defined as an individual who presents with dyspnea/respiratory discomfort, persistent pressure or pain in the chest, oxygen saturation less than 95% without oxygen, or cyanosis of the lips or face (https://www.gov.br/saude/pt-br/assuntos/coronavirus/artigos/definicao-e-casos-suspeitos). This database has been widely used as a source for other epidemiological studies \u003csup\u003e16\u0026ndash;19\u003c/sup\u003e. In the present work, the\u0026nbsp;SIVEP-Gripe was used to create a token dictionary for unstructured text from clinical questionnaires. The dataset is publicly available at https://opendatasus.saude.gov.br/group/dados-sobre-srag.\u003c/p\u003e\n\u003cp\u003eThe testing dataset was built from\u0026nbsp;the EHR of patients who were hospitalized for COVID-19 at the Hospital S\u0026atilde;o Paulo, the University Hospital (UH) of the Federal University of S\u0026atilde;o Paulo (Unifesp), from March 2020 to June 2022 and were followed after discharge at a Post-COVID-19 Disease Unit (PCDU). The PCDU is a multidisciplinary unit where health professionals assist patients and administer a questionnaire to gather information on any prolonged signs or symptoms after the acute phase of COVID-19. This questionnaire includes information on acute COVID-19, evolution of the infection, medical history, and post-acute phase signs and symptoms. These data were also linked to demographic information and SARS-CoV-2 PCR results from patients.\u003c/p\u003e\n\u003cp\u003eHospital S\u0026atilde;o Paulo uses multiple information systems,\u0026nbsp;resulting in the dispersion of relevant information across different databases. As an initial search strategy to extract data, terminologies related to the infectious disease COVID-19 were applied to the Clinical Notes Database (MongoDB) for the period from March 1, 2020 to September 30, 2022. A total of eight clinical encounter forms were preselected from this data structure, and the clinical records of 16,017 patients were collected. The form data were extracted and grouped into eight JSON (JavaScript object notation) files that were converted to the Tidy Data format and saved as CSV (comma separated values) text files. Following the analysis of the files by the technical-scientific team, the clinical encounter form \u0026quot;Post-COVID Care (Pneumo)\u0026quot;, containing records of 440 patients, was chosen for the analysis of demographic and hospital historical data. The second database of interest was the general patient record, a relational database (Oracle) that was queried using the Standard Query Language (SQL). We extracted data on emergency room visits, outpatient consultations, appointments, exam results, hospitalizations, and surgeries. The extracted data from this database were stored in *.XLSX (Microsoft Excel 2007) file format. The extracted medical conditions and patient symptoms were validated with the assistance of ambulatory pneumology and infectiology at the UH.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eClinical data collection\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe SIVEP-Gripe dataset (training dataset), which contains \u0026ge;2.6 million entries and information on medical conditions and symptoms in the form of unstructured text, was used for the training dataset. Comorbidity and symptom information was extracted from this system via a phoneme approach using the metaphonept-br library (https://github.com/carlosjordao/metaphone-ptbr). All data processing was run in Python using Jupyter Notebooks.\u003c/p\u003e\n\u003cp\u003eBefore beginning the text analysis, it was necessary to normalize and clean the text strings. To this end, we utilized regular expressions to clean special characters and to make a diverse set of separators between words uniform to a whitespace. The phonetic text clustering (PTC) method groups terms together according to their phonetics, effectively consolidating variations of similar terms into a single phonemic representation and using n-gram text analysis to detect compound words\u0026nbsp;\u003csup\u003e20\u003c/sup\u003e. This method not only captured and grouped terms but also allowed the accommodation of synonyms, abbreviations, typographical errors, and the different conjunctions found in Brazilian Portuguese.\u003c/p\u003e\n\u003cp\u003eTo ensure the accuracy of these results, a dictionary of similar terms was carefully curated by five specialists in internal medicine, infectology, pharmacology,\u0026nbsp;pathology, otorhinolaryngology and public health and three medical students. This step prevented the grouping of different terms into the same phoneme. This curation process was crucial in enabling the use of the dictionary to identify medical conditions from unstructured text from different and more complex contexts.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePTC validation with long COVID questionnaires\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo validate the PTC method, the clinical information of patients collected from the long COVID questionnaire was organized into structured and unstructured data (Testing dataset). The structured data consisted of yes\u0026ndash;no and multiple-choice answers, as well as numerical variables. The unstructured data comprised textual patient reports, including records of symptoms, clinical signs, laboratory tests, previous medical conditions, and lifestyle habits, such as smoking.\u003c/p\u003e\n\u003cp\u003eSubsequently, an automated approach to process the unstructured variables from the long COVID questionnaires was applied. We searched for all previously defined terms in the curated dictionary and focused on the most frequent comorbidities associated with COVID-19, which included obesity, hypertension, diabetes mellitus, chronic obstructive pulmonary disease (COPD), asthma, hypothyroidism, and hyperthyroidism. Information on patients\u0026apos; smoking history was also collected to classify patients as smokers or former smokers. Medical records that did not include information on comorbidities or smoking history were classified as having no comorbidities or having a non-smoking history.\u003c/p\u003e\n\u003cp\u003eIn addition, information on long COVID symptoms (cough, fatigue, headache and myalgia) (https://www.cdc.gov/coronavirus/2019-ncov/long-term-effects/index.html) was collected. For symptoms, the focus was on terms in the questionnaires that described patient information in the present, excluding the text referring to symptoms reported in the past (acute phase).\u003c/p\u003e\n\u003cp\u003eImportantly, the medical condition terms detected in the unstructured text of the Long COVID dataset could be in the context of a negative report, i.e., the patient confirming or denying the medical condition. To address this issue, we assessed negative operators such as \u0026ldquo;deny\u0026rdquo; (\u003cem\u003enega\u003c/em\u003e, in Brazilian Portuguese) that appeared before the comorbidity of interest and until the following sentence with regular expressions (\u003ca href=\"https://docs.python.org/3/library/re.html\"\u003ehttps://docs.python.org/3/library/re.html\u003c/a\u003e). This allowed us to capture instances where patients denied having the specified condition. Figure 1 shows a diagram detailing the method developed in this study using the extracted dataset.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLong COVID study population\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe study population included adults (\u0026ge;18 years) who were hospitalized due to acute COVID-19 at the Hospital S\u0026atilde;o Paulo\u0026nbsp;and discharged. The data were collected between March 2020 and June 2022. We excluded individuals who 1) had no SARS-CoV-2 PCR result or had a negative result; 2) did not have a recorded date of first symptoms; 3) completed questionnaires less than 30 days after the first occurrence of symptoms; 4) were hospitalized for more than 120 days; 5) did not have records of COVID-19 evolution during the acute phase; 6) had date inconsistencies; (7) had encounters after the first questionnaire; or (8) had no severity classification in the acute phase.\u003c/p\u003e\n\u003cp\u003eDemographic variables from the long COVID dataset, such as sex, age (stratified into 18-39, 40-59, 60-79, and \u0026ge;80 years), and race (divided into white, black, mixed-brown, Asian, and Indigenous), were evaluated. Due to the small sample size, Asian and Indigenous individuals were combined for the analysis. Additionally, other variables, such as medical history, length of hospital stay (stratified into 0-14, 15-30, 31-60, and \u0026ge;60 days), and severity of the acute phase of COVID-19, were included. The severity categories were defined as moderate (non-ICU ward), severe (intensive care unit; ICU), or critical (ICU with mechanical ventilation).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eStatistical analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo validate the accuracy of the automated approach, we compared the automated results with manually searched and labeled clinical data. The manual labeling was performed by six of the authors with clinical training, and each record was individually labeled three times. We conducted Pearson\u0026apos;s chi-square test to compare the automated and manual term counts to assess the accuracy of the text mining. Then, we performed a descriptive analysis of the long COVID findings to validate our findings against those of previous studies on the topic.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthical approval\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe Brazilian National Commission in Research Ethics approved the research protocol (CONEP approval number 4.921.308 and CAAE registration no. 58619822.6.1001.5505).\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e\u003cstrong\u003eAutomated labeling of the training dataset\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFirst, we investigated the records from the SIVEP-Gripe.\u0026nbsp;A total of\u0026nbsp;2,490,196 SARS records of patients admitted to hospitals between December 31, 2019, and March 27, 2023 were collected. All records were then analyzed as input for the PTC tokenization of medical conditions and symptoms, which were used to create the dictionary that was used to structure the data and to create the database for long COVID. Overall, 635,921 (25.5%) records reported one or more medical conditions, and 849,976 (34.1%) reported one or more SARS-related symptoms in the unstructured text field (Figure S1 and Table 1).\u0026nbsp;From the unstructured clinical data, a dictionary collecting\u0026nbsp;synonyms, misspelled and derivative words into a unique term (Table S1)\u0026nbsp;was produced.\u0026nbsp;Based on this dictionary, 20 of the most frequent medical conditions and\u0026nbsp;10 of the most frequent symptoms (Table 1) were captured for further analyses.\u003c/p\u003e\n\u003cp\u003eSARS patient records were stratified by medical conditions and symptoms in a \u0026quot;yes/no\u0026apos;\u0026apos; format, such as diabetes mellitus, obesity, cardiopathy, loss of smell, loss of taste, fatigue and cough. The results showed that\u0026nbsp;22,458\u0026nbsp;of the\u0026nbsp;terms containing medical conditions captured from the unstructured text overlapped with at least one of the binary comorbidities with a \u0026quot;yes\u0026quot; response in the questionnaire. In addition, 1,418 terms overlapped with at least one of the binary symptom variables with a \u0026quot;yes\u0026quot; response. Thus, to evaluate the gain of information, records with overlapping medical conditions or symptoms were excluded. The terms that were not included in the binary variables from the questionnaire appeared more frequently in the unstructured text annotation (Figure 2).\u003c/p\u003e\n\u003cp\u003eAmong medical conditions, the most frequent term captured by the automated reading was \u0026quot;hypertension\u0026quot;, present in 303,109 entries, representing\u0026nbsp;11.3%\u0026nbsp;of the total database (Figure 2A), followed by \u0026ldquo;smoker\u0026rdquo; in 61,110 entries, representing 2.3%; \u0026quot;hypothyroidism\u0026quot; in 39,550 entries, representing\u0026nbsp;1.5%; and \u0026quot;COPD\u0026quot; in 30,387 entries, representing\u0026nbsp;1.1%. Additionally, \u0026quot;smoking\u0026quot; was found in 61,110 entries (2.3%). Among symptoms, the most frequent terms were \u0026quot;headache\u0026quot;, present in 215,225 entries, representing\u0026nbsp;8.0% of the total database; \u0026quot;myalgia\u0026quot;, present in 213,035 entries, representing\u0026nbsp;7.9%; \u0026quot;asthenia\u0026quot;, present in 124,086 entries, representing\u0026nbsp;4.6%; and \u0026quot;runny nose\u0026quot;, present in 122,766 entries, representing 4.5% (Figure 2B).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eValidating text mining on EHRs\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eData from patients who were admitted with COVID-19 at the Hospital S\u0026atilde;o Paulo, stayed in the hospital for more than 30 days, and were followed at the PCDU after discharge were evaluated.\u003c/p\u003e\n\u003cp\u003eTo validate the PTC method on these data obtained from Hospital S\u0026atilde;o Paulo,\u0026nbsp;398 post-COVID patient questionnaires collected from the PCDU (Figure S2) were cross-checked. The dictionary derived from records of SARS-hospitalized patients was applied. Medical conditions and symptoms from these post-COVID-19 patients were extracted and studied by using an automated method. The results obtained were compared with those obtained through manual searches conducted by specialists, which showed a high degree of similarity in present, absent and negated terms. According to this method, the similarity ranged from 93% to 99% for medical condition terms and from 87% to 95% for symptom terms (Table S2). The statistical significance of these findings is reflected in the p values for all terms, which were less than 0.01 (Figure 3, Table S2).\u003c/p\u003e\n\u003cp\u003eThe study population was divided into individuals who reported no symptoms (29.1%) and those with at least one symptom 30 days after the onset of COVID-19 (70.9%) (Table 2). Demographic characteristics were similar between these two groups; however, patients with three or more medical conditions showed more post-COVID-19 symptoms after 30 days of discharge from the hospital than individuals without comorbidities (24% with symptoms against 17% without symptoms).\u003c/p\u003e\n\u003cp\u003eFor patients who presented with at least one symptom, the most prevalent symptom was dyspnea (77.7%), followed by cough (21.3%) and fatigue (13.5%). Low oxygen saturation (below 27.3%) was the most common continuous variable reported. In terms of lifestyle, 25.9% were former smokers. A total of 48.6% of the population with symptoms after 30 days had hypertension, 26.9% had diabetes, and 15.2% had obesity (Figure 4).\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eIn the present study, three different developments resulted from this study. First, we built a text mining workflow that was able to extract structured medical information from clinical notes in Brazilian Portuguese. Second, this method, in conjunction with the validated text tokens, could be used as a platform for future analyses of long COVID in hospitals that use different systems. Finally, the method was applied back to the training dataset (SIVEP-Gripe), enriching the national database and resulting in more detailed clinical characterizations of SARS in Brazil in the last decade.\u003c/p\u003e\n\u003cp\u003eThe method developed for text mining of clinical data was based on grouping synonyms by phoneme. Our method was able to extract clinical information that was not available previously as variables, with a total informational gain of 32.30% for the 30 categories of comorbidities and symptoms from the records of hospitalized SARS patients. Furthermore, we validated our method against human labeling using electronic records from patients who returned to the post-COVID-19 unit after being discharged for 30 days, which allowed us to describe the clinical findings related to long COVID in those patients.\u003c/p\u003e\n\u003cp\u003eThe initial difficulty was structuring a database from a set of unstructured data that would allow subsequent analysis of a disease such as COVID-19 and post acute symptoms, characterized as long COVID. The benchmarks were previous studies on COVID-19 and vaccine effectiveness using national health system datasets, from which cohorts for studies on the effectiveness of different vaccines administered in Brazil were formed, and national databases\u0026nbsp;\u003csup\u003e18,21\u003c/sup\u003e. Thus, it was possible to enrich the same dataset and cross-check the informational gain using data from patients who were admitted to the UH and who, after discharge, were followed up at the PCDU due to various symptoms.\u003c/p\u003e\n\u003cp\u003eAfter defining the sample, advanced methodologies were used for data extraction from the database, and sensitivity analyses were used to define the modeling. The inclusion and exclusion criteria were based on the creation of a dictionary containing the most frequent long COVID symptoms, organized from medical records using International Classification of Diseases (ICD) codes and the PTC method. This structure allowed the extraction of data from unstructured text to enrich the study population information from EHRs.\u003c/p\u003e\n\u003cp\u003eThe method developed in this study exhibited good performance and was subsequently used to investigate the effects of long COVID in patients who were admitted to the UH and were followed for several months after being discharged. Phonemic representation has been used previously to cluster variations in writing and represent these clusters as an n-gram, but this is the first time that it has been used for clinical notes in Brazilian Portuguese\u0026nbsp;\u003csup\u003e20\u003c/sup\u003e. This plot captured groups of variations in terms, such as close synonyms, abbreviations, and typographical errors typical of the language, which confirmed the validation and interpretability of the PTC method.\u003c/p\u003e\n\u003cp\u003eImportantly, the construction of this method allowed for a more accurate analysis of symptoms in patients followed by the PCDU of Hospital S\u0026atilde;o Paulo, which showed that the majority of individuals presented dyspnea as a prevalent symptom, often accompanied by low oxygen saturation. These data are in accordance with other studies that used different methods, including the studies that reported low oxygen saturation during physical exercise\u003csup\u003e22,23\u003c/sup\u003e. Since dyspnea is one of the most frequent and well-documented symptoms of long COVID, it is notable that it was detected by our study and method and provided further information concerning low oxygen saturation. In addition, other symptoms, such as fatigue and muscle pain, were detected that had been described by other authors\u003csup\u003e24\u003c/sup\u003e, corroborating the quality of the new method to extract symptoms from non-structured data.\u003c/p\u003e\n\u003cp\u003eImportantly, the curation and constant maintenance of the dictionary will be continued, and we will update the dictionary with new information and terms used by services. Thus, new qualifiers of clinical conditions, such as different degrees of dyspnea and the evolution of these clinical conditions over time, which may encompass periods of improvement and worsening, will be included in the dictionary. In addition, creating specific platforms to characterize and identify a little-known and difficult-to-diagnose condition, such as long COVID, represents an important advance for data modeling and decision-making after the occurrence of COVID-19. The tool created from the methods used in this study has characteristics that indicate the possibility of analyzing data in the language in which medical records are written, in addition to machine and human checking, which can overcome the lack of homogeneity in different records and allow more accurate results. These results are important, although it is important to emphasize that the risks of death and hospitalization remained statistically high in different phases of the pandemic, particularly in those who were hospitalized during the acute phase of SARS-CoV-2 infection and in countries such as Brazil\u0026nbsp;\u003csup\u003e25\u003c/sup\u003e, in which a high number of cases were reported and, therefore, must also consider the substantial number of individuals with COVID-19 sequelae. Since there is also evidence of COVID-19 sequelae in individuals who were not hospitalized, it is crucial to emphasize the importance of treating those who are infected and preventing reinfections. Therefore, reducing the risk of long-term sequelae remains a need in terms of public health and health policies.\u003c/p\u003e\n\u003cp\u003eFinally, there are still many gaps and regional disparities in long COVID research. In particular, there are significant geographic gaps in the available research data, with an abundance of studies originating from Northern Hemisphere populations and a paucity of information regarding long COVID in low- and middle-income countries. There is a critical need for more focused research in these regions. Therefore, the use of NLP to evaluate nonstructured EHRs provides a great opportunity to improve the knowledge of long COVID in areas with resource-limited settings.\u003c/p\u003e\n\u003cp\u003eThe method and modeling presented in this work and the use of cohorts of data to predict and treat long COVID patients will be crucial, and more studies should be performed to not only increase knowledge but also develop the necessary care and rehabilitation methods in addition to the planning and capacity of the primary health care system. In this context, studies such as the present one should be expanded to help understand long COVID and predict its effects. The results of these studies will allow the development of prevention or treatment systems that will achieve higher quality standards in population health even in the face of the pandemic.\u003c/p\u003e"},{"header":"Declarations","content":"Conflict of Interest:\nThe authors declare no conflicts of interest related to the present work.\nAvailability of Data and Materials:\nDue to the nature of the research, due to [ethical/legal/commercial] supporting data is not available.\nAcknowledgments:\nThe authors acknowledge Dr. Lucia Pellanda; Dr. Ethel Maciel; Dr. Adhemar Arthur Chioro; and Dr. Nisia Trindade for their support and discussion on this work. We also acknowledge the support of Fiotec- Fiocruz, FAP-Unifesp, CNPq 400504/2023-5 and FAPESP 2019/02821-8.\n"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eElpeltagy M, Sallam H. Automatic prediction of COVID\u0026thinsp;\u0026ndash;\u0026thinsp;19 from chest images using modified ResNet50. Multimed Tools Appl 2021; 80: 26451\u0026ndash;26463.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbbar S, Mokbel M. The role of AI in digital contact tracing. In: \u003cem\u003eLeveraging Artificial Intelligence in Global Epidemics\u003c/em\u003e. Elsevier, 2021, pp 203\u0026ndash;221.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChowdhury MEH, Rahman T, Khandakar A, Mazhar R, Kadir MA, Mahbub ZB \u003cem\u003eet al.\u003c/em\u003e Can AI Help in Screening Viral and COVID-19 Pneumonia? IEEE Access 2020; 8: 132665\u0026ndash;132676.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCau R, Faa G, Nardi V, Balestrieri A, Puig J, Suri JS \u003cem\u003eet al.\u003c/em\u003e Long-COVID diagnosis: From diagnostic to advanced AI-driven models. European Journal of Radiology 2022; 148: 110164.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKe Y-Y, Peng T-T, Yeh T-K, Huang W-Z, Chang S-E, Wu S-H \u003cem\u003eet al.\u003c/em\u003e Artificial intelligence approach fighting COVID-19 with repurposing drugs. Biomedical Journal 2020; 43: 355\u0026ndash;362.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChang Z, Zhan Z, Zhao Z, You Z, Liu Y, Yan Z \u003cem\u003eet al.\u003c/em\u003e Application of artificial intelligence in COVID-19 medical area: a systematic review. J Thorac Dis 2021; 13: 7034\u0026ndash;7053.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eN\u0026eacute;v\u0026eacute;ol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P. Clinical Natural Language Processing in languages other than English: opportunities and challenges. J Biomed Semant 2018; 9: 12.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V. Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review. JMIR Med Inform 2019; 7: e12239.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWei W-Q, Teixeira PL, Mo H, Cronin RM, Warner JL, Denny JC. Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance. Journal of the American Medical Informatics Association 2016; 23: e20\u0026ndash;e27.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNurek M, Rayner C, Freyer A, Taylor S, J\u0026auml;rte L, MacDermott N \u003cem\u003eet al.\u003c/em\u003e Recommendations for the recognition, diagnosis, and management of long COVID: a Delphi study. Br J Gen Pract 2021; 71: e815\u0026ndash;e825.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSoriano JB, Murthy S, Marshall JC, Relan P, Diaz JV. A clinical case definition of post-COVID-19 condition by a Delphi consensus. The Lancet Infectious Diseases 2022; 22: e102\u0026ndash;e107.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMcGrath LJ, Scott AM, Surinach A, Chambers R, Benigno M, Malhotra D. Use of the Postacute Sequelae of COVID-19 Diagnosis Code in Routine Clinical Practice in the US. JAMA Netw Open 2022; 5: e2235089.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKingery JR, Safford MM, Martin P, Lau JD, Rajan M, Wehmeyer GT \u003cem\u003eet al.\u003c/em\u003e Health Status, Persistent Symptoms, and Effort Intolerance One Year After Acute COVID-19 Infection. J GEN INTERN MED 2022; 37: 1218\u0026ndash;1225.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBowe B, Xie Y, Al-Aly Z. Acute and postacute sequelae associated with SARS-CoV-2 reinfection. Nat Med 2022; 28: 2398\u0026ndash;2405.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBowe B, Xie Y, Al-Aly Z. Postacute sequelae of COVID-19 at 2 years. Nat Med 2023; 29: 2347\u0026ndash;2357.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRanzani OT, Bastos LSL, Gelli JGM, Marchesi JF, Bai\u0026atilde;o F, Hamacher S \u003cem\u003eet al.\u003c/em\u003e Characterisation of the first 250 000 hospital admissions for COVID-19 in Brazil: a retrospective analysis of nationwide data. The Lancet Respiratory Medicine 2021; 9: 407\u0026ndash;418.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOliveira EA, Colosimo EA, E Silva ACS, Mak RH, Martelli DB, Silva LR \u003cem\u003eet al.\u003c/em\u003e Risk factors for COVID-19 mortality in hospitalised children and adolescents in Brazil \u0026ndash; Authors\u0026rsquo; reply. The Lancet Child \u0026amp; Adolescent Health 2021; 5: e40\u0026ndash;e42.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCerqueira-Silva T, Andrews JR, Boaventura VS, Ranzani OT, De Ara\u0026uacute;jo Oliveira V, Paix\u0026atilde;o ES \u003cem\u003eet al.\u003c/em\u003e Effectiveness of CoronaVac, ChAdOx1 nCoV-19, BNT162b2, and Ad26.COV2.S among individuals with previous SARS-CoV-2 infection in Brazil: a test-negative, case-control study. The Lancet Infectious Diseases 2022; 22: 791\u0026ndash;801.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFlorentino PTV, Alves FJO, Cerqueira-Silva T, de Ara\u0026uacute;jo Oliveira V, J\u0026uacute;nior JBS, Penna GO \u003cem\u003eet al.\u003c/em\u003e Effectiveness of BNT162b2 booster after CoronaVac primary regimen in pregnant people during omicron period in Brazil. The Lancet Infectious Diseases 2022; 22: 1669\u0026ndash;1670.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBird S, Klein E, Loper E. \u003cem\u003eNatural language processing with Python\u003c/em\u003e. 1st ed. O\u0026rsquo;Reilly: Beijing; Cambridge [Mass.], 2009.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCerqueira-Silva T, Katikireddi SV, De Araujo Oliveira V, Flores-Ortiz R, J\u0026uacute;nior JB, Paix\u0026atilde;o ES \u003cem\u003eet al.\u003c/em\u003e Vaccine effectiveness of heterologous CoronaVac plus BNT162b2 in Brazil. Nat Med 2022; 28: 838\u0026ndash;843.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSch\u0026auml;fer H, Teschler M, Mooren FC, Schmitz B. Altered tissue oxygenation in patients with post COVID-19 syndrome. Microvascular Research 2023; 148: 104551.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuarnieri G, Lococo S, Bertagna De Marchi L, Cecchetto A, Molena B, Arcaro G \u003cem\u003eet al.\u003c/em\u003e Persistent oxygen desaturation during exercise in patients with long COVID. In: \u003cem\u003e01.05 - Clinical respiratory physiology, exercise and functional imaging\u003c/em\u003e. European Respiratory Society, 2022, p 3725.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGlobal Burden of Disease Long COVID Collaborators, Wulf Hanson S, Abbafati C, Aerts JG, Al-Aly Z, Ashbaugh C \u003cem\u003eet al.\u003c/em\u003e Estimated Global Proportions of Individuals With Persistent Fatigue, Cognitive, and Respiratory Symptom Clusters Following Symptomatic COVID-19 in 2020 and 2021. JAMA 2022; 328: 1604.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKatikireddi SV, Cerqueira-Silva T, Vasileiou E, Robertson C, Amele S, Pan J \u003cem\u003eet al.\u003c/em\u003e Two-dose ChAdOx1 nCoV-19 vaccine protection against COVID-19 hospital admissions and deaths over time: a retrospective, population-based cohort study in Scotland and Brazil. The Lancet 2022; 399: 25\u0026ndash;35.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"},{"header":"Tables","content":"\u003cp\u003eTables 1-2 is available in the Supplementary Files section.\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"cell-death-and-disease","isNatureJournal":false,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"cddis","sideBox":"Learn more about [Cell Death \u0026 Disease](http://www.nature.com/cddis/)","snPcode":"41419","submissionUrl":"https://mts-cddis.nature.com/cgi-bin/main.plex","title":"Cell Death \u0026 Disease","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-4262099/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4262099/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eLong COVID is characterized by persistent symptoms beyond established timeframes, presenting a significant challenge in understanding its clinical manifestations and implications. In this study, we present a novel application of natural language processing (NLP) techniques to automatically extract unstructured data from a Long COVID survey conducted at a prominent university hospital in S\u0026atilde;o Paulo, Brazil. Our phonetic text clustering (PTC) method enables the exploration of unstructured EHR data to unify different written forms of similar terms into a single phonemic representation. We use n-gram text analysis to detect compound words and negated terms in Portuguese-BR, focusing on medical conditions and symptoms related to Long COVID. By leveraging NLP, we aim to contribute to a deeper understanding of this chronic condition and its implications for healthcare systems worldwide. The model developed in this study has the potential for scalability and applicability in other healthcare settings, facilitating broader research efforts and informing clinical decision-making for Long COVID patients.\u003c/p\u003e","manuscriptTitle":"Natural Language Processing method to Unravel Long COVID's clinical condition in hospitalized patients","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-04-22 02:50:18","doi":"10.21203/rs.3.rs-4262099/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"revise","date":"2024-06-14T08:20:34+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"This content is not available.","date":"2024-06-07T14:50:42+00:00","index":2,"fulltext":"This content is not available."},{"type":"editorInvitedReview","content":"This content is not available.","date":"2024-06-03T15:16:54+00:00","index":3,"fulltext":"This content is not available."},{"type":"reviewerAgreed","content":"This content is not available.","date":"2024-05-17T08:58:41+00:00","index":3,"fulltext":"This content is not available."},{"type":"reviewerAgreed","content":"This content is not available.","date":"2024-05-06T12:20:15+00:00","index":2,"fulltext":"This content is not available."},{"type":"reviewerAgreed","content":"This content is not available.","date":"2024-04-13T15:19:16+00:00","index":1,"fulltext":"This content is not available."},{"type":"reviewersInvited","content":"","date":"2024-04-13T15:15:59+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-04-13T15:10:17+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-04-13T14:22:12+00:00","index":"","fulltext":""},{"type":"submitted","content":"Cell Death \u0026 Disease","date":"2024-04-13T14:22:11+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"cell-death-and-disease","isNatureJournal":false,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"cddis","sideBox":"Learn more about [Cell Death \u0026 Disease](http://www.nature.com/cddis/)","snPcode":"41419","submissionUrl":"https://mts-cddis.nature.com/cgi-bin/main.plex","title":"Cell Death \u0026 Disease","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"474a5412-ab2e-45c5-aaec-d5369a77294a","owner":[],"postedDate":"April 22nd, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":30638793,"name":"Health sciences/Medical research/Epidemiology"},{"id":30638794,"name":"Health sciences/Diseases/Infectious diseases/Viral infection"}],"tags":[],"updatedAt":"2024-09-14T07:09:59+00:00","versionOfRecord":{"articleIdentity":"rs-4262099","link":"https://doi.org/10.1038/s41419-024-07043-4","journal":{"identity":"cell-death-and-disease","isVorOnly":false,"title":"Cell Death \u0026 Disease"},"publishedOn":"2024-09-13 04:00:00","publishedOnDateReadable":"September 13th, 2024"},"versionCreatedAt":"2024-04-22 02:50:18","video":"","vorDoi":"10.1038/s41419-024-07043-4","vorDoiUrl":"https://doi.org/10.1038/s41419-024-07043-4","workflowStages":[]},"version":"v1","identity":"rs-4262099","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4262099","identity":"rs-4262099","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.