Evaluating the performance of artificial intelligence in supporting evidence synthesis: a blinded comparison between chatbots and humans | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Evaluating the performance of artificial intelligence in supporting evidence synthesis: a blinded comparison between chatbots and humans Kim Nordmann, Stefanie Sauter, Mirjam Stein, Johanna Aigner, Marie-Christin Redlich, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5710671/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 30 May, 2025 Read the published version in BMC Medical Research Methodology → Version 1 posted 11 You are reading this latest preprint version Abstract Background With the rise of large language models, the application of artificial intelligence in research is expanding, possibly accelerating specific stages of the research processes. This study aims to compare the accuracy, completeness and relevance of chatbot-generated responses against human responses in evidence synthesis as part of a scoping review. Methods We employed a structured, randomised and blinded survey-based research methodology to analyse and compare responses between two human researchers and four chatbots (ZenoChat, ChatGPT 3.5, ChatGPT 4.0, and ChatFlash) to questions based on a pre-coded sample of 407 articles. These questions were part of an evidence synthesis of a scoping review dealing with digitally supported interaction between healthcare workers. Results The analysis revealed no significant differences in judgments of correctness between answers by chatbots and those given by humans. However, chatbots’ answers were found to recognise the context of the original text better, and they provided more complete, albeit longer, responses. Human responses were less likely to add new content to the original text or include interpretation. Amongst the chatbots, ZenoChat provided the best-rated answers, followed by ChatFlash, with ChatGPT 3.5 and ChatGPT 4.0 tying for third. Correct contextualisation of the answer was positively correlated with completeness and correctness of the answer. Conclusions Chatbots powered by large language models may be a useful tool to accelerate qualitative evidence synthesis. Given the current speed of chatbot development and fine-tuning, the successful applications of chatbots to facilitate research will very likely continue to expand over the coming years. artificial intelligence chatbot large language model ChatGPT ChatFlash ZenoChat Figures Figure 1 Background Since the public launch of ChatGPT (OpenAI) in November 2022, chatbots powered by large language models have received increasing attention by the public, politicians and science.[ 1 ] Its usability and challenges have been debated across various sectors, particularly in human-related areas such as education or healthcare.[ 2 , 3 ] These advanced language models are trained on vast repositories of data and tuned to mimic human conversation.[ 4 ] Their scale, pre-training mechanisms, contextual understanding and flexibility set them apart from previous machine learning tools.[ 5 , 6 ] With the rise of large language models, the application of artificial intelligence in analysing complex datasets, making predictions, and supporting literature reviews has opened up various new opportunities, particularly for healthcare.[ 5 , 6 ] In addition, its application in research is also expanding due to the anticipated potential to accelerate the research process and possibly improve transparency.[ 7 – 9 ] Especially in evidence synthesis such as systematic reviews, where the mean duration from literature search to publication extends beyond a year, large language models could expedite the preparation of evidence-based guidelines and, thereby, positively impact medical practice.[ 10 – 12 ] In qualitative research, ChatGPT performed well in reproducing specific themes, but less successfully in establishing interpretative themes, and creating depth when coding inductively.[ 7 , 13 , 14 ] Qualitative research tools, such as MAXQDA and ATLAS.ti, have integrated artificial intelligence tools in partnership with OpenAI to support users at various stages of their research process.[ 13 ] Chatbots might further enhance and accelerate research by supporting tasks such as writing search strings and summaries.[ 8 , 9 , 15 ] While ChatGPT is a prominent example, it is only one of numerous chatbots to employ large language models.[ 4 ] In recent years, the market has seen the introduction of multiple alternatives. Despite sharing a similar core technology, their different training and fine-tuning processes may result in variations in their response generation and capabilities.[ 16 ] The aim of this study is to compare the accuracy, completeness and relevance of different chatbot-extracted responses to questions of pre-coded article excerpts against human responses, contributing to an advanced understanding of the use of chatbots in evidence synthesis. Methods We deployed a randomised and blinded, survey-based process in order to assess the performance of chatbots with regard to supporting evidence synthesis. 2.1 Context and sample As part of a scoping review on digitally supported interaction between healthcare workers, 407 articles were included for data extraction and manually analysed.[ 17 ] The analysis scheme was informed by Greenhalgh et al.’s NASSS framework (Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies).[ 18 ] This framework contains seven categories with specific questions on each. It was translated into a coding framework of seven codes, which, when applied to the 407 articles, produced a total of 504,736 words (1,070 pages) coded text. From a subsample of articles, i.e., those pertaining to the ambulant setting, five articles were chosen at random for each of the seven codes to form the basis of the comparison exercise. As one of the seven codes was only used twice in the subset, both articles were chosen. Different coded parts of the same code in one article were combined into text passages that varied in length and complexity. The text passages were then presented to two researchers and four chatbots with the task of answering the questions for each code including all relevant information. A coded text passage under each code is given as an example jointly with an answer provided by one of the four chatbots or a human researcher at the end of this section (Table 1 ). Table 1 Overview of codes. Code Question Example text passage Example answer 1. Condition 1A. Why was the technology introduced? […] health information technology (HIT) systems for shared electronic medication plans. These systems enable patients and health care professionals to use and manage a common list of current medications across sectors and settings. Shared electronic medication plans have great potential to improve medication management and patient safety, […] 1 ChatGPT3.5 : - The technology (HIT systems for shared electronic medication plans) was introduced to improve medication management and patient safety. 1B. Is there a specific disease area mentioned? ChatGPT3.5 : - The paragraph does not mention a specific disease area. The focus is on medication management and patient safety in a broader context. 2. Technology 2A. What are the key features of the technology? 12A multidisciplinary digital communication tool, Congredi (Fast Guide, Oud Gastel, the Netherlands), was implemented to improve communication about care among providers. It includes an interactive multidisciplinary care plan and a secure e-mailing channel. It was selected on the basis of jointly developed functional specifications; The Tool Congredi is a communication tool for providers, and it was originally developed for primary care by a general practitioner. It consists of an option to share data in a care plan and asynchronous communication options using secure email. The included multidisciplinary care plan was based on the social, functional, mental, physical, and communication problem inventory domain model.15 Providers received a practical instruction training session lasting 4 hours and demonstrating use of the Congredi tool. The patient had to give permission to begin a record and to invite providers to link. After patient permission was obtained, a provider opened a Congredi record for the patient and filled in a care plan. Other providers who were involved with the patient and had access to Congredi could be invited by link so that they could view the care plan and use it interactively. Provider actions were recorded in the system. The following are three main categories of actions: care actions (problem assessment, defining care goals, observing patients' health status, and adapting care goals), e-mailing, and inviting other providers to link. All providers had the ability to update the care plan and initiate actions; they received alerts when there were e-mails in their inboxes. 2 ChatGPT4.0 : -An online platform for creating, using, and managing a list of all the medications a patient was taking and had taken in the past. - Pilot project: addition of a shared electronic medication plan system onto the existing eHealth platform; new collaborative model based on the patient-GP-pharmacy triad to make primary care medication management safer in cases involving polypharmacy. - information exchange between the patient and an intervening health care professional. 2B. What knowledge and/or support is required to use the technology? ChatGPT4.0 : No information is provided in the given paragraph. 2C. What is the technology supply model? ChatGPT4.0 : - The tool is a communication tool for providers, particularly developed for primary care. - Originally developed by a general practitioner. - Providers receive training on its use, and patient permission is obtained to initiate a record. 2D. What is said current technology use and uptake? ChatGPT4.0 : - No information is provided in the given paragraph. 3. Value proposition 3B.1. What is the technology's positive value proposition in terms of e.g., desirability, efficacy, safety, and cost effectiveness amongst others, excluding beneficial effects on staff? Among those who did have concerns, concerns for both personal and patient privacy were top of the list (Fig. 3).; In this study, we found that after 1 year of use, the implementation of a private Facebook group for a large, multi-site FHT was ultimately not successful, despite cross-site communication having previously been identified to be in need of improvement by FHT members.12; Use of the group depended heavily on the study team and was not sustained. Survey respondents expressed high levels of concern about both patient and provider privacy and viewed the Facebook group as unnecessary, adding extra time and work to their already busy days; Our survey results suggest that FHT members were inherently suspicious of Facebook, even with guides on privacy settings in place and the recommendation to create a separate profile. 3 Researcher B : - No information is provided in the given paragraph 3B.2. What is the technology's negative value proposition in terms of e.g., desirability, efficacy, safety, and cost effectiveness amongst others, excluding beneficial effects on staff? Researcher B : - High levels of concern about both patient and provider privacy (using a Facebook group) . Facebook group was viewed as unnecessary, adding extra time and work to the already busy day of the participants - FHT members were inherently suspicious of Facebook, even with guides on privacy settings in place and the recommendation to create a separate profile 4. Adopters 4A. What changes in staff roles, practices and routines, and identities are implied? 4.2. Professional tasks and responsibilities reconfigured The informants reported changes in how much information they sent and received due to the simpler means of communication. We might say that more exchange of information is the structural effect of the introduction of e-messaging that, in turn, leads to unforeseen changes in tasks and responsibilities. Two nurses report how they use e-messages below: Nurse A: We answer questions from the GP. We write general information if they [patients] have an appointment with the GP, and if we think they won’t remember to bring up everything that they should; then we send some kind of update in advance, so that all issues are addressed. Interviewer: Before the implementation of the e-messages system, did you provide this information via telephone? Nurse B: Yes, but not always, because it was so cumbersome via the telephone, so you would just skip it. I think e-messaging has made it easier. You can inform via a few keystrokes instead of waiting on the telephone. Nurse B points to the fact that she informs the GP more often after the introduction of e-messaging, for example, before a patient arrives for his/her appointment with the GP. It means that the GP may receive information that is more comprehensive and up-to-date, ensuring a better quality of service. Similarly, some of the nurses reported that they use e-messages to inform the GP that patients have been admit-ted to hospital, signalling that the GP should be prepared for changes in the patients’ medications. GPs are also of the opinion that they send more messages. However, one highlighted a problematic aspect of communication becoming too easy: We send more messages. Earlier, we used to let them accumulate until Tuesdays [when they had meetings with homecare], and then some problems were solved at that time. Now we send them at once, right. The volume of the messages they [homecare] receive has increased as a result. And that might be a problem. Cause we often ‘shoot from the hip’. We see a problem, and then we send it [a message], so we won’t forget. Earlier, we would have thought, “can we try to solve it?”, “can we try to find out?”, “can we ask the patient to bring the pill dispenser?,” can we ask for a “dis-charge note from the hospital?” It [the increased volume of messages] becomes a burden for homecare services. (GP municipality B) The homecare nurses did not cite message overload or messages not carefully thought out as frequent problems. Thus, there may be a discrepancy between what is comprehended as ‘too much’ from a GP’s perspective and from a homecare nurse’s perspective. It is also necessary to keep in mind that we are examining individual experiences. However, some nurses did talk about receiving too much/irrelevant information via e-message, but this was explained as a consequence of (poor) functionality in the e-message system, where some message types automatically imported extensive amounts of information from the EPR and attached it to the message. The interviews show agreement across the informants that more information is sent and received, indicating that e-message users have more knowledge of patients than do nonusers. Two unintended consequences are worth noting from the example quotes presented above: (1) e-messaging may become a structural means of change, allowing nurses to become more active organisers and facilitators of GP’s work, which in turn can mean better follow-up of patients. (2) It may also be argued that e-messaging becomes a tool for getting things done and eliminating cognitive overload but in doing so unnecessarily transfers one’s own work to another actor.; 4.3. Empowering nurses Another issue associated with HIT that was not addressed prior to the introduction of e-messaging is the power balance between homecare nurses and GPs. Shifts in the balance of power may occur as a result of the new form of communication. Through the automatic documentation of communication provided by e-messaging, the nurses felt their requests had more weight. Two quotes illustrate this: First, it is documented every time we have made an inquiry. And they [GPs] can’t ignore an e-message. When there is a message, they have to answer. So, that is very much an advantage. (nurse municipality A) We haven’t been spoilt when it comes to having contact with the GP. It’s like.. .we try to get hold of them, and they are supposed to call back, but it doesn’t always happen. Sometimes it feels like we are not taken seriously. But when you have sent an e-message, then it is there [in their EPR system]. And they have to open it, and they have to send an answer. (nurse municipality A) When the nurses talked about what e-messages meant to them, they often described the positive effect of the automatic documentation on their dialogues with the GPs. Previously, messages for GPs were often mediated through their medical secretaries and could get ‘lost’, according to the nurses – meaning that the GP never called back. According to the nurses, requests documented in the EPR have more weight than a post-it note from a secretary, and they found that it resulted in a more conscientious follow-up from the GP. Furthermore, nurses used phrases such as “having their back covered” and “having documented that you have done your job”. Such expressions illustrate how nurses use e-messages for attaining accountability and to fulfil the demands of the public, their colleagues, patients and next of kin. E-messaging might be viewed as a tool that empowers nurses if used consciously. 4 ZenoChat : - Changes in staff roles, practices, and routines are implied by the introduction of e-messaging. - More exchange of information is the structural effect of the introduction of e-messaging, which leads to unforeseen changes in tasks and responsibilities. - Nurses use e-messages to inform the GP more often and with more comprehensive and up-to-date information. - E-messaging becomes a structural means of change, allowing nurses to become more active organizers and facilitators of GP's work, which in turn can mean better follow-up of patients. - E-messaging becomes a tool for getting things done and eliminating cognitive overload but in doing so unnecessarily transfers one's own work to another actor. - The power balance between homecare nurses and GPs may shift as a result of the new form of communication. - Through the automatic documentation of communication provided by e-messaging, nurses feel their requests have more weight. - Requests documented in the EPR (Electronic Patient Record) have more weight than a post-it note from a secretary, resulting in a more conscientious follow-up from the GP. - E-messaging might be viewed as a tool that empowers nurses if used consciously. 4B. What is the technology's desirability for staff? (Including beneficial effects on staff?) ZenoChat : - The technology's desirability for staff is not explicitly mentioned in the given paragraph. - However, e-messaging may become a tool that empowers nurses and allows them to become more active organizers and facilitators of GP's work, which could lead to improved job satisfaction. 5. Organisation 5A. What is the organization’s readiness and/or capacity for technology-supported change? In Australia, video consultations have been funded by Medicare (the universal health insurance system) since 2011; however, general practitioners are not rebated to deliver telehealth services, only to attend patients in person to support video consultations from medical specialists […] Overall, it was very challenging to make video consultations a routine method of delivering general medical ser-vices to long term care facilities. Existing communications infrastructure was usually not suitable, with bandwidth being the greatest barrier. 5 Researcher B : - Video consultations were very challenging as a routine method of delivering general medical services to long term care facilities, as existing communications infrastructure was usually not suitable, amongst others due to bandwidth 5B. What is said about adoption and funding? Researcher B : - Video consultations have been funded by Medicare (the universal health insurance system) since 2011 - A problem with funding is that general practitioners are not rebated to deliver telehealth services, only to attend patients in person to support video consultations from medical specialists; 5C. What work is involved in implementation and who will do it? Researcher B : - The paragraph does not provide specific information about the work involved in implementation or who will do it. 6. Wider system 6A. What is the context for program rollout (e.g., political, economic, regulatory, professional (e.g., medicolegal), and sociocultural)? The Norwegian primary health care is managed by the municipalities and includes home health care, with approximately 31,000 nurses and 4,000 GPs being contracted. These two services are organized separately, governed and financed differently, and also have incompatible EHR systems; (b)Legislation does not permit them to share EHR systems, but they are obliged to exchange patient information when required.28 The nurses in home health care and GPs have many common patients, thus indicating that ensuring a good clinical communication is essential. 6 Researcher A : - Home health nurses and GPs work in different systems and are financed and governed differently. - Home health nurses and GPs have incompatible EHR systems. - Legislation does not allow sharing of EHR systems 7. Embedding and adaptation over time 7A. How much scope is there for adapting and coevolving the technology and the service over time? 5.1. Incorporating other diseases and social determinants A comprehensive integrated care delivery system needs to respond to a broad range of diseases and risks relevant to population health. We have laid some groundwork for taking our present system beyond pregnancy and early childhood. In parallel, the team has initiated a non-communicable disease CHW follow-up program that supports patients detected at the hospital level, and we plan to initiate a surgical follow-up program in the coming year. As an entry point into the social determinants of health, we have also developed an approach to measure household expenditures on health, medical debt, and medical impoverishment, and are currently analyzing these data. Over time, we aim to incorporate these various streams of household-level health, socio-economics, and disease risk data into the comprehensive integrated care digital platform. 7 ChatFlash : - The team has initiated a non-communicable disease CHW follow-up program - The team plans to initiate a surgical follow-up program in the coming year - They have developed an approach to measure household expenditures on health, medical debt, and medical impoverishment - The team aims to incorporate household-level health, socio-economics, and disease risk data into the comprehensive integrated care digital platform over time. The codes are displayed as proposed by Greenhalgh et al. in their NASSS framework (Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies).[ 18 ] The questions under each code were taken from the NASSS framework and partially adapted. A coded text passage under each code is given as an example together with an answer provided by one of the four chatbots or a human. References for the example text passage: 1. Bugnon B, Geissbuhler A, Bischoff T, et al. Improving Primary Care Medication Processes by Using Shared Electronic Medication Plans in Switzerland: Lessons Learned From a Participatory Action Research Study. JMIR Form Res 2021; 5: e22319. 2. Jong CC de, Ros WJG, van Leeuwen M, et al. Professionals' Use of a Multidisciplinary Communication Tool for Patients With Dementia in Primary Care. Comput Inform Nurs 2018; 36: 193–198. 3. Lofters AK, Slater MB, Nicholas Angl E, et al. Facebook as a tool for communication, collaboration, and informal knowledge exchange among members of a multisite family health team. J Multidiscip Healthc 2016; 9: 29–34. 4. Melby L and Hellesø R. Introducing electronic messaging in Norwegian healthcare: unintended consequences for interprofessional collaboration. Int J Med Inform 2014; 83: 343–353. 5. Wade V, Whittaker F and Hamlyn J. An evaluation of the benefits and challenges of video consulting between general practitioners and residential aged care facilities. J Telemed Telecare 2015; 21: 490–493. 6. Lyngstad M, Melby L, Grimsmo A, et al. Toward Increased Patient Safety? Electronic Communication of Medication Information Between Nurses in Home Health Care and General Practitioners. Home Health Care Management & Practice 2013; 25: 203–211. 7. Citrin D, Thapa P, Nirola I, et al. Developing and deploying a community healthcare worker-driven, digitally- enabled integrated care system for municipalities in rural Nepal. Healthc (Amst) 2018; 6: 197–204. 2.2 Selection of chatbots and answer generation We opted for comparing four chatbots: ZenoChat (TextCortex AI), ChatGPT 3.5, ChatGPT 4.0 (OpenAI) and the free version of ChatFlash (neuroflash GmbH). ChatFlash, ChatGPT 3.5 and ChatGPT 4.0 are based on GPT (Generative Pre-trained Transformer). Depending on the settings, ZenoChat uses either GPT 4.0, Sophos-2 or Mixtral as its underlying model. As even small changes affect a language model, we included both ChatGPT 3.5 and 4, based on GPT 3.5 and 4, respectively, as well as the ZenoChat version based on Sophos-2.[ 16 ] To ensure a standardised process, all chatbots were presented with the same prompt “Use the following paragraph to extract in academic style in bullet points the answers – if any answers are provided – to the following questions:”, followed by one to four questions pertaining to the specific code. No limit was set for the word count, as texts were heterogeneous in complexity and information density. The answer generation for all chatbots was conducted in November 2023. 2.3 Randomization and survey We used digitally supported randomisation to select three text passages of each code, resulting in a total of 20 text passages, to serve as the basis of our survey. Each text passage had four AI-written responses and two written by human beings. A survey was designed to evaluate the responses, applying a consistent evaluation framework to each question: the length of the response was measured on a Likert 1–3 scale: 1 – too short, 2 – appropriate length, 3 – too long. Completeness and correctness were equally measured on a Likert 1–3 scale: 1 complete / correct, 2 – partially complete / partially correct, 3 – significant part(s) are missing / content is displayed incorrectly. Three further questions evaluated the correct identification of the context (correct/incorrect) and whether the answer included any addition of new content (yes/no) and or an interpretation beyond the original text (yes/no). Respondents could provide open-ended feedback on both, the response and the original text passage (Appendix 1). We used word counts to measure the length of the responses and calculated the word ratio by dividing the answers’ word count by that of the original text’s. Six independent survey participants with a background in social sciences and differing professional expertise ranging from nursing to physiotherapy (henceforth: rater) were recruited from a department in the Bavarian Center for Digital Health and Social Care. These individuals were not involved in deriving answers, but were familiar with the scoping review and the aim of the research. Each rater was randomly assigned to a text passage with the corresponding answers to the question, blinded to the identities of the author in each case. In total, 120 text passages were reviewed with each text passage being presented to three raters with each rater assessing in the same group constellation (e.g., rater A, B, C) the same number of text passages. To minimise recognition bias, we standardized the formatting of answers and omitted introductory phrases such as “The provided paragraph contains information to answer the questions related to the introduction of the technology and the specific disease area:”. Figure 1 summarises the research process and Table 1 provides an overview of the codes and examples of responses provided by the chatbots and human researchers. 2.4 Data analysis Quantitative data obtained from the surveys were exported to Microsoft Excel (Microsoft Office, Professional Plus 2021) and analysed using descriptive and comparative statistical methods in Stata (Version 16). To elucidate statistical difference, we used the non-parametric Kruskal-Wallis test at a significance level of p < 0.05 1) over the entire dataset, 2) to compare responses from chatbots and humans, and 3) between each answer-generating method (i.e., Researcher A and ChatFlash or between two chatbots). The interrater reliability was calculated using Cohen’s Kappa (κ) in cases with two response possibilities and a weighted Cohen’s Kappa for three response possibilities with weights equally distributed, i.e., 0.5 if one step between the ratings (ex. Rater A = 1 and Rater B = 2) and 1 if the ratings were maximally different (ex. Rater A = 1 and Rater B = 3). We followed Cohen’s interpretation of Kappa, with values ≤ 0 being interpretated as no agreement, 0.01–0.20 as none to slight, 0.21–0.40 as fair, 0.41– 0.60 as moderate, 0.61–0.80 as substantial and 0.81–1.00 as almost perfect agreement.[ 19 ] Qualitative data were transferred to a Microsoft Excel document and analysed inductively. Results The following results highlight statistically significant differences – unless indicated otherwise. Analysis on the question level revealed no particular pattern within certain codes or across questions within codes. Across the entire dataset, the correctness of the responses given was similar, while all other categories revealed statistically different results (Table 2 ). In general, the answers provided by chatbots were perceived to demonstrate a better recognition of the context (chatbot: 92.42% vs. human: 84.85%) and were also longer than those of humans with a mean word ratio of the length of the answer vs. the length of the original text of 0.45 ± 0.50 as compared to 0.21 ± 0.26 for humans. Furthermore, the answers by chatbots were perceived to be more complete than those by humans (chatbot: 79.73% vs. human: 52.65%). Conversely, human answers were perceived to be superior in absence of interpretation (human: 97.35% vs. chatbot: 81.44%) and addition of material not included in the original text (human: 97.73% vs. chatbot: 81.82%). Table 2 Overview of survey results. Category Value Total Chatbot Human p-value overall Overall Chatbot Human p-value Chatbot vs. Human ChatFlash ChatGPT 3.5 ChatGPT 4.0 ZenoChat Researcher A Researcher B Addition yes 102 96 (18.18%) 6 (2.27%) < 0.001* 16 39 30 11 6 0 < 0.001* no 690 432 (81.82%) 258 (97.73%) 116 93 102 121 126 132 Completeness complete 560 421 (79.73%) 139 (52.65%) < 0.001* 108 108 104 101 65 74 < 0.001* partial 131 67 (12.69%) 64 (24.24%) 17 18 14 18 34 30 incomplete 101 40 (7.58%) 61 (23.11%) 7 6 14 13 33 28 Context correct 712 488 (92.42%) 224 (84.85%) 0.001* 125 116 123 124 110 114 0.007* incorrect 80 40 (7.58%) 40 (15.15%) 7 16 9 8 22 18 Correctness correct 574 390 (73.86%) 184 (69.70%) 0.116 102 90 90 108 91 93 0.051 partial 119 83 (15.72%) 36 (13.64%) 20 25 26 12 16 20 incorrect 99 55 (10.42%) 44 (16.67%) 10 17 16 12 25 19 Interpretation yes 105 98 (18.56%) 7 (2.65%) < 0.001* 18 30 43 7 5 2 < 0.001* no 687 430 (81.44%) 257 (97.35%) 114 102 89 125 127 130 Length too short 93 19 (3.60%) 74 (28.03%) < 0.001* 1 3 4 11 38 36 perfect 537 363 (68.75%) 174 (65.91%) 104 84 86 89 89 85 < 0.001* too long 162 146 (27.65%) 16 (6.06%) 27 45 42 32 5 11 Significant differences between groups are indicated by *. Results are presented as absolute values unless indicated otherwise; the percentages refer to the columns. Individual comparisons revealed that addition was the only statistical difference between Researcher A and Researcher B, with Researcher A showing a slightly higher percentage (4.55%) of addition than Researcher B (0.00%). There were no statistical differences between the evaluations of the responses of ChatGPT 3.5 and ChatGPT 4.0. However, ChatGTP 3.5’s responses were longer (mean word ratio: 0.58 ± 0.70) than ChatFlash’s (mean word ratio: 0.34 ± 0.36) and ZenoChat’s (mean word ratio: 0.38 ± 0.37). Both human researchers’ responses were shorter compared to the chatbots individually, with a word ratio of 0.16 ± 0.14 for Researcher A and 0.25 ± 0.34 for Researcher B. Responses that were evaluated as being too long had a word ratio (of the length of the answer vs. the length of the original text) of 0.84 ± 0.65 compared to 0.27 ± 0.28 for adequate lengths. Some 93 answers (11.74%) were evaluated to be too short with a word ratio of 0.10 ± 0.11. The evaluation of chatbot’s completeness did not significantly differ between them; demonstrating comparable levels of completeness with between 76.52% (ZenoChat) and 81.82% (ChatFlash and ChatGPT 3.5). However, all answers by the chatbots were more complete than those by human researchers, whose responses were deemed complete in 49.24% (Researcher A) and 56.06% (Researcher B) of all cases. In terms of correctness , ZenoChat’s answers were assessed as being more correct (81.82%) than those by ChatGPT 3.5 (68.18%) and ChatGPT 4.0 (68.18%) or those by the human researchers (A: 68.94% and B: 70.45%). ChatFlash’s (77.27%) answers were not evaluated significantly differently than those of either chatbots or humans. ChatFlash’s responses were perceived to have shown a better understanding of the context than ChatGPT 3.5 (94.70% vs. 87.88% correct). Researcher B’s (86.26% correct) responses were perceived to show a lower understanding of the context than ZenoChat (93.94% correct) and ChatFlash (94.70% correct), with Researcher A’s context in responses being additionally inferior to ChatGPT 4.0 (83.33% vs. 93.18% correct). The responses by ChatGPT 3.5 and ChatGPT 4.0 were evaluated as containing more addition than ZenoChat and ChatFlash; with 29.55% and 22.72% of answers containing an addition vs. 8.33% and 87,88%, respectively. Researcher A’s responses were perceived to contain less addition (4.55%) than ChatGPT 3.5 (29.55%), ChatGPT 4.0 (22.73%) and ChatFlash (12.12%); Researcher B’s responses (0.00%) also contained less addition than ZenoChat (8.33%). Furthermore, the responses by ChatGPT 4.0 were evaluated as containing more interpretation than ZenoChat and ChatFlash; with 32.58% of answers containing an interpretation vs. 5.30% and 13.66%, respectively. ZenoChat also contained fewer interpretations than ChatFlash and was the only chatbot to not provide more interpretation in its answers than the human researchers (A: 5.30%, B: 3.79%). Across all chatbots, ZenoChat provided the best-rated answers, followed by ChatFlash. An overview of the Kruskal-Wallis tests between individual chatbots and humans is available in the supplementary material (Appendix 2). Raters had a fair Cohen’s kappa inter-rater reliability of 0.30 (0.12–0.51), with a standard error of 0.11 (0.07–0.12). Cohen’s kappa for context was lower (κ = 0.18 ± 0.09) than the other variables, with a Cohen’s kappa between 0.27 ± 0.11 for correctness and 0.39 ± 0.10 for length. Comparative analysis between the variables revealed a moderate positive correlation between correctness and completeness (ρ = 0.63), correctness and context (ρ = 0.56), as well as length and word ratio (ρ = 0.56) (Table 3 ). The correlation between the variables in the human responses was higher than that for chatbots with regard to correctness and completeness (ρ = 0.71 vs. ρ = 0.60) and correctness and context (ρ = 0.72 vs. ρ = 0.44). There was a low positive correlation between context and completeness (ρ = 0.46), as between interpretation and addition (ρ = 0.35). A low negative correlation was determined between completeness and length (ρ = -0.33), i.e., the shorter the answer, the more incomplete, as well as correctness and addition (ρ = -0.35). In the human responses, there was a low negative correlation between length and correctness (ρ = -0.34), indicating a loss of correctness with shorter answers, and a moderate negative correlation between length and completeness (ρ = -0.66), indicating a loss of completeness with shorter answers. The answers by chatbots demonstrated a low negative correlation between addition and correctness, indicating a loss of correctness through addition of new content. Table 3 Correlation of questionnaire categories Addition Completeness Context Correctness Interpretation Length Ratio Addition 1.00 Completeness -0.04 1.00 Context -0.12* 0.46* 1.00 Correctness -0.35* 0.63* 0.56* 1.00 Interpretation 0.35* 0.05 0.07 -0.20* 1.00 Length -0.25* -0.33* -0.12* -0.02 -0.22* 1.00 Ratio -0.27* -0.22* -0.11* -0.02 -0.17* 0.56* 1.00 Spearman’s rho (ρ): correlation of questionnaire categories. \(\:\text{R}\text{a}\text{t}\text{i}\text{o}=\frac{Word\:Coun{t}_{Answer}}{Word\:Coun{t}_{Original\:text}}\) . ρ denotes no/little correlation (|ρ| <0,3), low correlation (0.5 ≤ |ρ| ≤ 0.3), moderate correlation (0.7 ≤ |ρ| 0.7).[ 20 ] Entries marked with an * refer to p-values < 0.05. Comments in the free text fields explained the reasons for certain ratings, e.g., some raters commented on the length of the answer: An answer would be considered too long if it contained repeat information, or irrelevant or too detailed. In cases where an answer was rated as incomplete, free text fields were sometimes used to mention the expected content. Some comments criticised answers that were generated even if the original text did not mention details related to a question. Most importantly, however, remarks were made about which bullet point(s) in the response were perceived as interpretation. Interestingly, many of them contained words such as ‘potentially’, ‘suggesting’, ‘pointing to’, ‘could’, ‘appears to be’, ‘indicating’, ‘may reflect’, ‘may lead’, indicating an interpretation. A few comments also pointed towards the responding agent needing to do a certain amount of interpretation to correctly answer the question. This ranged from being able to interpret certain abbreviations, such as ‘MOH’ as standing for ‘Ministry of Health’ through to a broad understanding of the matter at hand, as illustrated by the quote ‘The summary is not explicitly stated in the text. However, it is correct, but is only achieved through interpretation’. (Rater A). One rater explicitly mentioned the difficulty in ‘answer[ing] the question, because it almost requires an interpretation to grasp the right aspects‘ (Rater A), indicating that ‘the system therefore already needs to perform semantic or interpretative tasks in order to answer … [the] question’? (Rater B). Discussion The chatbots’ responses were regarded as better in recognising context and providing more complete, albeit longer summaries, while humans were viewed as less prone to add or interpret the material. Among the chatbots, ZenoChat provided the best-rated answers, followed by ChatFlash in second place, with ChatGPT 3.5 and ChatGPT 4.0 equally in third place. Statistical analysis indicated a positive correlation between correct contextualisation and completeness, and between correct contextualisation and correctness of the response. Qualitative feedback highlighted that longer answers often contained redundant information and raised the question of the role of interpretation in effectively answering the question. Indeed, correct contextualisation and the absence of addition and interpretation in the answer were important underlying factors in our research set-up. While Hamilton et al. found ChatGPT to have limited contextual understanding, our study found a clearer understanding of the context by the chatbots than displayed in the answers of the human researchers.[ 14 ] The distinction between recognising context – for example, that ‘EHR’ stands for ‘electronic health record’ – and interpreting content to answer a question is subtle and complex. This is closely linked to the debate over what constitutes the addition of new content versus the interpretation of the original text’s content. To enhance the accuracy and relevance of chatbots’ responses, precise and specific instructions as prompts are essential.[ 7 , 12 , 21 ] In our study, chatbots tended to provide more extensive answers than humans, likely due to their implicit understanding of the exercise. While humans aimed for brevity to expedite subsequent analysis, chatbots did not have this implicit knowledge, as the prompt did not contain any guidance on the response length. Since chatbots are designed to assume user expectations from the given prompt, they tend to comply with instructions instead of seeking clarification and responding according to their own skills and limitations. [ 4 , 22 ] Recent research has demonstrated the efficacy of chatbots such as ChatGPT in generating discharge and patient summaries of patients’ medical histories, and summarising the scientific literature, achieving levels of quality and accuracy comparable to or exceeding traditional methods.[ 15 , 23 , 24 ] This highlights their ability to distil main ideas from a given text, mirroring the findings from our study. However, these studies did not compare the capacities of different chatbots. Some studies compared chatbots’ ability to answer complex medical examination questions. In these, ChatGPT 4.0 responded more accurately and concisely than Bard (Microsoft) albeit scoring worse than the medical reference group.[ 1 , 25 – 27 ] This suggests that chatbots’ ability to answer questions, as observed in our study, is contingent on the availability of an accurate reference text. Given this limitation, chatbots can be implemented in scenarios requiring idea extraction from a given resource, such as content analysis for qualitative research. The study demonstrated an example of how AI-powered chatbots can enhance and accelerate the research process and successfully support humans in conducting research endeavours such as reviews. However, the authors – in concordance with other studies – highlight that human oversight and correction are essential.[ 8 , 14 , 21 ] Recognising chatbots’ potential, it is imperative to be aware of their shortcomings, with some of the most importants being potential inclusion of biases and non-disclosure of training data, incorrect information and the possibility of nonsensical responses. [ 1 , 8 , 9 , 13 , 23 , 28 – 31 ] A major strength of this study is the comparison between different chatbots combined with an evaluation of their performance against human researchers, as most other studies only compare chatbots against one another. Despite this strength, it is essential to acknowledge the study’s constraints. Firstly, we selected four chatbots from a rapidly growing array of chatbots using different underlying large language models. Furthermore, caution is advised in generalising the findings of this study, as the text parts used for eliciting the answers were preselected by two human researchers for their topical relevance. Eliciting responses from the full text might yield different outcomes. Future research should aim at refining the prompt to better match the implicit human understanding of the context and the specific objective anticipated from the chatbot’s use.[ 7 , 8 ] This entails researchers assessing their underlying assumptions and intentions, clearly defining the criteria for evaluating responses, and employing prompt engineering methods to refine the prompt.[ 7 ] As the underlying large language models of chatbots are quickly evolving, longitudinal studies are crucial to offer insights into how chatbots’ capabilities and performance are changing. In line with Hamilton et al’s recommendations, we also recommend assessing the chatbots’ capabilities to answer the questions when receiving the full text, instead of the curated human-determined important parts.[ 14 ] This might not only assess the chatbots’ abilities, but also further assist in identifying additional information that might have escaped the human’s judgment.[ 9 ] Conclusion Our study demonstrates chatbots’ ability to provide complete and correct answers to questions on a given text, which may be a useful tool to accelerate research processes, in specific qualitative evidence synthesis. Given the speed of chatbot development and fine-tuning, the successful applications of chatbots to facilitate research will very likely continue to expand. Abbreviations AI Artificial intelligence EHR Electronic health record GPT Generative Pre-trained Transformer NASSS Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies Declarations Ethics approval: As the study did not involve sensitive data, no ethical clearance was necessary. Study participants were researchers acting as experts. They either provided answers to the questions or judged the answers. The respondents of the survey were able to drop out at any time without negative consequences. Consent for publication: Not applicable. Availability of data and materials: The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request. Funding: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Competing interests: The authors declare that they have no competing interests. Authors’ contributions: KN: Conceptualization, Methodology, Formal analysis, Investigation, Resources, Data Curation, Visualization, Project administration, Writing - Original Draft. SS: Methodology, Formal analysis, Resources, Visualization, Writing - Review & Editing. MSt: Resources, Writing - Review & Editing. JA: Resources, Writing - Review & Editing. MCR: Resources, Writing - Review & Editing. MSc: Methodology, Resources, Writing - Review & Editing. FF: Conceptualization, Methodology, Resources, Writing - Review & Editing, Supervision Acknowledgements: The authors would like to extend their gratitude to Setareh Rabbani and Julia Schulze Pröbsting for their participation in the study. Declaration of generative AI in scientific writing: During the preparation of this work, the authors used GTP 4.0 in order to improve readability and language. After using this tool, the author(s) reviewed and edited the content as needed and take full responsibility for the content of the publication. References Patil NS, Huang RS, van der Pol CB, Larocque N. Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment. Can Assoc Radiol J. 2024;75:344–50. doi:10.1177/08465371231193716. Alshami A, Elsayed M, Ali E, Eltoukhy AEE, Zayed T. Harnessing the Power of ChatGPT for Automating Systematic Review Process: Methodology, Case Study, Limitations, and Future Directions. Systems. 2023;11:351. doi:10.3390/systems11070351. Khlaif ZN, Mousa A, Hattab MK, Itmazi J, Hassan AA, Sanmugam M, Ayyoub A. The Potential and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation. JMIR Med Educ. 2023;9:e47049. doi:10.2196/47049. Meyer JG, Urbanowicz RJ, Martin PCN, O'Connor K, Li R, Peng P-C, et al. ChatGPT and large language models in academia: opportunities and challenges. BioData Min. 2023;16:20. doi:10.1186/s13040-023-00339-9. Chappell M, Edwards M, Watkins D, Marshall C, Graziadio S. Machine learning for accelerating screening in evidence reviews. Cochrane Evidence Synthesis and Methods 2023;1:e12021. doi:10.1002/cesm.12021. van de Schoot R, Bruin J de, Schram R, Zahedi P, Boer J de, Weijdema F, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell. 2021;3:125–33. doi:10.1038/s42256-020-00287-7. Zhang H, Wu C, Xie J, Lyu Y, Cai J, Carroll JM. Redefining Qualitative Analysis in the AI Era: Utilizing ChatGPT for Efficient Thematic Analysis. 2023; arXiv:2309.10771. Huang Y-M, Rocha T, editors. Innovative Technologies and Learning: 6th International Conference, ICITL 2023 Porto, Portugal, August 28–30, 2023 Proceedings. Cham: Springer; 2023. Christou P. A Critical Perspective Over Whether and How to Acknowledge the Use of Artificial Intelligence (AI) in Qualitative Studies. TQR. 2023;28:1981–91. doi:10.46743/2160-3715/2023.6407. La Torre-López J de, Ramírez A, Romero JR. Artificial intelligence to automate the systematic review of scientific literature. Computing. 2023;105:2171–94. doi:10.1007/s00607-023-01181-x. Sampson M, Shojania KG, Garritty C, Horsley T, Ocampo M, Moher D. Systematic reviews can be produced and published faster. J Clin Epidemiol. 2008;61:531–6. doi:10.1016/j.jclinepi.2008.02.004. Mahuli SA, Rai A, Mahuli AV, Kumar A. Application ChatGPT in conducting systematic reviews and meta-analyses. Br Dent J. 2023;235:90–2. doi:10.1038/s41415-023-6132-y. Morgan DL. Exploring the Use of Artificial Intelligence for Qualitative Data Analysis: The Case of ChatGPT. International Journal of Qualitative Methods. 2023;22:1–10. doi:10.1177/16094069231211248. Hamilton L, Elliott D, Quick A, Smith S, Choplin V. Exploring the Use of AI in Qualitative Analysis: A Comparative Study of Guaranteed Income Data. International Journal of Qualitative Methods. 2023;22:1–13. doi:10.1177/16094069231201504. Michalowski M, Abidi SSR, Abidi S, editors. Artificial Intelligence in Medicine: 20th International Conference on Artificial Intelligence in Medicine, AIME 2022 Halifax, NS, Canada, June 14–17, 2022 Proceedings. Cham: Springer; 2022. Chen L, Zaharia M, Zou J. How is ChatGPT's behavior changing over time?; 2023; arXiv:2307.09009. Nordmann K, Sauter S, Möbius-Lerch P, Redlich M-C, Schaller M, Fischer F. Conceptualizing Interprofessional Digital Communication and Collaboration in Health Care: Protocol for a Scoping Review. JMIR Res Protoc. 2023;12:e45179. doi:10.2196/45179. Greenhalgh T, Wherton J, Papoutsi C, Lynch J, Hughes G, A'Court C, et al. Beyond Adoption: A New Framework for Theorizing and Evaluating Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies. J Med Internet Res. 2017;19:e367. doi:10.2196/jmir.8775. McHugh ML. Interrater reliability: the kappa statistic. Biochem Med. 2012;22:276–82. Rovetta A. Raiders of the Lost Correlation: A Guide on Using Pearson and Spearman Coefficients to Detect Hidden Correlations in Medical Sciences. Cureus. 2020;12:e11794. doi:10.7759/cureus.11794. Qureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models "the answer" to bringing us closer to systematic review automation? Syst Rev. 2023;12:72. doi:10.1186/s13643-023-02243-z. Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, Moy L. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology. 2023;307:e230163. doi:10.1148/radiol.230163. Clough RAJ, Sparkes WA, Clough OT, Sykes JT, Steventon AT, King K. Transforming healthcare documentation: harnessing the potential of AI to generate discharge summaries. BJGP Open. 2024;8:BJGPO.2023.0116. doi:10.3399/BJGPO.2023.0116. Scott D, Hallett C, Fettiplace R. Data-to-text summarisation of patient records: using computer-generated summaries to access patient histories. Patient Educ Couns. 2013;92:153–9. doi:10.1016/j.pec.2013.04.019. Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun C-H, Lam JSH, et al. Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770. doi:10.1016/j.ebiom.2023.104770. Buhr CR, Smith H, Huppertz T, Bahr-Hamm K, Matthias C, Blaikie A, et al. ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions. JMIR Med Educ. 2023;9:e49183. doi:10.2196/49183. Massey PA, Montgomery C, Zhang AS. Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations. J Am Acad Orthop Surg. 2023;31:1173–9. doi:10.5435/JAAOS-D-23-00396. Christou P. Ηow to Use Artificial Intelligence (AI) as a Resource, Methodological and Analysis Tool in Qualitative Research? TQR. 2023;28:1968–80. doi:10.46743/2160-3715/2023.6406. Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit Health. 2023;5:e107-e108. doi:10.1016/S2589-7500(23)00021-3. Naumova EN. A mistake-find exercise: a teacher's tool to engage with information innovations, ChatGPT, and their analogs. J Public Health Policy. 2023;44:173–8. doi:10.1057/s41271-023-00400-1. Stephens LD, Jacobs JW, Adkins BD, Booth GS. Battle of the (Chat)Bots: Comparing Large Language Models to Practice Guidelines for Transfusion-Associated Graft-Versus-Host Disease Prevention. Transfus Med Rev. 2023;37:150753. doi:10.1016/j.tmrv.2023.150753. Additional Declarations No competing interests reported. Supplementary Files Appendix1.pdf Appendix 1: Survey Appendix2.pdf Appendix 2: Results of Kruskal-Wallis tests Cite Share Download PDF Status: Published Journal Publication published 30 May, 2025 Read the published version in BMC Medical Research Methodology → Version 1 posted Editorial decision: Revision requested 22 Jan, 2025 Reviews received at journal 21 Jan, 2025 Reviews received at journal 21 Jan, 2025 Reviews received at journal 15 Jan, 2025 Reviewers agreed at journal 03 Jan, 2025 Reviewers agreed at journal 03 Jan, 2025 Reviewers agreed at journal 03 Jan, 2025 Reviewers invited by journal 02 Jan, 2025 Editor assigned by journal 02 Jan, 2025 Submission checks completed at journal 31 Dec, 2024 First submitted to journal 25 Dec, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5710671","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":396204100,"identity":"a13768bc-cdab-4eed-804d-b643e1e8d95c","order_by":0,"name":"Kim Nordmann","email":"","orcid":"","institution":"Kempten University of Applied Sciences","correspondingAuthor":false,"prefix":"","firstName":"Kim","middleName":"","lastName":"Nordmann","suffix":""},{"id":396204101,"identity":"7ba4df18-1583-4fae-9f80-053b21a45d60","order_by":1,"name":"Stefanie Sauter","email":"","orcid":"","institution":"Kempten University of Applied Sciences","correspondingAuthor":false,"prefix":"","firstName":"Stefanie","middleName":"","lastName":"Sauter","suffix":""},{"id":396204102,"identity":"b593f856-e911-4faa-9367-7361c54f4696","order_by":2,"name":"Mirjam Stein","email":"","orcid":"","institution":"Kempten University of Applied Sciences","correspondingAuthor":false,"prefix":"","firstName":"Mirjam","middleName":"","lastName":"Stein","suffix":""},{"id":396204103,"identity":"86b050c2-8d92-412b-b3b1-2753bc5152dc","order_by":3,"name":"Johanna Aigner","email":"","orcid":"","institution":"Kempten University of Applied Sciences","correspondingAuthor":false,"prefix":"","firstName":"Johanna","middleName":"","lastName":"Aigner","suffix":""},{"id":396204104,"identity":"6cf29e8d-3c5f-4f56-b8c9-85baf0973099","order_by":4,"name":"Marie-Christin Redlich","email":"","orcid":"","institution":"Kempten University of Applied Sciences","correspondingAuthor":false,"prefix":"","firstName":"Marie-Christin","middleName":"","lastName":"Redlich","suffix":""},{"id":396204105,"identity":"19604dc7-7c8b-4b65-bafb-120a3fd8351e","order_by":5,"name":"Michael Schaller","email":"","orcid":"","institution":"Kempten University of Applied Sciences","correspondingAuthor":false,"prefix":"","firstName":"Michael","middleName":"","lastName":"Schaller","suffix":""},{"id":396204106,"identity":"8896b046-8c73-46c7-b346-0bee8ce565a2","order_by":6,"name":"Florian Fischer","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABK0lEQVRIie2PMUsDMRTHcwTulhxZUw7qV7hSEDr1q7zjhi4WBBeHIoWDuqhzBD/ETSdukcC5CK4ZDrHLzZUDqVCLSc8umipuIvlB8v6E9+PlIeRw/EGCU32BCRlCWNdu3NYWZlGI/FBMMK39rcK+VTZBtK1J/qOCcbl4mlSIBMG8OTx+HF1TgePng+qERlNPLWyKn3Ioax1IP+L3R+MbDji5LGrWuRJ4wL8qQ92JwJcm+DicwThXtJFhIVmswI+IbYpR1lKHoMbhGkbxg8DyTStDo6x2KMnMKGgfh1OAWABOPTOFacW6vp+i5EKSzS6khF6uAPfOi7rDVZINzixKkElv+SK7hN7NGzKBPfMx9lpUlPL0Vi0tY7bq5wehjzfdLVgQv+p2OByOf807W1lesXpMIMoAAAAASUVORK5CYII=","orcid":"","institution":"Kempten University of Applied Sciences","correspondingAuthor":true,"prefix":"","firstName":"Florian","middleName":"","lastName":"Fischer","suffix":""}],"badges":[],"createdAt":"2024-12-25 10:38:07","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5710671/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5710671/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s12874-025-02532-2","type":"published","date":"2025-05-30T15:56:57+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":72795580,"identity":"5a7f0e13-723a-41ff-89e3-e3498a5a6e1a","added_by":"auto","created_at":"2025-01-02 08:56:25","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":200248,"visible":true,"origin":"","legend":"\u003cp\u003eMethodological process, enriched by comparison of the four chatbots used to answer the questions\u003c/p\u003e","description":"","filename":"Figure1.png","url":"https://assets-eu.researchsquare.com/files/rs-5710671/v1/94e00e2c4958e112a604d6b4.png"},{"id":83782812,"identity":"194e17b9-b5e4-48be-9c77-93e04c76db89","added_by":"auto","created_at":"2025-06-02 16:06:49","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1257013,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5710671/v1/7b14ec3d-9cb5-4cbc-a430-b8a957f616d2.pdf"},{"id":72795585,"identity":"71f504b2-e397-499a-bfd0-4cd9bbeea612","added_by":"auto","created_at":"2025-01-02 08:56:26","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":185276,"visible":true,"origin":"","legend":"\u003cp\u003eAppendix 1: Survey\u003c/p\u003e","description":"","filename":"Appendix1.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5710671/v1/84c82d9bdfda28a9c8ae7ce2.pdf"},{"id":72795582,"identity":"e22118fa-17bc-4057-9522-c9b85f40f292","added_by":"auto","created_at":"2025-01-02 08:56:25","extension":"pdf","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":78168,"visible":true,"origin":"","legend":"\u003cp\u003eAppendix 2: Results of Kruskal-Wallis tests\u003c/p\u003e","description":"","filename":"Appendix2.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5710671/v1/2e0d845edbdfff7da6a72b27.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Evaluating the performance of artificial intelligence in supporting evidence synthesis: a blinded comparison between chatbots and humans","fulltext":[{"header":"Background","content":"\u003cp\u003eSince the public launch of ChatGPT (OpenAI) in November 2022, chatbots powered by large language models have received increasing attention by the public, politicians and science.[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e] Its usability and challenges have been debated across various sectors, particularly in human-related areas such as education or healthcare.[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e] These advanced language models are trained on vast repositories of data and tuned to mimic human conversation.[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] Their scale, pre-training mechanisms, contextual understanding and flexibility set them apart from previous machine learning tools.[\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]\u003c/p\u003e \u003cp\u003eWith the rise of large language models, the application of artificial intelligence in analysing complex datasets, making predictions, and supporting literature reviews has opened up various new opportunities, particularly for healthcare.[\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e] In addition, its application in research is also expanding due to the anticipated potential to accelerate the research process and possibly improve transparency.[\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e–\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] Especially in evidence synthesis such as systematic reviews, where the mean duration from literature search to publication extends beyond a year, large language models could expedite the preparation of evidence-based guidelines and, thereby, positively impact medical practice.[\u003cspan additionalcitationids=\"CR11\" citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e–\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] In qualitative research, ChatGPT performed well in reproducing specific themes, but less successfully in establishing interpretative themes, and creating depth when coding inductively.[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] Qualitative research tools, such as MAXQDA and ATLAS.ti, have integrated artificial intelligence tools in partnership with OpenAI to support users at various stages of their research process.[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e] Chatbots might further enhance and accelerate research by supporting tasks such as writing search strings and summaries.[\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]\u003c/p\u003e \u003cp\u003eWhile ChatGPT is a prominent example, it is only one of numerous chatbots to employ large language models.[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] In recent years, the market has seen the introduction of multiple alternatives. Despite sharing a similar core technology, their different training and fine-tuning processes may result in variations in their response generation and capabilities.[\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] The aim of this study is to compare the accuracy, completeness and relevance of different chatbot-extracted responses to questions of pre-coded article excerpts against human responses, contributing to an advanced understanding of the use of chatbots in evidence synthesis.\u003c/p\u003e \n\n \u003cp\u003e \u003c/p\u003e \u003cp\u003e\u003c/p\u003e \n\n \u003cp\u003e \u003c/p\u003e\n\n"},{"header":"Methods","content":"\u003cp\u003eWe deployed a randomised and blinded, survey-based process in order to assess the performance of chatbots with regard to supporting evidence synthesis.\u003c/p\u003e\u003ch3\u003e2.1 Context and sample\u003c/h3\u003e\u003cp\u003eAs part of a scoping review on digitally supported interaction between healthcare workers, 407 articles were included for data extraction and manually analysed.[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e] The analysis scheme was informed by Greenhalgh et al.’s NASSS framework (Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies).[\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e] This framework contains seven categories with specific questions on each. It was translated into a coding framework of seven codes, which, when applied to the 407 articles, produced a total of 504,736 words (1,070 pages) coded text. From a subsample of articles, i.e., those pertaining to the ambulant setting, five articles were chosen at random for each of the seven codes to form the basis of the comparison exercise. As one of the seven codes was only used twice in the subset, both articles were chosen. Different coded parts of the same code in one article were combined into text passages that varied in length and complexity. The text passages were then presented to two researchers and four chatbots with the task of answering the questions for each code including all relevant information. A coded text passage under each code is given as an example jointly with an answer provided by one of the four chatbots or a human researcher at the end of this section (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eOverview of codes.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCode\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eQuestion\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eExample text passage\u003c/p\u003e \u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eExample answer\u003c/p\u003e \u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e1. Condition\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1A. Why was the technology introduced?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e[…] health information technology (HIT) systems for shared electronic medication plans. These systems enable patients and health care professionals to use and manage a common list of current medications across sectors and settings. Shared electronic medication plans have great potential to improve medication management and patient safety, […]\u003csup\u003e1\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eChatGPT3.5\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e- The technology (HIT systems for shared electronic medication plans) was introduced to improve medication management and patient safety.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1B. Is there a specific disease area mentioned?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eChatGPT3.5\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e- The paragraph does not mention a specific disease area. The focus is on medication management and patient safety in a broader context.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2. Technology\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2A. What are the key features of the technology?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003e12A multidisciplinary digital communication tool, Congredi (Fast Guide, Oud Gastel, the Netherlands), was implemented to improve communication about care among providers. It includes an interactive multidisciplinary care plan and a secure e-mailing channel. It was selected on the basis of jointly developed functional specifications; The Tool Congredi is a communication tool for providers, and it was originally developed for primary care by a general practitioner. It consists of an option to share data in a care plan and asynchronous communication options using secure email. The included multidisciplinary care plan was based on the social, functional, mental, physical, and communication problem inventory domain model.15 Providers received a practical instruction training session lasting 4 hours and demonstrating use of the Congredi tool. The patient had to give permission to begin a record and to invite providers to link. After patient permission was obtained, a provider opened a Congredi record for the patient and filled in a care plan. Other providers who were involved with the patient and had access to Congredi could be invited by link so that they could view the care plan and use it interactively. Provider actions were recorded in the system. The following are three main categories of actions: care actions (problem assessment, defining care goals, observing patients' health status, and adapting care goals), e-mailing, and inviting other providers to link. All providers had the ability to update the care plan and initiate actions; they received alerts when there were e-mails in their inboxes.\u003csup\u003e2\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eChatGPT4.0\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e-An online platform for creating, using, and managing a list of all the medications a patient was taking and had taken in the past.\u003c/p\u003e \u003cp\u003e- Pilot project: addition of a shared electronic medication plan system onto the existing eHealth platform; new collaborative model based on the patient-GP-pharmacy triad to make primary care medication management safer in cases involving polypharmacy.\u003c/p\u003e \u003cp\u003e- information exchange between the patient and an intervening health care professional.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2B. What knowledge and/or support is required to use the technology?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eChatGPT4.0\u003c/em\u003e: No information is provided in the given paragraph.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2C. What is the technology supply model?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eChatGPT4.0\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e- The tool is a communication tool for providers, particularly developed for primary care.\u003c/p\u003e \u003cp\u003e- Originally developed by a general practitioner.\u003c/p\u003e \u003cp\u003e- Providers receive training on its use, and patient permission is obtained to initiate a record.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2D. What is said current technology use and uptake?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eChatGPT4.0\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e- No information is provided in the given paragraph.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e3. Value proposition\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3B.1. What is the technology's positive value proposition in terms of e.g., desirability, efficacy, safety, and cost effectiveness amongst others, excluding beneficial effects on staff?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eAmong those who did have concerns, concerns for both personal and patient privacy were top of the list (Fig.\u0026nbsp;3).; In this study, we found that after 1 year of use, the implementation of a private Facebook group for a large, multi-site FHT was ultimately not successful, despite cross-site communication having previously been identified to be in need of improvement by FHT members.12; Use of the group depended heavily on the study team and was not sustained. Survey respondents expressed high levels of concern about both patient and provider privacy and viewed the Facebook group as unnecessary, adding extra time and work to their already busy days; Our survey results suggest that FHT members were inherently suspicious of Facebook, even with guides on privacy settings in place and the recommendation to create a separate profile.\u003csup\u003e3\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eResearcher B\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e- No information is provided in the given paragraph\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3B.2. What is the technology's negative value proposition in terms of e.g., desirability, efficacy, safety, and cost effectiveness amongst others, excluding beneficial effects on staff?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eResearcher B\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e-\u0026nbsp;High levels of concern about\u0026nbsp;both patient and provider privacy (using a Facebook group)\u003c/p\u003e \u003cp\u003e.\u0026nbsp;Facebook\u0026nbsp;\u0026nbsp; group was viewed as unnecessary, adding extra time and work to the already busy day of the participants\u003c/p\u003e \u003cp\u003e-\u0026nbsp;FHT members were inherently suspicious of Facebook, even with guides on privacy\u0026nbsp;settings in place and the recommendation to create a separate profile\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e4. Adopters\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e4A. What changes in staff roles, practices and routines, and identities are implied?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e4.2. Professional tasks and responsibilities reconfigured\u003c/p\u003e \u003cp\u003eThe informants reported changes in how much information they sent and received due to the simpler means of communication. We might say that more exchange of information is the structural effect of the introduction of e-messaging that, in turn, leads to unforeseen changes in tasks and responsibilities. Two nurses report how they use e-messages below:\u003c/p\u003e \u003cp\u003eNurse A: We answer questions from the GP. We write general information if they [patients] have an appointment with the GP, and if we think they won’t remember to bring up everything that they should; then we send some kind of update in advance, so that all issues are addressed.\u003c/p\u003e \u003cp\u003eInterviewer: Before the implementation of the e-messages system, did you provide this information via telephone?\u003c/p\u003e \u003cp\u003eNurse B: Yes, but not always, because it was so cumbersome via the telephone, so you would just skip it. I think e-messaging has made it easier. You can inform via a few keystrokes instead of waiting on the telephone.\u003c/p\u003e \u003cp\u003eNurse B points to the fact that she informs the GP more often after the introduction of e-messaging, for example, before a patient arrives for his/her appointment with the GP. It means that the GP may receive information that is more comprehensive and up-to-date, ensuring a better quality of service. Similarly, some of the nurses reported that they use e-messages to inform the GP that patients have been admit-ted to hospital, signalling that the GP should be prepared for changes in the patients’ medications.\u003c/p\u003e \u003cp\u003eGPs are also of the opinion that they send more messages. However, one highlighted a problematic aspect of communication becoming too easy:\u003c/p\u003e \u003cp\u003eWe send more messages. Earlier, we used to let them accumulate until Tuesdays [when they had meetings with homecare], and then some problems were solved at that time. Now we send them at once, right. The volume of the messages they [homecare] receive has increased as a result. And that might be a problem. Cause we often ‘shoot from the hip’. We see a problem, and then we send it [a message], so we won’t forget. Earlier, we would have thought, “can we try to solve it?”, “can we try to find out?”, “can we ask the patient to bring the pill dispenser?,” can we ask for a “dis-charge note from the hospital?” It [the increased volume of messages] becomes a burden for homecare services. (GP municipality B)\u003c/p\u003e \u003cp\u003eThe homecare nurses did not cite message overload or messages not carefully thought out as frequent problems. Thus, there may be a discrepancy between what is comprehended as ‘too much’ from a GP’s perspective and from a homecare nurse’s perspective. It is also necessary to keep in mind that we are examining individual experiences. However, some nurses did talk about receiving too much/irrelevant information via e-message, but this was explained as a consequence of (poor) functionality in the e-message system, where some message types automatically imported extensive amounts of information from the EPR and attached it to the message.\u003c/p\u003e \u003cp\u003eThe interviews show agreement across the informants that more information is sent and received, indicating that e-message users have more knowledge of patients than do nonusers. Two unintended consequences are worth noting from the example quotes presented above: (1) e-messaging may become a structural means of change, allowing nurses to become more active organisers and facilitators of GP’s work, which in turn can mean better follow-up of patients. (2) It may also be argued that e-messaging becomes a tool for getting things done and eliminating cognitive overload but in doing so unnecessarily transfers one’s own work to another actor.; 4.3. Empowering nurses\u003c/p\u003e \u003cp\u003eAnother issue associated with HIT that was not addressed prior to the introduction of e-messaging is the power balance between homecare nurses and GPs. Shifts in the balance of power may occur as a result of the new form of communication. Through the automatic documentation of communication provided by e-messaging, the nurses felt their requests had more weight. Two quotes illustrate this:\u003c/p\u003e \u003cp\u003eFirst, it is documented every time we have made an inquiry. And they [GPs] can’t ignore an e-message. When there is a message, they have to answer. So, that is very much an advantage. (nurse municipality A)\u003c/p\u003e \u003cp\u003eWe haven’t been spoilt when it comes to having contact with the GP. It’s like.. .we try to get hold of them, and they are supposed to call back, but it doesn’t always happen. Sometimes it feels like we are not taken seriously. But when you have sent an e-message, then it is there [in their EPR system]. And they have to open it, and they have to send an answer. (nurse municipality A)\u003c/p\u003e \u003cp\u003eWhen the nurses talked about what e-messages meant to them, they often described the positive effect of the automatic documentation on their dialogues with the GPs. Previously, messages for GPs were often mediated through their medical secretaries and could get ‘lost’, according to the nurses – meaning that the GP never called back. According to the nurses, requests documented in the EPR have more weight than a post-it note from a secretary, and they found that it resulted in a more conscientious follow-up from the GP.\u003c/p\u003e \u003cp\u003eFurthermore, nurses used phrases such as “having their back covered” and “having documented that you have done your job”. Such expressions illustrate how nurses use e-messages for attaining accountability and to fulfil the demands of the public, their colleagues, patients and next of kin. E-messaging might be viewed as a tool that empowers nurses if used consciously.\u003csup\u003e4\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eZenoChat\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e- Changes in staff roles, practices, and routines are implied by the introduction of e-messaging.\u003c/p\u003e \u003cp\u003e- More exchange of information is the structural effect of the introduction of e-messaging, which leads to unforeseen changes in tasks and responsibilities.\u003c/p\u003e \u003cp\u003e- Nurses use e-messages to inform the GP more often and with more comprehensive and up-to-date information.\u003c/p\u003e \u003cp\u003e- E-messaging becomes a structural means of change, allowing nurses to become more active organizers and facilitators of GP's work, which in turn can mean better follow-up of patients.\u003c/p\u003e \u003cp\u003e- E-messaging becomes a tool for getting things done and eliminating cognitive overload but in doing so unnecessarily transfers one's own work to another actor.\u003c/p\u003e \u003cp\u003e- The power balance between homecare nurses and GPs may shift as a result of the new form of communication.\u003c/p\u003e \u003cp\u003e- Through the automatic documentation of communication provided by e-messaging, nurses feel their requests have more weight.\u003c/p\u003e \u003cp\u003e- Requests documented in the EPR (Electronic Patient Record) have more weight than a post-it note from a secretary, resulting in a more conscientious follow-up from the GP.\u003c/p\u003e \u003cp\u003e- E-messaging might be viewed as a tool that empowers nurses if used consciously.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e4B. What is the technology's desirability for staff? (Including beneficial effects on staff?)\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eZenoChat\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e- The technology's desirability for staff is not explicitly mentioned in the given paragraph.\u003c/p\u003e \u003cp\u003e- However, e-messaging may become a tool that empowers nurses and allows them to become more active organizers and facilitators of GP's work, which could lead to improved job satisfaction.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e5. Organisation\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e5A. What is the organization’s readiness and/or capacity for technology-supported change?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eIn Australia, video consultations have been funded by Medicare (the universal health insurance system) since 2011; however, general practitioners are not rebated to deliver telehealth services, only to attend patients in person to support video consultations from medical specialists […] Overall, it was very challenging to make video consultations a routine method of delivering general medical ser-vices to long term care facilities. Existing communications infrastructure was usually not suitable, with bandwidth being the greatest barrier.\u003csup\u003e5\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eResearcher B\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e- Video consultations were very challenging as a routine method of delivering general medical services to long term care facilities, as existing communications infrastructure was usually not suitable, amongst others due to bandwidth\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e5B. What is said about adoption and funding?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eResearcher B\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e- Video consultations have been funded by Medicare (the universal health insurance system) since 2011\u003c/p\u003e \u003cp\u003e- A problem with funding is that general practitioners are not rebated to deliver telehealth services, only to attend patients in\u003c/p\u003e \u003cp\u003eperson to support video consultations from medical specialists;\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e5C. What work is involved in implementation and who will do it?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eResearcher B\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e- The paragraph does not provide specific information about the work involved in implementation or who will do it.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e6. Wider system\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e6A. What is the context for program rollout (e.g., political, economic, regulatory, professional (e.g., medicolegal), and sociocultural)?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eThe Norwegian primary health care is managed by the municipalities and includes home health care, with approximately 31,000 nurses and 4,000 GPs being contracted. These two services are organized separately, governed and financed differently, and also have incompatible EHR systems; (b)Legislation does not permit them to share EHR systems, but they are obliged to exchange patient information when required.28 The nurses in home health care and GPs have many common patients, thus indicating that ensuring a good clinical communication is essential.\u003csup\u003e6\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eResearcher A\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e- Home health nurses and GPs work in different systems and are financed and governed differently.\u003c/p\u003e \u003cp\u003e- Home health nurses and GPs have incompatible EHR systems.\u003c/p\u003e \u003cp\u003e- Legislation does not allow sharing of EHR systems\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e7. Embedding and adaptation over time\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e7A. How much scope is there for adapting and coevolving the technology and the service over time?\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e5.1. Incorporating other diseases and social determinants\u003c/p\u003e \u003cp\u003eA comprehensive integrated care delivery system needs to respond to a broad range of diseases and risks relevant to population health. We have laid some groundwork for taking our present system beyond pregnancy and early childhood. In parallel, the team has initiated a non-communicable disease CHW follow-up program that supports patients detected at the hospital level, and we plan to initiate a surgical follow-up program in the coming year. As an entry point into the social determinants of health, we have also developed an approach to measure household expenditures on health, medical debt, and medical impoverishment, and are currently analyzing these data. Over time, we aim to incorporate these various streams of household-level health, socio-economics, and disease risk data into the comprehensive integrated care digital platform.\u003csup\u003e7\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eChatFlash\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e- The team has initiated a non-communicable disease CHW follow-up program\u003c/p\u003e \u003cp\u003e- The team plans to initiate a surgical follow-up program in the coming year\u003c/p\u003e \u003cp\u003e- They have developed an approach to measure household expenditures on health, medical debt, and medical impoverishment\u003c/p\u003e \u003cp\u003e- The team aims to incorporate household-level health, socio-economics, and disease risk data into the comprehensive integrated care digital platform over time.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e \u003cp\u003eThe codes are displayed as proposed by Greenhalgh et al. in their NASSS framework (Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies).[\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e] The questions under each code were taken from the NASSS framework and partially adapted. A coded text passage under each code is given as an example together with an answer provided by one of the four chatbots or a human.\u003c/p\u003e \u003cp\u003eReferences for the example text passage:\u003c/p\u003e \u003cp\u003e1. Bugnon B, Geissbuhler A, Bischoff T, et al. Improving Primary Care Medication Processes by Using Shared Electronic Medication Plans in Switzerland: Lessons Learned From a Participatory Action Research Study. \u003cem\u003eJMIR Form Res\u003c/em\u003e 2021; 5: e22319.\u003c/p\u003e \u003cp\u003e2. Jong CC de, Ros WJG, van Leeuwen M, et al. Professionals' Use of a Multidisciplinary Communication Tool for Patients With Dementia in Primary Care. \u003cem\u003eComput Inform Nurs\u003c/em\u003e 2018; 36: 193–198.\u003c/p\u003e \u003cp\u003e3. Lofters AK, Slater MB, Nicholas Angl E, et al. Facebook as a tool for communication, collaboration, and informal knowledge exchange among members of a multisite family health team. \u003cem\u003eJ Multidiscip Healthc\u003c/em\u003e 2016; 9: 29–34.\u003c/p\u003e \u003cp\u003e4. Melby L and Hellesø R. Introducing electronic messaging in Norwegian healthcare: unintended consequences for interprofessional collaboration. \u003cem\u003eInt J Med Inform\u003c/em\u003e 2014; 83: 343–353.\u003c/p\u003e \u003cp\u003e5. Wade V, Whittaker F and Hamlyn J. An evaluation of the benefits and challenges of video consulting between general practitioners and residential aged care facilities. \u003cem\u003eJ Telemed Telecare\u003c/em\u003e 2015; 21: 490–493.\u003c/p\u003e \u003cp\u003e6. Lyngstad M, Melby L, Grimsmo A, et al. Toward Increased Patient Safety? Electronic Communication of Medication Information Between Nurses in Home Health Care and General Practitioners. \u003cem\u003eHome Health Care Management \u0026amp; Practice\u003c/em\u003e 2013; 25: 203–211.\u003c/p\u003e \u003cp\u003e7. Citrin D, Thapa P, Nirola I, et al. Developing and deploying a community healthcare worker-driven, digitally- enabled integrated care system for municipalities in rural Nepal. \u003cem\u003eHealthc (Amst)\u003c/em\u003e 2018; 6: 197–204.\u003c/p\u003e \u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e\u003ch2\u003e2.2 Selection of chatbots and answer generation\u003c/h2\u003e\u003cp\u003eWe opted for comparing four chatbots: ZenoChat (TextCortex AI), ChatGPT 3.5, ChatGPT 4.0 (OpenAI) and the free version of ChatFlash (neuroflash GmbH). ChatFlash, ChatGPT 3.5 and ChatGPT 4.0 are based on GPT (Generative Pre-trained Transformer). Depending on the settings, ZenoChat uses either GPT 4.0, Sophos-2 or Mixtral as its underlying model. As even small changes affect a language model, we included both ChatGPT 3.5 and 4, based on GPT 3.5 and 4, respectively, as well as the ZenoChat version based on Sophos-2.[\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] To ensure a standardised process, all chatbots were presented with the same prompt “Use the following paragraph to extract in academic style in bullet points the answers – if any answers are provided – to the following questions:”, followed by one to four questions pertaining to the specific code. No limit was set for the word count, as texts were heterogeneous in complexity and information density. The answer generation for all chatbots was conducted in November 2023.\u003c/p\u003e\u003ch3\u003e2.3 Randomization and survey\u003c/h3\u003e\u003cp\u003eWe used digitally supported randomisation to select three text passages of each code, resulting in a total of 20 text passages, to serve as the basis of our survey. Each text passage had four AI-written responses and two written by human beings. A survey was designed to evaluate the responses, applying a consistent evaluation framework to each question: the \u003cb\u003elength of the response\u003c/b\u003e was measured on a Likert 1–3 scale: 1 – too short, 2 – appropriate length, 3 – too long. \u003cb\u003eCompleteness\u003c/b\u003e and \u003cb\u003ecorrectness\u003c/b\u003e were equally measured on a Likert 1–3 scale: 1 complete / correct, 2 – partially complete / partially correct, 3 – significant part(s) are missing / content is displayed incorrectly. Three further questions evaluated the correct identification of the \u003cb\u003econtext\u003c/b\u003e (correct/incorrect) and whether the answer included any \u003cb\u003eaddition of new content\u003c/b\u003e (yes/no) and or an \u003cb\u003einterpretation\u003c/b\u003e beyond the original text (yes/no). Respondents could provide open-ended feedback on both, the response and the original text passage (Appendix 1). We used word counts to measure the length of the responses and calculated the word ratio by dividing the answers’ word count by that of the original text’s.\u003c/p\u003e\u003cp\u003eSix independent survey participants with a background in social sciences and differing professional expertise ranging from nursing to physiotherapy (henceforth: rater) were recruited from a department in the Bavarian Center for Digital Health and Social Care. These individuals were not involved in deriving answers, but were familiar with the scoping review and the aim of the research. Each rater was randomly assigned to a text passage with the corresponding answers to the question, blinded to the identities of the author in each case. In total, 120 text passages were reviewed with each text passage being presented to three raters with each rater assessing in the same group constellation (e.g., rater A, B, C) the same number of text passages. To minimise recognition bias, we standardized the formatting of answers and omitted introductory phrases such as “The provided paragraph contains information to answer the questions related to the introduction of the technology and the specific disease area:”. Figure\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e summarises the research process and Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e provides an overview of the codes and examples of responses provided by the chatbots and human researchers.\u003c/p\u003e\u003ch3\u003e2.4 Data analysis\u003c/h3\u003e\u003cp\u003eQuantitative data obtained from the surveys were exported to Microsoft Excel (Microsoft Office, Professional Plus 2021) and analysed using descriptive and comparative statistical methods in Stata (Version 16). To elucidate statistical difference, we used the non-parametric Kruskal-Wallis test at a significance level of p \u0026lt; 0.05 1) over the entire dataset, 2) to compare responses from chatbots and humans, and 3) between each answer-generating method (i.e., Researcher A and ChatFlash or between two chatbots). The interrater reliability was calculated using Cohen’s Kappa (κ) in cases with two response possibilities and a weighted Cohen’s Kappa for three response possibilities with weights equally distributed, i.e., 0.5 if one step between the ratings (ex. Rater A = 1 and Rater B = 2) and 1 if the ratings were maximally different (ex. Rater A = 1 and Rater B = 3). We followed Cohen’s interpretation of Kappa, with values ≤ 0 being interpretated as no agreement, 0.01–0.20 as none to slight, 0.21–0.40 as fair, 0.41– 0.60 as moderate, 0.61–0.80 as substantial and 0.81–1.00 as almost perfect agreement.[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e] Qualitative data were transferred to a Microsoft Excel document and analysed inductively.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eThe following results highlight statistically significant differences \u0026ndash; unless indicated otherwise. Analysis on the question level revealed no particular pattern within certain codes or across questions within codes. Across the entire dataset, the correctness of the responses given was similar, while all other categories revealed statistically different results (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). In general, the answers provided by chatbots were perceived to demonstrate a better recognition of the context (chatbot: 92.42% vs. human: 84.85%) and were also longer than those of humans with a mean word ratio of the length of the answer vs. the length of the original text of 0.45\u0026thinsp;\u0026plusmn;\u0026thinsp;0.50 as compared to 0.21\u0026thinsp;\u0026plusmn;\u0026thinsp;0.26 for humans. Furthermore, the answers by chatbots were perceived to be more complete than those by humans (chatbot: 79.73% vs. human: 52.65%). Conversely, human answers were perceived to be superior in absence of interpretation (human: 97.35% vs. chatbot: 81.44%) and addition of material not included in the original text (human: 97.73% vs. chatbot: 81.82%).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eOverview of survey results.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"13\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c11\" colnum=\"11\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c12\" colnum=\"12\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c13\" colnum=\"13\"\u003e\u003c/div\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eCategory\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eValue\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c6\" namest=\"c3\"\u003e \u003cp\u003eTotal\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c10\" namest=\"c7\"\u003e \u003cp\u003eChatbot\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c12\" namest=\"c11\"\u003e \u003cp\u003eHuman\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c13\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003ep-value overall\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eOverall\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eChatbot\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eHuman\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003ep-value Chatbot vs. Human\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eChatFlash\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003eChatGPT 3.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003eChatGPT 4.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003eZenoChat\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003eResearcher A\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003eResearcher B\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eAddition\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eyes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e102\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e96 (18.18%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e6 (2.27%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e16\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e39\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e30\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c13\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eno\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e690\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e432 (81.82%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e258 (97.73%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e116\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e102\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e121\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e126\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e132\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eCompleteness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ecomplete\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e560\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e421 (79.73%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e139 (52.65%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e108\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e108\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e104\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e101\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e65\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e74\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c13\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003epartial\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e131\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e67 (12.69%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e64 (24.24%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e17\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e34\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e30\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eincomplete\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e101\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e40 (7.58%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e61 (23.11%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e13\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e33\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e28\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eContext\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ecorrect\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e712\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e488 (92.42%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e224 (84.85%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e0.001*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e125\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e116\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e123\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e124\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e110\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e114\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c13\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e0.007*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eincorrect\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e40 (7.58%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e40 (15.15%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e16\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e22\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e18\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eCorrectness\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ecorrect\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e574\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e390 (73.86%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e184 (69.70%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e0.116\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e102\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e90\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e90\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e108\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e91\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c13\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e0.051\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003epartial\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e119\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e83 (15.72%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e36 (13.64%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e25\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e26\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e16\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eincorrect\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e99\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e55 (10.42%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e44 (16.67%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e17\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e16\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e25\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e19\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eInterpretation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eyes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e105\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e98 (18.56%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e7 (2.65%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e30\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e43\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c13\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eno\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e687\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e430 (81.44%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e257 (97.35%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e114\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e102\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e89\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e125\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e127\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e130\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003eLength\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003etoo short\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e19 (3.60%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e74 (28.03%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e38\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e36\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c13\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eperfect\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e537\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e363 (68.75%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e174 (65.91%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e104\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e84\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e86\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e89\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e89\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e85\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c13\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001*\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003etoo long\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e162\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e146 (27.65%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e16 (6.06%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e27\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e45\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e42\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e32\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c11\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c12\"\u003e \u003cp\u003e11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c13\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"13\"\u003eSignificant differences between groups are indicated by *. Results are presented as absolute values unless indicated otherwise; the percentages refer to the columns.\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eIndividual comparisons revealed that addition was the only statistical difference between Researcher A and Researcher B, with Researcher A showing a slightly higher percentage (4.55%) of addition than Researcher B (0.00%). There were no statistical differences between the evaluations of the responses of ChatGPT 3.5 and ChatGPT 4.0. However, ChatGTP 3.5\u0026rsquo;s responses were \u003cb\u003elonger\u003c/b\u003e (mean word ratio: 0.58\u0026thinsp;\u0026plusmn;\u0026thinsp;0.70) than ChatFlash\u0026rsquo;s (mean word ratio: 0.34\u0026thinsp;\u0026plusmn;\u0026thinsp;0.36) and ZenoChat\u0026rsquo;s (mean word ratio: 0.38\u0026thinsp;\u0026plusmn;\u0026thinsp;0.37). Both human researchers\u0026rsquo; responses were shorter compared to the chatbots individually, with a word ratio of 0.16\u0026thinsp;\u0026plusmn;\u0026thinsp;0.14 for Researcher A and 0.25\u0026thinsp;\u0026plusmn;\u0026thinsp;0.34 for Researcher B. Responses that were evaluated as being too long had a word ratio (of the length of the answer vs. the length of the original text) of 0.84\u0026thinsp;\u0026plusmn;\u0026thinsp;0.65 compared to 0.27\u0026thinsp;\u0026plusmn;\u0026thinsp;0.28 for adequate lengths. Some 93 answers (11.74%) were evaluated to be too short with a word ratio of 0.10\u0026thinsp;\u0026plusmn;\u0026thinsp;0.11. The evaluation of chatbot\u0026rsquo;s \u003cb\u003ecompleteness\u003c/b\u003e did not significantly differ between them; demonstrating comparable levels of completeness with between 76.52% (ZenoChat) and 81.82% (ChatFlash and ChatGPT 3.5). However, all answers by the chatbots were more complete than those by human researchers, whose responses were deemed complete in 49.24% (Researcher A) and 56.06% (Researcher B) of all cases. In terms of \u003cb\u003ecorrectness\u003c/b\u003e, ZenoChat\u0026rsquo;s answers were assessed as being more correct (81.82%) than those by ChatGPT 3.5 (68.18%) and ChatGPT 4.0 (68.18%) or those by the human researchers (A: 68.94% and B: 70.45%). ChatFlash\u0026rsquo;s (77.27%) answers were not evaluated significantly differently than those of either chatbots or humans. ChatFlash\u0026rsquo;s responses were perceived to have shown a better understanding of the \u003cb\u003econtext\u003c/b\u003e than ChatGPT 3.5 (94.70% vs. 87.88% correct). Researcher B\u0026rsquo;s (86.26% correct) responses were perceived to show a lower understanding of the context than ZenoChat (93.94% correct) and ChatFlash (94.70% correct), with Researcher A\u0026rsquo;s context in responses being additionally inferior to ChatGPT 4.0 (83.33% vs. 93.18% correct). The responses by ChatGPT 3.5 and ChatGPT 4.0 were evaluated as containing more \u003cb\u003eaddition\u003c/b\u003e than ZenoChat and ChatFlash; with 29.55% and 22.72% of answers containing an addition vs. 8.33% and 87,88%, respectively. Researcher A\u0026rsquo;s responses were perceived to contain less addition (4.55%) than ChatGPT 3.5 (29.55%), ChatGPT 4.0 (22.73%) and ChatFlash (12.12%); Researcher B\u0026rsquo;s responses (0.00%) also contained less addition than ZenoChat (8.33%). Furthermore, the responses by ChatGPT 4.0 were evaluated as containing more \u003cb\u003einterpretation\u003c/b\u003e than ZenoChat and ChatFlash; with 32.58% of answers containing an interpretation vs. 5.30% and 13.66%, respectively. ZenoChat also contained fewer interpretations than ChatFlash and was the only chatbot to not provide more interpretation in its answers than the human researchers (A: 5.30%, B: 3.79%). Across all chatbots, ZenoChat provided the best-rated answers, followed by ChatFlash. An overview of the Kruskal-Wallis tests between individual chatbots and humans is available in the supplementary material (Appendix 2).\u003c/p\u003e \u003cp\u003eRaters had a fair Cohen\u0026rsquo;s kappa inter-rater reliability of 0.30 (0.12\u0026ndash;0.51), with a standard error of 0.11 (0.07\u0026ndash;0.12). Cohen\u0026rsquo;s kappa for context was lower (κ\u0026thinsp;=\u0026thinsp;0.18\u0026thinsp;\u0026plusmn;\u0026thinsp;0.09) than the other variables, with a Cohen\u0026rsquo;s kappa between 0.27\u0026thinsp;\u0026plusmn;\u0026thinsp;0.11 for correctness and 0.39\u0026thinsp;\u0026plusmn;\u0026thinsp;0.10 for length.\u003c/p\u003e \u003cp\u003eComparative analysis between the variables revealed a moderate positive correlation between correctness and completeness (ρ\u0026thinsp;=\u0026thinsp;0.63), correctness and context (ρ\u0026thinsp;=\u0026thinsp;0.56), as well as length and word ratio (ρ\u0026thinsp;=\u0026thinsp;0.56) (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). The correlation between the variables in the human responses was higher than that for chatbots with regard to correctness and completeness (ρ\u0026thinsp;=\u0026thinsp;0.71 vs. ρ\u0026thinsp;=\u0026thinsp;0.60) and correctness and context (ρ\u0026thinsp;=\u0026thinsp;0.72 vs. ρ\u0026thinsp;=\u0026thinsp;0.44). There was a low positive correlation between context and completeness (ρ\u0026thinsp;=\u0026thinsp;0.46), as between interpretation and addition (ρ\u0026thinsp;=\u0026thinsp;0.35). A low negative correlation was determined between completeness and length (ρ = -0.33), i.e., the shorter the answer, the more incomplete, as well as correctness and addition (ρ = -0.35). In the human responses, there was a low negative correlation between length and correctness (ρ = -0.34), indicating a loss of correctness with shorter answers, and a moderate negative correlation between length and completeness (ρ = -0.66), indicating a loss of completeness with shorter answers. The answers by chatbots demonstrated a low negative correlation between addition and correctness, indicating a loss of correctness through addition of new content.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eCorrelation of questionnaire categories\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"8\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAddition\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCompleteness\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eContext\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCorrectness\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eInterpretation\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eLength\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eRatio\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eAddition\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eCompleteness\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e-0.04\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eContext\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e-0.12*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.46*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eCorrectness\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e-0.35*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.63*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.56*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eInterpretation\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.35*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.05\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.07\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.20*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLength\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e-0.25*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-0.33*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e-0.12*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.02\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e-0.22*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eRatio\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e-0.27*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-0.22*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e-0.11*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.02\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e-0.17*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.56*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e1.00\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"8\"\u003eSpearman\u0026rsquo;s rho (ρ): correlation of questionnaire categories. \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{R}\\text{a}\\text{t}\\text{i}\\text{o}=\\frac{Word\\:Coun{t}_{Answer}}{Word\\:Coun{t}_{Original\\:text}}\\)\u003c/span\u003e\u003c/span\u003e. ρ denotes no/little correlation (|ρ| \u0026lt;0,3), low correlation (0.5 \u0026le; |ρ| \u0026le; 0.3), moderate correlation (0.7 \u0026le; |ρ| \u0026lt; 0.5) and strong correlation (|ρ| \u0026gt; 0.7).[\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e] Entries marked with an * refer to p-values \u0026lt;\u0026thinsp;0.05.\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eComments in the free text fields explained the reasons for certain ratings, e.g., some raters commented on the length of the answer: An answer would be considered too long if it contained repeat information, or irrelevant or too detailed. In cases where an answer was rated as incomplete, free text fields were sometimes used to mention the expected content. Some comments criticised answers that were generated even if the original text did not mention details related to a question. Most importantly, however, remarks were made about which bullet point(s) in the response were perceived as interpretation. Interestingly, many of them contained words such as \u0026lsquo;potentially\u0026rsquo;, \u0026lsquo;suggesting\u0026rsquo;, \u0026lsquo;pointing to\u0026rsquo;, \u0026lsquo;could\u0026rsquo;, \u0026lsquo;appears to be\u0026rsquo;, \u0026lsquo;indicating\u0026rsquo;, \u0026lsquo;may reflect\u0026rsquo;, \u0026lsquo;may lead\u0026rsquo;, indicating an interpretation. A few comments also pointed towards the responding agent needing to do a certain amount of interpretation to correctly answer the question. This ranged from being able to interpret certain abbreviations, such as \u0026lsquo;MOH\u0026rsquo; as standing for \u0026lsquo;Ministry of Health\u0026rsquo; through to a broad understanding of the matter at hand, as illustrated by the quote \u0026lsquo;The summary is not explicitly stated in the text. However, it is correct, but is only achieved through interpretation\u0026rsquo;. (Rater A). One rater explicitly mentioned the difficulty in \u0026lsquo;answer[ing] the question, because it almost requires an interpretation to grasp the right aspects\u0026lsquo; (Rater A), indicating that \u0026lsquo;the system therefore already needs to perform semantic or interpretative tasks in order to answer \u0026hellip; [the] question\u0026rsquo;? (Rater B).\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe chatbots\u0026rsquo; responses were regarded as better in recognising context and providing more complete, albeit longer summaries, while humans were viewed as less prone to add or interpret the material. Among the chatbots, ZenoChat provided the best-rated answers, followed by ChatFlash in second place, with ChatGPT 3.5 and ChatGPT 4.0 equally in third place. Statistical analysis indicated a positive correlation between correct contextualisation and completeness, and between correct contextualisation and correctness of the response. Qualitative feedback highlighted that longer answers often contained redundant information and raised the question of the role of interpretation in effectively answering the question.\u003c/p\u003e \u003cp\u003eIndeed, correct contextualisation and the absence of addition and interpretation in the answer were important underlying factors in our research set-up. While Hamilton et al. found ChatGPT to have limited contextual understanding, our study found a clearer understanding of the context by the chatbots than displayed in the answers of the human researchers.[\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] The distinction between recognising context \u0026ndash; for example, that \u0026lsquo;EHR\u0026rsquo; stands for \u0026lsquo;electronic health record\u0026rsquo; \u0026ndash; and interpreting content to answer a question is subtle and complex. This is closely linked to the debate over what constitutes the addition of new content versus the interpretation of the original text\u0026rsquo;s content.\u003c/p\u003e \u003cp\u003eTo enhance the accuracy and relevance of chatbots\u0026rsquo; responses, precise and specific instructions as prompts are essential.[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] In our study, chatbots tended to provide more extensive answers than humans, likely due to their implicit understanding of the exercise. While humans aimed for brevity to expedite subsequent analysis, chatbots did not have this implicit knowledge, as the prompt did not contain any guidance on the response length. Since chatbots are designed to assume user expectations from the given prompt, they tend to comply with instructions instead of seeking clarification and responding according to their own skills and limitations. [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]\u003c/p\u003e \u003cp\u003eRecent research has demonstrated the efficacy of chatbots such as ChatGPT in generating discharge and patient summaries of patients\u0026rsquo; medical histories, and summarising the scientific literature, achieving levels of quality and accuracy comparable to or exceeding traditional methods.[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e] This highlights their ability to distil main ideas from a given text, mirroring the findings from our study. However, these studies did not compare the capacities of different chatbots. Some studies compared chatbots\u0026rsquo; ability to answer complex medical examination questions. In these, ChatGPT 4.0 responded more accurately and concisely than Bard (Microsoft) albeit scoring worse than the medical reference group.[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan additionalcitationids=\"CR26\" citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e] This suggests that chatbots\u0026rsquo; ability to answer questions, as observed in our study, is contingent on the availability of an accurate reference text. Given this limitation, chatbots can be implemented in scenarios requiring idea extraction from a given resource, such as content analysis for qualitative research.\u003c/p\u003e \u003cp\u003e The study demonstrated an example of how AI-powered chatbots can enhance and accelerate the research process and successfully support humans in conducting research endeavours such as reviews. However, the authors \u0026ndash; in concordance with other studies \u0026ndash; highlight that human oversight and correction are essential.[\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] Recognising chatbots\u0026rsquo; potential, it is imperative to be aware of their shortcomings, with some of the most importants being potential inclusion of biases and non-disclosure of training data, incorrect information and the possibility of nonsensical responses. [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan additionalcitationids=\"CR29 CR30\" citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]\u003c/p\u003e \u003cp\u003eA major strength of this study is the comparison between different chatbots combined with an evaluation of their performance against human researchers, as most other studies only compare chatbots against one another. Despite this strength, it is essential to acknowledge the study\u0026rsquo;s constraints. Firstly, we selected four chatbots from a rapidly growing array of chatbots using different underlying large language models. Furthermore, caution is advised in generalising the findings of this study, as the text parts used for eliciting the answers were preselected by two human researchers for their topical relevance. Eliciting responses from the full text might yield different outcomes.\u003c/p\u003e \u003cp\u003eFuture research should aim at refining the prompt to better match the implicit human understanding of the context and the specific objective anticipated from the chatbot\u0026rsquo;s use.[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e] This entails researchers assessing their underlying assumptions and intentions, clearly defining the criteria for evaluating responses, and employing prompt engineering methods to refine the prompt.[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e] As the underlying large language models of chatbots are quickly evolving, longitudinal studies are crucial to offer insights into how chatbots\u0026rsquo; capabilities and performance are changing. In line with Hamilton et al\u0026rsquo;s recommendations, we also recommend assessing the chatbots\u0026rsquo; capabilities to answer the questions when receiving the full text, instead of the curated human-determined important parts.[\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] This might not only assess the chatbots\u0026rsquo; abilities, but also further assist in identifying additional information that might have escaped the human\u0026rsquo;s judgment.[\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eOur study demonstrates chatbots\u0026rsquo; ability to provide complete and correct answers to questions on a given text, which may be a useful tool to accelerate research processes, in specific qualitative evidence synthesis. Given the speed of chatbot development and fine-tuning, the successful applications of chatbots to facilitate research will very likely continue to expand.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eAI \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;Artificial intelligence\u003c/p\u003e\n\u003cp\u003eEHR \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;Electronic health record\u003c/p\u003e\n\u003cp\u003eGPT \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;Generative Pre-trained Transformer\u003c/p\u003e\n\u003cp\u003eNASSS\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies\u003c/p\u003e\n"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval:\u0026nbsp;\u003c/strong\u003eAs the study did not involve sensitive data, no ethical clearance was necessary. Study participants were researchers acting as experts. They either provided answers to the questions or judged the answers. The respondents of the survey were able to drop out at any time without negative consequences.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication:\u0026nbsp;\u003c/strong\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials:\u0026nbsp;\u003c/strong\u003eThe datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding:\u003c/strong\u003e This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests:\u0026nbsp;\u003c/strong\u003eThe authors declare that they have no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026rsquo; contributions:\u0026nbsp;\u003c/strong\u003eKN: Conceptualization, Methodology, Formal analysis, Investigation, Resources, Data Curation, Visualization, Project administration, Writing - Original Draft. SS: Methodology, Formal analysis, Resources, Visualization, Writing - Review \u0026amp; Editing. MSt: Resources, Writing - Review \u0026amp; Editing. JA: Resources, Writing - Review \u0026amp; Editing. MCR: Resources, Writing - Review \u0026amp; Editing. MSc: Methodology, Resources, Writing - Review \u0026amp; Editing. FF: Conceptualization, Methodology, Resources, Writing - Review \u0026amp; Editing, Supervision\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements:\u0026nbsp;\u003c/strong\u003eThe authors would like to extend their gratitude to Setareh Rabbani and Julia Schulze Pr\u0026ouml;bsting for their participation in the study.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDeclaration of generative AI in scientific writing:\u0026nbsp;\u003c/strong\u003eDuring the preparation of this work, the authors used GTP 4.0 in order to improve readability and language. After using this tool, the author(s) reviewed and edited the content as needed and take full responsibility for the content of the publication.\u003c/p\u003e\n"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003ePatil NS, Huang RS, van der Pol CB, Larocque N. Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment. Can Assoc Radiol J. 2024;75:344\u0026ndash;50. doi:10.1177/08465371231193716.\u003c/li\u003e\n\u003cli\u003eAlshami A, Elsayed M, Ali E, Eltoukhy AEE, Zayed T. Harnessing the Power of ChatGPT for Automating Systematic Review Process: Methodology, Case Study, Limitations, and Future Directions. Systems. 2023;11:351. doi:10.3390/systems11070351.\u003c/li\u003e\n\u003cli\u003eKhlaif ZN, Mousa A, Hattab MK, Itmazi J, Hassan AA, Sanmugam M, Ayyoub A. The Potential and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation. JMIR Med Educ. 2023;9:e47049. doi:10.2196/47049.\u003c/li\u003e\n\u003cli\u003eMeyer JG, Urbanowicz RJ, Martin PCN, O\u0026apos;Connor K, Li R, Peng P-C, et al. ChatGPT and large language models in academia: opportunities and challenges. BioData Min. 2023;16:20. doi:10.1186/s13040-023-00339-9.\u003c/li\u003e\n\u003cli\u003eChappell M, Edwards M, Watkins D, Marshall C, Graziadio S. Machine learning for accelerating screening in evidence reviews. Cochrane Evidence Synthesis and Methods 2023;1:e12021. doi:10.1002/cesm.12021.\u003c/li\u003e\n\u003cli\u003evan de Schoot R, Bruin J de, Schram R, Zahedi P, Boer J de, Weijdema F, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell. 2021;3:125\u0026ndash;33. doi:10.1038/s42256-020-00287-7.\u003c/li\u003e\n\u003cli\u003eZhang H, Wu C, Xie J, Lyu Y, Cai J, Carroll JM. Redefining Qualitative Analysis in the AI Era: Utilizing ChatGPT for Efficient Thematic Analysis. 2023; arXiv:2309.10771.\u003c/li\u003e\n\u003cli\u003eHuang Y-M, Rocha T, editors. Innovative Technologies and Learning: 6th International Conference, ICITL 2023 Porto, Portugal, August 28\u0026ndash;30, 2023 Proceedings. Cham: Springer; 2023.\u003c/li\u003e\n\u003cli\u003eChristou P. A Critical Perspective Over Whether and How to Acknowledge the Use of Artificial Intelligence (AI) in Qualitative Studies. TQR. 2023;28:1981\u0026ndash;91. doi:10.46743/2160-3715/2023.6407.\u003c/li\u003e\n\u003cli\u003eLa Torre-L\u0026oacute;pez J de, Ram\u0026iacute;rez A, Romero JR. Artificial intelligence to automate the systematic review of scientific literature. Computing. 2023;105:2171\u0026ndash;94. doi:10.1007/s00607-023-01181-x.\u003c/li\u003e\n\u003cli\u003eSampson M, Shojania KG, Garritty C, Horsley T, Ocampo M, Moher D. Systematic reviews can be produced and published faster. J Clin Epidemiol. 2008;61:531\u0026ndash;6. doi:10.1016/j.jclinepi.2008.02.004.\u003c/li\u003e\n\u003cli\u003eMahuli SA, Rai A, Mahuli AV, Kumar A. Application ChatGPT in conducting systematic reviews and meta-analyses. Br Dent J. 2023;235:90\u0026ndash;2. doi:10.1038/s41415-023-6132-y.\u003c/li\u003e\n\u003cli\u003eMorgan DL. Exploring the Use of Artificial Intelligence for Qualitative Data Analysis: The Case of ChatGPT. International Journal of Qualitative Methods. 2023;22:1\u0026ndash;10. doi:10.1177/16094069231211248.\u003c/li\u003e\n\u003cli\u003eHamilton L, Elliott D, Quick A, Smith S, Choplin V. Exploring the Use of AI in Qualitative Analysis: A Comparative Study of Guaranteed Income Data. International Journal of Qualitative Methods. 2023;22:1\u0026ndash;13. doi:10.1177/16094069231201504.\u003c/li\u003e\n\u003cli\u003eMichalowski M, Abidi SSR, Abidi S, editors. Artificial Intelligence in Medicine: 20th International Conference on Artificial Intelligence in Medicine, AIME 2022 Halifax, NS, Canada, June 14\u0026ndash;17, 2022 Proceedings. Cham: Springer; 2022.\u003c/li\u003e\n\u003cli\u003eChen L, Zaharia M, Zou J. How is ChatGPT\u0026apos;s behavior changing over time?; 2023; arXiv:2307.09009.\u003c/li\u003e\n\u003cli\u003eNordmann K, Sauter S, M\u0026ouml;bius-Lerch P, Redlich M-C, Schaller M, Fischer F. Conceptualizing Interprofessional Digital Communication and Collaboration in Health Care: Protocol for a Scoping Review. JMIR Res Protoc. 2023;12:e45179. doi:10.2196/45179.\u003c/li\u003e\n\u003cli\u003eGreenhalgh T, Wherton J, Papoutsi C, Lynch J, Hughes G, A\u0026apos;Court C, et al. Beyond Adoption: A New Framework for Theorizing and Evaluating Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies. J Med Internet Res. 2017;19:e367. doi:10.2196/jmir.8775.\u003c/li\u003e\n\u003cli\u003eMcHugh ML. Interrater reliability: the kappa statistic. Biochem Med. 2012;22:276\u0026ndash;82.\u003c/li\u003e\n\u003cli\u003eRovetta A. Raiders of the Lost Correlation: A Guide on Using Pearson and Spearman Coefficients to Detect Hidden Correlations in Medical Sciences. Cureus. 2020;12:e11794. doi:10.7759/cureus.11794.\u003c/li\u003e\n\u003cli\u003eQureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models \u0026quot;the answer\u0026quot; to bringing us closer to systematic review automation? Syst Rev. 2023;12:72. doi:10.1186/s13643-023-02243-z.\u003c/li\u003e\n\u003cli\u003eShen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, Moy L. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology. 2023;307:e230163. doi:10.1148/radiol.230163.\u003c/li\u003e\n\u003cli\u003eClough RAJ, Sparkes WA, Clough OT, Sykes JT, Steventon AT, King K. Transforming healthcare documentation: harnessing the potential of AI to generate discharge summaries. BJGP Open. 2024;8:BJGPO.2023.0116. doi:10.3399/BJGPO.2023.0116.\u003c/li\u003e\n\u003cli\u003eScott D, Hallett C, Fettiplace R. Data-to-text summarisation of patient records: using computer-generated summaries to access patient histories. Patient Educ Couns. 2013;92:153\u0026ndash;9. doi:10.1016/j.pec.2013.04.019.\u003c/li\u003e\n\u003cli\u003eLim ZW, Pushpanathan K, Yew SME, Lai Y, Sun C-H, Lam JSH, et al. Benchmarking large language models\u0026apos; performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770. doi:10.1016/j.ebiom.2023.104770.\u003c/li\u003e\n\u003cli\u003eBuhr CR, Smith H, Huppertz T, Bahr-Hamm K, Matthias C, Blaikie A, et al. ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions. JMIR Med Educ. 2023;9:e49183. doi:10.2196/49183.\u003c/li\u003e\n\u003cli\u003eMassey PA, Montgomery C, Zhang AS. Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations. J Am Acad Orthop Surg. 2023;31:1173\u0026ndash;9. doi:10.5435/JAAOS-D-23-00396.\u003c/li\u003e\n\u003cli\u003eChristou P. \u0026Eta;ow to Use Artificial Intelligence (AI) as a Resource, Methodological and Analysis Tool in Qualitative Research? TQR. 2023;28:1968\u0026ndash;80. doi:10.46743/2160-3715/2023.6406.\u003c/li\u003e\n\u003cli\u003ePatel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit Health. 2023;5:e107-e108. doi:10.1016/S2589-7500(23)00021-3.\u003c/li\u003e\n\u003cli\u003eNaumova EN. A mistake-find exercise: a teacher\u0026apos;s tool to engage with information innovations, ChatGPT, and their analogs. J Public Health Policy. 2023;44:173\u0026ndash;8. doi:10.1057/s41271-023-00400-1.\u003c/li\u003e\n\u003cli\u003eStephens LD, Jacobs JW, Adkins BD, Booth GS. Battle of the (Chat)Bots: Comparing Large Language Models to Practice Guidelines for Transfusion-Associated Graft-Versus-Host Disease Prevention. Transfus Med Rev. 2023;37:150753. doi:10.1016/j.tmrv.2023.150753.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"bmc-medical-research-methodology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bmrm","sideBox":"Learn more about [BMC Medical Research Methodology](http://bmcmedresmethodol.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/bmrm/default.aspx","title":"BMC Medical Research Methodology","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"artificial intelligence, chatbot, large language model, ChatGPT, ChatFlash, ZenoChat","lastPublishedDoi":"10.21203/rs.3.rs-5710671/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5710671/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eWith the rise of large language models, the application of artificial intelligence in research is expanding, possibly accelerating specific stages of the research processes. This study aims to compare the accuracy, completeness and relevance of chatbot-generated responses against human responses in evidence synthesis as part of a scoping review.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eWe employed a structured, randomised and blinded survey-based research methodology to analyse and compare responses between two human researchers and four chatbots (ZenoChat, ChatGPT 3.5, ChatGPT 4.0, and ChatFlash) to questions based on a pre-coded sample of 407 articles. These questions were part of an evidence synthesis of a scoping review dealing with digitally supported interaction between healthcare workers.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eThe analysis revealed no significant differences in judgments of correctness between answers by chatbots and those given by humans. However, chatbots\u0026rsquo; answers were found to recognise the context of the original text better, and they provided more complete, albeit longer, responses. Human responses were less likely to add new content to the original text or include interpretation. Amongst the chatbots, ZenoChat provided the best-rated answers, followed by ChatFlash, with ChatGPT 3.5 and ChatGPT 4.0 tying for third. Correct contextualisation of the answer was positively correlated with completeness and correctness of the answer.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eChatbots powered by large language models may be a useful tool to accelerate qualitative evidence synthesis. Given the current speed of chatbot development and fine-tuning, the successful applications of chatbots to facilitate research will very likely continue to expand over the coming years.\u003c/p\u003e","manuscriptTitle":"Evaluating the performance of artificial intelligence in supporting evidence synthesis: a blinded comparison between chatbots and humans","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-01-02 08:56:21","doi":"10.21203/rs.3.rs-5710671/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-01-22T07:03:15+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-01-22T04:06:53+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-01-22T01:41:11+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-01-15T06:50:56+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"179757642949712722950567670575690896930","date":"2025-01-03T12:50:30+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"81511464116901154095822228026381749789","date":"2025-01-03T06:26:24+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"118385745729995921605388470492806890099","date":"2025-01-03T06:13:23+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-01-03T01:16:45+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-01-02T07:10:08+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-12-31T13:28:11+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Medical Research Methodology","date":"2024-12-25T10:25:44+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"bmc-medical-research-methodology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bmrm","sideBox":"Learn more about [BMC Medical Research Methodology](http://bmcmedresmethodol.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/bmrm/default.aspx","title":"BMC Medical Research Methodology","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"d92938ff-9395-4897-9efd-9b6d8ccca94d","owner":[],"postedDate":"January 2nd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-06-02T15:59:43+00:00","versionOfRecord":{"articleIdentity":"rs-5710671","link":"https://doi.org/10.1186/s12874-025-02532-2","journal":{"identity":"bmc-medical-research-methodology","isVorOnly":false,"title":"BMC Medical Research Methodology"},"publishedOn":"2025-05-30 15:56:57","publishedOnDateReadable":"May 30th, 2025"},"versionCreatedAt":"2025-01-02 08:56:21","video":"","vorDoi":"10.1186/s12874-025-02532-2","vorDoiUrl":"https://doi.org/10.1186/s12874-025-02532-2","workflowStages":[]},"version":"v1","identity":"rs-5710671","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5710671","identity":"rs-5710671","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.