Enhancing Patient-Physician Communication: Simulating African American Vernacular English in Medical Diagnostics with Large Language Models | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Enhancing Patient-Physician Communication: Simulating African American Vernacular English in Medical Diagnostics with Large Language Models Yeawon Lee, Chia-Hsuan Chang, Christopher C. Yang This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5279660/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 11 Mar, 2025 Read the published version in Journal of Healthcare Informatics Research → Version 1 posted 9 You are reading this latest preprint version Abstract Effective communication plays a pivotal role in mitigating health disparities. However, linguistic differences, such as African American Vernacular English (AAVE), can lead to communication gaps between patients and physicians, consequently impacting healthcare effectiveness and patient outcomes. This research delves into the potential of GPT-4, a large language model, to replicate AAVE in medical dialogues, with the aim of exploring its potential to address these communication barriers. We devised four prompt types: medical case-only prompts (BaseP), prompts containing demographic details (DemoP), prompts with AAVE-specific linguistic features (LingP), and prompts integrating DemoP and LingP (ComP). Through statistical analyses, including ANOVA and t-tests, applied to case simulations from the United States Medical Licensing Examination (USMLE), we evaluated GPT-4's capacity to mirror AAVE linguistic attributes. The findings indicate that GPT-4 effectively emulated AAVE traits, with ComP producing the most AAVE linguistic features. Notably, DemoP elicited more phonological features than LingP, implying an intrinsic correlation between the African American demographic and specific linguistic markers in GPT-4. However, the model encountered challenges with certain AAVE constructs, such as question inversion and unique vocabulary. This study underscores GPT-4's potential to enhance culturally sensitive healthcare communication while emphasizing the necessity for further research to refine its precision in simulating diverse linguistic styles for practical medical training applications. large language model health disparities patient simulation patient-physician communication gap communication training Figures Figure 1 Figure 2 1 Introduction 1.1 Background In the United States, health disparities disproportionately affect people of color and minority races, as evidenced by higher mortality rates, earlier onset of diseases, and more severe symptoms compared to the White population [25]. African Americans, for instance, face a 30% higher risk of death compared to Whites, a disparity that persists even after adjusting for age factors [7]. Conditions like hypertension, HIV, heart disease, stroke, and cancer are notably more prevalent in these minority groups. Asthma, the most common chronic childhood disease in the U.S., is 1.6 times more prevalent in Black children and 2.4 times more in Puerto Rican children compared to white children [4]. Such inequalities in health are not unique to the U.S. and can be attributed to a multitude of factors, including lifestyle differences, unequal access to medical infrastructure, structural discrimination, and stress and trauma from these conditions [2]. These factors often interact, exacerbating health issues in these communities. One of the key contributors to these disparities, and the focal point of our study, is the communication barrier between healthcare providers and patients. Patients who feel heard and respected are more likely to express their concerns clearly, aiding in effective diagnosis and treatment adherence. However, in many contexts, and particularly among culturally and ethnically diverse groups, communication is often hindered by language barriers, sociocultural differences, and ingrained stereotypes. These obstacles can lead to less effective healthcare interventions and poorer patient outcomes [25]. The Institute of Medicine [18] reported that disparities in treatment exist across various minority groups. For instance, Mexican Americans with myocardial infarction are 40% less likely to receive thrombolytic therapy than Whites. African Americans and Latino Americans are less likely to receive appropriate emergency care and pain management, respectively. Ortega et al. [19] found that Latino and African American children with asthma receive fewer standard management medications compared to White children. In an increasingly global healthcare landscape, the diverse mix of cultural and linguistic backgrounds among patients presents a significant challenge in communication, potentially widening health disparities. Effective communication in healthcare goes beyond words; it encompasses an understanding of the different contexts in which patients experience pain and the ways in which they express it. This is where technology like Large Language Models (LLMs), with their capacity to understand and generate natural language, presents a novel and promising avenue to mitigate these barriers, potentially transforming healthcare interactions and outcomes. However, there is a caveat: if these LLMs predominantly generate responses in standard, formal English, they may not fully capture the wide range of linguistic nuances found in real-world healthcare environments. The predominance of standardized language overlooks the rich variety of linguistic styles present among patients. To begin addressing this gap, our study explores the potential of LLMs, particularly GPT-4, to replicate diverse linguistic styles. Specifically, we investigate whether GPT-4 can emulate the linguistic characteristics of African American Vernacular English (AAVE). By evaluating their effectiveness in simulating patients with distinct linguistic traits, this research aims to bridge linguistic gaps in healthcare communication. We aspire to contribute to a more inclusive and culturally sensitive healthcare environment. This work marks the beginning of a necessary journey in healthcare technology, one with the potential to significantly improve communication and patient outcomes. 1.2 Prior Work Communication barriers in healthcare are not only prevalent when providers and patients speak different primary languages [28][1]; they also arise within the same language. Even among speakers who are native to the same language, subtle variations in intonation, speech patterns, and interaction styles [10] can lead to significant misinterpretations. For instance, Park’s ethnographic study [20], conducted in the context of free clinics serving immigrant workers, highlights communication challenges between Korean Chinese (“Chaoxianzu” in Chinese, “Joseonjok” in Korean) migrant workers and native Korean medical staff in South Korea, despite both groups speaking Korean. Native Korean doctors often struggle to understand Korean Chinese patients. Notably, the Korean Chinese language, based on North Korea's Munhwaŏ standard, significantly differs from South Korean in vocabulary, phonetics, phonology, stress, and intonation [32]. Additionally, cultural differences manifest in language, particularly in how symptoms are expressed. Korean Chinese migrants not only use atypical vocabulary, unrecognized by native Korean medical professionals, but also often describe their symptoms in relation to specific organs. This unconventional expression of symptoms can lead to unnecessary tests and a confusing diagnostic process for medical personnel, while Korean Chinese patients may feel their complaints are misunderstood. The risk of misinterpretation may be even more pronounced in the U.S., where a multitude of immigrant groups and diverse demographics coexist. Racial dynamics can further complicate communication in healthcare settings [6, 22, 27]. Ray [23] highlights the significance of understanding historical contexts in interracial communication. In the U.S., racial minorities often aim to project competence and earn respect, while White individuals may prioritize being liked and seen as moral[3]. These differing conversational approaches can negatively impact interactions, leading to misunderstandings and adverse perceptions [5]. Linguistic variations specific to racial, demographic, or ethnic communities add another layer of complexity to communication in healthcare settings. African American Vernacular English (AAVE) is a prime example of these linguistic variations. Characterized by unique phonological, grammatical, and stylistic features [22], AAVE is not only a means of communication but also a powerful symbol of cultural identity for many African Americans. However, this diversity poses challenges in medical and technological contexts, as illustrated by a study [11] examining older Black adults' interactions with voice assistants like Google Home, particularly for health information-seeking purposes. Participants, aged 50–89 from lower-income neighborhoods in Chicago and Detroit, struggled to communicate effectively with these devices. Programmed primarily in Standard English, the voice assistants failed to grasp the nuances of their dialect and accent. This language barrier necessitated a cumbersome process of cultural code-switching, rendering the interactions with the technology both cognitively demanding and time-consuming. Such scenarios underscore a critical oversight in technology design – the failure to accommodate the rich linguistic diversity within communities. As a result, such technologies risk alienating historically marginalized groups who speak dialects like AAVE, leading to reduced long-term engagement and perpetuating digital divides in accessing health information and technology. This issue aligns with the broader call for culturally competent healthcare, as defined by Meldrum [16]. This approach emphasizes tailoring care to each patient's unique background, including their social, cultural, and linguistic needs. However, practical challenges in implementing this competency within healthcare systems persist, as noted in various studies [2, 9]. These include staff shortages and bureaucratic constraints, which can impede clinicians' ability to fully engage with patients' unique backgrounds [14]. In this context, the emerging domain of Artificial Intelligence (AI), especially Large Language Models (LLMs), presents a promising avenue for augmenting healthcare communication. These models, with their advanced capabilities in comprehending and generating nuanced human-like language, hold potential to revolutionize patient interactions. By grasping the subtleties of various languages and dialects, they could enable more culturally attuned and linguistically sensitive communication, thereby not only improving the delivery of healthcare but also potentially leading to more accurate diagnoses and a deeper exchange of health-related information [14]. This technological advancement aligns well with the goals of culturally competent healthcare, offering a bridge over existing gaps in clinician-patient interactions. 1.3 Goal of This Study The primary aim of this study is to evaluate the capability of Large Language Models (LLMs), with a specific focus on GPT-4, to accurately simulate patients who communicate using African American Vernacular English (AAVE). AAVE, a significant variant of American English that often diverges from Standard English, offers an interesting linguistic case study. Given its distinct linguistic features and status as a well-researched dialect, AAVE provides an ideal benchmark for assessing the linguistic proficiency of generative models like GPT-4. Our research question is: Can LLMs, particularly GPT-4, effectively simulate the speech of patients using AAVE? To investigate this, we have designed experiments where GPT-4 simulates interactions with patients speaking AAVE, responding to typical diagnostic questions similar to those posed by healthcare professionals. The GPT-4's responses are generated based on the medical cases embedded within our prompts. To identify effective prompt-based learning methods for the simulation of AAVE-speaking patients, our research involves experimenting with various prompt structures, ranging from straightforward medical cases to more intricate scenarios incorporating demographic and linguistic nuances. The medical cases in our prompts are derived from the United States Medical Licensing Examination (USMLE) Computer-Based Case Simulations (CCS), offering a diverse and realistic range of patient situations, designed specifically to test the clinical application skills of medical students. The feasibility of using GPT-4 to simulate AAVE-speaking patients opens significant implications for healthcare communication. For example, introducing clinicians to AI-generated examples of linguistic diversity, like those examined in our study, could expand their understanding of different cultural contexts. This, in turn, is likely to foster greater empathy in patient care. In the latter part of this paper, we explore broader applications and the potential directions for future research in more detail. Thus, the current work represents an initial step towards bridging linguistic gaps and addressing communication challenges in healthcare. 2 Methods In this study, we propose a framework for simulating patient-physician communication. As shown in Fig. 1 , the framework simulates an arbitrary patient using the GPT-4 model from OpenAI through a carefully crafted prompt. This prompt instructs GPT-4 to act as a patient within a specified medical case, incorporating demographic details and linguistic features. We incrementally increase the complexity by integrating these elements into each prompt. The simulated patient then engages with a set of predefined diagnostic questions generated by ChatGPT. Lastly, we analyze the simulated patient's responses to evaluate the effectiveness of the patient-physician interaction. The following subsections describe the preparation of materials and the detailed mechanism of this framework. 2.1 Medical Case To incorporate realistic patient scenarios into our prompts, our research leveraged the USMLE, a three-step exam required for medical licensure in the U.S. Specifically, we sourced diverse medical cases from the CCS component of the Step 3. This final step of the USMLE evaluates a candidate's ability to apply clinical knowledge and manage patient care in ambulatory settings. The CCS scenarios are designed to mimic real-world patient encounters, challenging candidates to demonstrate clinical proficiency and effective time management under time constraints. The USMLE website [31] provides six CCS sample cases as practice material. We utilized all six cases due to their availability and open access. Each case depicts a patient with specific health conditions and contexts. Examples of these cases include a 65-year-old man with symptoms of acute chest pain and difficulty in breathing, a 32-year-old woman experiencing knee pain and swelling, and so forth. The contexts include vital signs (e.g., pulse, blood pressure, and body mass index), reason for visit, history of present illness, past medical history, family history, and societal variables. For the purposes of our research, we excluded the patient's ethnicity information, originally listed under 'Identifies as,' from the medical cases. The complete text of the six medical cases can be found in Multimedia Appendix 1: [Medical cases]. 2.2 African American as a Demographic Focus In medical cases where patients have identical health conditions, the quality of patient-physician communication can vary significantly based on demographic factors such as age, gender, and race/ethnicity [5]. Our research specifically targets the African American demographic, aligning with our exploration of LLMs’ ability to simulate AAVE. AAVE, a distinct linguistic variety in the U.S., has its roots in the history of African American experiences, notably during the period of slavery [26]. Given its prevalent use and historical significance within the African American community, AAVE naturally becomes our focal demographic group. Furthermore, the existing racial disparities in U.S. healthcare between Black and White populations [13] underscore the relevance and urgency of this approach in advancing a more inclusive healthcare system. To maintain a clear focus, we deliberately exclude other demographic variables such as gender, age, geographical location, or socio-economic status. This strategy allows us to closely examine GPT-4’s ability to produce linguistically relevant responses without the added complexity of multiple demographic dimensions. While our study strategically omits these factors, it is important to acknowledge their significance; they represent promising directions for future research. 2.3 Linguistic Feature AAVE is a dialect rich in history and unique characteristics, predominantly used by African Americans. While it is not monolithic, and indeed, just like any dialect, has regional differences in grammar, vocabulary, and pronunciation, our research necessitates a focus on certain consistent features. For the purposes of evaluating how effectively LLMs can incorporate distinct linguistic characteristics, it was essential to extract typical, common factors from AAVE to set up a baseline for evaluation. This approach allows us to assess the capability of LLMs to emulate these features systematically, even as we recognize the inherent variability among speakers and regions. Its origins can be traced back to Southern American English, a connection highlighted in Mufwene's research [17]. This dialect evolved from the English variants spoken by African slaves and their descendants, predominantly located in the Southern United States. Mufwene’s analysis of AAVE in relation to other dialects reveals notable similarities with Southern White dialects, particularly in syntax elements like negative concord – the use of multiple negatives to express a single negation - and double modals, which involves the use of two modal verbs together. Rickford's foundational research [24] offers a comprehensive compilation of AAVE features, establishing it as a crucial reference point for subsequent studies. He detailed unique traits in AAVE's phonology and grammar. Grammatically, AAVE is distinguished from Standard English by its treatment of verb tense, aspect, and mood, as well as its use of pronouns and negation. Notably, the omission of the copula/auxiliary 'is' and 'are' in present tense leads to constructions like 'He tall' instead of 'He is tall.' AAVE also employs the invariant 'be' to indicate habitual aspects, as in 'He be walkin',' and uses unstressed 'been' or 'bin' for what in Standard English would be 'has/have been.' A stressed 'bin’ denotes actions or states that commenced long ago and may still continue. The use of ‘done’ in AAVE emphasizes completed actions and can co-occur with ‘been,’ as in ‘He done did it’ or ‘They done been sitting there an hour.’ In terms of nouns and pronouns, AAVE often omits the possessive -s, as in ‘John house,’ and can use associative plurals marked by ‘and (th)em’ or ‘nem.’ Negation in AAVE, characterized by the use of ‘ain’t’ as a general preverbal negator and multiple negation or negative concord, is a well-studied aspect as well. Pullum [21] complements the understanding of AAVE by providing more nuanced rules in negative concord and copula omission. Pullum notes the repositioning of negative auxiliary verbs at sentence beginnings, especially when the subject is indefinite, as in 'Ain't nobody gonna find out.' He also enumerated contexts where copula omission does not occur, such as when the copula bears accent, is infinitival, expresses habitual aspect, is in the past tense, is first-person singular, begins a clause, or occurs in a confirmatory tag at the end of a sentence. Building upon these foundational studies, Wolfram[33] and Thomas[29] add social, cultural, and economic dimensions to our understanding of AAVE. Thomas focuses on the evolution and variation of AAVE, distinguishing it from broader African American English (AAE) used by different social classes. He underscores the unique migration history of AAVE, originating in the South and transitioning to urban centers during the Great Migration. This shift to urban life significantly influenced the dialect, leading to some dialect leveling as African Americans from various regions mixed in new urban communities. Wolfram highlights AAVE's strong association with urban black youth culture, noting the age-graded usage of 'habitual be,' predominantly found among younger speakers. This suggests a continuous evolution of AAVE within urban settings. Wolfram identifies a 'supra-regional core' of AAVE, acknowledging some regional variation but emphasizing shared features across different urban areas. These key grammatical features mostly align with Rickford's work and include copula absence, invariant BE, completive 'be done,' remote 'been,' and unique traits in negation and nominals. Wolfram also categorizes the features of urban AAVE into stable, intensifying, and receding traits, highlighting the dynamic nature of the dialect. This emphasizes the continuity with historical rural AAVE roots and the ongoing changes reflecting urban influences. In this research, while recognizing the dynamic nature and diversity within AAVE, as explored by these works, we primarily draw upon Rickford’s presentation of AAVE. His works offers clarity, comprehensive coverage, and foundational status in AAVE studies. However, we have tailored his framework to align with our research objectives. For instance, we excluded most phonological aspects due to the limitations of LLMs, which produce written responses and cannot capture phonetic aspects. Additionally, to prevent the LLMs from merely replicating specific vocabulary, we have minimized the inclusion of direct vocabulary presented as lexical features in Rickford's work. Nonetheless, we selectively incorporate certain phonological and lexical features that are central to AAVE and widely discussed in the literature. In our simulations of phonological features, GPT-4 employs orthographic representations, such as “havin’”,”’round”, “’cept”. Our methodology also omits features less frequently mentioned in recent studies, thereby focusing on the most prominent and impactful aspects of AAVE as per current academic consensus. This approach acknowledges the limitations of our study, particularly in the context of phonological representation, while striving for a comprehensive and relevant analysis of AAVE within the capabilities of current LLMs. Ultimately, our research focuses on 37 carefully selected features. Table 1 presents the linguistic features identified for our research, with a descriptive profile for each feature on the right and their corresponding categories on the left. This categorization was crucial for effectively identifying and annotating AAVE features in the patients' responses. Utilizing all 37 features as separate labels would have resulted in an impractically extensive list for annotation purposes. Considering balance, we decided to utilize the bolded titles in Table 1 as labels during our annotation process, aiding in the identification of AAVE features in the patients' responses. While we have endeavored to compile a thorough list of these linguistic features, the list isn't exhaustive. Therefore, we introduced an "out of list" label (refer to the ‘Quantitative Analysis and Annotation Strategy’ section). Full explanations and examples of each feature are provided in Multimedia Appendix 2: [Linguistic features of AAVE]. Table 1 AAVE Linguistic Features Grammatical features Pre-verbal markers Omission of "is" and "are" Invariant “be” habitual actions contractions of "will/would be" “been”/ “bin” unstressed "been" for present perfect stressed "been" for the action that happened a long time ago Use of “done” for a distant past tense Use of “be done” for a future perfect tense Use of “had” for a past tense Use of double modals Verbal tense-number marking Absence of third person singular present -s, doesn't, or has Use of "is" and "was" for plural and second person subjects Use of past tense for past participle Use of past participle for past tense Use of verb stem (root forms) for past tense Reduplicated Tense Marking Nouns and pronouns Unmarked possessives Unmarked plural forms Regularization of irregular plural nouns Use of "an 'em", "and 'em", "nem" to mark associative plurals Appositive or pleonastic pronouns Use of “y'all” for the 2nd person plural Use of demonstrative “them” Omission of relative pronoun Negation Negative concord Negative inversion Use of "ain't" as a general preverbal negator Use of "ain't" + "but", and "don't" + "but" to indicate "only" Questions Formation of direct questions without inversion Inversion in embedded questions Existential and locative construction Use of existential "it", "they" (or "dey") Use of existential "they got" Use of "here go" as a static locative or presentational form Lexical features Use of "steady" for consistent, persistent or repeated action Use of "come" to imply the speaker's indignation Use of "finna" to indicate immediate future actions Phonological features Replacement of final "ing" with "in' " Omission of unstressed syllables at the beginning and middle 2.4 Prompt Design As indicated in previous study [15], LLMs generate the response \(\:Y=[{y}_{1},\dots\:,{y}_{N}]\) based on the given contexts \(\:X\) , which is commonly referred to a prompt and a natural language description of a task of interest. Let \(\:p\) denote the GPT-4, it is conditioned on \(\:X\) and generates every token \(\:{y}_{t}\) in an autoregressive manner \(\:p\left({y}_{t+1}\right|X,{y}_{t})\) . In our study, we crafted the prompt integrating medical cases, a demographic variable, and linguistic features, incrementally increasing the complexity, to ask GPT-4 to simulate AAVE-speaking patients. We designed four types of prompts for each medical case to assess GPT-4's ability of patient simulation. Baseline Prompt (BaseP): Baseline prompt delineates only the medical case, without additional demographic or linguistic variables, serving as our comparison benchmark. Your task is to role-play as a patient in the given medical case, which is enclosed by """. 1. Respond to questions posed by a user who is acting as a doctor. 2. If the patient in the case cannot communicate, you should respond as their caregiver. This could be a family member or friend who has accompanied the patient. 3. Always stick to the details provided in the case. Medical case: """ {medical case} """ 2. Demographic Prompt (DemoP): This version integrates a demographic variable with the medical case. We utilized the phrase "Ensure your responses incorporate the linguistic features common to the way many African Americans speak English" as the demographic variable. It aims to assess how GPT-4 handles demographic information in the context of medical cases. Your task is to role-play as a patient in the given medical case, which is enclosed by """. 1. Respond to questions posed by a user who is acting as a doctor. 2. If the patient in the case cannot communicate, you should respond as their caregiver. This could be a family member or friend who has accompanied the patient. 3. Always stick to the details provided in the case. 4. Ensure your responses incorporate the linguistic features common to the way many African Americans speak English. Medical case: """ {medical case} """ 3. Linguistic Prompt (LingP): This prompt combines the medical case with specific linguistic features of AAVE. The intention is to evaluate GPT-4's capability to understand and respond using the linguistic traits of AAVE. Your task is to role-play as a patient in the given medical case, which is enclosed by """. 1. Respond to questions posed by a user who is acting as a doctor. 2. If the patient in the case cannot communicate, you should respond as their caregiver. This could be a family member or friend who has accompanied the patient. 3. Always stick to the details provided in the case. 4. Ensure your responses incorporate the linguistic features outlined between ***. Medical case: """ {medical case} """ Linguistic features: *** {features} *** 4. Comprehensive Prompt (CompP): The most detailed version, this prompt intertwines the medical case with both demographic variable and pertinent linguistic traits. This comprehensive prompt aims to simulate a more complex and realistic patient interaction scenario. Your task is to role-play as a patient in the given medical case, which is enclosed by """. 1. Respond to questions posed by a user who is acting as a doctor. 2. If the patient in the case cannot communicate, you should respond as their caregiver. This could be a family member or friend who has accompanied the patient. 3. Always stick to the details provided in the case. 4. Ensure your responses incorporate the linguistic features common to the way many African Americans speak English, as described between ***. Medical case: """ {medical case} """ Linguistic features: *** {features} *** 2.5 Question & Chat Simulation The simulation of patient-physician interactions is conducted by sequentially posing a set of prepared questions to GPT-4. These diagnostic questions, crafted by ChatGPT 4 in response to the query, 'common questions physicians ask patients', are designed to maintain neutrality concerning the patient’s gender, age, and symptoms. The selection of these questions is strategically based on their prevalence and universal relevance across all six medical cases, ensuring that they are broadly applicable and fundamental to the diagnostic process. As a result, we have compiled a list of six diagnostic questions routinely used by physicians: What brings you in today? Have you had any procedures or major illnesses in the past 12 months? Are you currently taking any medications, including over-the-counter and herbal supplements? What allergies do you have? Have you traveled anywhere recently? Have you been exposed to anyone who's been sick recently? Additionally, to accommodate scenarios where direct responses from patients may not be possible, such as in cases involving young children or patients in coma, we adapted the phrasing of each question, creating variations of the questions with different subjects ("you", "he", "she"). This allows the simulation to accommodate a wide range of clinical scenarios, promoting an uninterrupted and natural flow of conversation throughout the diagnostic interaction. 2.6 Quantitative Analysis and Annotation Strategy To assess the effectiveness of GPT-4 in simulating AAVE-speaking patients, we conducted a quantitative analysis of the linguistic features present in the patient's responses. This analysis involved annotating these features and then applying statistical tests, such as Analysis of Variance (ANOVA) and t-tests, to understand their prevalence and significance. By quantifying these features, we aimed to understand not only the presence but also the extent of GPT-4’s ability to replicate specific linguistic elements associated with AAVE. Our annotation strategy, therefore, played a critical role in how we processed and interpreted the responses. The annotation process was guided by an expert in AAVE, an African American healthcare professional with over 10 years of nursing experience in the U.S. She compiled the annotation guidelines and ensured the cultural and linguistic authenticity of the process. The final label set for annotation includes the linguistic categories, those with bolded titles in Table 1 , along with an 'out of list' category. This category accounts for deviations from 'standard English' not covered by the linguistic features in the prompt, which may not be specific to AAVE or any racial/ethnic group. After completing the annotations, we will determine whether these 'out of list' features can be classified as AAVE. In defining the boundaries to annotate, the primary rule is to tag each word or phrase exhibiting linguistic features, detailed in Multimedia Appendix 2: [Linguistic features of AAVE], in the responses. This process often results in overlaps. More detailed guidelines for ambiguous cases can be accessed on Multimedia Appendix 3: [Annotation guidelines]. The annotation team included two experienced NLP researchers, with extensive backgrounds in healthcare and biomedical domains, who have conducted multiple projects involving complex annotation tasks. We utilized Label Studio[30], an open-source data labeling tool, employing the Named Entity Recognition template suitable for marking relevant spans of text and categorizing them into pre-defined labels. This was specifically pertinent to identifying and labeling pre-defined linguistic features from the patients’ responses. We calculated the Inter-annotator Agreement (IAA) score to assess the reliability of our annotations. It is important to note that the purpose of this annotation process is to examine the text generated by the model, rather than to create a benchmark dataset. Therefore, discrepancies between annotators were not resolved procedurally; instead, we used annotations on which two annotators had reached agreement for further statistical analysis. We referred to previous study [12] for the formula to compute the degree of agreement. The following formula for positive specific agreement ( \(\:P\) ) is to calculate the agreement between two annotators for text markup tasks. $$\:P=\frac{2a}{2a+b+c}$$ , where \(\:a\) is the number of identified features that both annotators agree, and \(\:b\) as well \(\:c\) is the number of identified features that only one annotator agrees. 3 Results 3.1 Annotation Table 2 Inter-Annotator Agreement for Each Linguistic Feature Agreement Support Grammatical Features Pre-verbal markers 0.8 155 Verbal tense-number marking 0.8 15 Nouns and pronouns 0.6 14 Negation 0.98 298 Questions - 0 Existential and locative construction - 0 Lexical Features 0.5 1 Phonological Features 0.97 186 Out of List 0.8 234 Average 0.9 903 Weighted Average 0.9 Table 2 quantitatively demonstrates the consistency of agreement among annotators in annotating linguistic features. Generally, a high level of agreement is observed. However, specific features such as 'Nouns and Pronouns' and 'Lexical Features' show lower levels of agreement, with scores of 0.6 and 0.5, respectively. This low agreement may stem from the limited number of annotations in these categories. Interestingly, categories like 'Questions' and 'Existential and Locative Constructions' received no annotation, suggesting that the GPT-4 did not simulate patients with these features. In contrast, 'Negation' achieved the highest level of agreement at 0.98, along with the most substantial annotation support (298 annotations). 3.2 Comparison of Prompt Effectiveness We assessed the effectiveness of various prompts by counting the number of AAVE linguistic features. BaseP, which doesn’t include any demographic or linguistic information, was excluded from this analysis because we did not observe any AAVE linguistic features from the responses answered by its simulated patients. Our focus was on comparing the effectiveness of DemoP, LingP, and CompP, aiming to determine which prompt most frequently elicited AAVE linguistic features in the simulated patients. Figure 2 presents a heatmap illustrating the distribution of AAVE features across different prompts. This analysis encompassed a total of 6 medical cases (rows), where each case includes 6 responses. We compiled all AAVE annotations agreed upon by both annotators and represented them in the heatmap. Generally, in terms of frequency, CompP elicited more AAVE features compared to the other two. Additionally, more features were identified in DemoP, which included the demographic variable, than in LingP, which comprised a list of AAVE linguistic features. We observed an unexpected concentration of AAVE linguistic features within the responses from the sixth medical case, to LingP. While LingP generally elicited fewer AAVE features compared to the CompP, these responses deviated from the pattern. This deviation may suggest the dynamic nature of language models, underscoring the importance of evaluating them across a diverse array of responses to fully grasp their linguistic capabilities and patterns. Nevertheless, CompP exhibited more consistency, reliably producing responses with a higher frequency of linguistic features compared to those elicited by DemoP and LingP. We conducted a series of statistical analyses to determine if the differences in the number of linguistic features identified in responses generated by each prompt were statistically significant. Initially, we applied an ANOVA to examine any notable disparities in the counts of linguistic features across the three types of prompts (DemoP, LingP, and CompP). If the ANOVA indicated significant differences (with a significance level set at α = .05), we performed t-tests for post hoc analysis. This would involve comparing each pair of prompts (e.g., DemoP vs. LingP, DemoP vs. CompP, and LingP vs. CompP). It is important to note that the simulated responses in a row (as shown in Fig. 2 ) originate from the same question in identical medical cases and were subjected to all three prompt types. This constitutes repeated measures on the same subjects, thereby necessitating the use of a one-way repeated measures ANOVA and paired samples t-tests. Our analysis began with a comprehensive evaluation of all feature types, followed by the same statistical tests—ANOVA and t-tests—within each individual feature type for a finer analysis. Table 3 One-Way Repeated Measures ANOVA for All Types Count Sum Mean (SD) DemoP 36 300 8.3 (4) LingP 36 249 6.9 (4.2) CompP 36 354 9.8 (4.4) Total 108 903 8.4 (4.3) Sum of Squares F test ( df ) P -value Between Groups 153.2 6.2 (2, 105) .003 Within Groups 1851.8 Total 2004.9 Table 3 shows the descriptive statistics for all feature types across the three prompts. The statistical results align with the heatmap observations (Fig. 2 ), CompP generated the most features, with an average of 9.83. This outcome was anticipated because CompP includes comprehensive cues about AAVE, including the demographic indication and specific AAVE linguistic features. Moreover, the ANOVA results reveal a statistically significant difference in the effectiveness of the three prompts, with P = .003, well below the threshold of P = .05. Table 4 Paired Two-Sample t -Test for All Types DemoP, LingP DemoP, CompP LingP, CompP Mean (SE) -1.4 (0.8) 1.5 (0.8) 2.9 (0.9) t test (df) -1.7 (35) 1.9 (35) 3.4 (35) P -value (1-tail) .048 .03 < .001 Table 4 presents the results of t-tests conducted on pairs of prompts, highlighting significant differences between two groups. In each pair, the first group name in the header represents Group 1, and the second name represents Group 2, with the differences calculated as Group 2 minus Group 1. Notably, the most comprehensive prompt, CompP, significantly outperformed both DemoP and LingP. This suggests that the inclusion of both demographic indicators and linguistic features enhances the effective simulation of AAVE features in the GPT-4’s responses. Interestingly, DemoP led to significantly more features in the GPT-4’s responses compared to LingP, implying that the demographic variable may be a more crucial factor. Therefore, we conducted fine-grained analyses for each specific linguistic feature type to further reveal these details. Table 5 illustrates the distribution of nine linguistic feature types across three types of prompts: DemoP, LingP, and CompP. It reveals that 'Questions' and 'Existential and locative constructions' had no annotations, and 'Lexical features' had only one. Consequently, these three types of linguistic features were excluded from our detailed feature analysis due to the limited number of annotations. Among the remaining six feature types, significant ANOVA results were observed for: Pre-verbal markers, Nouns and pronouns, Negation, and Phonological features. Table 5 Distribution of Linguistic Feature Types Across Prompts DemoP LingP CompP Pre-verbal markers 35 52 68 Verbal tense-number marking 7 3 5 Nouns and pronouns 1 3 10 Negation 101 84 113 Questions 0 0 0 Existential and locative construction 0 0 0 Lexical features 1 0 0 Phonological features 68 32 86 Out of list 87 75 72 Total 300 249 354 Table 6 presents the per-type statistical analysis (ANOVA and t-test) for the four features that showed significant ANOVA results. It was observed that CompP significantly generated a greater number of features than DemoP and LingP across all types. The comparative effectiveness of DemoP and LingP remained statistically inconclusive in most cases. However, for the 'Pre-verbal markers' type, LingP elicited more features than DemoP, with P = .01. Conversely, in the 'Phonological features' type, DemoP was significantly more effective than LingP, as indicated by P = .007. For the other types, we found no statistically significant differences between DemoP and LingP. Table 6 Per-Type Analysis ANOVA t test F test ( df ) P -value DemoP, LingP DemoP, CompP LingP, CompP Pre-verbal markers 7.3 (2, 105) .001 Mean (SE) 0.5 (0.2) 0.9 (0.2) 0.4 (0.3) t test ( df ) 2.3 (35) 3.8 (35) 1.7 (35) P -value (1-tail) .01 < .001 .05 Nouns and pronouns 5 (2, 105) .0097 Mean (SE) 0.1 (0.1) 0.3 (0.1) 0.2 (0.1) t test ( df ) 0.8 (35) 3(35) 2 (35) P -value (1-tail) .21 .002 .03 Negation 3.6 (2, 105) .03 Mean (SE) -0.5 (0.3) 0.3 (0.3) 0.8 (0.3) t test ( df ) -1.6 (35) 1 (35) 2.7 (35) P -value (1-tail) .06 .15 .005 Phonological features 8.7 (2, 105) < .001 Mean (SE) -1 (0.4) 0.5 (0.3) 1.5 (0.4) t test ( df ) -2.6 (35) 1.5 (35) 4.1(35) P -value (1-tail) .007 .08 < .001 Notably, the 'Phonological features' type exhibits the most pronounced difference between DemoP and LingP. This distinct pattern, coupled with the considerable presence of phonological features in the overall dataset, suggests that this feature type is a contributing factor that explains why DemoP elicited a significantly higher number of features compared to LingP in the evaluation of all feature types, as observed in Table 4 . 3.3 Out-of-List Features We analyzed 'out-of-list' features that were annotated by both annotators to assess their alignment with AAVE. The following features were identified: Omission of the subject in sentences Use of “Doc” as a colloquial abbreviation for “doctor” Use of “Dang” as a euphemism for “damn” Use of “Naw” or “nah” in place of “no” Use of “Ma” as a colloquial form of “my” Use of “Real” for “really,” as exemplified in “My girl here been feelin’ real bad.” Use of “Outta” as a contraction of “out of” Use of “Ya” as a colloquial form of “you” Use of "Swole" to describe being muscular Use of an idiomatic expression "a good long while" to indicate a substantial period of time Use of “lawd” as an expressive form of “Lord,” denoting frustration, exasperation, or admiration Use of “all” for emphasis, as in various expressions like “She been workin' a lot and all,” “Then she started feelin' all tired,” “I'm all confused 'n stuff,” “my feet been all swole up.” Note that these features are not exclusive to AAVE and can be found in other varieties of English, including Southern American English. When and how these features are used is significant in determining whether they're being used as part of AAVE or as part of informal, colloquial speech more broadly. However, the significant aspect of this study is the observation that these features prominently appear in GPT-4's responses only when prompted with demographic descriptors (DemoP) or specific AAVE linguistic features (LingP). This pattern suggests that, despite the inherent ambiguity of these features, the model demonstrates a capability to selectively engage with them in a contextually appropriate manner when provided with relevant cues. Thus, while these features alone may not uniquely define AAVE, their conditional appearance in the model’s output can be interpreted as evidence of the model’s ability to represent the nuanced use of language associated with African American speech. This capability demonstrates its potential in reflecting the linguistic diversity and nuances present in human language use. 4 Discussion Our study demonstrates that the GPT-4 consistently exhibits AAVE features across various prompts, with the exception of the BaseP, which served as a benchmark. CompP, which combined the demographic variable and linguistic features, emerged as the most effective format, as shown by both counts and statistical analysis. However, the relative effectiveness of DemoP (demographic indicator) versus LingP (linguistic features) remains statistically ambiguous. However, for the 'phonological features' category, DemoP, which includes only a demographic variable, was significantly more effective at simulating AAVE than LingP, which provided explicit phonological details, as indicated by a very low p-value. This implies that for phonological features, a general demographic cue was more effective than detailed linguistic specification, a finding that challenges intuition given the complexity of LingP. In essence, the mere inclusion of a demographic indicator seems sufficient to elicit an abundant production of phonological features. This suggests that there might be an inherent mechanism within the GPT-4 that strongly associates African American community with specific phonological characteristics. Moreover, our findings shed light on the GPT-4's capacity to autonomously generate unique linguistic behaviors, which we have termed 'out-of-list features.' These features, absent in responses to the BaseP — a prompt lacking demographic indicators and specific linguistic characteristics — emerged when either demographic or linguistic cues were present. Notably, when presented with demographic information alone, GPT-4 independently incorporated certain linguistic characteristics, some surpassing our study's initial scope. This demonstrates GPT-4's comprehensive grasp of AAVE. Contrastingly, when prompts explicitly specified linguistic features, GPT-4 did not confine its responses to these inputs. Instead, it integrated additional linguistic elements, possibly associated with the African American demographic. This emergent behavior opens avenues for further research. Subsequent studies might explore GPT-4's intrinsic understanding of AAVE through ablation experiments, removing individual features to observe if the model continues to represent them. However, our research identified limitations in GPT-4's simulation of certain AAVE features, such as 'questions,' 'existential and locative constructions,' and 'lexical features.' Multiple factors could contribute to these limitations, including potentially unclear prompts or the rarity of these features in real-world usage. GPT-4's effectiveness hinges on its training data; a lack of diverse AAVE examples can hinder its ability to accurately replicate these dialect features. Even when employing reinforcement learning techniques like RLHF, the biases inherent in the original dataset or the trainers’ preferences for more standard language outputs may still restrict the model's capability to simulate specific dialects or styles. To improve its simulation of dialects like AAVE, GPT-4 would benefit from more diverse and representative training data, coupled with training methods that emphasize linguistic diversity. This highlights the importance of continued research and innovation to enhance LLMs’ abilities to recognize and reproduce the vast array of human languages. Additionally, the primary goal of our experimental design was to assess whether GPT-4 could adopt AAVE features to align with the tone and context of an ongoing medical conversation, rather than simulating every defined feature without omission. If our objective had been to test specific features, we would have designed customized diagnostic questions to elicit these responses. For example, despite our main experiment showing no instances of 'questions' features—due to the nature of doctor-patient interactions, where typically the doctor poses questions, limiting the model's opportunities to initiate questions—an additional prompt asked during the study, 'Do you have any questions?' elicited the response: 'Well, doc, what you think this chest pain be about?' This demonstrates GPT-4’s ability to form direct questions without inversion, highlighting the influence of experimental design on observed outcomes. One promising direction opened by this study is the potential development of AI systems, equipped with LLMs, tailored to specific dialects, such as AAVE. These systems could enhance interpretation within clinical settings and medical training by providing simulations focused on linguistic adaptability. For example, building on the concept of virtual patient simulators, like the one developed using Siamese LSTM architecture by Furlan et al.[8], which improved medical students' diagnostic reasoning and learning outcomes through interactive feedback and targeted review suggestions, AI could replicate patients with specific linguistic features. This would enable medical students to practice and refine their communication strategies by interacting with virtual patients who exhibit these linguistic characteristics. Furthermore, such simulations could also feature patients with varied demographic characteristics, including those with limited health literacy or non-native speakers of the dominant language, thus preparing clinicians for a broader range of communication challenges. The essence of this approach lies in fostering linguistic adaptability in healthcare communication, which aligns with patient-centered care principles. However, it's important to note that the integration of socio-cultural elements into these simulations is complex, was outside the scope of this study and requires further research. It is crucial to recognize that while language models like GPT-4 offer significant potential, they also carry a risk of perpetuating stereotypes. Developers and researchers must remain vigilant about these potential biases to actively work towards minimizing them. Implementing measures such as involving linguists and cultural studies experts in system design is essential to ensure that AI technologies support equitable and respectful interactions in healthcare settings. 5 Conclusions This study delves into the capabilities of generative AI, particularly GPT-4, in replicating the linguistic nuances of AAVE. Our findings reveal that LLMs like GPT-4 are adept at simulating the linguistic behaviors of specific demographic groups. This ability is pivotal in bridging dialectal and linguistic divides in healthcare, enhancing communication, and serving as an invaluable training resource. By interacting with AI systems that mirror a range of patient demographics, medical trainees can significantly enhance their communication skills, better preparing them for real-world patient interactions and enhancing patient care. Despite the promising findings of this study, it is essential to acknowledge its limitations. The primary focus was on the simulation of AAVE, without considering other demographic variables such as gender, age, geographical location, or socio-economic status. This narrow focus might limit the generalizability of our findings across broader demographic characteristics. Moreover, our research only addressed the written aspects of AAVE due to the text-based nature of LLM outputs, which excludes phonological features that are crucial to authentic spoken communication. Furthermore, while our findings indicate potential applications in medical training simulations, we did not empirically test these applications within actual clinical settings, which is necessary to truly assess the efficacy of such training tools. Recognizing these limitations, future research should expand the scope to include a more comprehensive array of demographic and sociolect factors and explore the practical application of LLM simulations in clinical training scenarios to fully realize the potential benefits outlined in this study. By developing customized LLMs that consider these aspects, healthcare providers can significantly reduce communication barriers, which can lead to improved patient outcomes. This approach will ensure the creation of AI models that are not only technologically advanced but also culturally sensitive and inclusive, contributing to a more equitable healthcare system. Abbreviations AAE African American English AAVE African American Vernacular English AI Artificial Intelligence ANOVA Analysis of Variance BaseP Baseline Prompt CCS Computer-Based Case Simulations ComP Comprehensive Prompt DemoP Demographic Prompt IAA Inter-annotator Agreement LLMs Large Language Models LingP Linguistic Prompt USMLE United States Medical Licensing Examination Declarations Competing Interests Christopher C. Yang, the corresponding author of this manuscript, is also the editor of this journal. To ensure a transparent review process, he will not be involved in the editorial handling or decision-making of this submission. An alternative editor will oversee the peer review process. The authors have no other competing interests to declare. Funding This work was funded by the National Science Foundation under the Grants IIS-1741306 and IIS-2235548, and by the Department of Defense under the Grant DoD W91XWH-05-1-023. Author Contribution Y.L. and C.-H.C. wrote the manuscript. C.C.Y. supervised the entire process and confirmed the final version of the manuscript. All authors reviewed the manuscript. Acknowledgement This work was supported in part by the National Science Foundation under the Grants IIS-1741306 and IIS-2235548, and by the Department of Defense under the Grant DoD W91XWH-05-1-023. This material is based upon work supported by (while serving at) the National Science Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. References Rosa M, Avila, Bramlett MD (2013) Language and immigrant status effects on disparities in Hispanic children’s health status and access to health care. Matern. Child Health J. 17, (2013), 415–423 Jean E, Beatson (2016) Addressing health disparities through cultural and linguistic competency trainings. ABNF J. 27, 4 (2016), 83 Hilary B, Bergsieker JN, Shelton, Richeson JA (2010) To be liked versus respected: Divergent goals in interracial interactions. J. Pers. Soc. Psychol. 99, 2 (2010), 248 Kecia Carroll (2013) Socioeconomic status, race/ethnicity, and asthma in youth. American Thoracic Society Jennifer E, DeVoe LS, Wallace, Fryer GE (2009) Measuring patients’ perceptions of communication with healthcare providers: do differences in demographic and socioeconomic characteristics matter? Health Expect. Int. J. Public Particip. Health Care Health Policy 12, 1 (March 2009), 70–80. https://doi.org/10.1111/j.1369-7625.2008.00516.x Regine A, Douthard IK, Martin T, Chapple-McGruder A, Langer, Soju, Chang (2021) US maternal mortality within a global context: historical trends, current state, and future directions. J. Womens Health 30, 2 (2021), 168–177 Peter Franks P, Muennig E, Lubetkin, Jia H (2006) The burden of disease associated with being African-American in the United States and the contribution of socio-economic status. Soc. Sci. Med. 62, 10 (May 2006), 2469–2478. https://doi.org/10.1016/j.socscimed.2005.10.035 Raffaello Furlan M, Gatti R, Menè D, Shiffer C, Marchiori AG, Levra V, Saturnino E, Brunetta (2021) and Franca Dipaola. A Natural Language Processing-Based Virtual Patient Simulator and Intelligent Tutoring System for the Clinical Diagnostic Process: Simulator Development and Case Study. JMIR Med. Inform. 9, 4 (April 2021), e24073. https://doi.org/10.2196/24073 DeWan Gibson and Mei Zhong (2005) Intercultural communication competence in the healthcare context. Int. J. Intercult. Relat. 29, 5 (2005), 621–634 John J Gumperz. 1982. Discourse strategies. Cambridge University Press Christina N, Harrington R, Garg A, Woodward, Williams D (2022) It’s kind of like code-switching: Black older adults’ experiences with a voice assistant for health information seeking. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022. 1–15 Hripcsak G, Rothschild AS (2005) Agreement, the f-measure, and reliability in information retrieval. J. Am. Med. Inform. Assoc. 12, 3 (2005), 296–298 Rachel L, Johnson D, Roter NR, Powe, Cooper LA (2004) Patient race/ethnicity and quality of patient–physician communication during medical visits. Am. J. Public Health 94, 12 (2004), 2084–2090 Lee P, Goldberg C, Kohane I (2023) The AI revolution in medicine: GPT-4 and beyond. Pearson Pengfei Liu W, Yuan J, Fu Z, Jiang H, Hayashi, Neubig G (2023) Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55, 9 (2023), 1–35 Helen Meldrum (2009) Characteristics of Compassion: Portraits of Exemplary Physicians: Portraits of Exemplary Physicians. Jones & Bartlett In S, Mufwene J, Rickford J, Baugh, Bailey G (1998) Coexistent systems in African-American English. Struct. Afr.-Amreican Engl. (1998), 110–153 Alan Nelson (2002) Unequal treatment: confronting racial and ethnic disparities in health care. J. Natl. Med. Assoc. 94, 8 (August 2002), 666–668 Alexander N, Ortega, Peter J, Gergen AD, Paltiel H, Bauchner KD, Belanger, Leaderer BP (2002) Impact of site of care, race, and Hispanic ethnicity on medication use for childhood asthma. Pediatrics 109, 1 (2002), e1–e1 Young su Park (2012) Cultural conflicts over the illness experiences of Korean Chinese migrant workers. Seoul National University. Retrieved from https://hdl.handle.net/10371/134205 Geoffrey K, Pullum (1999) African American Vernacular English is not standard English with mistakes. Work. Lang. Prescr. Perspect. (1999), 59–66 Tamara Rakić and Anne Maass (2018) Communicating between groups, communicating about groups. Language, Communication, and Intergroup Relations. Routledge, pp 66–97 George B, Ray (2009) Language and interracial communication in the United States: Speaking in Black and White. Peter Lang Rickford JR (1999) African American vernacular English: Features, evolution, educational implications. No Title (1999) Merrill Singer H, Baer DL, Pavlotski A (2019) Introducing medical anthropology: a discipline in action. Rowman & Littlefield Geneva Smitherman, Samy Alim H (2021) Word from the mother: Language and African Americans. Routledge Jamila K, Taylor (2020) Structural racism and maternal health among Black women. J. Law. Med. Ethics 48, 3 (2020), 506–517 Sachiko Terui (2017) Conceptualizing the pathways and processes between language barriers and health disparities: review, synthesis, and extension. J. Immigr. Minor. Health 19, (2017), 215–224 Erik R, Thomas (2007) Phonological and phonetic characteristics of African American vernacular English. Lang. Linguist. Compass 1, 5 (2007), 450–475 Maxim Tkachenko M, Malyuk A, Holmanyuk (2020) and Nikolai Liubimov. Label Studio: Data labeling software. Retrieved from https://github.com/heartexlabs/label-studio USMLE. (n.d.). Computer-based Case Simulations. Retrieved January 11 (2024) from https://www.usmle.org/step-3-test-question-formats/computer-based-case-simulations Wen Yingxi and Cho Il Young (2017) A Comparative Study between the changes of China’s Joseonmal Ttuieosseugi revised in 2016 and current Spacing Word Rules of South and North Korean. J. Soc. Korean Lang. Lit. 81 (2017), 187–222 Walt Wolfram (2004) The grammar of urban African American vernacular English. Handb. Var. Engl. 2, (2004), 111–32 Additional Declarations Competing interest reported. Christopher C. Yang, the corresponding author of this manuscript, is also the editor of this journal. To ensure a transparent review process, he will not be involved in the editorial handling or decision-making of this submission. An alternative editor will oversee the peer review process. The authors have no other competing interests to declare. Supplementary Files appendix1.docx appendix2.docx appendix3.docx Cite Share Download PDF Status: Published Journal Publication published 11 Mar, 2025 Read the published version in Journal of Healthcare Informatics Research → Version 1 posted Editorial decision: Revision requested 03 Dec, 2024 Reviews received at journal 29 Nov, 2024 Reviews received at journal 16 Nov, 2024 Reviewers agreed at journal 07 Nov, 2024 Reviewers agreed at journal 07 Nov, 2024 Reviewers invited by journal 07 Nov, 2024 Editor assigned by journal 28 Oct, 2024 Submission checks completed at journal 25 Oct, 2024 First submitted to journal 17 Oct, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5279660","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":374510896,"identity":"86d42c97-fd7d-4325-9d01-2e27a90a5584","order_by":0,"name":"Yeawon Lee","email":"","orcid":"","institution":"Drexel University","correspondingAuthor":false,"prefix":"","firstName":"Yeawon","middleName":"","lastName":"Lee","suffix":""},{"id":374510898,"identity":"0f954412-d94c-4f96-8868-aeafe2459a72","order_by":1,"name":"Chia-Hsuan Chang","email":"","orcid":"","institution":"Drexel University","correspondingAuthor":false,"prefix":"","firstName":"Chia-Hsuan","middleName":"","lastName":"Chang","suffix":""},{"id":374510902,"identity":"aa678680-3886-4bcb-8fea-f62761cecd1b","order_by":2,"name":"Christopher C. Yang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAyUlEQVRIiWNgGAWjYHACxgcSFcxgxgEGBmaitDAbWJyBqCRaC5tEZRspWvjbzxhI3Jxnnc8vkXzgAEOFdWIDIS0SZ3IMDGduS7ec2XMs4QDDmXTCWgwkeAySJbcdNjA43mNwgLHtMHFaDv+dc9jA/jD/hwOM/4jTYtgg2QC0hb2H4QBjAxFaJM6kFTNIHEs3kDhzzOBAwrF0Y4Ja+NsPb/8hUWNtwD8j+eGDDzXWsgS1MDBwGCDYCYSVgwD7A+LUjYJRMApGwcgFAEUoP5l7xiNEAAAAAElFTkSuQmCC","orcid":"","institution":"Drexel University","correspondingAuthor":true,"prefix":"","firstName":"Christopher","middleName":"C.","lastName":"Yang","suffix":""}],"badges":[],"createdAt":"2024-10-17 04:53:10","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5279660/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5279660/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s41666-025-00194-9","type":"published","date":"2025-03-11T15:57:11+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":68697497,"identity":"893e2601-6ebd-4caf-a0a3-941e0a5df6ff","added_by":"auto","created_at":"2024-11-11 06:58:03","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":45083,"visible":true,"origin":"","legend":"\u003cp\u003eFramework for simulating patient-physician communication using GPT-4\u003c/p\u003e","description":"","filename":"fig1.png","url":"https://assets-eu.researchsquare.com/files/rs-5279660/v1/f591b4a1bc3c337daccb1e11.png"},{"id":68697496,"identity":"093e68c3-48dc-46ed-8f8b-6b3966f6557f","added_by":"auto","created_at":"2024-11-11 06:58:03","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":24837,"visible":true,"origin":"","legend":"\u003cp\u003eDistribution of AAVE Feature Annotations by Diagnostic Case. Each row represents the aggregated responses for a specific medical case from the simulated patient. The values indicate the total count of AAVE linguistic features identified within the sets of six responses for each diagnostic case.\u003c/p\u003e","description":"","filename":"fig2.png","url":"https://assets-eu.researchsquare.com/files/rs-5279660/v1/2a1919a43e7aa4f9bdb28524.png"},{"id":78688912,"identity":"75ff5392-2a91-425b-a9ba-a50077c75f21","added_by":"auto","created_at":"2025-03-17 16:06:58","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1117537,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5279660/v1/62879d20-60fa-4403-a1b4-178c930be37b.pdf"},{"id":68698011,"identity":"5868eaee-7336-424d-b740-22216e98b184","added_by":"auto","created_at":"2024-11-11 07:06:03","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":27654,"visible":true,"origin":"","legend":"","description":"","filename":"appendix1.docx","url":"https://assets-eu.researchsquare.com/files/rs-5279660/v1/c0ef79fd453da65f62087b76.docx"},{"id":68697499,"identity":"6c384892-b082-4e6d-8cdf-1b368bd2fe8c","added_by":"auto","created_at":"2024-11-11 06:58:03","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":41940,"visible":true,"origin":"","legend":"","description":"","filename":"appendix2.docx","url":"https://assets-eu.researchsquare.com/files/rs-5279660/v1/fb38b7fc646a4dcdf75537e7.docx"},{"id":68698012,"identity":"2a71afbb-098d-4a5f-8aad-4e9a8af0c026","added_by":"auto","created_at":"2024-11-11 07:06:03","extension":"docx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":496166,"visible":true,"origin":"","legend":"","description":"","filename":"appendix3.docx","url":"https://assets-eu.researchsquare.com/files/rs-5279660/v1/41ecab4332309579c2e7b581.docx"}],"financialInterests":"Competing interest reported. Christopher C. Yang, the corresponding author of this manuscript, is also the editor of this journal. To ensure a transparent review process, he will not be involved in the editorial handling or decision-making of this submission. An alternative editor will oversee the peer review process. The authors have no other competing interests to declare.","formattedTitle":"Enhancing Patient-Physician Communication: Simulating African American Vernacular English in Medical Diagnostics with Large Language Models","fulltext":[{"header":"1\tIntroduction","content":"\u003ch2\u003e1.1\tBackground\u003c/h2\u003e\u003cp\u003eIn the United States, health disparities disproportionately affect people of color and minority races, as evidenced by higher mortality rates, earlier onset of diseases, and more severe symptoms compared to the White population [25]. African Americans, for instance, face a 30% higher risk of death compared to Whites, a disparity that persists even after adjusting for age factors [7]. Conditions like hypertension, HIV, heart disease, stroke, and cancer are notably more prevalent in these minority groups. Asthma, the most common chronic childhood disease in the U.S., is 1.6 times more prevalent in Black children and 2.4 times more in Puerto Rican children compared to white children [4]. Such inequalities in health are not unique to the U.S. and can be attributed to a multitude of factors, including lifestyle differences, unequal access to medical infrastructure, structural discrimination, and stress and trauma from these conditions [2]. These factors often interact, exacerbating health issues in these communities.\u003c/p\u003e \u003cp\u003eOne of the key contributors to these disparities, and the focal point of our study, is the communication barrier between healthcare providers and patients. Patients who feel heard and respected are more likely to express their concerns clearly, aiding in effective diagnosis and treatment adherence. However, in many contexts, and particularly among culturally and ethnically diverse groups, communication is often hindered by language barriers, sociocultural differences, and ingrained stereotypes. These obstacles can lead to less effective healthcare interventions and poorer patient outcomes [25]. The Institute of Medicine [18] reported that disparities in treatment exist across various minority groups. For instance, Mexican Americans with myocardial infarction are 40% less likely to receive thrombolytic therapy than Whites. African Americans and Latino Americans are less likely to receive appropriate emergency care and pain management, respectively. Ortega et al. [19] found that Latino and African American children with asthma receive fewer standard management medications compared to White children.\u003c/p\u003e \u003cp\u003eIn an increasingly global healthcare landscape, the diverse mix of cultural and linguistic backgrounds among patients presents a significant challenge in communication, potentially widening health disparities. Effective communication in healthcare goes beyond words; it encompasses an understanding of the different contexts in which patients experience pain and the ways in which they express it. This is where technology like Large Language Models (LLMs), with their capacity to understand and generate natural language, presents a novel and promising avenue to mitigate these barriers, potentially transforming healthcare interactions and outcomes.\u003c/p\u003e \u003cp\u003eHowever, there is a caveat: if these LLMs predominantly generate responses in standard, formal English, they may not fully capture the wide range of linguistic nuances found in real-world healthcare environments. The predominance of standardized language overlooks the rich variety of linguistic styles present among patients. To begin addressing this gap, our study explores the potential of LLMs, particularly GPT-4, to replicate diverse linguistic styles. Specifically, we investigate whether GPT-4 can emulate the linguistic characteristics of African American Vernacular English (AAVE). By evaluating their effectiveness in simulating patients with distinct linguistic traits, this research aims to bridge linguistic gaps in healthcare communication. We aspire to contribute to a more inclusive and culturally sensitive healthcare environment. This work marks the beginning of a necessary journey in healthcare technology, one with the potential to significantly improve communication and patient outcomes.\u003c/p\u003e \u003cdiv id=\"Sec2\" class=\"Section2\"\u003e \u003ch2\u003e1.2 Prior Work\u003c/h2\u003e \u003cp\u003eCommunication barriers in healthcare are not only prevalent when providers and patients speak different primary languages [28][1]; they also arise within the same language. Even among speakers who are native to the same language, subtle variations in intonation, speech patterns, and interaction styles [10] can lead to significant misinterpretations. For instance, Park\u0026rsquo;s ethnographic study [20], conducted in the context of free clinics serving immigrant workers, highlights communication challenges between Korean Chinese (\u0026ldquo;Chaoxianzu\u0026rdquo; in Chinese, \u0026ldquo;Joseonjok\u0026rdquo; in Korean) migrant workers and native Korean medical staff in South Korea, despite both groups speaking Korean. Native Korean doctors often struggle to understand Korean Chinese patients. Notably, the Korean Chinese language, based on North Korea's Munhwaŏ standard, significantly differs from South Korean in vocabulary, phonetics, phonology, stress, and intonation [32]. Additionally, cultural differences manifest in language, particularly in how symptoms are expressed. Korean Chinese migrants not only use atypical vocabulary, unrecognized by native Korean medical professionals, but also often describe their symptoms in relation to specific organs. This unconventional expression of symptoms can lead to unnecessary tests and a confusing diagnostic process for medical personnel, while Korean Chinese patients may feel their complaints are misunderstood.\u003c/p\u003e \u003cp\u003eThe risk of misinterpretation may be even more pronounced in the U.S., where a multitude of immigrant groups and diverse demographics coexist. Racial dynamics can further complicate communication in healthcare settings [6, 22, 27]. Ray [23] highlights the significance of understanding historical contexts in interracial communication. In the U.S., racial minorities often aim to project competence and earn respect, while White individuals may prioritize being liked and seen as moral[3]. These differing conversational approaches can negatively impact interactions, leading to misunderstandings and adverse perceptions [5]. Linguistic variations specific to racial, demographic, or ethnic communities add another layer of complexity to communication in healthcare settings. African American Vernacular English (AAVE) is a prime example of these linguistic variations. Characterized by unique phonological, grammatical, and stylistic features [22], AAVE is not only a means of communication but also a powerful symbol of cultural identity for many African Americans. However, this diversity poses challenges in medical and technological contexts, as illustrated by a study [11] examining older Black adults' interactions with voice assistants like Google Home, particularly for health information-seeking purposes. Participants, aged 50\u0026ndash;89 from lower-income neighborhoods in Chicago and Detroit, struggled to communicate effectively with these devices. Programmed primarily in Standard English, the voice assistants failed to grasp the nuances of their dialect and accent. This language barrier necessitated a cumbersome process of cultural code-switching, rendering the interactions with the technology both cognitively demanding and time-consuming. Such scenarios underscore a critical oversight in technology design \u0026ndash; the failure to accommodate the rich linguistic diversity within communities. As a result, such technologies risk alienating historically marginalized groups who speak dialects like AAVE, leading to reduced long-term engagement and perpetuating digital divides in accessing health information and technology.\u003c/p\u003e \u003cp\u003eThis issue aligns with the broader call for culturally competent healthcare, as defined by Meldrum [16]. This approach emphasizes tailoring care to each patient's unique background, including their social, cultural, and linguistic needs. However, practical challenges in implementing this competency within healthcare systems persist, as noted in various studies [2, 9]. These include staff shortages and bureaucratic constraints, which can impede clinicians' ability to fully engage with patients' unique backgrounds [14]. In this context, the emerging domain of Artificial Intelligence (AI), especially Large Language Models (LLMs), presents a promising avenue for augmenting healthcare communication. These models, with their advanced capabilities in comprehending and generating nuanced human-like language, hold potential to revolutionize patient interactions. By grasping the subtleties of various languages and dialects, they could enable more culturally attuned and linguistically sensitive communication, thereby not only improving the delivery of healthcare but also potentially leading to more accurate diagnoses and a deeper exchange of health-related information [14]. This technological advancement aligns well with the goals of culturally competent healthcare, offering a bridge over existing gaps in clinician-patient interactions.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e1.3 Goal of This Study\u003c/h2\u003e \u003cp\u003eThe primary aim of this study is to evaluate the capability of Large Language Models (LLMs), with a specific focus on GPT-4, to accurately simulate patients who communicate using African American Vernacular English (AAVE). AAVE, a significant variant of American English that often diverges from Standard English, offers an interesting linguistic case study. Given its distinct linguistic features and status as a well-researched dialect, AAVE provides an ideal benchmark for assessing the linguistic proficiency of generative models like GPT-4.\u003c/p\u003e \u003cp\u003eOur research question is: Can LLMs, particularly GPT-4, effectively simulate the speech of patients using AAVE? To investigate this, we have designed experiments where GPT-4 simulates interactions with patients speaking AAVE, responding to typical diagnostic questions similar to those posed by healthcare professionals. The GPT-4's responses are generated based on the medical cases embedded within our prompts. To identify effective prompt-based learning methods for the simulation of AAVE-speaking patients, our research involves experimenting with various prompt structures, ranging from straightforward medical cases to more intricate scenarios incorporating demographic and linguistic nuances. The medical cases in our prompts are derived from the United States Medical Licensing Examination (USMLE) Computer-Based Case Simulations (CCS), offering a diverse and realistic range of patient situations, designed specifically to test the clinical application skills of medical students.\u003c/p\u003e \u003cp\u003eThe feasibility of using GPT-4 to simulate AAVE-speaking patients opens significant implications for healthcare communication. For example, introducing clinicians to AI-generated examples of linguistic diversity, like those examined in our study, could expand their understanding of different cultural contexts. This, in turn, is likely to foster greater empathy in patient care. In the latter part of this paper, we explore broader applications and the potential directions for future research in more detail. Thus, the current work represents an initial step towards bridging linguistic gaps and addressing communication challenges in healthcare.\u003c/p\u003e \u003c/div\u003e"},{"header":"2 Methods","content":"\u003cp\u003eIn this study, we propose a framework for simulating patient-physician communication. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, the framework simulates an arbitrary patient using the GPT-4 model from OpenAI through a carefully crafted prompt. This prompt instructs GPT-4 to act as a patient within a specified medical case, incorporating demographic details and linguistic features. We incrementally increase the complexity by integrating these elements into each prompt. The simulated patient then engages with a set of predefined diagnostic questions generated by ChatGPT. Lastly, we analyze the simulated patient's responses to evaluate the effectiveness of the patient-physician interaction. The following subsections describe the preparation of materials and the detailed mechanism of this framework.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Medical Case\u003c/h2\u003e \u003cp\u003eTo incorporate realistic patient scenarios into our prompts, our research leveraged the USMLE, a three-step exam required for medical licensure in the U.S. Specifically, we sourced diverse medical cases from the CCS component of the Step 3. This final step of the USMLE evaluates a candidate's ability to apply clinical knowledge and manage patient care in ambulatory settings. The CCS scenarios are designed to mimic real-world patient encounters, challenging candidates to demonstrate clinical proficiency and effective time management under time constraints.\u003c/p\u003e \u003cp\u003eThe USMLE website [31] provides six CCS sample cases as practice material. We utilized all six cases due to their availability and open access. Each case depicts a patient with specific health conditions and contexts. Examples of these cases include a 65-year-old man with symptoms of acute chest pain and difficulty in breathing, a 32-year-old woman experiencing knee pain and swelling, and so forth. The contexts include vital signs (e.g., pulse, blood pressure, and body mass index), reason for visit, history of present illness, past medical history, family history, and societal variables. For the purposes of our research, we excluded the patient's ethnicity information, originally listed under 'Identifies as,' from the medical cases. The complete text of the six medical cases can be found in Multimedia Appendix 1: [Medical cases].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.2 African American as a Demographic Focus\u003c/h2\u003e \u003cp\u003eIn medical cases where patients have identical health conditions, the quality of patient-physician communication can vary significantly based on demographic factors such as age, gender, and race/ethnicity [5]. Our research specifically targets the African American demographic, aligning with our exploration of LLMs\u0026rsquo; ability to simulate AAVE. AAVE, a distinct linguistic variety in the U.S., has its roots in the history of African American experiences, notably during the period of slavery [26]. Given its prevalent use and historical significance within the African American community, AAVE naturally becomes our focal demographic group. Furthermore, the existing racial disparities in U.S. healthcare between Black and White populations [13] underscore the relevance and urgency of this approach in advancing a more inclusive healthcare system. To maintain a clear focus, we deliberately exclude other demographic variables such as gender, age, geographical location, or socio-economic status. This strategy allows us to closely examine GPT-4\u0026rsquo;s ability to produce linguistically relevant responses without the added complexity of multiple demographic dimensions. While our study strategically omits these factors, it is important to acknowledge their significance; they represent promising directions for future research.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Linguistic Feature\u003c/h2\u003e \u003cp\u003eAAVE is a dialect rich in history and unique characteristics, predominantly used by African Americans. While it is not monolithic, and indeed, just like any dialect, has regional differences in grammar, vocabulary, and pronunciation, our research necessitates a focus on certain consistent features. For the purposes of evaluating how effectively LLMs can incorporate distinct linguistic characteristics, it was essential to extract typical, common factors from AAVE to set up a baseline for evaluation. This approach allows us to assess the capability of LLMs to emulate these features systematically, even as we recognize the inherent variability among speakers and regions.\u003c/p\u003e \u003cp\u003eIts origins can be traced back to Southern American English, a connection highlighted in Mufwene's research [17]. This dialect evolved from the English variants spoken by African slaves and their descendants, predominantly located in the Southern United States. Mufwene\u0026rsquo;s analysis of AAVE in relation to other dialects reveals notable similarities with Southern White dialects, particularly in syntax elements like negative concord \u0026ndash; the use of multiple negatives to express a single negation - and double modals, which involves the use of two modal verbs together.\u003c/p\u003e \u003cp\u003eRickford's foundational research [24] offers a comprehensive compilation of AAVE features, establishing it as a crucial reference point for subsequent studies. He detailed unique traits in AAVE's phonology and grammar. Grammatically, AAVE is distinguished from Standard English by its treatment of verb tense, aspect, and mood, as well as its use of pronouns and negation. Notably, the omission of the copula/auxiliary 'is' and 'are' in present tense leads to constructions like 'He tall' instead of 'He is tall.' AAVE also employs the invariant 'be' to indicate habitual aspects, as in 'He be walkin',' and uses unstressed 'been' or 'bin' for what in Standard English would be 'has/have been.' A stressed 'bin\u0026rsquo; denotes actions or states that commenced long ago and may still continue. The use of \u0026lsquo;done\u0026rsquo; in AAVE emphasizes completed actions and can co-occur with \u0026lsquo;been,\u0026rsquo; as in \u0026lsquo;He done did it\u0026rsquo; or \u0026lsquo;They done been sitting there an hour.\u0026rsquo; In terms of nouns and pronouns, AAVE often omits the possessive -s, as in \u0026lsquo;John house,\u0026rsquo; and can use associative plurals marked by \u0026lsquo;and (th)em\u0026rsquo; or \u0026lsquo;nem.\u0026rsquo; Negation in AAVE, characterized by the use of \u0026lsquo;ain\u0026rsquo;t\u0026rsquo; as a general preverbal negator and multiple negation or negative concord, is a well-studied aspect as well. Pullum [21] complements the understanding of AAVE by providing more nuanced rules in negative concord and copula omission. Pullum notes the repositioning of negative auxiliary verbs at sentence beginnings, especially when the subject is indefinite, as in 'Ain't nobody gonna find out.' He also enumerated contexts where copula omission does not occur, such as when the copula bears accent, is infinitival, expresses habitual aspect, is in the past tense, is first-person singular, begins a clause, or occurs in a confirmatory tag at the end of a sentence.\u003c/p\u003e \u003cp\u003eBuilding upon these foundational studies, Wolfram[33] and Thomas[29] add social, cultural, and economic dimensions to our understanding of AAVE. Thomas focuses on the evolution and variation of AAVE, distinguishing it from broader African American English (AAE) used by different social classes. He underscores the unique migration history of AAVE, originating in the South and transitioning to urban centers during the Great Migration. This shift to urban life significantly influenced the dialect, leading to some dialect leveling as African Americans from various regions mixed in new urban communities. Wolfram highlights AAVE's strong association with urban black youth culture, noting the age-graded usage of 'habitual be,' predominantly found among younger speakers. This suggests a continuous evolution of AAVE within urban settings. Wolfram identifies a 'supra-regional core' of AAVE, acknowledging some regional variation but emphasizing shared features across different urban areas. These key grammatical features mostly align with Rickford's work and include copula absence, invariant BE, completive 'be done,' remote 'been,' and unique traits in negation and nominals. Wolfram also categorizes the features of urban AAVE into stable, intensifying, and receding traits, highlighting the dynamic nature of the dialect. This emphasizes the continuity with historical rural AAVE roots and the ongoing changes reflecting urban influences.\u003c/p\u003e \u003cp\u003eIn this research, while recognizing the dynamic nature and diversity within AAVE, as explored by these works, we primarily draw upon Rickford\u0026rsquo;s presentation of AAVE. His works offers clarity, comprehensive coverage, and foundational status in AAVE studies. However, we have tailored his framework to align with our research objectives. For instance, we excluded most phonological aspects due to the limitations of LLMs, which produce written responses and cannot capture phonetic aspects. Additionally, to prevent the LLMs from merely replicating specific vocabulary, we have minimized the inclusion of direct vocabulary presented as lexical features in Rickford's work. Nonetheless, we selectively incorporate certain phonological and lexical features that are central to AAVE and widely discussed in the literature. In our simulations of phonological features, GPT-4 employs orthographic representations, such as \u0026ldquo;havin\u0026rsquo;\u0026rdquo;,\u0026rdquo;\u0026rsquo;round\u0026rdquo;, \u0026ldquo;\u0026rsquo;cept\u0026rdquo;. Our methodology also omits features less frequently mentioned in recent studies, thereby focusing on the most prominent and impactful aspects of AAVE as per current academic consensus. This approach acknowledges the limitations of our study, particularly in the context of phonological representation, while striving for a comprehensive and relevant analysis of AAVE within the capabilities of current LLMs.\u003c/p\u003e \u003cp\u003eUltimately, our research focuses on 37 carefully selected features. Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e presents the linguistic features identified for our research, with a descriptive profile for each feature on the right and their corresponding categories on the left. This categorization was crucial for effectively identifying and annotating AAVE features in the patients' responses. Utilizing all 37 features as separate labels would have resulted in an impractically extensive list for annotation purposes. Considering balance, we decided to utilize the bolded titles in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e as labels during our annotation process, aiding in the identification of AAVE features in the patients' responses. While we have endeavored to compile a thorough list of these linguistic features, the list isn't exhaustive. Therefore, we introduced an \"out of list\" label (refer to the \u0026lsquo;Quantitative Analysis and Annotation Strategy\u0026rsquo; section). Full explanations and examples of each feature are provided in Multimedia Appendix 2: [Linguistic features of AAVE].\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eAAVE Linguistic Features\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"31\" rowspan=\"32\"\u003e \u003cp\u003eGrammatical features\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"8\" rowspan=\"9\"\u003e \u003cp\u003ePre-verbal markers\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eOmission of \"is\" and \"are\"\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eInvariant \u0026ldquo;be\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003ehabitual actions\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003econtractions of \"will/would be\"\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026ldquo;been\u0026rdquo;/ \u0026ldquo;bin\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eunstressed \"been\" for present perfect\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003estressed \"been\" for the action that happened a long time ago\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of \u0026ldquo;done\u0026rdquo; for a distant past tense\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of \u0026ldquo;be done\u0026rdquo; for a future perfect tense\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of \u0026ldquo;had\u0026rdquo; for a past tense\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of double modals\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"5\" rowspan=\"6\"\u003e \u003cp\u003e\u003cb\u003eVerbal tense-number marking\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eAbsence of third person singular present -s, doesn't, or has\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of \"is\" and \"was\" for plural and second person subjects\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of past tense for past participle\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of past participle for past tense\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of verb stem (root forms) for past tense\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eReduplicated Tense Marking\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"7\" rowspan=\"8\"\u003e \u003cp\u003e\u003cb\u003eNouns and pronouns\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUnmarked possessives\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUnmarked plural forms\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eRegularization of irregular plural nouns\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of \"an 'em\", \"and 'em\", \"nem\" to mark associative plurals\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eAppositive or pleonastic pronouns\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of \u0026ldquo;y'all\u0026rdquo; for the 2nd person plural\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of demonstrative \u0026ldquo;them\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eOmission of relative pronoun\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003e\u003cb\u003eNegation\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eNegative concord\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eNegative inversion\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of \"ain't\" as a general preverbal negator\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of \"ain't\" + \"but\", and \"don't\" + \"but\" to indicate \"only\"\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eQuestions\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eFormation of direct questions without inversion\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eInversion in embedded questions\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cb\u003eExistential and locative construction\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of existential \"it\", \"they\" (or \"dey\")\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of existential \"they got\"\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of \"here go\" as a static locative or presentational form\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" morerows=\"2\" nameend=\"c2\" namest=\"c1\" rowspan=\"3\"\u003e \u003cp\u003e\u003cb\u003eLexical features\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of \"steady\" for consistent, persistent or repeated action\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of \"come\" to imply the speaker's indignation\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eUse of \"finna\" to indicate immediate future actions\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" morerows=\"1\" nameend=\"c2\" namest=\"c1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003ePhonological features\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eReplacement of final \"ing\" with \"in' \"\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e \u003cp\u003eOmission of unstressed syllables at the beginning and middle\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e2.4 Prompt Design\u003c/h2\u003e \u003cp\u003eAs indicated in previous study [15], LLMs generate the response \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:Y=[{y}_{1},\\dots\\:,{y}_{N}]\\)\u003c/span\u003e\u003c/span\u003e based on the given contexts \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:X\\)\u003c/span\u003e\u003c/span\u003e, which is commonly referred to a prompt and a natural language description of a task of interest. Let \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:p\\)\u003c/span\u003e\u003c/span\u003e denote the GPT-4, it is conditioned on \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:X\\)\u003c/span\u003e\u003c/span\u003e and generates every token \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{y}_{t}\\)\u003c/span\u003e\u003c/span\u003e in an autoregressive manner \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:p\\left({y}_{t+1}\\right|X,{y}_{t})\\)\u003c/span\u003e\u003c/span\u003e. In our study, we crafted the prompt integrating medical cases, a demographic variable, and linguistic features, incrementally increasing the complexity, to ask GPT-4 to simulate AAVE-speaking patients. We designed four types of prompts for each medical case to assess GPT-4's ability of patient simulation.\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eBaseline Prompt (BaseP): Baseline prompt delineates only the medical case, without additional demographic or linguistic variables, serving as our comparison benchmark.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Taba\" border=\"1\"\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eYour task is to role-play as a patient in the given medical case,\u003c/p\u003e \u003cp\u003ewhich is enclosed by \"\"\".\u003c/p\u003e \u003cp\u003e1. Respond to questions posed by a user who is acting as a doctor.\u003c/p\u003e \u003cp\u003e2. If the patient in the case cannot communicate, you should respond as their caregiver. This could be a family member or friend who has accompanied the patient.\u003c/p\u003e \u003cp\u003e3. Always stick to the details provided in the case.\u003c/p\u003e \u003cp\u003eMedical case:\u003c/p\u003e \u003cp\u003e\"\"\" {medical case} \"\"\"\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e2. Demographic Prompt (DemoP): This version integrates a demographic variable with the medical case. We utilized the phrase \"Ensure your responses incorporate the linguistic features common to the way many African Americans speak English\" as the demographic variable. It aims to assess how GPT-4 handles demographic information in the context of medical cases.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Tabb\" border=\"1\"\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eYour task is to role-play as a patient in the given medical case,\u003c/p\u003e \u003cp\u003ewhich is enclosed by \"\"\".\u003c/p\u003e \u003cp\u003e1. Respond to questions posed by a user who is acting as a doctor.\u003c/p\u003e \u003cp\u003e2. If the patient in the case cannot communicate, you should respond as their caregiver. This could be a family member or friend who has accompanied the patient.\u003c/p\u003e \u003cp\u003e3. Always stick to the details provided in the case.\u003c/p\u003e \u003cp\u003e4. Ensure your responses incorporate the linguistic features common to the way many African Americans speak English.\u003c/p\u003e \u003cp\u003eMedical case:\u003c/p\u003e \u003cp\u003e\"\"\" {medical case} \"\"\"\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e3. Linguistic Prompt (LingP): This prompt combines the medical case with specific linguistic features of AAVE. The intention is to evaluate GPT-4's capability to understand and respond using the linguistic traits of AAVE.\u003c/p\u003e\u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Tabc\" border=\"1\"\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eYour task is to role-play as a patient in the given medical case,\u003c/p\u003e \u003cp\u003ewhich is enclosed by \"\"\".\u003c/p\u003e \u003cp\u003e1. Respond to questions posed by a user who is acting as a doctor.\u003c/p\u003e \u003cp\u003e2. If the patient in the case cannot communicate, you should respond as their caregiver. This could be a family member or friend who has accompanied the patient.\u003c/p\u003e \u003cp\u003e3. Always stick to the details provided in the case.\u003c/p\u003e \u003cp\u003e4. Ensure your responses incorporate the linguistic features outlined between ***.\u003c/p\u003e \u003cp\u003eMedical case:\u003c/p\u003e \u003cp\u003e\"\"\" {medical case} \"\"\"\u003c/p\u003e \u003cp\u003eLinguistic features:\u003c/p\u003e \u003cp\u003e*** {features} ***\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e\u003cp\u003e4. Comprehensive Prompt (CompP): The most detailed version, this prompt intertwines the medical case with both demographic variable and pertinent linguistic traits. This comprehensive prompt aims to simulate a more complex and realistic patient interaction scenario.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Tabd\" border=\"1\"\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eYour task is to role-play as a patient in the given medical case,\u003c/p\u003e \u003cp\u003ewhich is enclosed by \"\"\".\u003c/p\u003e \u003cp\u003e1. Respond to questions posed by a user who is acting as a doctor.\u003c/p\u003e \u003cp\u003e2. If the patient in the case cannot communicate, you should respond as their caregiver. This could be a family member or friend who has accompanied the patient.\u003c/p\u003e \u003cp\u003e3. Always stick to the details provided in the case.\u003c/p\u003e \u003cp\u003e4. Ensure your responses incorporate the linguistic features common to the way many African Americans speak English, as described between ***.\u003c/p\u003e \u003cp\u003eMedical case:\u003c/p\u003e \u003cp\u003e\"\"\" {medical case} \"\"\"\u003c/p\u003e \u003cp\u003eLinguistic features:\u003c/p\u003e \u003cp\u003e*** {features} ***\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e2.5 Question \u0026amp; Chat Simulation\u003c/h2\u003e \u003cp\u003eThe simulation of patient-physician interactions is conducted by sequentially posing a set of prepared questions to GPT-4. These diagnostic questions, crafted by ChatGPT 4 in response to the query, 'common questions physicians ask patients', are designed to maintain neutrality concerning the patient\u0026rsquo;s gender, age, and symptoms. The selection of these questions is strategically based on their prevalence and universal relevance across all six medical cases, ensuring that they are broadly applicable and fundamental to the diagnostic process.\u003c/p\u003e \u003cp\u003eAs a result, we have compiled a list of six diagnostic questions routinely used by physicians:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eWhat brings you in today?\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eHave you had any procedures or major illnesses in the past 12 months?\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eAre you currently taking any medications, including over-the-counter and herbal supplements?\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eWhat allergies do you have?\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eHave you traveled anywhere recently?\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eHave you been exposed to anyone who's been sick recently?\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003eAdditionally, to accommodate scenarios where direct responses from patients may not be possible, such as in cases involving young children or patients in coma, we adapted the phrasing of each question, creating variations of the questions with different subjects (\"you\", \"he\", \"she\"). This allows the simulation to accommodate a wide range of clinical scenarios, promoting an uninterrupted and natural flow of conversation throughout the diagnostic interaction.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e2.6 Quantitative Analysis and Annotation Strategy\u003c/h2\u003e \u003cp\u003eTo assess the effectiveness of GPT-4 in simulating AAVE-speaking patients, we conducted a quantitative analysis of the linguistic features present in the patient's responses. This analysis involved annotating these features and then applying statistical tests, such as Analysis of Variance (ANOVA) and t-tests, to understand their prevalence and significance. By quantifying these features, we aimed to understand not only the presence but also the extent of GPT-4\u0026rsquo;s ability to replicate specific linguistic elements associated with AAVE. Our annotation strategy, therefore, played a critical role in how we processed and interpreted the responses.\u003c/p\u003e \u003cp\u003eThe annotation process was guided by an expert in AAVE, an African American healthcare professional with over 10 years of nursing experience in the U.S. She compiled the annotation guidelines and ensured the cultural and linguistic authenticity of the process. The final label set for annotation includes the linguistic categories, those with bolded titles in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, along with an 'out of list' category. This category accounts for deviations from 'standard English' not covered by the linguistic features in the prompt, which may not be specific to AAVE or any racial/ethnic group. After completing the annotations, we will determine whether these 'out of list' features can be classified as AAVE. In defining the boundaries to annotate, the primary rule is to tag each word or phrase exhibiting linguistic features, detailed in Multimedia Appendix 2: [Linguistic features of AAVE], in the responses. This process often results in overlaps. More detailed guidelines for ambiguous cases can be accessed on Multimedia Appendix 3: [Annotation guidelines].\u003c/p\u003e \u003cp\u003eThe annotation team included two experienced NLP researchers, with extensive backgrounds in healthcare and biomedical domains, who have conducted multiple projects involving complex annotation tasks. We utilized Label Studio[30], an open-source data labeling tool, employing the Named Entity Recognition template suitable for marking relevant spans of text and categorizing them into pre-defined labels. This was specifically pertinent to identifying and labeling pre-defined linguistic features from the patients\u0026rsquo; responses.\u003c/p\u003e \u003cp\u003eWe calculated the Inter-annotator Agreement (IAA) score to assess the reliability of our annotations. It is important to note that the purpose of this annotation process is to examine the text generated by the model, rather than to create a benchmark dataset. Therefore, discrepancies between annotators were not resolved procedurally; instead, we used annotations on which two annotators had reached agreement for further statistical analysis.\u003c/p\u003e \u003cp\u003eWe referred to previous study [12] for the formula to compute the degree of agreement. The following formula for positive specific agreement (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:P\\)\u003c/span\u003e\u003c/span\u003e) is to calculate the agreement between two annotators for text markup tasks.\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:P=\\frac{2a}{2a+b+c}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003e, where \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:a\\)\u003c/span\u003e\u003c/span\u003e is the number of identified features that both annotators agree, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:b\\)\u003c/span\u003e\u003c/span\u003e as well \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:c\\)\u003c/span\u003e\u003c/span\u003e is the number of identified features that only one annotator agrees.\u003c/p\u003e \u003c/div\u003e"},{"header":"3 Results","content":"\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Annotation\u003c/h2\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eInter-Annotator Agreement for Each Linguistic Feature\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAgreement\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSupport\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"5\" rowspan=\"6\"\u003e \u003cp\u003eGrammatical Features\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003ePre-verbal markers\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e155\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eVerbal tense-number marking\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e15\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eNouns and pronouns\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e14\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eNegation\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e298\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eQuestions\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eExistential and locative construction\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLexical Features\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003ePhonological Features\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.97\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e186\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eOut of List\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e234\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003eAverage\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e903\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003eWeighted Average\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e quantitatively demonstrates the consistency of agreement among annotators in annotating linguistic features. Generally, a high level of agreement is observed. However, specific features such as 'Nouns and Pronouns' and 'Lexical Features' show lower levels of agreement, with scores of 0.6 and 0.5, respectively. This low agreement may stem from the limited number of annotations in these categories. Interestingly, categories like 'Questions' and 'Existential and Locative Constructions' received no annotation, suggesting that the GPT-4 did not simulate patients with these features. In contrast, 'Negation' achieved the highest level of agreement at 0.98, along with the most substantial annotation support (298 annotations).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Comparison of Prompt Effectiveness\u003c/h2\u003e \u003cp\u003eWe assessed the effectiveness of various prompts by counting the number of AAVE linguistic features. BaseP, which doesn\u0026rsquo;t include any demographic or linguistic information, was excluded from this analysis because we did not observe any AAVE linguistic features from the responses answered by its simulated patients. Our focus was on comparing the effectiveness of DemoP, LingP, and CompP, aiming to determine which prompt most frequently elicited AAVE linguistic features in the simulated patients.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e presents a heatmap illustrating the distribution of AAVE features across different prompts. This analysis encompassed a total of 6 medical cases (rows), where each case includes 6 responses. We compiled all AAVE annotations agreed upon by both annotators and represented them in the heatmap. Generally, in terms of frequency, CompP elicited more AAVE features compared to the other two. Additionally, more features were identified in DemoP, which included the demographic variable, than in LingP, which comprised a list of AAVE linguistic features.\u003c/p\u003e \u003cp\u003eWe observed an unexpected concentration of AAVE linguistic features within the responses from the sixth medical case, to LingP. While LingP generally elicited fewer AAVE features compared to the CompP, these responses deviated from the pattern. This deviation may suggest the dynamic nature of language models, underscoring the importance of evaluating them across a diverse array of responses to fully grasp their linguistic capabilities and patterns. Nevertheless, CompP exhibited more consistency, reliably producing responses with a higher frequency of linguistic features compared to those elicited by DemoP and LingP.\u003c/p\u003e \u003cp\u003eWe conducted a series of statistical analyses to determine if the differences in the number of linguistic features identified in responses generated by each prompt were statistically significant. Initially, we applied an ANOVA to examine any notable disparities in the counts of linguistic features across the three types of prompts (DemoP, LingP, and CompP). If the ANOVA indicated significant differences (with a significance level set at α\u0026thinsp;=\u0026thinsp;.05), we performed t-tests for post hoc analysis. This would involve comparing each pair of prompts (e.g., DemoP vs. LingP, DemoP vs. CompP, and LingP vs. CompP). It is important to note that the simulated responses in a row (as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e) originate from the same question in identical medical cases and were subjected to all three prompt types. This constitutes repeated measures on the same subjects, thereby necessitating the use of a one-way repeated measures ANOVA and paired samples t-tests. Our analysis began with a comprehensive evaluation of all feature types, followed by the same statistical tests\u0026mdash;ANOVA and t-tests\u0026mdash;within each individual feature type for a finer analysis.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eOne-Way Repeated Measures ANOVA for All Types\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCount\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSum\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMean (SD)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eDemoP\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e36\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e300\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e8.3 (4)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLingP\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e36\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e249\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e6.9 (4.2)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eCompP\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e36\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e354\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e9.8 (4.4)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eTotal\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e108\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e903\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e8.4 (4.3)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eSum of Squares\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003eF\u003c/b\u003e \u003cb\u003etest (\u003c/b\u003e\u003cb\u003edf\u003c/b\u003e\u003cb\u003e)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003eP\u003c/b\u003e\u003cb\u003e-value\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eBetween Groups\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e153.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e6.2 (2, 105)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e.003\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eWithin Groups\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1851.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eTotal\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2004.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows the descriptive statistics for all feature types across the three prompts. The statistical results align with the heatmap observations (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e), CompP generated the most features, with an average of 9.83. This outcome was anticipated because CompP includes comprehensive cues about AAVE, including the demographic indication and specific AAVE linguistic features. Moreover, the ANOVA results reveal a statistically significant difference in the effectiveness of the three prompts, with \u003cem\u003eP\u003c/em\u003e\u0026thinsp;=\u0026thinsp;.003, well below the threshold of \u003cem\u003eP\u003c/em\u003e\u0026thinsp;=\u0026thinsp;.05.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePaired Two-Sample \u003cem\u003et\u003c/em\u003e-Test for All Types\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDemoP, LingP\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDemoP, CompP\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLingP, CompP\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMean (SE)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e-1.4 (0.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1.5 (0.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e2.9 (0.9)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cem\u003et\u003c/em\u003e test \u003cem\u003e(df)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e-1.7 (35)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1.9 (35)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e3.4 (35)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cem\u003eP\u003c/em\u003e-value (1-tail)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e.048\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e.03\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e presents the results of t-tests conducted on pairs of prompts, highlighting significant differences between two groups. In each pair, the first group name in the header represents Group 1, and the second name represents Group 2, with the differences calculated as Group 2 minus Group 1. Notably, the most comprehensive prompt, CompP, significantly outperformed both DemoP and LingP. This suggests that the inclusion of both demographic indicators and linguistic features enhances the effective simulation of AAVE features in the GPT-4\u0026rsquo;s responses.\u003c/p\u003e \u003cp\u003eInterestingly, DemoP led to significantly more features in the GPT-4\u0026rsquo;s responses compared to LingP, implying that the demographic variable may be a more crucial factor. Therefore, we conducted fine-grained analyses for each specific linguistic feature type to further reveal these details.\u003c/p\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e illustrates the distribution of nine linguistic feature types across three types of prompts: DemoP, LingP, and CompP. It reveals that 'Questions' and 'Existential and locative constructions' had no annotations, and 'Lexical features' had only one. Consequently, these three types of linguistic features were excluded from our detailed feature analysis due to the limited number of annotations. Among the remaining six feature types, significant ANOVA results were observed for: Pre-verbal markers, Nouns and pronouns, Negation, and Phonological features.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eDistribution of Linguistic Feature Types Across Prompts\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDemoP\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLingP\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCompP\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePre-verbal markers\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e35\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e52\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e68\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVerbal tense-number marking\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNouns and pronouns\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNegation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e101\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e84\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e113\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eQuestions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eExistential and locative construction\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLexical features\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePhonological features\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e68\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e32\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e86\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOut of list\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e87\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e75\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e72\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTotal\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e300\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e249\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e354\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e presents the per-type statistical analysis (ANOVA and t-test) for the four features that showed significant ANOVA results. It was observed that CompP significantly generated a greater number of features than DemoP and LingP across all types. The comparative effectiveness of DemoP and LingP remained statistically inconclusive in most cases. However, for the 'Pre-verbal markers' type, LingP elicited more features than DemoP, with \u003cem\u003eP\u003c/em\u003e\u0026thinsp;=\u0026thinsp;.01. Conversely, in the 'Phonological features' type, DemoP was significantly more effective than LingP, as indicated by \u003cem\u003eP\u003c/em\u003e\u0026thinsp;=\u0026thinsp;.007. For the other types, we found no statistically significant differences between DemoP and LingP.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePer-Type Analysis\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eANOVA\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u003cem\u003et\u003c/em\u003e test\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eF\u003c/em\u003e test (\u003cem\u003edf\u003c/em\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cem\u003eP\u003c/em\u003e-value\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eDemoP, LingP\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eDemoP, CompP\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003eLingP, CompP\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cb\u003ePre-verbal markers\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e7.3\u003c/p\u003e \u003cp\u003e(2, 105)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e.001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMean (SE)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.5 (0.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.9 (0.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.4 (0.3)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003et\u003c/em\u003e test (\u003cem\u003edf\u003c/em\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e2.3 (35)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e3.8 (35)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e1.7 (35)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eP\u003c/em\u003e-value\u003c/p\u003e \u003cp\u003e(1-tail)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e.01\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e.05\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cb\u003eNouns and pronouns\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e5\u003c/p\u003e \u003cp\u003e(2, 105)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e.0097\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMean (SE)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.1 (0.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.3 (0.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.2 (0.1)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003et\u003c/em\u003e test (\u003cem\u003edf\u003c/em\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.8 (35)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e3(35)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e2 (35)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eP\u003c/em\u003e-value\u003c/p\u003e \u003cp\u003e(1-tail)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e.21\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e.002\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e.03\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cb\u003eNegation\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e3.6\u003c/p\u003e \u003cp\u003e(2, 105)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e.03\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMean (SE)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-0.5 (0.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.3 (0.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.8 (0.3)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003et\u003c/em\u003e test (\u003cem\u003edf\u003c/em\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-1.6 (35)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1 (35)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e2.7 (35)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eP\u003c/em\u003e-value\u003c/p\u003e \u003cp\u003e(1-tail)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e.06\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e.15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e.005\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cb\u003ePhonological features\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e8.7\u003c/p\u003e \u003cp\u003e(2, 105)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMean (SE)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-1 (0.4)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.5 (0.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e1.5 (0.4)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003et\u003c/em\u003e test (\u003cem\u003edf\u003c/em\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-2.6 (35)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1.5 (35)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e4.1(35)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eP\u003c/em\u003e-value\u003c/p\u003e \u003cp\u003e(1-tail)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e.007\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e.08\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eNotably, the 'Phonological features' type exhibits the most pronounced difference between DemoP and LingP. This distinct pattern, coupled with the considerable presence of phonological features in the overall dataset, suggests that this feature type is a contributing factor that explains why DemoP elicited a significantly higher number of features compared to LingP in the evaluation of all feature types, as observed in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Out-of-List Features\u003c/h2\u003e \u003cp\u003eWe analyzed 'out-of-list' features that were annotated by both annotators to assess their alignment with AAVE. The following features were identified:\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eOmission of the subject in sentences\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eUse of \u0026ldquo;Doc\u0026rdquo; as a colloquial abbreviation for \u0026ldquo;doctor\u0026rdquo;\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eUse of \u0026ldquo;Dang\u0026rdquo; as a euphemism for \u0026ldquo;damn\u0026rdquo;\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eUse of \u0026ldquo;Naw\u0026rdquo; or \u0026ldquo;nah\u0026rdquo; in place of \u0026ldquo;no\u0026rdquo;\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eUse of \u0026ldquo;Ma\u0026rdquo; as a colloquial form of \u0026ldquo;my\u0026rdquo;\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eUse of \u0026ldquo;Real\u0026rdquo; for \u0026ldquo;really,\u0026rdquo; as exemplified in \u0026ldquo;My girl here been feelin\u0026rsquo; real bad.\u0026rdquo;\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eUse of \u0026ldquo;Outta\u0026rdquo; as a contraction of \u0026ldquo;out of\u0026rdquo;\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eUse of \u0026ldquo;Ya\u0026rdquo; as a colloquial form of \u0026ldquo;you\u0026rdquo;\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eUse of \"Swole\" to describe being muscular\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eUse of an idiomatic expression \"a good long while\" to indicate a substantial period of time\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eUse of \u0026ldquo;lawd\u0026rdquo; as an expressive form of \u0026ldquo;Lord,\u0026rdquo; denoting frustration, exasperation, or admiration\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eUse of \u0026ldquo;all\u0026rdquo; for emphasis, as in various expressions like \u0026ldquo;She been workin' a lot and all,\u0026rdquo; \u0026ldquo;Then she started feelin' all tired,\u0026rdquo; \u0026ldquo;I'm all confused 'n stuff,\u0026rdquo; \u0026ldquo;my feet been all swole up.\u0026rdquo;\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003eNote that these features are not exclusive to AAVE and can be found in other varieties of English, including Southern American English. When and how these features are used is significant in determining whether they're being used as part of AAVE or as part of informal, colloquial speech more broadly. However, the significant aspect of this study is the observation that these features prominently appear in GPT-4's responses only when prompted with demographic descriptors (DemoP) or specific AAVE linguistic features (LingP). This pattern suggests that, despite the inherent ambiguity of these features, the model demonstrates a capability to selectively engage with them in a contextually appropriate manner when provided with relevant cues. Thus, while these features alone may not uniquely define AAVE, their conditional appearance in the model\u0026rsquo;s output can be interpreted as evidence of the model\u0026rsquo;s ability to represent the nuanced use of language associated with African American speech. This capability demonstrates its potential in reflecting the linguistic diversity and nuances present in human language use.\u003c/p\u003e \u003c/div\u003e"},{"header":"4 Discussion","content":"\u003cp\u003eOur study demonstrates that the GPT-4 consistently exhibits AAVE features across various prompts, with the exception of the BaseP, which served as a benchmark. CompP, which combined the demographic variable and linguistic features, emerged as the most effective format, as shown by both counts and statistical analysis. However, the relative effectiveness of DemoP (demographic indicator) versus LingP (linguistic features) remains statistically ambiguous.\u003c/p\u003e \u003cp\u003eHowever, for the 'phonological features' category, DemoP, which includes only a demographic variable, was significantly more effective at simulating AAVE than LingP, which provided explicit phonological details, as indicated by a very low p-value. This implies that for phonological features, a general demographic cue was more effective than detailed linguistic specification, a finding that challenges intuition given the complexity of LingP. In essence, the mere inclusion of a demographic indicator seems sufficient to elicit an abundant production of phonological features. This suggests that there might be an inherent mechanism within the GPT-4 that strongly associates African American community with specific phonological characteristics.\u003c/p\u003e \u003cp\u003eMoreover, our findings shed light on the GPT-4's capacity to autonomously generate unique linguistic behaviors, which we have termed 'out-of-list features.' These features, absent in responses to the BaseP \u0026mdash; a prompt lacking demographic indicators and specific linguistic characteristics \u0026mdash; emerged when either demographic or linguistic cues were present. Notably, when presented with demographic information alone, GPT-4 independently incorporated certain linguistic characteristics, some surpassing our study's initial scope. This demonstrates GPT-4's comprehensive grasp of AAVE. Contrastingly, when prompts explicitly specified linguistic features, GPT-4 did not confine its responses to these inputs. Instead, it integrated additional linguistic elements, possibly associated with the African American demographic. This emergent behavior opens avenues for further research. Subsequent studies might explore GPT-4's intrinsic understanding of AAVE through ablation experiments, removing individual features to observe if the model continues to represent them.\u003c/p\u003e \u003cp\u003eHowever, our research identified limitations in GPT-4's simulation of certain AAVE features, such as 'questions,' 'existential and locative constructions,' and 'lexical features.' Multiple factors could contribute to these limitations, including potentially unclear prompts or the rarity of these features in real-world usage. GPT-4's effectiveness hinges on its training data; a lack of diverse AAVE examples can hinder its ability to accurately replicate these dialect features. Even when employing reinforcement learning techniques like RLHF, the biases inherent in the original dataset or the trainers\u0026rsquo; preferences for more standard language outputs may still restrict the model's capability to simulate specific dialects or styles. To improve its simulation of dialects like AAVE, GPT-4 would benefit from more diverse and representative training data, coupled with training methods that emphasize linguistic diversity. This highlights the importance of continued research and innovation to enhance LLMs\u0026rsquo; abilities to recognize and reproduce the vast array of human languages.\u003c/p\u003e \u003cp\u003eAdditionally, the primary goal of our experimental design was to assess whether GPT-4 could adopt AAVE features to align with the tone and context of an ongoing medical conversation, rather than simulating every defined feature without omission. If our objective had been to test specific features, we would have designed customized diagnostic questions to elicit these responses. For example, despite our main experiment showing no instances of 'questions' features\u0026mdash;due to the nature of doctor-patient interactions, where typically the doctor poses questions, limiting the model's opportunities to initiate questions\u0026mdash;an additional prompt asked during the study, 'Do you have any questions?' elicited the response: 'Well, doc, what you think this chest pain be about?' This demonstrates GPT-4\u0026rsquo;s ability to form direct questions without inversion, highlighting the influence of experimental design on observed outcomes.\u003c/p\u003e \u003cp\u003eOne promising direction opened by this study is the potential development of AI systems, equipped with LLMs, tailored to specific dialects, such as AAVE. These systems could enhance interpretation within clinical settings and medical training by providing simulations focused on linguistic adaptability. For example, building on the concept of virtual patient simulators, like the one developed using Siamese LSTM architecture by Furlan et al.[8], which improved medical students' diagnostic reasoning and learning outcomes through interactive feedback and targeted review suggestions, AI could replicate patients with specific linguistic features. This would enable medical students to practice and refine their communication strategies by interacting with virtual patients who exhibit these linguistic characteristics.\u003c/p\u003e \u003cp\u003eFurthermore, such simulations could also feature patients with varied demographic characteristics, including those with limited health literacy or non-native speakers of the dominant language, thus preparing clinicians for a broader range of communication challenges. The essence of this approach lies in fostering linguistic adaptability in healthcare communication, which aligns with patient-centered care principles. However, it's important to note that the integration of socio-cultural elements into these simulations is complex, was outside the scope of this study and requires further research.\u003c/p\u003e \u003cp\u003eIt is crucial to recognize that while language models like GPT-4 offer significant potential, they also carry a risk of perpetuating stereotypes. Developers and researchers must remain vigilant about these potential biases to actively work towards minimizing them. Implementing measures such as involving linguists and cultural studies experts in system design is essential to ensure that AI technologies support equitable and respectful interactions in healthcare settings.\u003c/p\u003e"},{"header":"5 Conclusions","content":"\u003cp\u003eThis study delves into the capabilities of generative AI, particularly GPT-4, in replicating the linguistic nuances of AAVE. Our findings reveal that LLMs like GPT-4 are adept at simulating the linguistic behaviors of specific demographic groups. This ability is pivotal in bridging dialectal and linguistic divides in healthcare, enhancing communication, and serving as an invaluable training resource. By interacting with AI systems that mirror a range of patient demographics, medical trainees can significantly enhance their communication skills, better preparing them for real-world patient interactions and enhancing patient care.\u003c/p\u003e \u003cp\u003eDespite the promising findings of this study, it is essential to acknowledge its limitations. The primary focus was on the simulation of AAVE, without considering other demographic variables such as gender, age, geographical location, or socio-economic status. This narrow focus might limit the generalizability of our findings across broader demographic characteristics. Moreover, our research only addressed the written aspects of AAVE due to the text-based nature of LLM outputs, which excludes phonological features that are crucial to authentic spoken communication. Furthermore, while our findings indicate potential applications in medical training simulations, we did not empirically test these applications within actual clinical settings, which is necessary to truly assess the efficacy of such training tools.\u003c/p\u003e \u003cp\u003eRecognizing these limitations, future research should expand the scope to include a more comprehensive array of demographic and sociolect factors and explore the practical application of LLM simulations in clinical training scenarios to fully realize the potential benefits outlined in this study. By developing customized LLMs that consider these aspects, healthcare providers can significantly reduce communication barriers, which can lead to improved patient outcomes. This approach will ensure the creation of AI models that are not only technologically advanced but also culturally sensitive and inclusive, contributing to a more equitable healthcare system.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cdiv class=\"DefinitionList\"\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u003cb\u003eAAE\u003c/b\u003e\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eAfrican American English\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u003cb\u003eAAVE\u003c/b\u003e\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eAfrican American Vernacular English\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u003cb\u003eAI\u003c/b\u003e\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eArtificial Intelligence\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u003cb\u003eANOVA\u003c/b\u003e\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eAnalysis of Variance\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u003cb\u003eBaseP\u003c/b\u003e\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eBaseline Prompt\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u003cb\u003eCCS\u003c/b\u003e\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eComputer-Based Case Simulations\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u003cb\u003eComP\u003c/b\u003e\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eComprehensive Prompt\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u003cb\u003eDemoP\u003c/b\u003e\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eDemographic Prompt\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u003cb\u003eIAA\u003c/b\u003e\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eInter-annotator Agreement\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u003cb\u003eLLMs\u003c/b\u003e\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eLarge Language Models\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u003cb\u003eLingP\u003c/b\u003e\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eLinguistic Prompt\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u003cb\u003eUSMLE\u003c/b\u003e\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eUnited States Medical Licensing Examination\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eCompeting Interests\u003c/strong\u003e\u003cp\u003eChristopher C. Yang, the corresponding author of this manuscript, is also the editor of this journal. To ensure a transparent review process, he will not be involved in the editorial handling or decision-making of this submission. An alternative editor will oversee the peer review process. The authors have no other competing interests to declare.\u003c/p\u003e\u003c/p\u003e\u003ch2\u003eFunding\u003c/h2\u003e \u003cp\u003eThis work was funded by the National Science Foundation under the Grants IIS-1741306 and IIS-2235548, and by the Department of Defense under the Grant DoD W91XWH-05-1-023.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eY.L. and C.-H.C. wrote the manuscript. C.C.Y. supervised the entire process and confirmed the final version of the manuscript. All authors reviewed the manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgement\u003c/h2\u003e\u003cp\u003eThis work was supported in part by the National Science Foundation under the Grants IIS-1741306 and IIS-2235548, and by the Department of Defense under the Grant DoD W91XWH-05-1-023. This material is based upon work supported by (while serving at) the National Science Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eRosa M, Avila, Bramlett MD (2013) Language and immigrant status effects on disparities in Hispanic children\u0026rsquo;s health status and access to health care. Matern. Child Health J. 17, (2013), 415\u0026ndash;423\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJean E, Beatson (2016) Addressing health disparities through cultural and linguistic competency trainings. ABNF J. 27, 4 (2016), 83\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHilary B, Bergsieker JN, Shelton, Richeson JA (2010) To be liked versus respected: Divergent goals in interracial interactions. J. Pers. Soc. Psychol. 99, 2 (2010), 248\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKecia Carroll (2013) Socioeconomic status, race/ethnicity, and asthma in youth. American Thoracic Society\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJennifer E, DeVoe LS, Wallace, Fryer GE (2009) Measuring patients\u0026rsquo; perceptions of communication with healthcare providers: do differences in demographic and socioeconomic characteristics matter? Health Expect. Int. J. Public Particip. Health Care Health Policy 12, 1 (March 2009), 70\u0026ndash;80. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1111/j.1369-7625.2008.00516.x\u003c/span\u003e\u003cspan address=\"10.1111/j.1369-7625.2008.00516.x\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRegine A, Douthard IK, Martin T, Chapple-McGruder A, Langer, Soju, Chang (2021) US maternal mortality within a global context: historical trends, current state, and future directions. J. Womens Health 30, 2 (2021), 168\u0026ndash;177\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePeter Franks P, Muennig E, Lubetkin, Jia H (2006) The burden of disease associated with being African-American in the United States and the contribution of socio-economic status. Soc. Sci. Med. 62, 10 (May 2006), 2469\u0026ndash;2478. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.socscimed.2005.10.035\u003c/span\u003e\u003cspan address=\"10.1016/j.socscimed.2005.10.035\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRaffaello Furlan M, Gatti R, Men\u0026egrave; D, Shiffer C, Marchiori AG, Levra V, Saturnino E, Brunetta (2021) and Franca Dipaola. A Natural Language Processing-Based Virtual Patient Simulator and Intelligent Tutoring System for the Clinical Diagnostic Process: Simulator Development and Case Study. JMIR Med. Inform. 9, 4 (April 2021), e24073. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.2196/24073\u003c/span\u003e\u003cspan address=\"10.2196/24073\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDeWan Gibson and Mei Zhong (2005) Intercultural communication competence in the healthcare context. Int. J. Intercult. Relat. 29, 5 (2005), 621\u0026ndash;634\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJohn J Gumperz. 1982. Discourse strategies. Cambridge University Press\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChristina N, Harrington R, Garg A, Woodward, Williams D (2022) It\u0026rsquo;s kind of like code-switching: Black older adults\u0026rsquo; experiences with a voice assistant for health information seeking. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022. 1\u0026ndash;15\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHripcsak G, Rothschild AS (2005) Agreement, the f-measure, and reliability in information retrieval. J. Am. Med. Inform. Assoc. 12, 3 (2005), 296\u0026ndash;298\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRachel L, Johnson D, Roter NR, Powe, Cooper LA (2004) Patient race/ethnicity and quality of patient\u0026ndash;physician communication during medical visits. Am. J. Public Health 94, 12 (2004), 2084\u0026ndash;2090\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee P, Goldberg C, Kohane I (2023) The AI revolution in medicine: GPT-4 and beyond. Pearson\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePengfei Liu W, Yuan J, Fu Z, Jiang H, Hayashi, Neubig G (2023) Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55, 9 (2023), 1\u0026ndash;35\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHelen Meldrum (2009) Characteristics of Compassion: Portraits of Exemplary Physicians: Portraits of Exemplary Physicians. Jones \u0026amp; Bartlett\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eIn S, Mufwene J, Rickford J, Baugh, Bailey G (1998) Coexistent systems in African-American English. Struct. Afr.-Amreican Engl. (1998), 110\u0026ndash;153\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlan Nelson (2002) Unequal treatment: confronting racial and ethnic disparities in health care. J. Natl. Med. Assoc. 94, 8 (August 2002), 666\u0026ndash;668\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlexander N, Ortega, Peter J, Gergen AD, Paltiel H, Bauchner KD, Belanger, Leaderer BP (2002) Impact of site of care, race, and Hispanic ethnicity on medication use for childhood asthma. Pediatrics 109, 1 (2002), e1\u0026ndash;e1\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYoung su Park (2012) Cultural conflicts over the illness experiences of Korean Chinese migrant workers. Seoul National University. Retrieved from \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://hdl.handle.net/10371/134205\u003c/span\u003e\u003cspan address=\"https://hdl.handle.net/10371/134205\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGeoffrey K, Pullum (1999) African American Vernacular English is not standard English with mistakes. Work. Lang. Prescr. Perspect. (1999), 59\u0026ndash;66\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTamara Rakić and Anne Maass (2018) Communicating between groups, communicating about groups. Language, Communication, and Intergroup Relations. Routledge, pp 66\u0026ndash;97\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGeorge B, Ray (2009) Language and interracial communication in the United States: Speaking in Black and White. Peter Lang\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRickford JR (1999) African American vernacular English: Features, evolution, educational implications. No Title (1999)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMerrill Singer H, Baer DL, Pavlotski A (2019) Introducing medical anthropology: a discipline in action. Rowman \u0026amp; Littlefield\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGeneva Smitherman, Samy Alim H (2021) Word from the mother: Language and African Americans. Routledge\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJamila K, Taylor (2020) Structural racism and maternal health among Black women. J. Law. Med. Ethics 48, 3 (2020), 506\u0026ndash;517\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSachiko Terui (2017) Conceptualizing the pathways and processes between language barriers and health disparities: review, synthesis, and extension. J. Immigr. Minor. Health 19, (2017), 215\u0026ndash;224\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eErik R, Thomas (2007) Phonological and phonetic characteristics of African American vernacular English. Lang. Linguist. Compass 1, 5 (2007), 450\u0026ndash;475\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMaxim Tkachenko M, Malyuk A, Holmanyuk (2020) and Nikolai Liubimov. Label Studio: Data labeling software. Retrieved from \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/heartexlabs/label-studio\u003c/span\u003e\u003cspan address=\"https://github.com/heartexlabs/label-studio\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eUSMLE. (n.d.). Computer-based Case Simulations. Retrieved January 11 (2024) from \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.usmle.org/step-3-test-question-formats/computer-based-case-simulations\u003c/span\u003e\u003cspan address=\"https://www.usmle.org/step-3-test-question-formats/computer-based-case-simulations\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWen Yingxi and Cho Il Young (2017) A Comparative Study between the changes of China\u0026rsquo;s Joseonmal Ttuieosseugi revised in 2016 and current Spacing Word Rules of South and North Korean. J. Soc. Korean Lang. Lit. 81 (2017), 187\u0026ndash;222\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWalt Wolfram (2004) The grammar of urban African American vernacular English. Handb. Var. Engl. 2, (2004), 111\u0026ndash;32\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"journal-of-healthcare-informatics-research","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"jhir","sideBox":"Learn more about [Journal of Healthcare Informatics Research](http://link.springer.com/journal/41666)","snPcode":"41666","submissionUrl":"https://submission.nature.com/new-submission/41666/3","title":"Journal of Healthcare Informatics Research","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"large language model, health disparities, patient simulation, patient-physician communication gap, communication training","lastPublishedDoi":"10.21203/rs.3.rs-5279660/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5279660/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eEffective communication plays a pivotal role in mitigating health disparities. However, linguistic differences, such as African American Vernacular English (AAVE), can lead to communication gaps between patients and physicians, consequently impacting healthcare effectiveness and patient outcomes. This research delves into the potential of GPT-4, a large language model, to replicate AAVE in medical dialogues, with the aim of exploring its potential to address these communication barriers. We devised four prompt types: medical case-only prompts (BaseP), prompts containing demographic details (DemoP), prompts with AAVE-specific linguistic features (LingP), and prompts integrating DemoP and LingP (ComP). Through statistical analyses, including ANOVA and t-tests, applied to case simulations from the United States Medical Licensing Examination (USMLE), we evaluated GPT-4's capacity to mirror AAVE linguistic attributes. The findings indicate that GPT-4 effectively emulated AAVE traits, with ComP producing the most AAVE linguistic features. Notably, DemoP elicited more phonological features than LingP, implying an intrinsic correlation between the African American demographic and specific linguistic markers in GPT-4. However, the model encountered challenges with certain AAVE constructs, such as question inversion and unique vocabulary. This study underscores GPT-4's potential to enhance culturally sensitive healthcare communication while emphasizing the necessity for further research to refine its precision in simulating diverse linguistic styles for practical medical training applications.\u003c/p\u003e","manuscriptTitle":"Enhancing Patient-Physician Communication: Simulating African American Vernacular English in Medical Diagnostics with Large Language Models","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-11-11 06:57:59","doi":"10.21203/rs.3.rs-5279660/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-12-03T21:09:47+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-11-30T02:32:40+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-11-16T15:34:50+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"180804973192445356028670163624995505435","date":"2024-11-08T04:57:44+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"82604081942742968234117222110605892685","date":"2024-11-08T04:56:17+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-11-08T04:40:12+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-10-28T18:24:40+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-10-25T07:48:52+00:00","index":"","fulltext":""},{"type":"submitted","content":"Journal of Healthcare Informatics Research","date":"2024-10-17T04:41:54+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"journal-of-healthcare-informatics-research","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"jhir","sideBox":"Learn more about [Journal of Healthcare Informatics Research](http://link.springer.com/journal/41666)","snPcode":"41666","submissionUrl":"https://submission.nature.com/new-submission/41666/3","title":"Journal of Healthcare Informatics Research","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"5e8073a4-aa9d-4ff6-b3c0-31fbc6cd46dc","owner":[],"postedDate":"November 11th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-03-17T16:00:04+00:00","versionOfRecord":{"articleIdentity":"rs-5279660","link":"https://doi.org/10.1007/s41666-025-00194-9","journal":{"identity":"journal-of-healthcare-informatics-research","isVorOnly":false,"title":"Journal of Healthcare Informatics Research"},"publishedOn":"2025-03-11 15:57:11","publishedOnDateReadable":"March 11th, 2025"},"versionCreatedAt":"2024-11-11 06:57:59","video":"","vorDoi":"10.1007/s41666-025-00194-9","vorDoiUrl":"https://doi.org/10.1007/s41666-025-00194-9","workflowStages":[]},"version":"v1","identity":"rs-5279660","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5279660","identity":"rs-5279660","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.