Exploring Inflammatory Bowel Disease Discourse on Reddit Throughout the COVID-19 Pandemic Using OpenAI's GPT-3.5 Turbo Model: Classification Model Validation and Case Study.

doi:10.2196/53332

Exploring Inflammatory Bowel Disease Discourse on Reddit Throughout the COVID-19 Pandemic Using OpenAI's GPT-3.5 Turbo Model: Classification Model Validation and Case Study.

2025 · doi:10.2196/53332 · PMID:40607732 · PMC12271966

OA: gold CC-BY-4.0

📄 Open PDF Full text JSON View on PubMed View at publisher

Full text 21,992 characters · extracted from pmc-nxml · 4 sections · click to expand

Methods

We collected data from Reddit, a popular social media platform that allows users to create and join communities, or subreddits, based on their interests. Reddit has over 57 million daily users and over 13 billion posts as of 2023 [ 20 ]. For this study, data were extracted from the 3 largest subreddits dedicated to IBD: r/CrohnsDisease, r/UlcerativeColitis, and r/IBD. These subreddits serve as internet-based support groups where users can post text, images, videos, or links to other websites and comment on other users’ posts. Each subreddit has its own rules and moderators, who are volunteers overseeing the content and quality of the posts and comments. We chose to analyze data from March 1, 2020, to December 31, 2022, aligning with the official declaration of the COVID-19 pandemic and its subsequent transition to an endemic phase [ 21 ]. We obtained posts from the Pushshift database, an archive of Reddit submissions and comments for researchers [ 22 ]. To ensure data integrity, we cross-verified the SHA-256 hash values, a cryptographic hash function designed to confirm data integrity provided by Pushshift, with those we computed for each downloaded file. We used a Python script developed by an open-source contributor to aggregate all subreddit-of-interest submissions into a single Newline Delimited JSON file for each month [ 23 ]. These files were subsequently merged into a single CSV file, resulting in an initial dataset of 67,860 posts. We preprocessed the raw data via the following exclusion criteria: combined length ≤50 characters, tagged as a poll, missing a body, posts removed by moderators, and duplicate posts across subreddits. The remaining posts were sorted in ascending order, and each was assigned a unique record ID. The final dataset comprised 53,333 posts. All data cleaning was completed via Alteryx (Alteryx, Inc) [ 24 ] ( Figure 1 ). Excluded posts. We developed a prompt to evaluate each post’s sentiment with a ternary scale (positive, negative, or neutral) and categorize it into one of 6 areas: medication, treatment, symptoms, diagnosis, diet, or other. Additionally, the prompt identifies any demographic information or references to the COVID-19 pandemic. Since prompt engineering is a relatively new field, we refined the prompt through an iterative process, testing it on random samples from the dataset and adjusting it to validate the stability and accuracy of the sentiment label distributions. The final prompt, shown in Textbox 1 , consisted of an initial message that instructed the model about its purpose, followed by instructions for each post-title combination and a final system message that defined the response format. After designing the prompt, we submitted it with each post via a Python script to the GPT-3.5 model application programming interface endpoint in separate batches of 10,000 records to account for website outages and connection losses. We then saved and remerged the responses based on the record ID. The outputs provided by the model were standardized using conditional statements. The recorded ages were grouped into 10-year intervals for demographic analysis. You are a large language model that has been trained to analyze titles and/or bodies of submissions submitted to a Reddit community dedicated to inflammatory bowel disease. The user will submit a list of objectives, and you will respond using only the categories they provide. “Title and/or Body of post was inserted here” Determine the sentiment expressed by the user using only the words: Positive, Negative, or Neutral. Classify the post using one of the following categories: Medication, Treatment, Symptoms, Diagnosis, Diet, or Other. Extract the gender and age of the poster if they included it in the post. If no demographic information is found, respond with the word 'Null'. Identify whether the post directly references the COVID-19 pandemic. Report your answer using only the words 'Yes', 'No', or 'Unsure'. I will only respond in a comma-separated format, as follows: Sentiment_Goes_Here,Category_Goes_Here,Gender_Goes_Here,Age_Goes_Here,COVID-19_Goes_Here To measure the overall accuracy of our model’s classifications, we chose both Fleiss Kappa and Gwet AC1 statistical measures to evaluate interrater reliability. Fleiss Kappa is a widely used statistic for assessing the extent of agreement among multiple raters while accounting for the possibility of chance agreement [ 25 ]. Lower Fleiss κ scores (ie, closer to 0) indicate greater disagreement, with scores approaching 1 suggesting higher interrater reliability [ 26 ]. We also opted to calculate Gwet AC1 because it is suggested to be less affected by prevalence and marginal probability compared with Fleiss κ, making it a more accurate measure [ 27 ]. According to Gwet AC1, scores above 0.75 are deemed acceptable, with higher scores indicating greater agreement. We calculated the required sample size for this subset analysis using the Taro Yamane Equation with a 0.5 degree of error, which resulted in the selection of 397 posts for evaluation [ 28 - 30 ]. As the sample size for κ coefficients is considered challenging to calculate, this sample size was further cross-referenced against Bujang and Baharum’s [ 31 ] prescribed criteria for Cohen κ sample size calculations, confirming an expected sample size of 389 posts. We aimed for an effect size of 0.75. The subsample includes 117 (30%) posts for sentiment evaluation, 49 (12.5%) posts for classification, 71 (18.25%) posts for gender categorization, 35 (9%) posts for age range classification, and an additional 117 (30%) posts for referencing the COVID-19 pandemic. We generated a randomized set of 397 Reddit posts from the final dataset using Alteryx to ensure impartiality. Two human raters from the study team and GPT-3.5 evaluated each category across multiple predefined categories. To ensure standardization of responses, both human raters followed a predetermined codebook for each category: sentiment (positive, negative, and neutral), category (medication, treatment, symptoms, diagnosis, diet, and other), gender (male and female), age (0-9, 10-19, 20-29, 30-39, 40-49, 50-59, and 60+ years), and reference to COVID-19 (yes, no, and unsure). A small number of posts not included in the subsample were initially reviewed to gather insight. Both human raters reviewed these posts and individually developed definitions for each category. The definitions were then combined to create an established codebook with definitive definitions for each category. Interrater reliability was assessed by comparing the GPT-3.5 model’s output with the evaluations of the 2 human raters. Any discrepancies identified were returned to the human raters for double scoring independently using the codebook as a reference. The final Fleiss κ and Gwet AC1 analyses were performed using RStudio (R Studio, Inc) and the irrCAC package [ 32 , 33 ]. The research activities described in this study were reviewed by the Human Research Protection Office at the University of Pittsburgh (STUDY23010103), and the study activities were determined not to involve human subjects as defined by the Department of Health and Human Services (DHHS) and the Food and Drug Administration (FDA) regulations.

Results

The comparison between GPT-3.5 and human raters revealed a moderate agreement for sentiment analysis and a substantial concordance for categorization. For variables pertaining to the COVID-19 pandemic references, gender, and age, GPT-3.5 demonstrated almost perfect alignment with human assessments ( Table 1 ). Fleiss and Gwet AC1 coefficients for GPT and human raters. All coefficients had a P value <.001. From self-reported gender, we observed 1509 men and 1502 women in our IBD Reddit users ( Figure 2 ). When comparing the users on the IBD subreddits to the general IBD population, there was a significant difference in gender distribution (N=3,090,011; χ 2 2 =69.53; P <.001; φ<0.001). Specifically, we saw a higher proportion of men and fewer women than anticipated considering the overall demographics of those affected by IBD [ 1 ]. However, examining the relative effect sizes suggested these differences were negligible. Similarly, while we saw a more significant proportion of women than expected (1144.20; 38%) given the general demographic breakdown of Reddit users (N=50,003,011; χ 2 2 =180.47; P <.001; φ<0.001), our effect size again suggested differences were negligible [ 34 ]. Heatmap of distinct age and gender data. Most users posting on the IBD subreddits self-reported their age as between 20-29 years (n=2392, 49%). This was consistent with the results of our chi-square (N=5,000,044; χ 2 4 =1945.51; P <.001; Cramer V<0.001), which suggested that users aged between 10-19 and 20-29 years were overrepresented in our IBD Reddit sample, whereas those aged 30-39, 40-49, and 50+ years were underrepresented compared with the general Reddit user data [ 34 ]. Again, the investigation of effect sizes suggested these differences were negligible. Sentimental analysis of the posts showed that (n=43,916, 83%) posts were neutral, (n=2010, 4%) were positive, (n=7016, 13%) were negative, and the remaining posts did not have a standardized sentiment value. Comparing this across the topic group ( Figure 3 ) and a previous study, examining topic analysis of Reddit posts discussing IBD exhibited a markedly lower frequency of prepandemic references to diet and nutrition (6204.95). Conversely, there was a notably higher volume of conversations surrounding medications before the pandemic (11,231.93) [ 8 ]. Percentage of posts by category and sentiment. During the study period, the model found that only a small portion of posts mentioned COVID-19 (n=3229, 6%) compared with those that did not (n=47,495, 89%). There were a small number of posts that were classified as unsure (n=2276, 4%). Although visual inspection of Figure 4 suggested a steep drop in COVID-19 mentions throughout the study period, chi-square results found a negligible difference in the number of references to COVID-19 (N=50,724; χ 2 2 =460.21; P <.001; φ<0.001). Again, the investigation of effect sizes suggested these differences were negligible. Figures 2 - 4 were generated using Tableau Desktop [ 35 ]. An overview of the data is provided in Table 2 . Percentage distribution of COVID-19 mentions throughout the study period. Data overview.

Discussion

The main contributions of this study are threefold. First, using GPT-3.5, we implemented a novel approach to processing and categorizing social media discussions. Second, we assessed the model’s performance against human raters on a range of subjective and objective criteria. Third, we delved into the themes and emotions expressed by patients with IBD during the COVID-19 pandemic. Our analysis of interrater reliability showcases that GPT-3.5, with prompt engineering, can achieve moderate interrater reliability on subjective aspects such as topic and emotions, and near-perfect reliability on objective elements such as age, gender, and COVID-19 mentions. Our successful use of this approach supports the preliminary feasibility of using GPT-3.5 and future iterations in analyzing big data. Most posts did not disclose demographic information. However, among those who did, the overall demographics aligned with general Reddit usage. A notable observation was the presence of a small cohort of self-reported adolescents, highlighting a potential area for further investigation into pediatric patient discourse. Exploring the specific issues and experiences shared by this demographic can inform the development of tailored support mechanisms and educational materials that better address the needs of young patients with IBD and their families. Most posts analyzed were straightforward questions or statements with neutral sentiment (n=43,896, 82%). For posts that had a sentiment value assigned, no single category had more positive sentiment than negative sentiment. The phenomenon toward negative sentiment values in health-related Reddit posts is consistent with findings in Goel et al [ 10 ] and Maleki et al [ 36 ]. The category with the highest ratio of positive to negative posts was diet, with an almost one-to-one ratio. Analysis of diet posts tends to show that while many people have issues with diet, many other people report success with being able to eat certain foods and finding “trigger foods.” The category with the lowest positive-to-negative post ratio is symptoms, with the overall lowest number of positive posts and highest number of negative posts. These posts often expressed issues surrounding pain and frequent bathroom use, as well as a lack of response to treatment. This finding reflects previous work highlighting that many posters appear to use health care–related social media to seek educational resources about their experiences and find validation for their symptoms from an empathetic internet-based community [ 10 ]. Consistent with previous studies, most discussions centered around medications (n=14,909, 28%) and symptoms (n=14,939, 28%). However, our analysis uncovered two distinct areas diverging from past research: dietary discussions were infrequent (n=3947, 7%), potentially due to the strong link between symptoms and dietary choices, and diagnosis-related posts, which constituted a small but significant portion of the dataset. A manual review revealed that these posts predominantly originated from individuals lacking a confirmed IBD diagnosis who were seeking diagnostic advice based on their symptoms. This emerging trend, previously undocumented, is concerning as it suggests a reliance on nonprofessional advice for health guidance. These data may support the need for greater community education regarding IBD, alongside outreach from the health care community to support individuals seeking a diagnosis. Finally, we also observed a gradual decline in pandemic-related mentions over the study period. This aligns with trends observed in other patient groups and suggests factors such as information fatigue or adaptation to the pandemic [ 37 ]. The reduced focus on COVID-19 among the IBD community, despite their heightened risk, underscores the need for ongoing research into the challenges faced by this population during the pandemic era. Our analysis was subject to several limitations. During our data analysis, we used the GPT-3.5 Turbo endpoint, the leading model publicly available at that time. However, since then, OpenAI has released the GPT-4 model, which has shown improvement in capturing nuanced semantic information, an area where the GPT-3.5 model showed difficulties [ 38 ]. Furthermore, OpenAI plans to allow the GPT-4 model to be fine-tuned using manually annotated data, enhancing its accuracy. Future studies could use these more advanced models to score data more accurately. Another limitation of our analysis lies in the nature of transformer models, such as GPT-3.5, used in this study. While these models are powerful, they lack transparency in their internal decision-making processes, making it difficult to fully understand how outputs are generated from inputs. This opacity can obscure potential biases, errors, or unintended correlations within the data, which may influence results in ways that are not readily apparent. Further limitations are that Reddit’s user base, which differs in demographics such as age, gender, location, education, income, and interests from other internet-based communities, may limit the generalizability of our findings to other platforms. Second, we assigned each post to a single topic and sentiment category, potentially simplifying posts with multiple topics or mixed sentiments. Finally, we relied on self-reported data for the poster’s gender and age, which cannot be verified. In this study, we used GPT-3.5, a powerful pretrained NLP model, to analyze the posts from 3 IBD subreddits during the COVID-19 pandemic. We demonstrated the preliminary feasibility of GPT-3.5 as a valuable sentiment and topic analysis tool capable of producing results with moderate to near-perfect reliability with human raters. Our study helps to fill the knowledge gap surrounding the discourse of individuals diagnosed with IBD, especially in the context of the pandemic. We discovered that people with IBD expressed more negative than positive emotions and that their primary areas of discussion surround medication and symptoms. These findings highlight the challenges and concerns that people with IBD faced throughout the pandemic and suggest the need for more targeted support and education for this population. Our study also provides a validated dataset of IBD posts that can be used for further training future NLP models and would also be valuable for subgroup analyses conducted by gastroenterology-focused research teams.

Introduction

Inflammatory bowel disease (IBD) is an autoimmune disorder of the gastrointestinal tract that impacts around 3.1 million adults in the United States [ 1 ]. While immunosuppressive medications have shown efficacy in treating IBD, they also increase the risk of infections such as COVID-19 [ 2 , 3 ]. This increased susceptibility to COVID-19 has led individuals with IBD to isolate, potentially exacerbating the adverse health effects associated with pandemic restrictions [ 4 - 6 ]. Despite a substantial body of literature on the use of social media by individuals with IBD, the impact of the COVID-19 pandemic on internet-based discussions within this community remains unclear. Understanding and categorizing behaviors of individuals with IBD can provide insights into how their interactions with social media platforms affect their mental health and inform the development of tailored internet-based resources and support. Previous studies examining social media use among individuals with IBD have aimed to analyze patient conversations on platforms such as Twitter (subsequently rebranded X) and Reddit (Advance Publications). A 2023 study by Rubin et al [ 7 ] examined patient perspectives on factors contributing to ulcerative colitis flares from public forums across 6 countries, identifying >27,000 patient posts, of which (N=12,900, 47.8%) were related to flares. The most frequently reported triggers included stress and anxiety (n=440, 37.9%) and diet (n=330, 28.4%). Another study by Rohde et al [ 8 ] characterized topics associated with IBD and distress on Reddit and Twitter, finding that symptoms (n=23,294, 47.8%) and medication (n=12,218, 30.1%) were the most prevalent topics. Additionally, a 2023 study by Stemmer et al [ 9 ] analyzed the content and sentiments expressed in posts by patients with IBD, revealing that they expressed more sadness and fear compared with a control group of healthy users. Although this previous research has provided a strong foundation for working with IBD social media data, researchers have encountered difficulties in analyzing the large volumes of posts and validating the findings. The rapid advancement of machine learning offers a powerful solution to the challenges of analyzing big data. For instance, Goel et al [ 10 ] used machine-learning techniques to conduct a sentimental and topical analysis of social media data about endometriosis, another private and stigmatized condition. This study used a bidirectional encoder representation from transformers model, a state-of-the-art natural language processing (NLP) model that can extract insights from the vast amount of unstructured data present in social media discussions. However, training a machine learning model requires substantial funding, computational power, and expertise, limiting the accessibility of this method of data analysis. GPT-3.5 is a powerful large language model that can generate coherent and diverse texts based on a given input [ 11 ]. GPT-3.5 is trained on a large corpus of text from various sources, such as books, websites, news articles, and social media posts. Approximately 22% of its training data came from the OpenWebText corpus, which consists of Reddit posts from 2005 to 2020 [ 12 ]. Early data support the use of GPT-3.5 in sentiment and topic analysis, especially within the mental health classification tasks [ 13 - 16 ]. For example, Nadi et al [ 17 ] demonstrated support for GPT-3.5 in determining sentiment based on movie reviews, with more than 90% reliability with human coders across multiple datasets. Similarly, He et al [ 18 ] compared the performance of GPT-3.5 with the Valence Aware Dictionary for Sentiment Reasoning (VADER) model, an open-source Python package designed to calculate sentiment from free text, finding that GPT-3.5 exhibited greater agreement with human coders in determining sentiment from health-related social media. Despite this, a recent preprint by Lockwood et al [ 19 ] highlighted potential flaws in the use of GPT-4 to conduct qualitative coding to identify themes from data by school psychology graduate educators on the impact of COVID-19 on their training, with findings suggesting support for its use in identifying broad themes, but difficulties in elucidating the depth and nuanced interpretation of human coders. However, this study relied on a small sample (N=60), highlighting the need to evaluate the use of NLP in classifying health-related social media data and benchmarking its reliability against human raters. This study aims to introduce a novel analytical method using GPT-3.5 to analyze large amounts of social media data. Our primary objective is to establish the feasibility of using GPT-3.5 to identify and characterize themes and sentiments in Reddit posts among individuals with IBD during the COVID-19 pandemic. Additionally, we aim to compare the interrater reliability of GPT-3.5 output against human raters to establish the model’s credibility. Finally, this study seeks to contribute to the understanding of discourse among individuals with IBD, particularly during the COVID-19 pandemic.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: pmc-nxml ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-06-26T06:14:25.090378+00:00
unpaywall: last seen: 2026-05-21T05:10:58.409756+00:00

License: CC-BY-4.0