Topic and sentiment in comments on diabetes-related Douyin short videos: a cross-sectional text-mining study | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Topic and sentiment in comments on diabetes-related Douyin short videos: a cross-sectional text-mining study Shan Chen, Emma Mohamad, Arina Azlan, xixi Zhao This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8464538/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 5 You are reading this latest preprint version Abstract Background Short-form video platforms are increasingly used for diabetes-related health information, and comment sections may capture users’ information needs and affective responses. Methods We analysed publicly visible top-level comments on diabetes-related Douyin (TikTok China) videos using a cross-sectional text-mining design. Videos were drawn from a previously evaluated dataset (n = 276) and stratified by information quality (final consensus modified DISCERN score) and diffusion (Douyin Communication Index) into four quadrants; six videos were selected from each quadrant (24 total). All retrieved comments (raw, n = 3,933) were used for descriptive temporal summaries, while text-based analyses were conducted on valid comments after rule-based cleaning (n = 2,007). We performed Chinese word segmentation (jieba), stop-word removal, term-frequency analysis, keyword co-occurrence network analysis (co-occurrence threshold ≥ 6), LDA topic modelling (K = 5), and SnowNLP sentiment classification (negative 0.65). Results High-frequency terms were concentrated on diabetes, blood glucose, fasting, doctors, and insulin. The most frequent co-occurring pairs included fasting–blood glucose ( 25 ) and diabetes–blood glucose ( 16 ). Topic modelling identified five topics; Topic 2 accounted for 89.0% of valid comments (1,786/2,007). Sentiment was predominantly neutral (92.18%, 1,850/2,007), with 6.83% positive (137/2,007) and 1.00% negative comments (20/2,007). In the raw corpus, commenting activity peaked on Fridays (16.5%) and during 18:00–22:00 (29.4%), with a single hourly peak at 20:00 (254 comments). Conclusions Comment discourse was primarily oriented toward practice-oriented diabetes self-management, particularly the reporting and interpretation of glycaemic readings and related action-oriented questions. Although negative sentiment was relatively uncommon, such comments often described concrete confusion, worries, or difficulties in disease management. These findings may inform platform-level governance of health-related content and more targeted communication strategies for populations affected by diabetes. Diabetes Douyin (TikTok China) health information quality user comments text mining topic modelling sentiment analysis Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Background Diabetes, as a typical chronic non-communicable disease, is highly dependent on continuous self-management behaviors for its management, including dietary control, regular physical activity, medication adherence, and blood glucose monitoring. It also requires continuous and reliable health information support. With the development of mobile Internet, the channels for the public to obtain health information have gradually expanded from traditional offline health education and medical institutions to multi-platform digital media. Social media has gradually become an important field for the dissemination of chronic disease information and health education. Short video platforms (such as Douyin / TikTok) are reshaping the production, distribution and consumption of health information with their characteristics of fast dissemination speed, strong visual expression, algorithm-driven recommendation and low interaction threshold. In the context of diabetes, short videos may promote the popularization of health knowledge and the improvement of self-management awareness, but due to heterogeneous content sources, limited evidence basis or overly simplified expression, they may also lead to information misinterpretation and bring potential health risks.( 1 – 4 ). Existing research on health communication and digital health mainly focuses on the intrinsic quality of short video content and the credibility of information sources, such as whether the video clearly conveys the communication purpose, whether it cites reliable information sources, whether it presents relatively balanced viewpoints, and whether it explains uncertainties. However, on short video platforms, the reach and visibility of content are largely influenced by user interaction metrics (such as likes, comments, collections, and shares) as well as platform recommendation algorithms.( 5 ). Therefore, the dissemination range of information does not necessarily align with its quality: content with a high dissemination rate may not necessarily have high information quality, and information of lower quality can also be widely spread.( 6 ). The multimodal features of short videos (including visual, audio and text information) further increase the complexity of identifying and governing misleading health information on platforms such as TikTok. In response to this issue, in recent years, some studies have begun to attempt to detect health misinformation in short videos through multimodal methods.( 7 ). From a public health perspective, this misalignment warrants attention because exposure to inaccurate or one-sided information may shape risk perceptions and health-related decisions and, in turn, influence health literacy and chronic disease self-management outcomes. Compared with research centered on the video content itself, the comment section, as an interactive space, carries users' understanding, reinterpretation and emotional expression of video information, providing an important window for insight into the audience's information needs and the risk communication process. In the interaction of comments, information exchange can be achieved through mechanisms such as "ask - answer - correct", and it often involves emotional expressions such as empathy, worry, fear, gratitude and doubt. Such interactions may either promote peer support or trigger collective misunderstandings, thereby influencing the way information is received and its further dissemination ( 8 , 9 ). Experimental evidence in health communication suggests that the valence of social media comments can shape affective trust and, together with source cues, influence attitudes and behavioral intentions related to health information sharing and service use( 10 ). For chronic conditions such as diabetes that require long-term self-management support, recurrent high-frequency questions, emotional cues, and reported behavioral difficulties in comments can reveal more immediate and authentic information needs, offering audience-level evidence to optimize health education content, improve platform governance, and inform clinical communication ( 11 ). Nevertheless, current evidence has at least three gaps. First, in the Chinese-language context, systematic quantitative evidence on comments under diabetes-related short videos on Douyin remains limited. Second, existing studies often treat comments as a single corpus for topic or sentiment analysis, but rarely link comment characteristics to video-level information quality and dissemination strength; as a result, it remains unclear whether audience responses differ systematically across dissemination contexts. Third, Chinese short-text comments are highly colloquial and frequently include emojis and very short strings, which may reduce the stability of topic and sentiment identification; therefore, transparent and reproducible preprocessing and analytic workflows are particularly important for improving verifiability( 12 , 13 ). Against this background, we examined top-level comments on diabetes-related short videos on Douyin. Using text-mining approaches—including word frequency analysis, keyword co-occurrence networks, topic modeling, and sentiment analysis—we systematically characterized discussion foci, semantic structures, and emotional features in the comment sections, and described temporal activity patterns and participation characteristics of highly active users. To enhance coverage across dissemination contexts, we drew on a pre-established video evaluation database to construct a four-quadrant stratification framework (information quality × dissemination strength) and sampled videos from each quadrant for comment collection. We addressed the following questions: ( 1 ) What topics do audiences primarily focus on in comments under diabetes-related Douyin short videos? ( 2 ) What semantic association structures emerge among keywords? ( 3 ) What latent topics and sentiment distributions are present? ( 4 ) What temporal and user-level patterns characterize comment participation? Our findings may inform platform governance of health-related content and strategies to better address information needs among people affected by diabetes. Methods Study design and study materials This study was a cross-sectional, quantitative content analysis and text-mining investigation based on publicly available social media texts. The study materials comprised top-level (first-level) comments posted under diabetes-related short videos on Douyin, with each individual top-level comment treated as the minimum analytic unit. Only content visible to the public was included. We did not access private information, interact with users, or collect any personally identifiable information (e.g., phone numbers or national identification numbers). To further reduce the risk of re-identification, user identifiers were de-identified during analysis and reporting, and results are presented in aggregate form only. Video database and stratified sampling framework Comments were collected from videos drawn from a diabetes-related Douyin short-video evaluation database previously established by the research team (n = 276). For each video in this database, information quality had been rated and dissemination strength had been computed. Using the two dimensions of “information quality” and “dissemination strength,” we constructed a four-quadrant stratification framework to guide stratified sampling of videos for comment collection, thereby covering audience interactions across different dissemination contexts. Information quality assessment The information quality at the video level was evaluated using the modified DISCERN (modified DISCERN, mDISCERN) scale. mDISCERN measures the reliability of health information from multiple dimensions, including whether the research purpose is clear, the credibility of the information source, whether the content presentation is balanced, whether reference sources for further information acquisition are provided, and whether uncertainties are explained, etc.( 14 – 16 ). Each video is independently scored by two trained evaluators, and any scoring differences are agreed upon through discussion. The final consensus total score (SUM) formed serves as an indicator of the information quality dimension, with a higher score indicating higher information quality. Dissemination strength: Douyin Communication Index (DCI) Dissemination strength was operationalized using the Douyin Communication Index (DCI), a continuous composite indicator intended to reflect overall diffusion and audience engagement on the platform. The DCI was computed as a weighted sum of publicly visible engagement metrics on the video page: DCI = 0.46 × number of likes + 0.37 × number of favorites + 0.17 × number of shares. The weighting scheme was derived from the research team’s prior work on constructing dissemination metrics ( 17 – 20 ). To improve comparability across records, engagement metrics used to compute the DCI were extracted from a single standardized time point when the video database was created (cumulative counts). When engagement fields were missing, data were handled according to pre-specified rules (e.g., exclusion or imputation as zero; see S1 Table). Higher DCI values indicate stronger dissemination. Quadrant classification and cut-off values We constructed a four-quadrant stratification framework using video information quality (final consensus mDISCERN sum score) and dissemination strength (DCI). Because platform engagement metrics are typically highly right-skewed—often driven by a small number of viral videos—mean-based thresholds can be overly influenced by extreme values and may yield unstable classifications. We therefore used the median as a robust, distribution-free cut-off that is less sensitive to outliers. A further practical advantage of median-based cut-offs is that they tend to produce more balanced group sizes, which improves comparability across quadrants and reduces the risk that descriptive patterns are dominated by a small subgroup. To make the classification rule fully deterministic and reproducible, values equal to the median were pre-specified to be assigned to the “low” category. Importantly, “high” versus “low” in this study was not intended to represent clinical or normative thresholds. Rather, the cut-offs were chosen to support stratified sampling and ensure coverage of different dissemination contexts (including potential “quality–reach mismatch” scenarios). Median split is therefore an appropriate choice for this exploratory, context-coverage purpose. Video selection and sampling We constructed a four-quadrant stratification framework using video information quality (final consensus mDISCERN sum score) and dissemination strength (DCI). Because platform engagement metrics are typically highly right-skewed—often driven by a small number of viral videos—mean-based thresholds can be overly influenced by extreme values and may yield unstable classifications. We therefore used the median as a robust, distribution-free cut-off that is less sensitive to outliers. A further practical advantage of median-based cut-offs is that they tend to produce more balanced group sizes, which improves comparability across quadrants and reduces the risk that descriptive patterns are dominated by a small subgroup. To make the classification rule fully deterministic and reproducible, values equal to the median were pre-specified to be assigned to the “low” category. Importantly, “high” versus “low” in this study was not intended to represent clinical or normative thresholds. Rather, the cut-offs were chosen to support stratified sampling and ensure coverage of different dissemination contexts (including potential “quality–reach mismatch” scenarios). Median split is therefore an appropriate choice for this exploratory, context-coverage purpose. To reduce potential bias arising from short-term trending events and fluctuations in platform ranking, we conducted standardized searches and sample verification at three adjacent time points: September 8, 15, and 22, 2025 (UTC + 8) at approximately 20:00 each day. Each search was performed while logged out, using the same device and network environment whenever possible. Browser/app histories and caches were cleared prior to searching to minimize personalization effects. These repeated searches were used to improve the robustness of the sample with respect to the contemporaneous “visible information environment” on the platform; comment data were collected as a one-time capture for the selected videos within a predefined collection window, and the overall design remained cross-sectional rather than longitudinal. Search strategy and sampling frame Within the Douyin app, we searched the Chinese keyword “糖尿病” (“diabetes”) and obtained the set of videos visible in the search results list using the platform’s default ranking. Because this list is generated by algorithmic ranking, it approximates the information environment that a typical user could encounter through keyword search( 21 ). As platform presentation may vary by device, location, and time, the present sample represents the set of search results visible under the standardized conditions at the specified time points, rather than an exhaustive sample of all diabetes-related videos on the platform. Stratified sampling across quadrants We used stratified sampling by the four-quadrant framework, selecting six videos from each quadrant (total n = 24) as targets for comment collection. This design aimed to ( 1 ) cover diverse dissemination contexts and reduce the risk that the sample would be dominated by a small number of highly popular videos; and ( 2 ) explicitly include “quality–dissemination mismatch” contexts (e.g., high dissemination but low quality; high quality but low dissemination), reflecting the content structure users may encounter in keyword-based searches. Within each quadrant, candidate videos were ranked by DCI in descending order and included sequentially until six videos were selected, subject to pre-specified rules: ( 1 ) avoiding repeated sampling from the same account where possible; ( 2 ) covering different content types (e.g., health education, experiential sharing, public narratives); and ( 3 ) ensuring that video links were accessible and top-level comments were visible at the time of collection. The titles, URLs, quadrant assignments, mDISCERN (SUM) scores, and DCI values of the 24 videos are provided in S1 Table to facilitate review and verification. Comment collection and analytic corpus Data extraction and variables For each of the 24 videos, we captured all publicly visible top-level comments at the time of collection, yielding a raw comment corpus of 3,933 comments. Extracted fields included comment text, posting timestamp, user identifier (used only for de-identified frequency statistics), a unique comment identifier (comment_id) for deduplication, and the video’s quadrant label for stratified description and comparison. Comments were collected using an automated script/tool that iterated through paginated results until no further comments were available. Data cleaning and exclusions To improve corpus quality and the interpretability and reproducibility of text-mining results, we applied a pre-specified, rule-based cleaning pipeline to the raw corpus and recorded the number of exclusions at each step ( S1 Fig ). Duplicate records were identified and removed using comment_id (50 removed); Removal of advertisements/solicitations and irrelevant content: comments clearly containing marketing, solicitation, or unrelated content were excluded (360 removed); Removal of non-linguistic texts: comments consisting only of emojis, special symbols, whitespace, or otherwise uninterpretable characters were excluded (1,123 removed); Removal of non-informative short comments: very short comments with insufficient semantic information to support downstream analyses were excluded (393 removed; definition in Section “Operational definition of non-informative short comments”). After cleaning, the analytic corpus comprised 2,007 valid comments and was used for all text-based analyses (word frequency, co-occurrence, topic modeling, and sentiment). When describing overall sample size, time coverage, and temporal distributions, we report the raw corpus (n = 3,933); for modeling and statistical summaries based on text content, the denominator was the analytic corpus (n = 2,007). Text preprocessing Preprocessing procedures We applied a standardized preprocessing pipeline to the analytic corpus (n = 2,007): Results Corpus overview and data integrity Sample size and time span A total of 3,933 top-level comments were collected from diabetes-related short videos on Douyin. Comment timestamps ranged from 13:28 on March 15, 2019 to 22:07 on December 13, 2025. By year, most comments were posted in 2025 (82.00%), followed by 2024 (10.88%); smaller proportions were observed for 2019 and 2023 (3.64% and 3.48%, respectively). Given the cross-sectional data collection design, this time span reflects the cumulative historical comments visible at the time of data capture rather than longitudinal tracking of comments for the same videos over time. Data completeness and analytic corpus Among the 3,933 exported comments, 98 had blank comment text, which may be attributable to platform display restrictions, comment deletion, or missing fields returned by the interface. To improve interpretability and reproducibility of text-based analyses, we applied a rule-based cleaning pipeline prior to analysis, including deduplication, removal of advertisements/irrelevant content, exclusion of non-linguistic texts, and exclusion of non-informative short comments (S1 Fig). After cleaning, 2,007 valid comments remained and constituted the analytic corpus. Unless otherwise specified, all content-based analyses (word frequency, keyword co-occurrence, topic modeling, and sentiment analysis) were conducted on the analytic corpus (n = 2,007). Descriptions of sample size, temporal distributions, and participation structure are reported using the raw corpus (n = 3,933). Comment length and text characteristics Within the analytic corpus (n = 2,007), the median comment length was 14 Chinese characters (IQR: 8–25), with a mean of 21.16 characters; the longest comment contained 459 characters. Overall, comments were predominantly short texts, although a small number were relatively long and typically contained more complete illness narratives, symptom descriptions, or treatment experiences. User participation and question-type comments Across the 3,933 raw comments, 3,786 unique users were identified. Most users posted only once (96.88%), whereas 118 users (3.12%) posted two or more comments, indicating a participation pattern characterized by broad engagement with low posting frequency. Using heuristic rules to identify interrogative expressions (presence of “?” or interrogative triggers such as “吗”, “怎么”, “多少”, “要不要”, and “呢”), 644 question-type comments were detected, accounting for 16.37% (644/3,933) of raw comments. In the analytic corpus, question-type comments were slightly shorter than non-question comments (mean length 19.69 vs. 22.11 characters; median 13 vs. 15 characters). Coverage across quadrants Videos used for comment collection were selected using the four-quadrant stratification framework defined by information quality (mDISCERN final consensus sum score, SUM) and dissemination strength (DCI). Six videos were sampled from each quadrant (24 videos in total), and all publicly visible top-level comments were captured for each selected video. For reproducibility, the titles, URLs, quadrant assignments, mDISCERN (SUM) scores, and DCI values for the 24 videos are provided in S1 Table. At the raw comment level, the number of comments per quadrant was: low quality/low dissemination (819), low quality/high dissemination (1,109), high quality/low dissemination (919), and high quality/high dissemination (1,086). Using the number of top-level comments per video as the unit of description, the median (IQR) comments per video were 155.5 (145.25–169.50) for low quality/low dissemination, 186.0 (178.00–193.25) for low quality/high dissemination, 166.0 (153.00–173.00) for high quality/low dissemination, and 178.0 (172.75–192.25) for high quality/high dissemination. One video in the low quality/low dissemination quadrant had no visible top-level comments at the time of collection, as noted in S1 Table. Retention of valid comments across quadrants After rule-based cleaning of the raw corpus (n = 3,933), 2,007 valid comments were retained. To assess whether the cleaning pipeline resulted in differential exclusions across dissemination contexts—which could potentially influence subsequent topic and sentiment analyses—we compared the correspondence between raw and valid comments in each quadrant and calculated retention rates (valid/raw). Retention rates varied across quadrants: 52.99% (434/819) for low quality/low dissemination, 58.22% (535/919) for high quality/low dissemination, 56.63% (615/1,086) for high quality/high dissemination, and the lowest retention for low quality/high dissemination at 38.14% (423/1,109). This pattern suggests that in the high-dissemination but low-quality context, raw comments were more likely to include advertisements/solicitations, non-linguistic texts, or semantically sparse short comments and were therefore more frequently removed during cleaning; retention in the other quadrants clustered at approximately 53%–58%. Word frequency patterns and semantic categories High-frequency terms After Chinese word segmentation, stop-word removal, and synonym normalization, we computed the top 50 high-frequency terms in the analytic corpus. The most frequent terms included “diabetes”, “blood glucose”, “fasting”, “doctor”, and “insulin”. Document frequencies (df) and relative frequencies for the top 20 terms are reported in Table 1. The full list of the top 50 terms is available in S2 Table. The top 20 high-frequency terms are visualized in Figure.1. A word cloud of the full corpus vocabulary is provided in S2 Fig. Table 1 Top 20 high-frequency terms Term (Chinese) Document frequency (df) Relative frequency (%) Diabetes 243 6.18 Blood glucose 166 4.22 Fasting 151 3.84 Doctor 140 3.56 Insulin 94 2.39 Teacher* 79 2.01 Safe / stable 67 1.70 Postprandial 52 1.32 Every day 46 1.17 Beverages 46 1.17 Hospital 45 1.14 Tears / cry 44 1.12 Exercise 44 1.12 Control 43 1.09 High blood glucose 41 1.04 Eating 39 0.99 Vinegar 38 0.97 Medication 37 0.94 After meals 35 0.89 Rice 35 0.89 Network structure and community detection To depict the overall structure of term associations, we constructed an undirected weighted keyword co-occurrence network based on the top 50 high-frequency terms, with edge weights defined by comment-level co-occurrence counts. For visualization and to reduce noise from incidental co-occurrences, a filtered display network was constructed for the main text (Figure.2) by retaining only term pairs with co-occurrence ≥ 5 and selecting the top K = 30 edges ranked by co-occurrence frequency. The resulting display network contained 20 nodes and 30 edges, with a network density of 0.158 and a mean degree of 3.00. In Fig. 2, node size represents degree, while edge width and labels indicate co-occurrence counts, highlighting hubs and clustering patterns among keywords. Community detection using the weighted greedy_modularity_communities algorithm identified three communities, with a weighted modularity of Q_weighted = 0.274, indicating a discernible modular structure in the co-occurrence network. Hub terms and semantic clusters In the display network, high-degree hub terms were primarily indicator-related concepts (e.g., “diabetes”, “blood glucose”, “fasting”), suggesting that the semantic core of comment discourse centered on interpretation of glycemic indicators and self-management. Examination of key connectors further indicated that some terms bridged different semantic substructures (e.g., terms related to diet or treatment), reflecting linkages across “indicators—behavior—healthcare/medication.” Community detection on the filtered network identified three communities (Q_weighted = 0.274), indicating a discernible clustering structure. Substantively, these communities could be summarized into two dominant semantic domains: one emphasizing symptom experience and dietary triggers/risk perceptions (e.g., beverages/sugary drinks), and another emphasizing indicator interpretation, control strategies, and healthcare-seeking/medication (e.g., fasting, blood glucose, glycated hemoglobin, medication, doctor/hospital). Full community membership lists and parameter settings are recommended for presentation in supplementary materials ( S2 Table) to support verification. Topic modeling Identified topics and prevalence Based on the analysis of the corpus (n = 2,007), the topic model identified five potential topics (K = 5). The representative keywords, the number of comments, and their proportions for each topic are presented in Table 2. Generally, the structure of the comment topics shows a highly concentrated feature, with a single topic dominating the overall discussion significantly. Among them, Topic 2 (centered on self-management content of diabetes such as fasting/blood glucose index interpretation, doctor consultation, and insulin-related discussions) accounts for 89.0% (1,786/2,007) of all comments. The proportions of the remaining topics are relatively small, namely: Topic 5 (including expressions of emotions accompanied by mentions of diabetes, doctors, and insulin, 5.5%, 111/2,007), Topic 3 (mainly consisting of expressions of gratitude and support, 2.7%, 55/2,007), Topic 1 (discussions around the time and context of fasting or post-meal blood glucose measurement, 2.4%, 48/2,007), and Topic 4 (scattered mentions of health-related topics, 0.3%, 7/2,007). Table 2 Latent topics identified by LDA topic modeling Topic Topic label Top keywords n % 1 Blood glucose monitoring and abnormal indicators 血糖高, 医院, 天天, 测, 空腹, 血糖, 检查 48 2.4 2 Glycaemic interpretation and daily self-management 空腹, 血糖, 胰岛素, 老师, 做, 餐后 1786 89 3 Professional medical advice and medication consultation 医生, 黄医生, 空腹, 血糖, 吃药, 糖尿病 55 2.7 4 Dietary preferences and glycaemic control conflicts 吃, 辣, 爱吃, 甜, 血糖 7 0.3 5 Diabetes type awareness and basic disease knowledge 一型, 二型, 糖尿病, 年龄, 治愈 111 5.5 The other four themes account for a relatively low proportion, but they still have certain semantic distinctions: Theme 1 mainly involves expressions related to the timing and context of blood sugar measurement (such as fasting, post-meal, etc.); Theme 3 is mainly about emotional supportive content, with common keywords including "rose", "heart", "thanks", etc.; Theme 4 is extremely rare and contains only a few scattered mentions of health-related matters, with relatively scattered semantics; Theme 5 is characterized by expression through facial expressions/emotionalization (such as "covering one's face", etc.), accompanied by keywords such as diabetes, doctor, and insulin, suggesting that this theme more reflects discussions with emotional overtones. Interpretation of the single-dominant topic structure Topic 2 accounts for a significantly higher proportion in the entire corpus. This phenomenon may be related to the expression characteristics of short text comments and the focus of the corpus. Firstly, short-video comments are often presented in the form of brief statements, numerical reports, or direct questions. The text length is limited and the lexical differentiation is low, which makes a large number of comments cluster in the semantic space, thereby forming a theme concentration. Secondly, the corpus used in this study focuses on diabetes-related content. Discussions tend to be more concentrated on indicators such as blood sugar/fasting blood sugar and practical information related to medical treatment and medication (such as insulin), which may further amplify the trend of theme concentration. Based on this, this paper regards Topic 2 as the core discussion thread in the comment area. The other topics are more likely to be expressed as relatively marginal types or supplementary concerns, rather than equally sized parallel topics. In terms of the time dimension, there may be certain phased fluctuations in the distribution of themes. However, in general, Theme 2 has remained dominant within the observation range. Given that this study did not conduct a systematic time series statistical test, the relevant time characteristics are only provided as descriptive hints and not for dynamic mechanism or causal inference. Topic distribution across quadrants To investigate whether there are differences in the structure of comment themes under different communication scenarios, this paper, based on the entire analysis corpus (n = 2,007), conducted a descriptive comparison of the theme distribution within each quadrant within the "information quality × communication intensity" four-quadrant framework (see Supplementary Table S4 ). The results showed that theme 2 dominated in all four quadrants, with its proportion ranging from 65.01% to 84.33%. Among them, theme 2 had the highest proportion in the "low quality × low communication" quadrant (84.33%); while in the "low quality × high communication" quadrant, the proportion of theme 2 decreased (65.01%), and the relative proportion of non-dominant themes increased, mainly manifested in theme 4 (14.42%) and theme 3 (12.06%). This result suggests that in situations with a higher communication range but lower information quality, the comment area may be more likely to have relatively marginal or more context-driven expressions. Given that the proportion of non-dominant themes in the overall corpus is still limited, this paper only provides a descriptive report of the above differences and does not conduct statistical inference or causal explanation. Sentiment analysis In the analytic corpus (n = 2,007), SnowNLP-based three-category sentiment classification indicated that 1,850 comments were neutral (92.18%), 137 were positive (6.83%), and 20 were negative (1.00%). The overall sentiment distribution is shown in Fig. 3. Under the fixed thresholds used in this study (negative: score 0.65), the predominance of neutral sentiment may reflect the fact that many comments were factual statements, numeric reporting of glycemic values, or question-type expressions. Because the distribution is also sensitive to the underlying model and threshold selection, sentiment findings are interpreted descriptively. Sentiment by topic Cross-tabulation of sentiment labels with LDA topics showed that the distribution of sentiment across topics was not uniform. Overall, negative comments were concentrated primarily within the dominant topic (Topic 2: glycemic indicators and self-management). In addition, small numbers of positive expressions were observed in some non-dominant topics (e.g., Topics 3 and 5). Given the overwhelming prevalence of Topic 2, these patterns may partly reflect a “base-rate” effect driven by topic size. Therefore, we report the observed distributions (Table 3) without inferential interpretation regarding differences in sentiment across topics. Table 3 Distribution of sentiment categories Topic Negative Neutral Positive 1 0 14 0 2 20 1760 132 3 0 60 4 4 0 5 0 5 0 11 1 Sentiment category and comment length Comment length differed by sentiment category. Within the analytic corpus (n = 2,007), negative comments were longer on average (mean 50.1 characters; median 30) than positive comments (mean 33.2; median 19) and neutral comments (mean 21.1; median 15). This suggests that although negative comments were rare, they more often contained detailed symptom experiences, illness narratives, or expressions of concern, and thus tended to carry higher information density (S6 Table ). Sentiment distribution across quadrants To assess whether sentiment structure varied across dissemination contexts, we compared the three-category sentiment distribution across the four quadrants within the analytic corpus (n = 2,007) (S7 Table ). Neutral sentiment predominated in all quadrants (approximately 90.73%–94.02%), with positive and negative comments accounting for relatively small proportions. Negative comments remained uncommon across quadrants but were relatively higher in the high-quality/high-dissemination quadrant (1.95%, 12/615) and lowest in the low-quality/low-dissemination quadrant (0.23%, 1/434). The low-quality/high-dissemination and high-quality/low-dissemination quadrants showed negative proportions of 0.95% (4/423) and 0.56% (3/535), respectively. The proportion of positive comments varied only modestly across quadrants (approximately 5.42%–7.83%). Given the small number of negative comments overall (n = 20), these differences are reported descriptively and are interpreted cautiously in light of methodological limitations. Temporal patterns Weekday distribution Temporal activity was described using the raw comment corpus (n = 3,933). By weekday, comment volume was highest on Fridays (647 comments, 16.5%), followed by Tuesdays (594 comments, 15.1%) and Saturdays (588 comments, 15.0%). The lowest volume was observed on Mondays (506 comments, 12.9%). Hourly patterns By hour of day, commenting activity was more concentrated in the evening. A total of 1,156 comments (29.4%) were posted between 18:00 and 22:00, with the single peak hour occurring at 20:00 (254 comments, 6.46%). The joint distribution by weekday and hour is shown in Fig. 4. Highly active users Among the 3,786 unique users identified, most users posted only a single comment. A total of 118 users (3.12%) posted two or more comments, and the two most active users each posted six comments. Contribution concentration was low: the top 10 users together contributed 41 comments (1.04% of all comments), indicating a participation structure dominated by broad engagement with low individual posting frequency (S7). Comment counts among the most active users are shown in S4 Fig. User–topic participation flows To depict the participation structure of highly active users in different topics, this paper selects the users with the highest number of comments and aggregates the number of their comments in each topic to draw a user-topic participation flow Sankey diagram (Figure 5). As shown in the figure, the comments of highly active users show a clear concentration trend at the topic level, mainly converging on Topic 5 (discussions related to blood glucose monitoring, fasting blood glucose, and medical treatment and medication), while the number of comments allocated to other topics (Topics 1-4) is relatively small. This result indicates that highly active users have a significant preference in their participation at the topic level, with Topic 5 occupying a core position in their overall participation structure. It should be noted that the Sankey diagram is constructed based on the aggregated user-topic flow of a subset of highly active users, and thus should not be directly compared with the topic proportion calculated based on the full corpus. During the visualization and result reporting process, all user identifiers were anonymized; the user-topic aggregated flow data used to draw the Sankey diagram is detailed in Supplementary Table S5. Discussion Based on the cleaned analytic corpus of valid comments (n = 2,007), this study applied quantitative text-mining methods to top-level comments under diabetes-related short videos on Douyin to characterize audience concerns, semantic association structures, latent topics, and sentiment distributions. Overall, comment discourse was highly concentrated on diabetes self-management, with a core focus on blood glucose monitoring and indicator interpretation (e.g., fasting glucose and glycated hemoglobin–related expressions), closely linked to practical information on dietary control, medication consultation, and healthcare-seeking behaviors. This pattern suggests that audience interaction in short-video comment spaces is strongly practice-oriented, organized around an information-need chain of “indicator interpretation–risk appraisal–action selection” ( 22 – 25 ). Despite the brevity and fragmentation typical of short-text comments, keyword co-occurrence analysis revealed a structured semantic association network rather than purely incidental co-mentions. Similar semantic/co-occurrence network approaches have been used to characterize topical structure in short social-media texts.( 26 – 29 ) Community detection on co-occurrence networks, however, is known to be sensitive to term selection, thresholding decisions, and algorithmic parameters; modularity-based partitions may change under alternative settings and should be interpreted as descriptive structure rather than definitive boundaries. Accordingly, we use the identified communities primarily to support a qualitative interpretation of the corpus’ semantic organization( 30 , 31 ). Topic modeling results indicated that, although comment topics were diverse, the overall structure was highly concentrated. Topic 2 accounted for 89.0% of all comments (1,786/2,007), with representative keywords centered on diabetes-related discussions involving fasting/blood glucose, doctor consultation, and insulin-related mentions. The remaining topics were comparatively minor, including Topic 5 (emoji-/expression-laden comments with diabetes/doctor/insulin mentions; 5.5%, 111/2,007), Topic 3 (gratitude and supportive expressions; 2.7%, 55/2,007), Topic 1 (blood glucose timing and measurement contexts such as fasting/postprandial; 2.4%, 48/2,007), and Topic 4 (miscellaneous health-related mentions; 0.3%, 7/2,007). Overall, these findings suggest that comment discourse is strongly practice-oriented and anchored in diabetes self-management and glycaemic monitoring, while also containing smaller proportions of affective or context-specific expressions. Across the “information quality × dissemination strength” quadrants, Topic 2 remained predominant, while the relative shares of non-dominant topics increased in the low-quality/high-dissemination quadrant (Supplementary Table S4 ). Because non-dominant topics accounted for a small share of the overall corpus, these differences are reported descriptively only, without statistical inference or causal interpretation.( 32 – 34 ). Given the small sample size of non-dominant topics, this observation still needs to be further verified in larger samples or future quadrilateral comparative studies. The results of sentiment analysis show that the overall comments are mainly neutral emotions, and the proportion of positive and negative expressions is relatively small. This distribution feature may be related to the large number of factual statements, blood glucose value reports and questioning expressions that exist in the comments. Meanwhile, the results of emotion classification are also influenced by model selection and threshold setting; The SnowNLP fixed threshold (0.35/0.65) adopted in this study may have increased the proportion of emotions classified as neutral( 35 , 36 ). It is worth noting that although the number of negative comments is relatively small, their content often corresponds to more specific confusions, concerns or uncomfortable experiences, with a high information density, and thus has potential significance in the context of public health risk communication. Given that oral expression, satire, emojis and context dependence in the Chinese context may affect the accuracy of automated emotion recognition, the relevant results should be carefully interpreted in a descriptive manner. Future research can be verified by manually annotating sub-samples or by using more advanced Chinese pre-trained language models to enhance the robustness of the analysis( 37 , 38 ). The comparison based on the quadrants also shows that negative comments are relatively rare in all kinds of scenarios, but they are relatively more abundant in the quadrants with high information quality and high dissemination intensity. Due to the limited absolute number of negative comments, this difference is only reported descriptively and no inferential explanation is provided. At the methodological level, this study embed comment analysis into a four-quadrant stratified sampling framework constructed based on information quality and dissemination intensity, thereby covering multiple dissemination scenarios and reducing the risk that the research results are dominated by a few high-popularity or "blockbuster" videos. By explicitly incorporating potential risk scenarios (such as high dissemination but low information quality), this framework provides an operational path for examining audience response patterns under the condition of "quality-dissemination mismatch"( 39 , 40 ). It should be noted that the analysis in this study is mainly descriptive and no formal statistical tests have been conducted on the differences in topic structure, question ratio, sentiment distribution or network indicators among different quadrants. Future research can introduce inferential comparisons and robustness analyses on this basis to enhance the explanatory power. This study also has several limitations. Firstly, the study adopted a cross-sectional design, and the time range covered reflected the cumulative results of visible comments on the platform at the time of data collection, rather than the longitudinal tracking of comment evolution. Secondly, this study only analyzed first-level comments and did not incorporate the response chain, which might have underestimated the interactive processes such as error correction, discussion, or peer support. Secondly, topic and emotion recognition rely on specific algorithms and parameter Settings. Although key processes have been reported to improve reproducibility, method dependency still needs to be considered when interpreting the results. Finally, the samples are derived from a single platform and specific keyword scenarios. When extending the research conclusions to other platforms or information scenarios, caution should be exercised. In summary, the comments under diabetes-related short videos on Douyin present a highly concentrated discussion pattern centered on blood glucose indicators and self-management, and at the same time, a recognizable semantic association structure has been formed in the keyword co-occurrence network. Although the overall mood was mainly expressed in a neutral tone, a few negative comments revealed representative confusion and uncertainty. Quantitative mining of comment texts provides data-driven evidence for understanding the interaction of health information on short-video platforms, helps identify the information needs of people with chronic diseases, and offers references for optimizing health communication and response strategies at the platform level. Conclusions In the analysis of the primary comments on diabetes-related Douyin short videos, it can be observed that the overall discussion content of the comments presents a distinct practical orientation feature, mainly focusing on issues related to diabetes self-management, especially reports and interpretations of sugar metabolism indicators such as blood glucose/fasting blood glucose, as well as questions about how to deal with and make decisions based on these indicators in daily life. The co-occurrence structure of keywords and the result of topic modeling further indicate that the semantic content of the comments is not scattered but has formed a structured semantic network centered on the indicators related to blood sugar, and is closely related to actual management behaviors such as seeking medical consultation and medication (such as insulin). In terms of emotional expression, the classification results based on SnowNLP and its fixed threshold show that the overall comments are mainly of neutral emotions (92.18%, 1,850/2,007), with relatively small proportions of positive and negative expressions (6.83% and 1.00% respectively). This distribution pattern may be related to the large number of factual statements, blood sugar value reports, and question-type expressions in the comments. Although the number of negative comments is limited, their content often involves specific confusion, concerns, or practical management difficulties, which has a high information density. In the context of public health risk communication, it still has certain reference value. It should be noted that the oral expressions, emoticons, and context dependence in the Chinese language environment may affect the accuracy of automatic emotion recognition. Therefore, the relevant results should be interpreted in a descriptive manner. Furthermore, embedding the comment analysis into the four-quadrant hierarchical framework of "information quality × communication intensity" helps to compare the audience responses in different communication scenarios. The results show that regardless of which quadrant, the core theme around blood sugar indicators and self-management remains dominant; in situations with a higher communication range but relatively lower information quality, the proportion of non-dominant expression types has increased. Given that the proportion of non-dominant themes in the overall corpus is relatively small, this paper only presents the descriptive presentation of the above differences and does not conduct statistical inference or causal explanation. Overall, the quantitative analysis of short video comment texts provides data support for understanding the healthy information interaction on short video platforms. It helps identify the core information needs of patients with chronic diseases and their related groups, and offers references for optimizing health communication and risk response strategies at the platform level. Declarations Ethics approval and consent to participate This study was conducted in accordance with the Declaration of Helsinki and relevant institutional guidelines. Ethics review for this study has been submitted to the Research Ethics Committee of Universiti Kebangsaan Malaysia (UKM) and is currently under review. The study analyzed publicly available, top-level comments from a social media platform; no interaction with users occurred, and all data were anonymized prior to analysis. Consent for publication Not applicable. Availability of data and materials The anonymized comment-level dataset generated and analyzed during the current study is available in the Supplementary Materials. Additional materials are available from the corresponding author upon reasonable request. Competing interests The authors declare that they have no competing interests. Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Authors’ contributions Shan Chen conceptualized the study, designed the methodology, conducted data collection and analysis, and drafted the manuscript. XI XI ZHAO contributed to data interpretation and manuscript revision. Professor EMMA provided theoretical guidance and critically reviewed the manuscript. Dr Anis supervised the research process and contributed to manuscript editing. All authors read and approved the final manuscript. References Association AD. 1. Improving Care and Promoting Health in Populations: Standards of Medical Care in Diabetes—2021. Diabetes Care. 2021;44(Supplement1):S7–14. Whittemore R, Liberti LS, Jeon S, Chao A, Minges KE, Murphy K, et al. Efficacy and implementation of an Internet psychoeducational program for teens with type 1 diabetes. Pediatr Diabetes. 2016;17(8):567–75. Ghio D, Lawes-Wickwar S, Tang MY, Epton T, Howlett N, Jenkinson E, et al. What influences people’s responses to public health messages for managing risks and preventing infectious diseases? A rapid systematic review of the evidence and recommendations. BMJ open. 2021;11(11):e048750. Zeng M, Grgurevic J, Diyab R, Roy R. # WhatIEatinaDay: The Quality, Accuracy, and Engagement of Nutrition Content on TikTok. Nutrients. 2025;17(5):781. Ahmed F, Kabir MA, Ahmed M. The Impact of Short Video Content and Social Media Influencers on Digital Marketing Success: A Systematic Literature Review of Smartphone Usage. Innovatech Eng J. 2025;1(02):1070937. Afful-Dadzie E, Afful-Dadzie A, Egala SB. Social media in health communication: A literature review of information quality. Health Inform Manage J. 2023;52(1):3–17. Shang L, Zhang Y, Deng Y, Wang D. MultiTec: a data-driven multimodal short video detection framework for healthcare misinformation on TikTok. IEEE Trans Big Data. 2025. Dutceac Segesten A, Bossetta M, Holmberg N, Niehorster D. The cueing power of comments on social media: How disagreement in Facebook comments affects user engagement with news. Inform Communication Soc. 2022;25(8):1115–34. Obamiro K, West S, Lee S. Like, comment, tag, share: Facebook interactions in health research. Int J Med Informatics. 2020;137:104097. Niu Z, Hu L, Jeong DC, Brickman J, Stapleton JL. An experimental investigation into promoting mental health service use on social media: effects of source and comments. Int J Environ Res Public Health. 2020;17(21):7898. Oser TK, Oser SM, Parascando JA, Hessler-Jones D, Sciamanna CN, Sparling K, et al. Social media in the diabetes community: a novel way to assess psychosocial needs in people with diabetes and their caregivers. Curr Diab Rep. 2020;20(3):10. Yaagoob E, Hunter S, Chan S. The effectiveness of social media intervention in people with diabetes: An integrative review. J Clin Nurs. 2023;32(11–12):2419–32. Fergie G, Hunt K, Hilton S. Social media as a space for support: young adults' perspectives on producing and consuming user-generated content about diabetes and mental health. Soc Sci Med. 2016;170:46–54. Etta RE, Babatunde AO, Okunlola PO, Akanbi OK, Adegoroye KJ, Adepoju RA, et al. The Assessment of TikTok as a Source of Quality Health Information on Human Papillomavirus: A Content Analysis. Cureus. 2024;16(12):e75419–e. Zhang B, Kalampakorn S, Powwattana A, Sillabutra J, Liu G. Oral Diabetes Medication Videos on Douyin: Analysis of Information Quality and User Comment Attitudes. JMIR Form Res. 2024;8:e57720. Wong HPN, So WZ, Senthamil Selvan V, Lee JY, Ho CERH, Tiong HY. A cross-sectional quality assessment of TikTok content on benign prostatic hyperplasia. World J Urol. 2023;41(11):3051–7. Chen X, Liu Y. Chinese libraries’ communication influence based on the Douyin communication index. Library Hi Tech; 2024. Shan Chen XXZ, Emma, Mohamad, Arina Anis Azlan. Quality vs. Reach in Health Short Videos: A Dual-Path Test of the Heuristic–Systematic Model. Malaysian J Communication. 2025;41. Bi J-W, Qin F, Huang C. Social media communication index in tourism forecasting. Curr Issues Tourism. 2025:1–21. Yuan Y, Li Y, Sun H, editors. Utilizing Multidimensional Features to Predict the Dissemination-Force of Emergency Short Videos. Proceedings of the 24th ACM/IEEE Joint Conference on Digital Libraries; 2024. Bilić P. Search algorithms, hidden labour and information control. Big Data Soc. 2016;3(1):2053951716652159. Myneni S, Lewis B, Singh T, Paiva K, Kim SM, Cebula AV, et al. Diabetes self-management in the age of social media: large-scale analysis of peer interactions using semiautomated methods. JMIR Med Inf. 2020;8(6):e18441. Elnaggar A, Ta Park V, Lee SJ, Bender M, Siegmund LA, Park LG. Patients’ use of social media for diabetes self-care: systematic review. J Med Internet Res. 2020;22(4):e14209. Kjærulff EM, Andersen TH, Kingod N, Nexø MA. When people with chronic conditions turn to peers on social media to obtain and share information: systematic review of the implications for relationships with health care professionals. J Med Internet Res. 2023;25(1):e41156. Roblin DW. The potential of cellular technology to mediate social networks for support of chronic disease self-management. J health communication. 2011;16(sup1):59–76. Kang GJ, Ewing-Nelson SR, Mackey L, Schlitt JT, Marathe A, Abbas KM, et al. Semantic network analysis of vaccine sentiment in online social media. Vaccine. 2017;35(29):3621–38. Chua CEH, Storey VC, Li X, Kaul M. Developing insights from social media using semantic lexical chains to mine short text structures. Decis Support Syst. 2019;127:113142. Kou F, Du J, He Y, Ye L. Social network search based on semantic analysis and learning. CAAI Trans Intell Technol. 2016;1(4):293–302. Liu W, Lai C-H, Xu WW. Tweeting about emergency: A semantic network analysis of government organizations’ social media messaging during Hurricane Harvey. Public relations Rev. 2018;44(5):807–19. Fortunato S, Barthelemy M. Resolution limit in community detection. Proceedings of the national academy of sciences. 2007;104(1):36–41. Chen S, Wang Z-Z, Tang L, Tang Y-N, Gao Y-Y, Li H-J, et al. Global vs local modularity for network community detection. PLoS ONE. 2018;13(10):e0205284. Albalawi R, Yeap TH, Benyoucef M. Using topic modeling methods for short-text data: A comparative analysis. Front Artif Intell. 2020;3:42. Murshed BAH, Mallappa S, Abawajy J, Saif MAN, Al-Ariki HDE, Abdulwahab HM. Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis. Artif Intell Rev. 2023;56(6):5133–260. Chen R, Chen G, Zhang L, Xie R, Chen R. An analysis of the factors influencing engagement metrics within the dissemination of health science misinformation. Front Public Health. 2025;13:1571210. Zhou B, Zhu Y, Mao X. Sentiment analysis on power rationing Micro blog comments based on SnowNLP-SVM-LDA model. Highlights Sci Eng Technol. 2022;4:179–85. Lu S, Liu Q, Zhang Z, editors. Sentiment analysis of weibo platform based on lda-snownlp model. 2023 2nd International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM); 2023: IEEE. Wahyuni ED, Suryanto TLM, Arviani H. Deep Learning Multimodal Sarcasm Detection in Social Media Comments: The Role of Memes and Emojis. J Artif Intell Technol. 2025;5:192–201. Bhargava N, Radaideh MI, Kwon OH, Verma A, Radaideh MI. On the Impact of Language Nuances on Sentiment Analysis with Large Language Models: Paraphrasing, Sarcasm, and Emojis. arXiv preprint arXiv:250405603. 2025. Ahmad H, Khan N, Shahid S, editors. Mitigating Toxicity in Social Media: Redesign Guidelines for Cultivating Positive User Interactions in the Instagram Threads App. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems; 2025. Aragón P, Gómez V, Kaltenbrunner A, editors. To thread or not to thread: The impact of conversation threading on online discussion. Proceedings of the International AAAI Conference on Web and social media; 2017. Additional Declarations No competing interests reported. Supplementary Files S1FigPRISMA.png S4Fig.Commentcountsamongthemostactiveusers.png S3FigAheatmapofthekeywordcooccurrence.png S2Figwordcloud.png S3Table.Cooccurringkeywordpairs.xlsx S2TableTop50highfrequencyterms.xlsx S6TableCommentLengthbySentiment.xlsx S5Table.AggregatedusertopicflowdatausedtoconstructtheSankeydiagram.xlsx S1Table.CharacteristicsofsampledDouyinvideosincludedforcomment.xlsx S7Overallsentimentdistribution.xlsx S4.Topicdistributionacrossthefourdisseminationquadrants..docx Cite Share Download PDF Status: Under Review Version 1 posted Reviewers invited by journal 29 Jan, 2026 Editor invited by journal 08 Jan, 2026 Editor assigned by journal 06 Jan, 2026 Submission checks completed at journal 06 Jan, 2026 First submitted to journal 28 Dec, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8464538","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":583670715,"identity":"810d29df-9fee-4c20-aa63-39a6fd274b6c","order_by":0,"name":"Shan Chen","email":"","orcid":"","institution":"National University of Malaysia","correspondingAuthor":false,"prefix":"","firstName":"Shan","middleName":"","lastName":"Chen","suffix":""},{"id":583670716,"identity":"44ab64be-f411-41a7-b751-726819d5f07f","order_by":1,"name":"Emma Mohamad","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA6ElEQVRIiWNgGAWjYPCCAyDC8AGETiBei7EByVrMJIjSwi92OvHTjYo78rrth7dV85zZxsDPnmPAdLMNtxbJ2bmbpXPOPDPcdiat7DbPjdsMkj1vDJhz8WgxuJ27QTq37TDjtgM5Zrd5PtxmMLiRg1+L/e3czb+BWuy3nX9jVgzSYk9Ii4F07jaQLYnbbuSYMYMcZiBBQIvE7dxt1jlnDidvu/GsWHLOmds8EmeeFRzOOYdbCz/Q+7dzKg7bbjufvPHDm2O35fjbkzc+zinDrQUD8ICIA4xsJGiBgj+kaxkFo2AUjIJhCwBHlF3lDlbmsAAAAABJRU5ErkJggg==","orcid":"","institution":"National University of Malaysia","correspondingAuthor":true,"prefix":"","firstName":"Emma","middleName":"","lastName":"Mohamad","suffix":""},{"id":583670717,"identity":"e2681571-7192-4d89-9efd-645b1b186617","order_by":2,"name":"Arina Azlan","email":"","orcid":"","institution":"National University of Malaysia","correspondingAuthor":false,"prefix":"","firstName":"Arina","middleName":"","lastName":"Azlan","suffix":""},{"id":583670718,"identity":"418d8b45-a51c-4cb1-9997-6c13d8a70b73","order_by":3,"name":"xixi Zhao","email":"","orcid":"","institution":"National University of Malaysia","correspondingAuthor":false,"prefix":"","firstName":"xixi","middleName":"","lastName":"Zhao","suffix":""}],"badges":[],"createdAt":"2025-12-28 07:53:29","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8464538/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8464538/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":101660783,"identity":"4fa65f77-9a29-4ec0-ae97-061201cd1d0b","added_by":"auto","created_at":"2026-02-02 10:45:48","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":57791,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eTop 20 high-frequency terms in diabetes-related Douyin comments\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/bf317503ee759fe3dafee569.png"},{"id":101660754,"identity":"93367f17-269a-4582-9077-4df691e3c09f","added_by":"auto","created_at":"2026-02-02 10:45:44","extension":"jpeg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":29752,"visible":true,"origin":"","legend":"\u003cp\u003eKeyword Co-occurrence Network\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eNote:\u003c/strong\u003eNodes represent keywords, with node size proportional to keyword frequency. Edges indicate co-occurrence relationships between keywords, with edge thickness reflecting co-occurrence frequency. Only keyword pairs with a co-occurrence frequency ≥ 6 are displayed for clarity.\u003c/p\u003e","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/9274dd4b6445eb3b5b80c968.jpeg"},{"id":101660756,"identity":"39da6ace-e635-4e31-b44d-684f36da5c06","added_by":"auto","created_at":"2026-02-02 10:45:44","extension":"jpeg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":154758,"visible":true,"origin":"","legend":"\u003cp\u003eOverall sentiment distribution of comments\u003c/p\u003e","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/3bc0b810c192277507bf884b.jpeg"},{"id":101660755,"identity":"b26ba609-a595-4bbf-8013-103d80878e03","added_by":"auto","created_at":"2026-02-02 10:45:44","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":53634,"visible":true,"origin":"","legend":"\u003cp\u003eJoint distribution of comment activity by weekday and hour\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eNote:\u003c/em\u003eThe heatmap displays the number of comments by weekday and hour of day. Color intensity indicates comment volume, with darker shades representing higher numbers of comments. All counts are based on the raw comment corpus prior to text cleaning.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/66f8497ac5ec6bbecde5954e.png"},{"id":101660757,"identity":"f65eafcf-d400-448a-8c1e-d6fb4b9d0e6a","added_by":"auto","created_at":"2026-02-02 10:45:44","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":155246,"visible":true,"origin":"","legend":"\u003cp\u003eSankey Diagram of User–Topic Participation\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eNote:\u003c/em\u003eThe Sankey diagram visualizes aggregated comment flows from highly active users to identified topics. User nodes represent anonymized users with high comment activity, and topic nodes represent LDA-identified topics. Link width reflects the number of comments contributed by each user to each topic. All user identifiers were anonymized for visualization and reporting.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/df57c7a37e01730db8f304ea.png"},{"id":101943442,"identity":"a3e91041-8c55-4fd0-9055-eb5a416788c1","added_by":"auto","created_at":"2026-02-05 09:41:58","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1652973,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/e549170f-dbf8-4936-9707-4d737549e173.pdf"},{"id":101660753,"identity":"1933fd0c-67dc-45da-af05-e2e4182ffdd1","added_by":"auto","created_at":"2026-02-02 10:45:44","extension":"png","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":112588,"visible":true,"origin":"","legend":"","description":"","filename":"S1FigPRISMA.png","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/1a29883df289ca42e74843e1.png"},{"id":101660776,"identity":"37673ccf-2489-43db-b792-e844c6db057c","added_by":"auto","created_at":"2026-02-02 10:45:44","extension":"png","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":173154,"visible":true,"origin":"","legend":"","description":"","filename":"S4Fig.Commentcountsamongthemostactiveusers.png","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/f28f5ca1a0c02405eaeec2b3.png"},{"id":101660774,"identity":"befc8eb6-4040-469c-8d46-1f1c3a3dc2cf","added_by":"auto","created_at":"2026-02-02 10:45:44","extension":"png","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":264357,"visible":true,"origin":"","legend":"","description":"","filename":"S3FigAheatmapofthekeywordcooccurrence.png","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/25f7e8fa9637708557181d34.png"},{"id":101660779,"identity":"73b15948-84b3-4590-b232-cc3613278e3e","added_by":"auto","created_at":"2026-02-02 10:45:44","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":1845512,"visible":true,"origin":"","legend":"","description":"","filename":"S2Figwordcloud.png","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/3542455fee4eb0fba82ecbd3.png"},{"id":101660782,"identity":"a36c5e10-d0d8-4291-ada0-af850e49271f","added_by":"auto","created_at":"2026-02-02 10:45:47","extension":"xlsx","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":11134,"visible":true,"origin":"","legend":"","description":"","filename":"S3Table.Cooccurringkeywordpairs.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/982692cd1a439748a07065ac.xlsx"},{"id":101660781,"identity":"03a312f5-baa8-49a4-a15c-e18b0e1475aa","added_by":"auto","created_at":"2026-02-02 10:45:45","extension":"xlsx","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":11397,"visible":true,"origin":"","legend":"","description":"","filename":"S2TableTop50highfrequencyterms.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/ebd0c9e518916be2b37d8b06.xlsx"},{"id":101880508,"identity":"7438a294-a50b-4857-8540-0730fc4a0bd4","added_by":"auto","created_at":"2026-02-04 15:03:03","extension":"xlsx","order_by":6,"title":"","display":"","copyAsset":false,"role":"supplement","size":9342,"visible":true,"origin":"","legend":"","description":"","filename":"S6TableCommentLengthbySentiment.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/a1750b5420d9f40eca9d3999.xlsx"},{"id":101754222,"identity":"ed13561c-32ca-4add-9c9f-4cc34d79778e","added_by":"auto","created_at":"2026-02-03 10:42:04","extension":"xlsx","order_by":8,"title":"","display":"","copyAsset":false,"role":"supplement","size":5930,"visible":true,"origin":"","legend":"","description":"","filename":"S5Table.AggregatedusertopicflowdatausedtoconstructtheSankeydiagram.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/7b80f14b64df4aebcd48ba61.xlsx"},{"id":101660767,"identity":"3cbb585e-35db-4942-8b00-63ae0f7830c4","added_by":"auto","created_at":"2026-02-02 10:45:44","extension":"xlsx","order_by":9,"title":"","display":"","copyAsset":false,"role":"supplement","size":12490,"visible":true,"origin":"","legend":"","description":"","filename":"S1Table.CharacteristicsofsampledDouyinvideosincludedforcomment.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/4588df3d11ab9f18caef1d78.xlsx"},{"id":101753770,"identity":"ec1b20dc-8ea5-492c-b80b-1a35af95a62a","added_by":"auto","created_at":"2026-02-03 10:40:48","extension":"xlsx","order_by":10,"title":"","display":"","copyAsset":false,"role":"supplement","size":272666,"visible":true,"origin":"","legend":"","description":"","filename":"S7Overallsentimentdistribution.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/31bbc4b1364340628ab6ec36.xlsx"},{"id":101660777,"identity":"7622d173-2ce4-48c3-b2ac-b148da219368","added_by":"auto","created_at":"2026-02-02 10:45:44","extension":"docx","order_by":10,"title":"","display":"","copyAsset":false,"role":"supplement","size":45028,"visible":true,"origin":"","legend":"","description":"","filename":"S4.Topicdistributionacrossthefourdisseminationquadrants..docx","url":"https://assets-eu.researchsquare.com/files/rs-8464538/v1/2ba8a5ecfeb820f659093e7c.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Topic and sentiment in comments on diabetes-related Douyin short videos: a cross-sectional text-mining study","fulltext":[{"header":"Background","content":"\u003cp\u003eDiabetes, as a typical chronic non-communicable disease, is highly dependent on continuous self-management behaviors for its management, including dietary control, regular physical activity, medication adherence, and blood glucose monitoring. It also requires continuous and reliable health information support. With the development of mobile Internet, the channels for the public to obtain health information have gradually expanded from traditional offline health education and medical institutions to multi-platform digital media. Social media has gradually become an important field for the dissemination of chronic disease information and health education. Short video platforms (such as Douyin / TikTok) are reshaping the production, distribution and consumption of health information with their characteristics of fast dissemination speed, strong visual expression, algorithm-driven recommendation and low interaction threshold. In the context of diabetes, short videos may promote the popularization of health knowledge and the improvement of self-management awareness, but due to heterogeneous content sources, limited evidence basis or overly simplified expression, they may also lead to information misinterpretation and bring potential health risks.(\u003cspan additionalcitationids=\"CR2 CR3\" citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eExisting research on health communication and digital health mainly focuses on the intrinsic quality of short video content and the credibility of information sources, such as whether the video clearly conveys the communication purpose, whether it cites reliable information sources, whether it presents relatively balanced viewpoints, and whether it explains uncertainties. However, on short video platforms, the reach and visibility of content are largely influenced by user interaction metrics (such as likes, comments, collections, and shares) as well as platform recommendation algorithms.(\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e). Therefore, the dissemination range of information does not necessarily align with its quality: content with a high dissemination rate may not necessarily have high information quality, and information of lower quality can also be widely spread.(\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e). The multimodal features of short videos (including visual, audio and text information) further increase the complexity of identifying and governing misleading health information on platforms such as TikTok. In response to this issue, in recent years, some studies have begun to attempt to detect health misinformation in short videos through multimodal methods.(\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e). From a public health perspective, this misalignment warrants attention because exposure to inaccurate or one-sided information may shape risk perceptions and health-related decisions and, in turn, influence health literacy and chronic disease self-management outcomes.\u003c/p\u003e \u003cp\u003eCompared with research centered on the video content itself, the comment section, as an interactive space, carries users' understanding, reinterpretation and emotional expression of video information, providing an important window for insight into the audience's information needs and the risk communication process. In the interaction of comments, information exchange can be achieved through mechanisms such as \"ask - answer - correct\", and it often involves emotional expressions such as empathy, worry, fear, gratitude and doubt. Such interactions may either promote peer support or trigger collective misunderstandings, thereby influencing the way information is received and its further dissemination (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e). Experimental evidence in health communication suggests that the valence of social media comments can shape affective trust and, together with source cues, influence attitudes and behavioral intentions related to health information sharing and service use(\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e). For chronic conditions such as diabetes that require long-term self-management support, recurrent high-frequency questions, emotional cues, and reported behavioral difficulties in comments can reveal more immediate and authentic information needs, offering audience-level evidence to optimize health education content, improve platform governance, and inform clinical communication (\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e). Nevertheless, current evidence has at least three gaps. First, in the Chinese-language context, systematic quantitative evidence on comments under diabetes-related short videos on Douyin remains limited. Second, existing studies often treat comments as a single corpus for topic or sentiment analysis, but rarely link comment characteristics to video-level information quality and dissemination strength; as a result, it remains unclear whether audience responses differ systematically across dissemination contexts. Third, Chinese short-text comments are highly colloquial and frequently include emojis and very short strings, which may reduce the stability of topic and sentiment identification; therefore, transparent and reproducible preprocessing and analytic workflows are particularly important for improving verifiability(\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eAgainst this background, we examined top-level comments on diabetes-related short videos on Douyin. Using text-mining approaches\u0026mdash;including word frequency analysis, keyword co-occurrence networks, topic modeling, and sentiment analysis\u0026mdash;we systematically characterized discussion foci, semantic structures, and emotional features in the comment sections, and described temporal activity patterns and participation characteristics of highly active users. To enhance coverage across dissemination contexts, we drew on a pre-established video evaluation database to construct a four-quadrant stratification framework (information quality \u0026times; dissemination strength) and sampled videos from each quadrant for comment collection. We addressed the following questions: (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e) What topics do audiences primarily focus on in comments under diabetes-related Douyin short videos? (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e) What semantic association structures emerge among keywords? (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e) What latent topics and sentiment distributions are present? (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e) What temporal and user-level patterns characterize comment participation? Our findings may inform platform governance of health-related content and strategies to better address information needs among people affected by diabetes.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eStudy design and study materials\u003c/h2\u003e \u003cp\u003eThis study was a cross-sectional, quantitative content analysis and text-mining investigation based on publicly available social media texts. The study materials comprised top-level (first-level) comments posted under diabetes-related short videos on Douyin, with each individual top-level comment treated as the minimum analytic unit. Only content visible to the public was included. We did not access private information, interact with users, or collect any personally identifiable information (e.g., phone numbers or national identification numbers). To further reduce the risk of re-identification, user identifiers were de-identified during analysis and reporting, and results are presented in aggregate form only.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eVideo database and stratified sampling framework\u003c/h3\u003e\n\u003cp\u003eComments were collected from videos drawn from a diabetes-related Douyin short-video evaluation database previously established by the research team (n\u0026thinsp;=\u0026thinsp;276). For each video in this database, information quality had been rated and dissemination strength had been computed. Using the two dimensions of \u0026ldquo;information quality\u0026rdquo; and \u0026ldquo;dissemination strength,\u0026rdquo; we constructed a four-quadrant stratification framework to guide stratified sampling of videos for comment collection, thereby covering audience interactions across different dissemination contexts.\u003c/p\u003e\n\u003ch3\u003eInformation quality assessment\u003c/h3\u003e\n\u003cp\u003eThe information quality at the video level was evaluated using the modified DISCERN (modified DISCERN, mDISCERN) scale. mDISCERN measures the reliability of health information from multiple dimensions, including whether the research purpose is clear, the credibility of the information source, whether the content presentation is balanced, whether reference sources for further information acquisition are provided, and whether uncertainties are explained, etc.(\u003cspan additionalcitationids=\"CR15\" citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e). Each video is independently scored by two trained evaluators, and any scoring differences are agreed upon through discussion. The final consensus total score (SUM) formed serves as an indicator of the information quality dimension, with a higher score indicating higher information quality.\u003c/p\u003e\n\u003ch3\u003eDissemination strength: Douyin Communication Index (DCI)\u003c/h3\u003e\n\u003cp\u003eDissemination strength was operationalized using the Douyin Communication Index (DCI), a continuous composite indicator intended to reflect overall diffusion and audience engagement on the platform. The DCI was computed as a weighted sum of publicly visible engagement metrics on the video page:\u003c/p\u003e \u003cp\u003eDCI\u0026thinsp;=\u0026thinsp;0.46 \u0026times; number of likes\u0026thinsp;+\u0026thinsp;0.37 \u0026times; number of favorites\u0026thinsp;+\u0026thinsp;0.17 \u0026times; number of shares. The weighting scheme was derived from the research team\u0026rsquo;s prior work on constructing dissemination metrics (\u003cspan additionalcitationids=\"CR18 CR19\" citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e). To improve comparability across records, engagement metrics used to compute the DCI were extracted from a single standardized time point when the video database was created (cumulative counts). When engagement fields were missing, data were handled according to pre-specified rules (e.g., exclusion or imputation as zero; see S1 Table). Higher DCI values indicate stronger dissemination.\u003c/p\u003e\n\u003ch3\u003eQuadrant classification and cut-off values\u003c/h3\u003e\n\u003cp\u003eWe constructed a four-quadrant stratification framework using video information quality (final consensus mDISCERN sum score) and dissemination strength (DCI). Because platform engagement metrics are typically highly right-skewed\u0026mdash;often driven by a small number of viral videos\u0026mdash;mean-based thresholds can be overly influenced by extreme values and may yield unstable classifications. We therefore used the median as a robust, distribution-free cut-off that is less sensitive to outliers.\u003c/p\u003e \u003cp\u003eA further practical advantage of median-based cut-offs is that they tend to produce more balanced group sizes, which improves comparability across quadrants and reduces the risk that descriptive patterns are dominated by a small subgroup. To make the classification rule fully deterministic and reproducible, values equal to the median were pre-specified to be assigned to the \u0026ldquo;low\u0026rdquo; category.\u003c/p\u003e \u003cp\u003eImportantly, \u0026ldquo;high\u0026rdquo; versus \u0026ldquo;low\u0026rdquo; in this study was not intended to represent clinical or normative thresholds. Rather, the cut-offs were chosen to support stratified sampling and ensure coverage of different dissemination contexts (including potential \u0026ldquo;quality\u0026ndash;reach mismatch\u0026rdquo; scenarios). Median split is therefore an appropriate choice for this exploratory, context-coverage purpose.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eVideo selection and sampling\u003c/h2\u003e \u003cp\u003eWe constructed a four-quadrant stratification framework using video information quality (final consensus mDISCERN sum score) and dissemination strength (DCI). Because platform engagement metrics are typically highly right-skewed\u0026mdash;often driven by a small number of viral videos\u0026mdash;mean-based thresholds can be overly influenced by extreme values and may yield unstable classifications. We therefore used the median as a robust, distribution-free cut-off that is less sensitive to outliers.\u003c/p\u003e \u003cp\u003eA further practical advantage of median-based cut-offs is that they tend to produce more balanced group sizes, which improves comparability across quadrants and reduces the risk that descriptive patterns are dominated by a small subgroup. To make the classification rule fully deterministic and reproducible, values equal to the median were pre-specified to be assigned to the \u0026ldquo;low\u0026rdquo; category.\u003c/p\u003e \u003cp\u003eImportantly, \u0026ldquo;high\u0026rdquo; versus \u0026ldquo;low\u0026rdquo; in this study was not intended to represent clinical or normative thresholds. Rather, the cut-offs were chosen to support stratified sampling and ensure coverage of different dissemination contexts (including potential \u0026ldquo;quality\u0026ndash;reach mismatch\u0026rdquo; scenarios). Median split is therefore an appropriate choice for this exploratory, context-coverage purpose.\u003c/p\u003e \u003cp\u003eTo reduce potential bias arising from short-term trending events and fluctuations in platform ranking, we conducted standardized searches and sample verification at three adjacent time points: September 8, 15, and 22, 2025 (UTC\u0026thinsp;+\u0026thinsp;8) at approximately 20:00 each day. Each search was performed while logged out, using the same device and network environment whenever possible. Browser/app histories and caches were cleared prior to searching to minimize personalization effects. These repeated searches were used to improve the robustness of the sample with respect to the contemporaneous \u0026ldquo;visible information environment\u0026rdquo; on the platform; comment data were collected as a one-time capture for the selected videos within a predefined collection window, and the overall design remained cross-sectional rather than longitudinal.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eSearch strategy and sampling frame\u003c/h3\u003e\n\u003cp\u003eWithin the Douyin app, we searched the Chinese keyword \u0026ldquo;糖尿病\u0026rdquo; (\u0026ldquo;diabetes\u0026rdquo;) and obtained the set of videos visible in the search results list using the platform\u0026rsquo;s default ranking. Because this list is generated by algorithmic ranking, it approximates the information environment that a typical user could encounter through keyword search(\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e). As platform presentation may vary by device, location, and time, the present sample represents the set of search results visible under the standardized conditions at the specified time points, rather than an exhaustive sample of all diabetes-related videos on the platform.\u003c/p\u003e\n\u003ch3\u003eStratified sampling across quadrants\u003c/h3\u003e\n\u003cp\u003eWe used stratified sampling by the four-quadrant framework, selecting six videos from each quadrant (total n\u0026thinsp;=\u0026thinsp;24) as targets for comment collection. This design aimed to (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e) cover diverse dissemination contexts and reduce the risk that the sample would be dominated by a small number of highly popular videos; and (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e) explicitly include \u0026ldquo;quality\u0026ndash;dissemination mismatch\u0026rdquo; contexts (e.g., high dissemination but low quality; high quality but low dissemination), reflecting the content structure users may encounter in keyword-based searches.\u003c/p\u003e \u003cp\u003eWithin each quadrant, candidate videos were ranked by DCI in descending order and included sequentially until six videos were selected, subject to pre-specified rules: (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e) avoiding repeated sampling from the same account where possible; (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e) covering different content types (e.g., health education, experiential sharing, public narratives); and (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e) ensuring that video links were accessible and top-level comments were visible at the time of collection. The titles, URLs, quadrant assignments, mDISCERN (SUM) scores, and DCI values of the 24 videos are provided in S1 Table to facilitate review and verification.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eComment collection and analytic corpus\u003c/h2\u003e \u003cdiv id=\"Sec12\" class=\"Section3\"\u003e \u003ch2\u003eData extraction and variables\u003c/h2\u003e \u003cp\u003eFor each of the 24 videos, we captured all publicly visible top-level comments at the time of collection, yielding a raw comment corpus of 3,933 comments. Extracted fields included comment text, posting timestamp, user identifier (used only for de-identified frequency statistics), a unique comment identifier (comment_id) for deduplication, and the video\u0026rsquo;s quadrant label for stratified description and comparison. Comments were collected using an automated script/tool that iterated through paginated results until no further comments were available.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eData cleaning and exclusions\u003c/h2\u003e \u003cp\u003eTo improve corpus quality and the interpretability and reproducibility of text-mining results, we applied a pre-specified, rule-based cleaning pipeline to the raw corpus and recorded the number of exclusions at each step (\u003cb\u003eS1 Fig\u003c/b\u003e). Duplicate records were identified and removed using comment_id (50 removed);\u003c/p\u003e \u003cp\u003eRemoval of advertisements/solicitations and irrelevant content: comments clearly containing marketing, solicitation, or unrelated content were excluded (360 removed);\u003c/p\u003e \u003cp\u003eRemoval of non-linguistic texts: comments consisting only of emojis, special symbols, whitespace, or otherwise uninterpretable characters were excluded (1,123 removed);\u003c/p\u003e \u003cp\u003eRemoval of non-informative short comments: very short comments with insufficient semantic information to support downstream analyses were excluded (393 removed; definition in Section \u0026ldquo;Operational definition of non-informative short comments\u0026rdquo;).\u003c/p\u003e \u003cp\u003eAfter cleaning, the analytic corpus comprised 2,007 valid comments and was used for all text-based analyses (word frequency, co-occurrence, topic modeling, and sentiment). When describing overall sample size, time coverage, and temporal distributions, we report the raw corpus (n\u0026thinsp;=\u0026thinsp;3,933); for modeling and statistical summaries based on text content, the denominator was the analytic corpus (n\u0026thinsp;=\u0026thinsp;2,007).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eText preprocessing\u003c/h2\u003e \u003cdiv id=\"Sec15\" class=\"Section3\"\u003e \u003ch2\u003ePreprocessing procedures\u003c/h2\u003e \u003cp\u003eWe applied a standardized preprocessing pipeline to the analytic corpus (n\u0026thinsp;=\u0026thinsp;2,007):\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec24\" class=\"Section2\"\u003e \u003ch2\u003eCorpus overview and data integrity\u003c/h2\u003e \u003cdiv id=\"Sec25\" class=\"Section3\"\u003e \u003ch2\u003eSample size and time span\u003c/h2\u003e \u003cp\u003eA total of 3,933 top-level comments were collected from diabetes-related short videos on Douyin. Comment timestamps ranged from 13:28 on March 15, 2019 to 22:07 on December 13, 2025. By year, most comments were posted in 2025 (82.00%), followed by 2024 (10.88%); smaller proportions were observed for 2019 and 2023 (3.64% and 3.48%, respectively). Given the cross-sectional data collection design, this time span reflects the cumulative historical comments visible at the time of data capture rather than longitudinal tracking of comments for the same videos over time.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec26\" class=\"Section3\"\u003e \u003ch2\u003eData completeness and analytic corpus\u003c/h2\u003e \u003cp\u003eAmong the 3,933 exported comments, 98 had blank comment text, which may be attributable to platform display restrictions, comment deletion, or missing fields returned by the interface. To improve interpretability and reproducibility of text-based analyses, we applied a rule-based cleaning pipeline prior to analysis, including deduplication, removal of advertisements/irrelevant content, exclusion of non-linguistic texts, and exclusion of non-informative short comments (S1 Fig). After cleaning, 2,007 valid comments remained and constituted the analytic corpus. Unless otherwise specified, all content-based analyses (word frequency, keyword co-occurrence, topic modeling, and sentiment analysis) were conducted on the analytic corpus (n\u0026thinsp;=\u0026thinsp;2,007). Descriptions of sample size, temporal distributions, and participation structure are reported using the raw corpus (n\u0026thinsp;=\u0026thinsp;3,933).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec27\" class=\"Section3\"\u003e \u003ch2\u003eComment length and text characteristics\u003c/h2\u003e \u003cp\u003eWithin the analytic corpus (n\u0026thinsp;=\u0026thinsp;2,007), the median comment length was 14 Chinese characters (IQR: 8\u0026ndash;25), with a mean of 21.16 characters; the longest comment contained 459 characters. Overall, comments were predominantly short texts, although a small number were relatively long and typically contained more complete illness narratives, symptom descriptions, or treatment experiences.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec28\" class=\"Section2\"\u003e \u003ch2\u003eUser participation and question-type comments\u003c/h2\u003e \u003cp\u003eAcross the 3,933 raw comments, 3,786 unique users were identified. Most users posted only once (96.88%), whereas 118 users (3.12%) posted two or more comments, indicating a participation pattern characterized by broad engagement with low posting frequency. Using heuristic rules to identify interrogative expressions (presence of \u0026ldquo;?\u0026rdquo; or interrogative triggers such as \u0026ldquo;吗\u0026rdquo;, \u0026ldquo;怎么\u0026rdquo;, \u0026ldquo;多少\u0026rdquo;, \u0026ldquo;要不要\u0026rdquo;, and \u0026ldquo;呢\u0026rdquo;), 644 question-type comments were detected, accounting for 16.37% (644/3,933) of raw comments. In the analytic corpus, question-type comments were slightly shorter than non-question comments (mean length 19.69 vs. 22.11 characters; median 13 vs. 15 characters).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec29\" class=\"Section2\"\u003e \u003ch2\u003eCoverage across quadrants\u003c/h2\u003e \u003cp\u003eVideos used for comment collection were selected using the four-quadrant stratification framework defined by information quality (mDISCERN final consensus sum score, SUM) and dissemination strength (DCI). Six videos were sampled from each quadrant (24 videos in total), and all publicly visible top-level comments were captured for each selected video. For reproducibility, the titles, URLs, quadrant assignments, mDISCERN (SUM) scores, and DCI values for the 24 videos are provided in S1 Table.\u003c/p\u003e \u003cp\u003eAt the raw comment level, the number of comments per quadrant was: low quality/low dissemination (819), low quality/high dissemination (1,109), high quality/low dissemination (919), and high quality/high dissemination (1,086). Using the number of top-level comments per video as the unit of description, the median (IQR) comments per video were 155.5 (145.25\u0026ndash;169.50) for low quality/low dissemination, 186.0 (178.00\u0026ndash;193.25) for low quality/high dissemination, 166.0 (153.00\u0026ndash;173.00) for high quality/low dissemination, and 178.0 (172.75\u0026ndash;192.25) for high quality/high dissemination. One video in the low quality/low dissemination quadrant had no visible top-level comments at the time of collection, as noted in S1 Table.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eRetention of valid comments across quadrants\u003c/h3\u003e\n\u003cp\u003eAfter rule-based cleaning of the raw corpus (n\u0026thinsp;=\u0026thinsp;3,933), 2,007 valid comments were retained. To assess whether the cleaning pipeline resulted in differential exclusions across dissemination contexts\u0026mdash;which could potentially influence subsequent topic and sentiment analyses\u0026mdash;we compared the correspondence between raw and valid comments in each quadrant and calculated retention rates (valid/raw).\u003c/p\u003e \u003cp\u003eRetention rates varied across quadrants: 52.99% (434/819) for low quality/low dissemination, 58.22% (535/919) for high quality/low dissemination, 56.63% (615/1,086) for high quality/high dissemination, and the lowest retention for low quality/high dissemination at 38.14% (423/1,109). This pattern suggests that in the high-dissemination but low-quality context, raw comments were more likely to include advertisements/solicitations, non-linguistic texts, or semantically sparse short comments and were therefore more frequently removed during cleaning; retention in the other quadrants clustered at approximately 53%\u0026ndash;58%.\u003c/p\u003e \u003cdiv id=\"Sec31\" class=\"Section2\"\u003e \u003ch2\u003eWord frequency patterns and semantic categories\u003c/h2\u003e \u003cdiv id=\"Sec32\" class=\"Section3\"\u003e \u003ch2\u003eHigh-frequency terms\u003c/h2\u003e \u003cp\u003eAfter Chinese word segmentation, stop-word removal, and synonym normalization, we computed the top 50 high-frequency terms in the analytic corpus. The most frequent terms included \u0026ldquo;diabetes\u0026rdquo;, \u0026ldquo;blood glucose\u0026rdquo;, \u0026ldquo;fasting\u0026rdquo;, \u0026ldquo;doctor\u0026rdquo;, and \u0026ldquo;insulin\u0026rdquo;. Document frequencies (df) and relative frequencies for the top 20 terms are reported in Table\u0026nbsp;1. The full list of the top 50 terms is available in S2 Table. The top 20 high-frequency terms are visualized in Figure.1. A word cloud of the full corpus vocabulary is provided in S2 Fig.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eTop 20 high-frequency terms\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTerm (Chinese)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDocument frequency (df)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRelative frequency (%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDiabetes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e243\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e6.18\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBlood glucose\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e166\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.22\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFasting\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e151\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3.84\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDoctor\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e140\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3.56\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eInsulin\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e94\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2.39\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTeacher*\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e79\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2.01\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSafe / stable\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e67\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.70\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePostprandial\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e52\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.32\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEvery day\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e46\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.17\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBeverages\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e46\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.17\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHospital\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e45\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.14\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTears / cry\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e44\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.12\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eExercise\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e44\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.12\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eControl\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e43\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.09\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHigh blood glucose\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e41\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.04\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEating\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e39\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.99\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVinegar\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e38\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.97\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMedication\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e37\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.94\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAfter meals\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e35\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.89\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRice\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e35\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.89\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003c/div\u003e\n\u003ch3\u003eNetwork structure and community detection\u003c/h3\u003e\n\u003cp\u003eTo depict the overall structure of term associations, we constructed an undirected weighted keyword co-occurrence network based on the top 50 high-frequency terms, with edge weights defined by comment-level co-occurrence counts. For visualization and to reduce noise from incidental co-occurrences, a filtered display network was constructed for the main text (Figure.2) by retaining only term pairs with co-occurrence\u0026thinsp;\u0026ge;\u0026thinsp;5 and selecting the top K\u0026thinsp;=\u0026thinsp;30 edges ranked by co-occurrence frequency.\u003c/p\u003e \u003cp\u003eThe resulting display network contained 20 nodes and 30 edges, with a network density of 0.158 and a mean degree of 3.00. In Fig.\u0026nbsp;2, node size represents degree, while edge width and labels indicate co-occurrence counts, highlighting hubs and clustering patterns among keywords. Community detection using the weighted greedy_modularity_communities algorithm identified three communities, with a weighted modularity of Q_weighted\u0026thinsp;=\u0026thinsp;0.274, indicating a discernible modular structure in the co-occurrence network.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\n\u003ch3\u003eHub terms and semantic clusters\u003c/h3\u003e\n\u003cp\u003eIn the display network, high-degree hub terms were primarily indicator-related concepts (e.g., \u0026ldquo;diabetes\u0026rdquo;, \u0026ldquo;blood glucose\u0026rdquo;, \u0026ldquo;fasting\u0026rdquo;), suggesting that the semantic core of comment discourse centered on interpretation of glycemic indicators and self-management. Examination of key connectors further indicated that some terms bridged different semantic substructures (e.g., terms related to diet or treatment), reflecting linkages across \u0026ldquo;indicators\u0026mdash;behavior\u0026mdash;healthcare/medication.\u0026rdquo; Community detection on the filtered network identified three communities (Q_weighted\u0026thinsp;=\u0026thinsp;0.274), indicating a discernible clustering structure. Substantively, these communities could be summarized into two dominant semantic domains: one emphasizing symptom experience and dietary triggers/risk perceptions (e.g., beverages/sugary drinks), and another emphasizing indicator interpretation, control strategies, and healthcare-seeking/medication (e.g., fasting, blood glucose, glycated hemoglobin, medication, doctor/hospital). Full community membership lists and parameter settings are recommended for presentation in supplementary materials ( S2 Table) to support verification.\u003c/p\u003e \u003cdiv id=\"Sec37\" class=\"Section2\"\u003e \u003ch2\u003eTopic modeling\u003c/h2\u003e \u003cdiv id=\"Sec38\" class=\"Section3\"\u003e \u003ch2\u003eIdentified topics and prevalence\u003c/h2\u003e \u003cp\u003eBased on the analysis of the corpus (n\u0026thinsp;=\u0026thinsp;2,007), the topic model identified five potential topics (K\u0026thinsp;=\u0026thinsp;5). The representative keywords, the number of comments, and their proportions for each topic are presented in Table\u0026nbsp;2. Generally, the structure of the comment topics shows a highly concentrated feature, with a single topic dominating the overall discussion significantly. Among them, Topic 2 (centered on self-management content of diabetes such as fasting/blood glucose index interpretation, doctor consultation, and insulin-related discussions) accounts for 89.0% (1,786/2,007) of all comments. The proportions of the remaining topics are relatively small, namely: Topic 5 (including expressions of emotions accompanied by mentions of diabetes, doctors, and insulin, 5.5%, 111/2,007), Topic 3 (mainly consisting of expressions of gratitude and support, 2.7%, 55/2,007), Topic 1 (discussions around the time and context of fasting or post-meal blood glucose measurement, 2.4%, 48/2,007), and Topic 4 (scattered mentions of health-related topics, 0.3%, 7/2,007).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eLatent topics identified by LDA topic modeling\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTopic\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTopic label\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eTop keywords\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003en\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003e%\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eBlood glucose monitoring and abnormal indicators\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e血糖高, 医院, 天天, 测, 空腹, 血糖, 检查\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e48\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e2.4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGlycaemic interpretation and daily self-management\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e空腹, 血糖, 胰岛素, 老师, 做, 餐后\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1786\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e89\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eProfessional medical advice and medication consultation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e医生, 黄医生, 空腹, 血糖, 吃药, 糖尿病\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e55\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e2.7\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDietary preferences and glycaemic control conflicts\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e吃, 辣, 爱吃, 甜, 血糖\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDiabetes type awareness and basic disease knowledge\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e一型, 二型, 糖尿病, 年龄, 治愈\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e111\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e5.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe other four themes account for a relatively low proportion, but they still have certain semantic distinctions: Theme 1 mainly involves expressions related to the timing and context of blood sugar measurement (such as fasting, post-meal, etc.); Theme 3 is mainly about emotional supportive content, with common keywords including \"rose\", \"heart\", \"thanks\", etc.; Theme 4 is extremely rare and contains only a few scattered mentions of health-related matters, with relatively scattered semantics; Theme 5 is characterized by expression through facial expressions/emotionalization (such as \"covering one's face\", etc.), accompanied by keywords such as diabetes, doctor, and insulin, suggesting that this theme more reflects discussions with emotional overtones.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec39\" class=\"Section2\"\u003e \u003ch2\u003eInterpretation of the single-dominant topic structure\u003c/h2\u003e \u003cp\u003eTopic 2 accounts for a significantly higher proportion in the entire corpus. This phenomenon may be related to the expression characteristics of short text comments and the focus of the corpus. Firstly, short-video comments are often presented in the form of brief statements, numerical reports, or direct questions. The text length is limited and the lexical differentiation is low, which makes a large number of comments cluster in the semantic space, thereby forming a theme concentration. Secondly, the corpus used in this study focuses on diabetes-related content. Discussions tend to be more concentrated on indicators such as blood sugar/fasting blood sugar and practical information related to medical treatment and medication (such as insulin), which may further amplify the trend of theme concentration.\u003c/p\u003e \u003cp\u003eBased on this, this paper regards Topic 2 as the core discussion thread in the comment area. The other topics are more likely to be expressed as relatively marginal types or supplementary concerns, rather than equally sized parallel topics. In terms of the time dimension, there may be certain phased fluctuations in the distribution of themes. However, in general, Theme 2 has remained dominant within the observation range. Given that this study did not conduct a systematic time series statistical test, the relevant time characteristics are only provided as descriptive hints and not for dynamic mechanism or causal inference.\u003c/p\u003e \u003cdiv id=\"Sec40\" class=\"Section3\"\u003e \u003ch2\u003eTopic distribution across quadrants\u003c/h2\u003e \u003cp\u003eTo investigate whether there are differences in the structure of comment themes under different communication scenarios, this paper, based on the entire analysis corpus (n\u0026thinsp;=\u0026thinsp;2,007), conducted a descriptive comparison of the theme distribution within each quadrant within the \"information quality \u0026times; communication intensity\" four-quadrant framework (see Supplementary Table \u003cspan refid=\"MOESM4\" class=\"InternalRef\"\u003eS4\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe results showed that theme 2 dominated in all four quadrants, with its proportion ranging from 65.01% to 84.33%. Among them, theme 2 had the highest proportion in the \"low quality \u0026times; low communication\" quadrant (84.33%); while in the \"low quality \u0026times; high communication\" quadrant, the proportion of theme 2 decreased (65.01%), and the relative proportion of non-dominant themes increased, mainly manifested in theme 4 (14.42%) and theme 3 (12.06%). This result suggests that in situations with a higher communication range but lower information quality, the comment area may be more likely to have relatively marginal or more context-driven expressions. Given that the proportion of non-dominant themes in the overall corpus is still limited, this paper only provides a descriptive report of the above differences and does not conduct statistical inference or causal explanation.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e\n\u003ch3\u003eSentiment analysis\u003c/h3\u003e\n\u003cp\u003eIn the analytic corpus (n\u0026thinsp;=\u0026thinsp;2,007), SnowNLP-based three-category sentiment classification indicated that 1,850 comments were neutral (92.18%), 137 were positive (6.83%), and 20 were negative (1.00%). The overall sentiment distribution is shown in Fig.\u0026nbsp;3. Under the fixed thresholds used in this study (negative: score\u0026thinsp;\u0026lt;\u0026thinsp;0.35; neutral: 0.35\u0026ndash;0.65; positive: \u0026gt; 0.65), the predominance of neutral sentiment may reflect the fact that many comments were factual statements, numeric reporting of glycemic values, or question-type expressions. Because the distribution is also sensitive to the underlying model and threshold selection, sentiment findings are interpreted descriptively.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\n\u003ch3\u003eSentiment by topic\u003c/h3\u003e\n\u003cp\u003eCross-tabulation of sentiment labels with LDA topics showed that the distribution of sentiment across topics was not uniform. Overall, negative comments were concentrated primarily within the dominant topic (Topic 2: glycemic indicators and self-management). In addition, small numbers of positive expressions were observed in some non-dominant topics (e.g., Topics 3 and 5). Given the overwhelming prevalence of Topic 2, these patterns may partly reflect a \u0026ldquo;base-rate\u0026rdquo; effect driven by topic size. Therefore, we report the observed distributions (Table\u0026nbsp;3) without inferential interpretation regarding differences in sentiment across topics.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eDistribution of sentiment categories\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTopic\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNegative\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNeutral\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003ePositive\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1760\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e132\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e\n\u003ch3\u003eSentiment category and comment length\u003c/h3\u003e\n\u003cp\u003eComment length differed by sentiment category. Within the analytic corpus (n\u0026thinsp;=\u0026thinsp;2,007), negative comments were longer on average (mean 50.1 characters; median 30) than positive comments (mean 33.2; median 19) and neutral comments (mean 21.1; median 15). This suggests that although negative comments were rare, they more often contained detailed symptom experiences, illness narratives, or expressions of concern, and thus tended to carry higher information density (S6 Table ).\u003c/p\u003e\n\u003ch3\u003eSentiment distribution across quadrants\u003c/h3\u003e\n\u003cp\u003eTo assess whether sentiment structure varied across dissemination contexts, we compared the three-category sentiment distribution across the four quadrants within the analytic corpus (n\u0026thinsp;=\u0026thinsp;2,007) (S7 Table ). Neutral sentiment predominated in all quadrants (approximately 90.73%\u0026ndash;94.02%), with positive and negative comments accounting for relatively small proportions. Negative comments remained uncommon across quadrants but were relatively higher in the high-quality/high-dissemination quadrant (1.95%, 12/615) and lowest in the low-quality/low-dissemination quadrant (0.23%, 1/434). The low-quality/high-dissemination and high-quality/low-dissemination quadrants showed negative proportions of 0.95% (4/423) and 0.56% (3/535), respectively. The proportion of positive comments varied only modestly across quadrants (approximately 5.42%\u0026ndash;7.83%). Given the small number of negative comments overall (n\u0026thinsp;=\u0026thinsp;20), these differences are reported descriptively and are interpreted cautiously in light of methodological limitations.\u003c/p\u003e\n\u003ch3\u003eTemporal patterns\u003c/h3\u003e\n\u003cp\u003e \u003cb\u003eWeekday distribution\u003c/b\u003e \u003c/p\u003e \u003cp\u003eTemporal activity was described using the raw comment corpus (n\u0026thinsp;=\u0026thinsp;3,933). By weekday, comment volume was highest on Fridays (647 comments, 16.5%), followed by Tuesdays (594 comments, 15.1%) and Saturdays (588 comments, 15.0%). The lowest volume was observed on Mondays (506 comments, 12.9%).\u003c/p\u003e \u003cp\u003e \u003cb\u003eHourly patterns\u003c/b\u003e \u003c/p\u003e \u003cp\u003eBy hour of day, commenting activity was more concentrated in the evening. A total of 1,156 comments (29.4%) were posted between 18:00 and 22:00, with the single peak hour occurring at 20:00 (254 comments, 6.46%). The joint distribution by weekday and hour is shown in Fig.\u0026nbsp;4.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e\u003cp\u003e\u003cstrong\u003eHighly active users\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAmong the 3,786 unique users identified, most users posted only a single comment. A total of 118 users (3.12%) posted two or more comments, and the two most active users each posted six comments. Contribution concentration was low: the top 10 users together contributed 41 comments (1.04% of all comments), indicating a participation structure dominated by broad engagement with low individual posting frequency (S7). Comment counts among the most active users are shown in S4 Fig.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eUser–topic participation flows\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo depict the participation structure of highly active users in different topics, this paper selects the users with the highest number of comments and aggregates the number of their comments in each topic to draw a user-topic participation flow Sankey diagram (Figure 5). As shown in the figure, the comments of highly active users show a clear concentration trend at the topic level, mainly converging on Topic 5 (discussions related to blood glucose monitoring, fasting blood glucose, and medical treatment and medication), while the number of comments allocated to other topics (Topics 1-4) is relatively small.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThis result indicates that highly active users have a significant preference in their participation at the topic level, with Topic 5 occupying a core position in their overall participation structure. It should be noted that the Sankey diagram is constructed based on the aggregated user-topic flow of a subset of highly active users, and thus should not be directly compared with the topic proportion calculated based on the full corpus. During the visualization and result reporting process, all user identifiers were anonymized; the user-topic aggregated flow data used to draw the Sankey diagram is detailed in Supplementary Table S5.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eBased on the cleaned analytic corpus of valid comments (n\u0026thinsp;=\u0026thinsp;2,007), this study applied quantitative text-mining methods to top-level comments under diabetes-related short videos on Douyin to characterize audience concerns, semantic association structures, latent topics, and sentiment distributions. Overall, comment discourse was highly concentrated on diabetes self-management, with a core focus on blood glucose monitoring and indicator interpretation (e.g., fasting glucose and glycated hemoglobin\u0026ndash;related expressions), closely linked to practical information on dietary control, medication consultation, and healthcare-seeking behaviors. This pattern suggests that audience interaction in short-video comment spaces is strongly practice-oriented, organized around an information-need chain of \u0026ldquo;indicator interpretation\u0026ndash;risk appraisal\u0026ndash;action selection\u0026rdquo; (\u003cspan additionalcitationids=\"CR23 CR24\" citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eDespite the brevity and fragmentation typical of short-text comments, keyword co-occurrence analysis revealed a structured semantic association network rather than purely incidental co-mentions. Similar semantic/co-occurrence network approaches have been used to characterize topical structure in short social-media texts.(\u003cspan additionalcitationids=\"CR27 CR28\" citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e)\u003c/p\u003e \u003cp\u003eCommunity detection on co-occurrence networks, however, is known to be sensitive to term selection, thresholding decisions, and algorithmic parameters; modularity-based partitions may change under alternative settings and should be interpreted as descriptive structure rather than definitive boundaries. Accordingly, we use the identified communities primarily to support a qualitative interpretation of the corpus\u0026rsquo; semantic organization(\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eTopic modeling results indicated that, although comment topics were diverse, the overall structure was highly concentrated. Topic 2 accounted for 89.0% of all comments (1,786/2,007), with representative keywords centered on diabetes-related discussions involving fasting/blood glucose, doctor consultation, and insulin-related mentions. The remaining topics were comparatively minor, including Topic 5 (emoji-/expression-laden comments with diabetes/doctor/insulin mentions; 5.5%, 111/2,007), Topic 3 (gratitude and supportive expressions; 2.7%, 55/2,007), Topic 1 (blood glucose timing and measurement contexts such as fasting/postprandial; 2.4%, 48/2,007), and Topic 4 (miscellaneous health-related mentions; 0.3%, 7/2,007). Overall, these findings suggest that comment discourse is strongly practice-oriented and anchored in diabetes self-management and glycaemic monitoring, while also containing smaller proportions of affective or context-specific expressions. Across the \u0026ldquo;information quality \u0026times; dissemination strength\u0026rdquo; quadrants, Topic 2 remained predominant, while the relative shares of non-dominant topics increased in the low-quality/high-dissemination quadrant (Supplementary Table \u003cspan refid=\"MOESM4\" class=\"InternalRef\"\u003eS4\u003c/span\u003e). Because non-dominant topics accounted for a small share of the overall corpus, these differences are reported descriptively only, without statistical inference or causal interpretation.(\u003cspan additionalcitationids=\"CR33\" citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e). Given the small sample size of non-dominant topics, this observation still needs to be further verified in larger samples or future quadrilateral comparative studies.\u003c/p\u003e \u003cp\u003eThe results of sentiment analysis show that the overall comments are mainly neutral emotions, and the proportion of positive and negative expressions is relatively small. This distribution feature may be related to the large number of factual statements, blood glucose value reports and questioning expressions that exist in the comments. Meanwhile, the results of emotion classification are also influenced by model selection and threshold setting; The SnowNLP fixed threshold (0.35/0.65) adopted in this study may have increased the proportion of emotions classified as neutral(\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e, \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e). It is worth noting that although the number of negative comments is relatively small, their content often corresponds to more specific confusions, concerns or uncomfortable experiences, with a high information density, and thus has potential significance in the context of public health risk communication.\u003c/p\u003e \u003cp\u003eGiven that oral expression, satire, emojis and context dependence in the Chinese context may affect the accuracy of automated emotion recognition, the relevant results should be carefully interpreted in a descriptive manner. Future research can be verified by manually annotating sub-samples or by using more advanced Chinese pre-trained language models to enhance the robustness of the analysis(\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e, \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e). The comparison based on the quadrants also shows that negative comments are relatively rare in all kinds of scenarios, but they are relatively more abundant in the quadrants with high information quality and high dissemination intensity. Due to the limited absolute number of negative comments, this difference is only reported descriptively and no inferential explanation is provided.\u003c/p\u003e \u003cp\u003eAt the methodological level, this study embed comment analysis into a four-quadrant stratified sampling framework constructed based on information quality and dissemination intensity, thereby covering multiple dissemination scenarios and reducing the risk that the research results are dominated by a few high-popularity or \"blockbuster\" videos. By explicitly incorporating potential risk scenarios (such as high dissemination but low information quality), this framework provides an operational path for examining audience response patterns under the condition of \"quality-dissemination mismatch\"(\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e, \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e). It should be noted that the analysis in this study is mainly descriptive and no formal statistical tests have been conducted on the differences in topic structure, question ratio, sentiment distribution or network indicators among different quadrants. Future research can introduce inferential comparisons and robustness analyses on this basis to enhance the explanatory power.\u003c/p\u003e \u003cp\u003eThis study also has several limitations. Firstly, the study adopted a cross-sectional design, and the time range covered reflected the cumulative results of visible comments on the platform at the time of data collection, rather than the longitudinal tracking of comment evolution. Secondly, this study only analyzed first-level comments and did not incorporate the response chain, which might have underestimated the interactive processes such as error correction, discussion, or peer support. Secondly, topic and emotion recognition rely on specific algorithms and parameter Settings. Although key processes have been reported to improve reproducibility, method dependency still needs to be considered when interpreting the results. Finally, the samples are derived from a single platform and specific keyword scenarios. When extending the research conclusions to other platforms or information scenarios, caution should be exercised.\u003c/p\u003e \u003cp\u003eIn summary, the comments under diabetes-related short videos on Douyin present a highly concentrated discussion pattern centered on blood glucose indicators and self-management, and at the same time, a recognizable semantic association structure has been formed in the keyword co-occurrence network. Although the overall mood was mainly expressed in a neutral tone, a few negative comments revealed representative confusion and uncertainty. Quantitative mining of comment texts provides data-driven evidence for understanding the interaction of health information on short-video platforms, helps identify the information needs of people with chronic diseases, and offers references for optimizing health communication and response strategies at the platform level.\u003c/p\u003e"},{"header":"Conclusions","content":"\u003cp\u003eIn the analysis of the primary comments on diabetes-related Douyin short videos, it can be observed that the overall discussion content of the comments presents a distinct practical orientation feature, mainly focusing on issues related to diabetes self-management, especially reports and interpretations of sugar metabolism indicators such as blood glucose/fasting blood glucose, as well as questions about how to deal with and make decisions based on these indicators in daily life. The co-occurrence structure of keywords and the result of topic modeling further indicate that the semantic content of the comments is not scattered but has formed a structured semantic network centered on the indicators related to blood sugar, and is closely related to actual management behaviors such as seeking medical consultation and medication (such as insulin).\u003c/p\u003e \u003cp\u003eIn terms of emotional expression, the classification results based on SnowNLP and its fixed threshold show that the overall comments are mainly of neutral emotions (92.18%, 1,850/2,007), with relatively small proportions of positive and negative expressions (6.83% and 1.00% respectively). This distribution pattern may be related to the large number of factual statements, blood sugar value reports, and question-type expressions in the comments. Although the number of negative comments is limited, their content often involves specific confusion, concerns, or practical management difficulties, which has a high information density. In the context of public health risk communication, it still has certain reference value. It should be noted that the oral expressions, emoticons, and context dependence in the Chinese language environment may affect the accuracy of automatic emotion recognition. Therefore, the relevant results should be interpreted in a descriptive manner.\u003c/p\u003e \u003cp\u003eFurthermore, embedding the comment analysis into the four-quadrant hierarchical framework of \"information quality \u0026times; communication intensity\" helps to compare the audience responses in different communication scenarios. The results show that regardless of which quadrant, the core theme around blood sugar indicators and self-management remains dominant; in situations with a higher communication range but relatively lower information quality, the proportion of non-dominant expression types has increased. Given that the proportion of non-dominant themes in the overall corpus is relatively small, this paper only presents the descriptive presentation of the above differences and does not conduct statistical inference or causal explanation.\u003c/p\u003e \u003cp\u003eOverall, the quantitative analysis of short video comment texts provides data support for understanding the healthy information interaction on short video platforms. It helps identify the core information needs of patients with chronic diseases and their related groups, and offers references for optimizing health communication and risk response strategies at the platform level.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was conducted in accordance with the Declaration of Helsinki and relevant institutional guidelines. Ethics review for this study has been submitted to the Research Ethics Committee of Universiti Kebangsaan Malaysia (UKM) and is currently under review. The study analyzed publicly available, top-level comments from a social media platform; no interaction with users occurred, and all data were anonymized prior to analysis.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe anonymized comment-level dataset generated and analyzed during the current study is available in the Supplementary Materials. Additional materials are available from the corresponding author upon reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors’ contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eShan Chen conceptualized the study, designed the methodology, conducted data collection and analysis, and drafted the manuscript.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eXI XI ZHAO contributed to data interpretation and manuscript revision.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eProfessor EMMA provided theoretical guidance and critically reviewed the manuscript.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eDr Anis supervised the research process and contributed to manuscript editing.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAll authors read and approved the final manuscript.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAssociation AD. 1. Improving Care and Promoting Health in Populations: Standards of Medical Care in Diabetes\u0026mdash;2021. Diabetes Care. 2021;44(Supplement1):S7\u0026ndash;14.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWhittemore R, Liberti LS, Jeon S, Chao A, Minges KE, Murphy K, et al. Efficacy and implementation of an Internet psychoeducational program for teens with type 1 diabetes. Pediatr Diabetes. 2016;17(8):567\u0026ndash;75.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGhio D, Lawes-Wickwar S, Tang MY, Epton T, Howlett N, Jenkinson E, et al. What influences people\u0026rsquo;s responses to public health messages for managing risks and preventing infectious diseases? A rapid systematic review of the evidence and recommendations. BMJ open. 2021;11(11):e048750.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZeng M, Grgurevic J, Diyab R, Roy R. # WhatIEatinaDay: The Quality, Accuracy, and Engagement of Nutrition Content on TikTok. Nutrients. 2025;17(5):781.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAhmed F, Kabir MA, Ahmed M. The Impact of Short Video Content and Social Media Influencers on Digital Marketing Success: A Systematic Literature Review of Smartphone Usage. Innovatech Eng J. 2025;1(02):1070937.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAfful-Dadzie E, Afful-Dadzie A, Egala SB. Social media in health communication: A literature review of information quality. Health Inform Manage J. 2023;52(1):3\u0026ndash;17.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShang L, Zhang Y, Deng Y, Wang D. MultiTec: a data-driven multimodal short video detection framework for healthcare misinformation on TikTok. IEEE Trans Big Data. 2025.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDutceac Segesten A, Bossetta M, Holmberg N, Niehorster D. The cueing power of comments on social media: How disagreement in Facebook comments affects user engagement with news. Inform Communication Soc. 2022;25(8):1115\u0026ndash;34.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eObamiro K, West S, Lee S. Like, comment, tag, share: Facebook interactions in health research. Int J Med Informatics. 2020;137:104097.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNiu Z, Hu L, Jeong DC, Brickman J, Stapleton JL. An experimental investigation into promoting mental health service use on social media: effects of source and comments. Int J Environ Res Public Health. 2020;17(21):7898.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOser TK, Oser SM, Parascando JA, Hessler-Jones D, Sciamanna CN, Sparling K, et al. Social media in the diabetes community: a novel way to assess psychosocial needs in people with diabetes and their caregivers. Curr Diab Rep. 2020;20(3):10.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYaagoob E, Hunter S, Chan S. The effectiveness of social media intervention in people with diabetes: An integrative review. J Clin Nurs. 2023;32(11\u0026ndash;12):2419\u0026ndash;32.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFergie G, Hunt K, Hilton S. Social media as a space for support: young adults' perspectives on producing and consuming user-generated content about diabetes and mental health. Soc Sci Med. 2016;170:46\u0026ndash;54.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEtta RE, Babatunde AO, Okunlola PO, Akanbi OK, Adegoroye KJ, Adepoju RA, et al. The Assessment of TikTok as a Source of Quality Health Information on Human Papillomavirus: A Content Analysis. Cureus. 2024;16(12):e75419\u0026ndash;e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang B, Kalampakorn S, Powwattana A, Sillabutra J, Liu G. Oral Diabetes Medication Videos on Douyin: Analysis of Information Quality and User Comment Attitudes. JMIR Form Res. 2024;8:e57720.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWong HPN, So WZ, Senthamil Selvan V, Lee JY, Ho CERH, Tiong HY. A cross-sectional quality assessment of TikTok content on benign prostatic hyperplasia. World J Urol. 2023;41(11):3051\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen X, Liu Y. Chinese libraries\u0026rsquo; communication influence based on the Douyin communication index. Library Hi Tech; 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShan Chen XXZ, Emma, Mohamad, Arina Anis Azlan. Quality vs. Reach in Health Short Videos: A Dual-Path Test of the Heuristic\u0026ndash;Systematic Model. Malaysian J Communication. 2025;41.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBi J-W, Qin F, Huang C. Social media communication index in tourism forecasting. Curr Issues Tourism. 2025:1\u0026ndash;21.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYuan Y, Li Y, Sun H, editors. Utilizing Multidimensional Features to Predict the Dissemination-Force of Emergency Short Videos. Proceedings of the 24th ACM/IEEE Joint Conference on Digital Libraries; 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBilić P. Search algorithms, hidden labour and information control. Big Data Soc. 2016;3(1):2053951716652159.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMyneni S, Lewis B, Singh T, Paiva K, Kim SM, Cebula AV, et al. Diabetes self-management in the age of social media: large-scale analysis of peer interactions using semiautomated methods. JMIR Med Inf. 2020;8(6):e18441.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eElnaggar A, Ta Park V, Lee SJ, Bender M, Siegmund LA, Park LG. Patients\u0026rsquo; use of social media for diabetes self-care: systematic review. J Med Internet Res. 2020;22(4):e14209.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKj\u0026aelig;rulff EM, Andersen TH, Kingod N, Nex\u0026oslash; MA. When people with chronic conditions turn to peers on social media to obtain and share information: systematic review of the implications for relationships with health care professionals. J Med Internet Res. 2023;25(1):e41156.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRoblin DW. The potential of cellular technology to mediate social networks for support of chronic disease self-management. J health communication. 2011;16(sup1):59\u0026ndash;76.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKang GJ, Ewing-Nelson SR, Mackey L, Schlitt JT, Marathe A, Abbas KM, et al. Semantic network analysis of vaccine sentiment in online social media. Vaccine. 2017;35(29):3621\u0026ndash;38.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChua CEH, Storey VC, Li X, Kaul M. Developing insights from social media using semantic lexical chains to mine short text structures. Decis Support Syst. 2019;127:113142.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKou F, Du J, He Y, Ye L. Social network search based on semantic analysis and learning. CAAI Trans Intell Technol. 2016;1(4):293\u0026ndash;302.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu W, Lai C-H, Xu WW. Tweeting about emergency: A semantic network analysis of government organizations\u0026rsquo; social media messaging during Hurricane Harvey. Public relations Rev. 2018;44(5):807\u0026ndash;19.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFortunato S, Barthelemy M. Resolution limit in community detection. Proceedings of the national academy of sciences. 2007;104(1):36\u0026ndash;41.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen S, Wang Z-Z, Tang L, Tang Y-N, Gao Y-Y, Li H-J, et al. Global vs local modularity for network community detection. PLoS ONE. 2018;13(10):e0205284.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlbalawi R, Yeap TH, Benyoucef M. Using topic modeling methods for short-text data: A comparative analysis. Front Artif Intell. 2020;3:42.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMurshed BAH, Mallappa S, Abawajy J, Saif MAN, Al-Ariki HDE, Abdulwahab HM. Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis. Artif Intell Rev. 2023;56(6):5133\u0026ndash;260.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen R, Chen G, Zhang L, Xie R, Chen R. An analysis of the factors influencing engagement metrics within the dissemination of health science misinformation. Front Public Health. 2025;13:1571210.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou B, Zhu Y, Mao X. Sentiment analysis on power rationing Micro blog comments based on SnowNLP-SVM-LDA model. Highlights Sci Eng Technol. 2022;4:179\u0026ndash;85.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLu S, Liu Q, Zhang Z, editors. Sentiment analysis of weibo platform based on lda-snownlp model. 2023 2nd International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM); 2023: IEEE.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWahyuni ED, Suryanto TLM, Arviani H. Deep Learning Multimodal Sarcasm Detection in Social Media Comments: The Role of Memes and Emojis. J Artif Intell Technol. 2025;5:192\u0026ndash;201.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBhargava N, Radaideh MI, Kwon OH, Verma A, Radaideh MI. On the Impact of Language Nuances on Sentiment Analysis with Large Language Models: Paraphrasing, Sarcasm, and Emojis. arXiv preprint arXiv:250405603. 2025.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAhmad H, Khan N, Shahid S, editors. Mitigating Toxicity in Social Media: Redesign Guidelines for Cultivating Positive User Interactions in the Instagram Threads App. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems; 2025.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eArag\u0026oacute;n P, G\u0026oacute;mez V, Kaltenbrunner A, editors. To thread or not to thread: The impact of conversation threading on online discussion. Proceedings of the International AAAI Conference on Web and social media; 2017.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"bmc-public-health","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"pubh","sideBox":"Learn more about [BMC Public Health](http://bmcpublichealth.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/pubh/default.aspx","title":"BMC Public Health","twitterHandle":"@BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Diabetes, Douyin (TikTok China), health information quality, user comments, text mining, topic modelling, sentiment analysis","lastPublishedDoi":"10.21203/rs.3.rs-8464538/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8464538/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eShort-form video platforms are increasingly used for diabetes-related health information, and comment sections may capture users\u0026rsquo; information needs and affective responses.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eWe analysed publicly visible top-level comments on diabetes-related Douyin (TikTok China) videos using a cross-sectional text-mining design. Videos were drawn from a previously evaluated dataset (n\u0026thinsp;=\u0026thinsp;276) and stratified by information quality (final consensus modified DISCERN score) and diffusion (Douyin Communication Index) into four quadrants; six videos were selected from each quadrant (24 total). All retrieved comments (raw, n\u0026thinsp;=\u0026thinsp;3,933) were used for descriptive temporal summaries, while text-based analyses were conducted on valid comments after rule-based cleaning (n\u0026thinsp;=\u0026thinsp;2,007). We performed Chinese word segmentation (jieba), stop-word removal, term-frequency analysis, keyword co-occurrence network analysis (co-occurrence threshold\u0026thinsp;\u0026ge;\u0026thinsp;6), LDA topic modelling (K\u0026thinsp;=\u0026thinsp;5), and SnowNLP sentiment classification (negative\u0026thinsp;\u0026lt;\u0026thinsp;0.35; neutral 0.35\u0026ndash;0.65; positive\u0026thinsp;\u0026gt;\u0026thinsp;0.65).\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eHigh-frequency terms were concentrated on diabetes, blood glucose, fasting, doctors, and insulin. The most frequent co-occurring pairs included fasting\u0026ndash;blood glucose (\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e) and diabetes\u0026ndash;blood glucose (\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e). Topic modelling identified five topics; Topic 2 accounted for 89.0% of valid comments (1,786/2,007). Sentiment was predominantly neutral (92.18%, 1,850/2,007), with 6.83% positive (137/2,007) and 1.00% negative comments (20/2,007). In the raw corpus, commenting activity peaked on Fridays (16.5%) and during 18:00\u0026ndash;22:00 (29.4%), with a single hourly peak at 20:00 (254 comments).\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eComment discourse was primarily oriented toward practice-oriented diabetes self-management, particularly the reporting and interpretation of glycaemic readings and related action-oriented questions. Although negative sentiment was relatively uncommon, such comments often described concrete confusion, worries, or difficulties in disease management. These findings may inform platform-level governance of health-related content and more targeted communication strategies for populations affected by diabetes.\u003c/p\u003e","manuscriptTitle":"Topic and sentiment in comments on diabetes-related Douyin short videos: a cross-sectional text-mining study","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-02 10:45:39","doi":"10.21203/rs.3.rs-8464538/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"reviewersInvited","content":"","date":"2026-01-29T15:30:22+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-01-08T08:05:30+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-01-06T07:53:06+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-01-06T07:48:21+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Public Health","date":"2025-12-28T07:39:21+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"bmc-public-health","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"pubh","sideBox":"Learn more about [BMC Public Health](http://bmcpublichealth.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/pubh/default.aspx","title":"BMC Public Health","twitterHandle":"@BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"39707107-d45b-4c32-89f2-a5315d1fff12","owner":[],"postedDate":"February 2nd, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-02-02T10:45:39+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-02 10:45:39","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8464538","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8464538","identity":"rs-8464538","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.