Presenting a new method for identifying and extracting keywords on Twitter related to Covid-19

preprint OA: closed
Full text JSON View at publisher
Full text 156,141 characters · extracted from preprint-html · click to expand
Presenting a new method for identifying and extracting keywords on Twitter related to Covid-19 | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Presenting a new method for identifying and extracting keywords on Twitter related to Covid-19 Seyedeh Fatemeh Langari, Hassan Saneifar, Meraj Hejazi, Hassan Ahmadi Choukalaei This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7832005/v1 This work is licensed under a CC BY 4.0 License Status: Under Revision Version 1 posted 11 You are reading this latest preprint version Abstract The relationship between hashtags and keywords in content generated on social media platforms is considered essential and fundamental. Retrieving information related to a specific topic on Twitter and categorizing them encounters difficulties. Hashtags act as approximate indicators of tweet topics, but due to their ambiguity and flexibility in usage, challenges still exist in searching for content related to a specific topic. Extracting keywords from Twitter is a vital step in displaying the main content of a post or a set of posts. These keywords usually have the best correlation with the textual content. Correctly extracting these keywords can provide the ability to analyze the text's topic and make critical decisions comprehensively. Therefore, research on extracting relationships between hashtags and keywords is of significant importance. It has been transformed into a necessity due to its fundamental role in improving the search and categorization of content on social media platforms. Consequently, this study evaluated the semantic relationship between keyword sets of a Twitter dataset regarding the coronavirus disease and the embedded hashtags in tweets. The dataset tweets amounted to 364,964 in English, each containing up to 280 characters without any images. Also, hashtag validation has been explicitly focused on the coronavirus vaccine; hence, we assume this topic is identified only with a specific hashtag. In this regard, a novel method is introduced, which utilizes semantic graph visualization and ranking techniques for keyword extraction. The proposed model involves the construction of a semantic graph, followed by the application of centrality measures to assign weights to its nodes. Subsequently, the similarity between keywords and hashtags was evaluated using three methods. Finally, two machine learning algorithms were implemented to distinguish between relevant and irrelevant tweets with the hashtag. The results of the two classification algorithms, with 73% and 96% accuracy, respectively, indicate that this approach can effectively validate the relationship between keywords and hashtags. Biological sciences/Computational biology and bioinformatics Physical sciences/Mathematics and computing Semantic graph Keyword extraction Hashtag validation Classification Twitter Covid vaccine Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 1. Introduction Recently, a vast amount of textual data has been generated on social media platforms for social purposes. Twitter has been a rich environment for analysis, enabling research to delve into real-world phenomena on a large global scale. This platform has evolved into one of the premier microblogging platforms globally, with over 556 million monthly active users. Users can connect by sending brief messages known as tweets, with a maximum length of 280 characters. Despite the compact nature of these messages, the platform's extensive user base enables the sending of 6000 tweets per second, leading to a remarkable volume of information being shared on the internet daily [1]. As microblogging sites like Twitter continue to thrive, we need help in effective post categorization and search. Moreover, finding pertinent information on a particular subject within this platform has become challenging. It is established that hashtags act as approximate indicators of tweet topics, as their purpose is to reference previously defined or emerging content. Due to character limitations on this platform, hashtags are heavily utilized and are often created to save space. Consequently, users began using hashtags with the "#" sign in their tweets to make them accessible and categorized for other users, making search results by other interested users easier. While hashtags offer numerous benefits to their users, issues still exist when searching for content on a specific topic due to the ease and flexibility provided by their usage. A single hashtag may refer to various topics. Since individuals have the freedom to choose hashtags, discovering tweets relevant to another user's search can prove challenging. Hashtags frequently harbor ambiguity, and the concepts they convey may be subsets of broader topics or might be better expressed through alternative hashtags [1]. Conversely, hashtags exhibit a dynamic nature, with their popularity and interpretations evolving constantly. Given that only one hashtag can be searched at any given moment, identifying the most effective phrase to explore a particular topic may require some deliberation. On the other hand, keyword extraction is a more critical step in analyzing data obtained from Twitter. This is a fundamental task for displaying the main content of a post or a set of posts. Therefore, it has become an essential and urgent research topic. Keywords in a text are usually several words or phrases that correlate best with the textual content, which, if properly extracted, can comprehensively analyze the text's topic and make good decisions on it. Keywords play a pivotal role across diverse domains, including information retrieval, topic identification, text mining, SEO, text classification, recommendation systems, NLP, and more [2]. The primary importance of keywords is summarizing the textual information of a document. That is, with some keywords, the main idea of a document can be predicted. Keywords are also essential for readers to quickly understand the information in the text. Existing methods for keyword extraction based on graphs have limitations and require parameters to construct the graph. A method utilizing graphs for extracting keywords from Twitter data, which employs closeness and farness centrality to ascertain node importance, has been proposed. Closeness and eccentricity centralities tend to be less reliable when dealing with disconnected graphs. This issue commonly arises in graphs generated from tweets, as the diversity of tweet content often leads to disconnection within the graph. Therefore, an effective keyword extraction method based on a graph is needed to overcome many of the limitations of graph-based models, including those mentioned above [4] Although it has been proposed to consider a simultaneous relationship between keywords, ultimately, this relationship uses language that is not supported by NLP tools. Graph-based keyword extraction can be conducted without the need for sophisticated linguistic knowledge, yet this simplicity often results in inferior analysis of Twitter data [2]. Therefore, meaningful relationships need to be considered in the extracted node. For a while, keywords were considered the only tool for content presentation. However, the use of hashtags, especially on Twitter, became prevalent. There is a symbiotic relationship between keywords and hashtags. Hashtags work like keywords but are more commonly used and require less time for followers. Keywords are successful tools for indicating the text's topic. Whether searching for a specific hashtag on Twitter or typing a phrase in the search box, if targeted words and phrases are used correctly, posts and content will be placed in front of users' attention. Additionally, researching and analyzing hashtags can help with keyword strategy. For example, by examining popular hashtags, some keywords and popular topics of tweets can be identified. Then, similar content can be shared with those hashtags. In this study, we address the following questions: What techniques can be employed to enhance keyword extraction processes? Does the combination of semantic graphs and numerical metrics increase the quality of extracted keywords? How can the relationship between extracted keywords and hashtags be calculated? In section 2, we explore related works and research background. Section 3 explains the necessary actions for selecting the optimal keywords and keyword similarity measurement with hashtags, followed by segregating relevant and irrelevant tweets. Section 4 introduces the dataset, and Section 5 elaborates on evaluating the implemented methods for tweet classification. Finally, the results of this research are presented in Section 6. 2. Background of the research In a study by Figueiredo and Jorge, one of the fundamental challenges related to identifying hashtags relevant to topics in Twitter streams was discussed. This challenge arises due to the complexity and high volume of Twitter data. The paper presents TORHID (Topic-Oriented et al.), a technique aimed at modeling topics to retrieve and identify relevant hashtags for specific topics on Twitter. It initiates with a seed hashtag and employs a classifier to filter out irrelevant hashtags [ 1 ]. In 2019, Devika R. and colleagues introduced an efficient approach known as Graph-based Keyword Extraction from Twitter (SKEM), which leverages ranking methods. In this suggested model, following thorough preprocessing, a semantic graph model was created. Then, numerical graph metrics were used to measure the importance of nodes in the semantic graph. The PageRank algorithm was also employed to rank the nodes, identifying the top ten nodes that effectively represent the most influential ones [ 2 ]. In another study published by Shin et al. in 2022, the challenge presented for keyword extraction was examined. The primary aim of this article is to introduce a model or approach for unsupervised keyword extraction in text analysis. This research proposed a general keyword extraction model using logistic regression and the Least Absolute Shrinkage and Selection Operator (LASSO). This model uses a classification-based structure to learn words that distinguish document groups and, unlike conventional models, is less sensitive to repetitive word occurrences. Experiments show that this model performs acceptably well in keyword extraction with high representation and distinction from documents, even when applied to unlabeled document classes [ 3 ]. Bordoloi and Biswas introduced an innovative unsupervised technique for graph-based keyword extraction, termed Keyword Extraction with Collective Node Weight (KECNW). This method assesses the importance of a keyword collectively by taking into account various influential parameters. In this article, the authors describe keyword extraction based on node weight allocation, which includes centrality, neighbor power, and node position [ 4 ]. Pan and Rui addressed challenges in automatic keyword extraction from Chinese text by proposing an algorithm based on word clustering. They employed statistical methods to calculate keyword features, measured interdependence with point mutual information and constructed a quantification matrix for keyword features. Semantic similarity was assessed, and low-importance single words were eliminated. They utilized a Bayesian framework to reduce dimensionality, and clustering and classification methods to refine keyword extraction [ 5 ]. AlShammari and colleagues developed a keyword extraction program using TF-IDF in Python. This program performed the main stages of keyword extraction: text preprocessing, creating a word list, etc. Ultimately, this program underwent experimentation using a sample text from Wikipedia, yielding the required outcomes [ 6 ]. Ning et al. proposed an enhanced TextRank keyword extraction algorithm, RDD-WRank, which leverages rough data reasoning in conjunction with word vector clustering. Initially, the method employs rough data reasoning to extract connections between candidate terms, broadening the exploration scope and producing more comprehensive outcomes. Subsequently, utilizing Wikipedia, word embedding techniques are integrated into the enhanced algorithm, and the word representations of TextRank nodes are grouped to adjust the voting significance of nodes within the cluster [ 7 ]. A new article by Ma et al. addressed a challenge in finding optimal optimization algorithms for keyword extraction from English texts using cluster analysis. The author proposed a new method for community detection in networks using complex network theory. This method utilizes communities present in complex networks for text clustering. They also proposed a new keyword extraction algorithm based on complex networks, which improves the connectivity of textual vocabulary using eigenvalue computations. This research demonstrated that this algorithm has high accuracy and speed in extracting text topics [ 8 ]. Lívia and Michal explored the utilization of Keyword Extraction to identify trending COVID-19-related search queries and evaluate their potential inclusion in models for predicting fake news. Using the KeyBERT machine learning technique, they extracted keywords from true and fake news articles, subsequently employing them to generate relevant search queries using the Google Trends API. [ 9 ]. The fundamental challenges associated with keyword extraction from scientific reports related to earth and space sciences were examined by Qiu et al. Their main objective was to present a practical model or method for keyword extraction from scientific reports in this field using a graph-based algorithm. This paper presents an unsupervised algorithm known as Graph-based Keyword Extraction with Error Propagation (GKEEP), which improves upon graph-based keyword extraction methods through the utilization of an error propagation mechanism similar to feedback propagation [ 10 ]. Huang and Xie proposed an improved model of TextRank for extracting keywords from patent texts. This method extracts essential and relevant keywords from patent texts using the TextRank algorithm and prior common knowledge. The methodology involves establishing a TextRank framework for each patent document and a foundational knowledge graph sourced from public lexical resources. Subsequently, a metric is devised to assess the significance ranking of nodes within the TextRank frameworks of patent documents, incorporating prior contextual insights from the foundational knowledge graph. Finally, the extraction of patent keywords involves identifying the top k nodes ranked by their value [ 11 ]. In 2020, the research primarily centered on text classification methodologies utilizing keywords for feature extraction and introduced a text classification technique utilizing automatic extraction of keywords with genetic algorithm optimization. The study encompasses five distinct methods for keyword extraction—synchronicity statistical information, maximum frequency, TF-IDF, reactivity-driven keyword extraction, and TextRank—and assesses their performance against four standard text classification techniques: Naive Bayes, Random Forest, Support Vector Machine ,and Linear Regression [ 12 ]. Kelsea et al., applied novel machine-learning techniques to social survey data, demonstrating their usefulness in identifying key variables in large, complex datasets. They used random forest classification and regression models to find significant predictors of household migration decisions. The results showed that random forests could capture nuances in migration predictors and outperform logistic regression and support vector machines in all cases analyzed. Thus, random forests and other machine-learning methods can enhance the predictive accuracy of migration models and uncover patterns in complex social datasets [ 13 ]. Shi and colleagues presented Hashtagger + in 2018, an improved version of their previous research, Hashtagger, which utilizes cold-start algorithms to expedite the recommendation process. Additionally, by proposing new data collection and feature computation techniques, they enhanced the performance and coverage of the advanced hashtag model [ 14 ]. Hashtagger employed a rank-based learning approach to model hashtag relationships to find relevant hashtags to stimulate news article dissemination. In today's context, with the vast amount of data available and the diverse range of research information, data analysis has emerged as a crucial subject. These analyses can lead to a vast amount of data and information. VOSviewer is a tool capable of generating networks representing scientific publications, academic journals, researchers, research institutions, countries, keywords, or phrases [ 15 ]. Figure 1 illustrates the results of evaluating a research network on "Hashtag validation." This figure represents the evaluation results of over 1000 studies conducted in this field. The size of each circle reflects the number of studies carried out in the respective area, with larger circles indicating a greater volume of research conducted within that domain. In Fig. 2 , the examination results are displayed in a clustered manner. Seven clusters are formed and displayed with different colors. The size of each label is directly related to the number of studies conducted in that area, meaning that if more studies are conducted in one area, its label size is more significant than in other areas. Also, the thicker the connecting lines between the labels, the stronger the connection between them. Figure 3 indicates which areas have seen more research in recent years. This assessment is based on studies conducted in the past ten years. The areas with yellow and orange labels indicate that research has been directed towards these areas in recent years. Figure 3. Results of network analysis of research conducted on hashtags from 2014 to 2024 In our study, to foster innovation, we've emphasized the adoption of an efficient algorithm for keyword extraction, contributing to enhancements in the quality and relevance of tweets Upon analyzing the words within tweets represented in a graph-like structure, we ascertain that the prominence and significance of a keyword on Twitter are contingent on several factors, including eigenvector centrality, closeness centrality, betweenness centrality, and degree centrality of the said keyword. These factors are regarded as pivotal influencers in our proposed extraction methodology. 3. Implementation Method Figure (4) provides an overview of this process. The objective of this study is to develop a method for validating hashtags related to a specific topic with the utmost accuracy. Our proposed approach is segmented into four stages. Keyword extraction and graph construction Calculation of the semantic relationship between keywords and the seed hashtag using three similarity metrics Labeling and categorizing tweets based on their relevance to the seed hashtag Training the classifier on the data to achieve the appropriate result of tweets classification 3.1 Keyword Extraction The most straightforward approach might be using frequency metrics to select essential keywords in a document. Given the typically subpar results associated with this method, our study presents a model for keyword extraction based on a semantic graph structure. This model incorporates various parameters such as closeness centrality, betweenness centrality, eigenvector centrality, and degree centrality of the graph nodes to calculate node weights. The advantage of implementing this model is that the extracted keyword will be not only semantically relevant but also structurally important based on graph sentence analysis. The implementation of this model includes the following steps: a) Preprocessing b) Construction of a semantic graph c) Calculation of node scores d) Node selection 3.1.1 Preprocessing Phase Twitter is a microblogging platform where its texts are usually unstructured because individuals mostly write conversationally, resulting in tweets containing various symbols that need more useful information. Therefore, their raw content is highly noisy for text mining tasks. Preprocessing is a process where surface-level text modifications are made. This stage is crucial as it reduces input noise. Moreover, efficient preprocessing significantly increases prediction accuracy. This is one of the critical reasons for enhancing the accuracy of the implemented model. a) Username, Retweets, and Links Removal: Tweets frequently include usernames preceded by the "@" symbol, and sometimes, they are retweeted by others and feature the "RT" symbol. These usernames, retweet symbols, and URL links are irrelevant for extracting keywords and introduce noise. Consequently, usernames and retweet symbols are eliminated. b) Numbers and Emphasized Words Removal: Eliminating numerical data from Twitter content improves the efficiency of extracting meaningful keywords from documents. Additionally, informal online communications frequently express emotions and sentiments by elongating words. For instance, "Gooooddd" instead of "Good" is a common occurrence on social media platforms. This elongation not only wastes space but also affects the word count. Given that frequency is the parameter of interest, it is crucial to handle such elongations appropriately. c) Converting All Letters to Lowercase: We convert text data to lowercase to achieve uniformity in the text. Similar words written in lowercase and uppercase can be considered separate words by the computer. d) Tokenization: Tokenization entails parsing text into smaller components, such as sentences or words, considering each component as a separate token. Various tokenization techniques can be executed based on language and modeling goals. e) Removing Punctuation Marks and Stop Words: Punctuation marks are often disregarded in keyword extraction models due to their substantial influence on the models' performance. Similarly, stop words, such as "the," are frequently excluded as they can hinder the accurate identification of phrases, such as "the who." f) Replacing Negators with Antonyms: In sentences, certain words are formed with a negator, but treating them as individual words may not be the most effective approach. For instance, words like "not" and "never" should be considered together with the subsequent word, rather than separately. To address this, words can be replaced with their relevant antonyms to ensure the inclusion of appropriate and meaningful words in the dataset. g) Stemming and Lemmatization: Stemming involves reducing an expression to its root form, while lemmatization aims to group various forms of a word into a common root, known as a lemma. Lemmatization maps multiple words to a shared root. For instance, words such as "gone," "going," and "went" are all transformed to "go." 3.1.2 Keyword Extraction Term Frequency-Inverse Document Frequency (TF-IDF) is a widely recognized technique for assessing the significance of a word in a document. Term Frequency (TF) measures how often a word appears in a document relative to the total number of words in that document. Since each content has a different length, it is possible for a word to appear much more in longer content compared to shorter ones. Therefore, this metric is normalized over the length of the document. Inverse Document Frequency (IDF) is the logarithm of the total number of documents divided by the number of documents containing a particular word. The importance of a word in a document can be expressed as follows: The weight of the word is equal to the product of the term frequency and the inverse document frequency. After calculating the weights of all words in the input document, words with the highest weights are introduced as candidate keywords. Table 1 presents information related to the weights of keywords. 3.1.3 Semantic Graph Construction Phase There are various advantages to extracting keywords based on free-text graph representation. The clearest ones are the simplicity of understanding the applied algorithms visually and their expected results. Another advantage is the adaptability of graph representation, which encompasses much more than simple word co-occurrences, such as semantic relationships, syntactic dependencies, contextual associations, and more. Incorporating these pieces of information enriches the representation explicitly and allows algorithms to use them independently or specifically. In this stage, constructing the semantic graph involves assigning vertices and connecting edges between vertices. Let G = (V, E) be a weighted undirected graph. This can be described as a framework that delineates relationships among elements of a set, where V represents a collection of elements known as vertices or nodes, and E denotes a set of connections between vertices referred to as edges. In our semantic graph, graph nodes represent unique keywords, and edges connect keywords with semantic relationships. The weight of an edge is equal to the level of similarity between two keywords. This similarity is calculated through WordNet. The desired graph is depicted in Figure 2. WordNet seamlessly integrates traditional lexical information with modern computational methods, establishing semantic relationships that connect sets of synonyms. Table 1. Weight of pre-processed keywords calculated by TF-IDF method Tf-idf Words Idf-weight Words No. 0.469551 vaccin 13.114407 vaccin 1 0.36006 peopl 13.114407 peopl 2 0.323352 shot 13.114407 shot 3 0.312904 day 13.114407 day 4 0.304243 covid 13.114407 covid 5 0.0 say 3.671171 say 6 0.0 you 3.309552 u 7 0.0 go 2.940911 go 8 0.0 amp 2.452431 amp 9 0.0 jab 2.113741 jab 10 3.1.4 Node Score Calculation Phase In graph theory, centrality measures are applied to identify the most important nodes in a graph and are used for ranking nodes. As previously stated, the node weight is determined through four metrics renowned for their efficacy in gauging the significance of a keyword, as they effectively emphasize the prominence of each word. These metrics include (a) betweenness centrality, (b) eigenvector centrality, (c) closeness centrality, and (d) degree centrality. Generally, centrality refers to how important a node is in a network, or more simply, which nodes are important in the network. Degree centrality quantifies the number of links a node possesses. Betweenness centrality assesses a node's centrality in a graph based on the shortest paths. Closeness centrality measures the average distance from a node to all other nodes in the graph. Eigenvector centrality measures the influence of a node in a network Connections to nodes with higher scores carry more weight in determining the score of the target node, while connections to nodes with lower scores have a comparatively lesser impact. This principle guides the allocation of relative scores to all nodes in the graph (Table 2). Table2. Criteria calculated for each node along with their final weight No. Keywords Eigenvector Degree Closeness Betweenness Node weight 1 vaccin 0.83 0.81 0.83 0.60 3.07 2 covid 0.83 0.81 0.83 0.60 3.07 3 peopl 0.78 0.68 0.58 0.27 2.63 4 need 0.55 0.78 0.55 0.66 2.41 5 dose 0.67 0.62 0.67 0.40 2.36 6 week 0.67 0.62 0.67 0.40 2.36 7 health 0.41 0.77 0.78 0.30 2.26 8 effect 0.41 0.77 0.78 0.30 2.26 9 receiv 0.67 0.62 0.67 0.00 2.26 10 shot 0.34 0.53 0.62 0.05 1.99 11 day 0.67 0.62 0.67 0.00 1.96 12 amp 0.67 0.62 0.67 0.00 1.96 13 take 0.62 0.59 0.69 0.00 1.96 14 get 0.62 0.67 0.67 0.00 1.96 15 plea 0.78 0.40 0.78 0.30 1.90 16 work 0.58 0.64 0.58 0.10 1.90 17 today 0.58 0.64 0.58 0.10 1.90 18 like 0.59 0.62 0.64 0.00 1.90 19 say 0.62 0.62 0.64 0.00 1.85 20 show 0.55 0.48 0.50 0.06 1.83 21 got 0.58 0.64 0.58 0.10 1.59 22 through 0.58 0.33 0.58 0.01 1.50 23 go 0.12 0.40 0.45 0.20 1.17 24 jab 0.64 0.62 0.64 0.00 0.91 25 appoint 0.33 0.27 0.25 0.00 0.85 26 u 0.35 0.07 0.27 0.00 0.69 27 one 0.07 0.35 0.23 0.00 0.65 28 thank 0.00 0.00 0.00 0.00 0.00 Table 3. Keywords selected by PageRank algorithm PR val Top words No. 0.032715 vaccin 1 0.032520 covid 2 0.032026 peopl 3 0.032026 need 4 0.030517 dose 5 0.028796 week 6 0.026645 health 7 0.026645 receiv 8 0.026162 get 9 0.025675 shot 10 0.023077 amp 11 0.023077 take 12 0.023036 plea 13 0.021601 day 14 0.020829 effect 15 3.1.5 Node Selection Phase In this stage of the algorithm, PageRank was used to find the top keywords based on their importance. PageRank is an algorithm primarily used to determine the importance of nodes in a graph. According to PageRank, the score associated with a node is determined based on the arrangement given to it, as well as the score of the node issuing these arrangements. Moreover, given the multifaceted nature of keyword importance on Twitter, we utilized the four centrality metrics calculated in the previous stage to ascertain the significance of the nodes. In our proposed method, these four metrics are sent to the PageRank algorithm in combination so that the algorithm can identify the top keywords by considering all aspects of the relationships between nodes (Table 3). 3.2 Calculating Semantic Relationship Between Keywords and Hashtag After extracting the most commonly used hashtags from the dataset, which was done using the NLTK library, the semantic relationship between hashtags and keywords was calculated for validation using three methods: 3.2.1 WordNet As mentioned earlier, WordNet is a lexical resource where synonym sets are associated with three numerical scores, explaining how objective, positive, or negative the expressions in the set are. One of its simplest uses is to power text representation in text mining tasks and add information about the sentiment-related features of expressions in the text. 3.2.2 Cosine Similarity Cosine similarity, widely used in text analysis, is a popular method in natural language processing for approximating the similarity of document vectors, regardless of their size. Mathematically, cosine calculates the angle between two vectors represented in a multi-dimensional space. Simply put, we determine the similarity between a hashtag and a keyword by examining the angle between their respective vectors. A cosine similarity score of +1 signifies perfect alignment between vectors, while a score of -1 indicates they are diametrically opposed. 3.2.3 Pearson Correlation The Pearson correlation coefficient is utilized to evaluate the relationship between two or more variables by quantifying both the direction and strength of the correlation. It serves as a mathematical index applicable to distributions involving multiple variables. Suppose the values of two variables change similarly, meaning that as one increases or decreases, the other also increases or decreases in a way that their relationship can be expressed by an equation. In that case, we say there is a correlation between these two variables. The Pearson correlation score determines the proportionality of two given data sets with a line measured from -1 to 1+. This coefficient has two parts: numerical value and sign. The sign of the Pearson correlation coefficient indicates whether the relationship between the two variables is positive or negative, while the numerical value signifies the strength of the linear relationship. 3.3 Separation of relevant and irrelevant tweets with hashtags After determining the similarity of each keyword with hashtags in the previous stage, it is time to separate the data. 3.3.1 Labeling tweets Since our dataset consists of many tweets with various hashtags, we needed a human force to label the tweets manually. Because our focus is specifically on the coronavirus vaccine, tweets whose content aligned with the hashtag received a "yes" label, while those whose content did not align received a "no" label. This stage was implemented to see if we could predict tweet content based on the extracted keywords. 3.3.2 Classifier training In the previous stage, the dataset was divided into two classes labeled "yes" and "no." For this stage, binary classification algorithms are required. Initially, we employed a Support Vector Classifier (SVC), a supervised machine learning algorithm commonly utilized for classification and regression tasks. SVC categorizes input data by mapping it to a high-dimensional space and determining the optimal hyperplane that separates the data into two classes. However, due to the higher probability of achieving improved performance, we opted for Random Forests in the second attempt. This algorithm comprises multiple decision trees trained on different subsets of the dataset and aggregates their predictions to enhance accuracy. Unlike a single decision tree, Random Forest predicts the final outcome based on majority voting among the trees. Increasing the number of trees in the forest enhances accuracy and mitigates overfitting concerns. 4. Dataset and evaluation The dataset used in this research, consisting of 364964 tweets with various hashtags related to the coronavirus, was obtained for analysis and evaluation from reputable sources such as the Kaggle website. The oldest tweets in this dataset date back to October 1, 2019. This dataset underwent a redesign on March 20, 2020, and includes features such as usernames, follower count, friend count, like count, retweet count, active hashtags, and geographic location. Due to the explicit focus of our research on the coronavirus vaccine, we decided to conduct our study precisely on the #CovidVaccine hashtag, which is recognized as the most frequent hashtag in this dataset. The purpose of evaluation in this work is to examine to what extent our proposed method has successfully validated the desired hashtag and its association with tweets. Python version 3.9 was used for algorithm implementation, and Jupyter Notebook was used as the development environment for experiments, and the operating system used was Windows 11. We start by evaluating keyword extraction with the TF-IDF, then move on to the two classification methods mentioned in the previous chapter. The values of the matrix (Figure 9) indicate that the Term Frequency-Inverse Document Frequency algorithm has identified 3250 keywords in the Yes label category. According to the report in Figure 5, this algorithm has only been able to make correct predictions up to 70%. Likewise, metrics such as accuracy and recall show figures of 0.72 and 0.71, respectively. Therefore, it can be said that if only the TF-IDF algorithm is used, this metric alone will not be sufficient for distinguishing between the two categories. Hence, we used centrality metrics. Next, we evaluate the performance of the Support Vector Classifier (SVC). Overall, we had 23,437 correct predictions and 8,563 incorrect predictions by this algorithm (Figure 10). In our training set, we had 12,544 tweets with the "No" hashtag and 19,456 tweets with the "Yes" hashtag, accounting for a relatively balanced data distribution. The ROC curve is a standard method for evaluating the quality of binary classification. As the name suggests, the ROC is a probability curve, and the AUC (Area Under Curve) measures separability. The higher the AUC and the closer to 1 it is, the better the model is. According to Figure 11, our model's performance is 0.96, indicating good results. Unlike accuracy, ROC curves are not sensitive to class imbalances, although there was balance in our training set. The accuracy score indicates that the Support Vector Classifier algorithm was able to distinguish between our data by 73% (Figures 11 and 13). The accuracy column means that in 84% of cases, tweets containing the desired hashtag were correctly predicted. The recall value indicates that our model identified 69% of tweets with the "Yes" label (Table 4). Table 4. Results of classification report of TF-IDF, SVC, and Random Forest algorithms F1-score Recall Precision Accuracy Classifiers 0.71 0.71 0.72 0.70 TF-IDF 0.76 0.69 0.84 0.73 SVC 0.97 0.96 0.97 0.96 Random-Forest 5. Conclusion Currently, users need to quickly review many Twitter posts to find topics relevant to their interests. Analyzing such vast amounts of data can be easier if we can have a subset of words (keywords) that can provide us with the main features, concepts, and document subjects. Hashtags, which have been widely used since 2007, are always used for marking keywords in tweets for message categorization and initiating conversations on Twitter, primarily aiding in attracting attention to user posts and encouraging interaction with them. For keyword ranking, a statistical approach based on graphs has been adopted. Text graph construction using WordNet involves node allocation and edge creation between nodes with semantic proximity to keywords. Not all keywords have the same importance in determining sentiment opinions. To calculate their importance, the weight of each keyword is proposed to be determined using various influential parameters such as position, centrality, and neighbor importance, and finally, the nodes are ranked using the PageRank algorithm. Semantic similarity between keywords and hashtags was performed using widely used metrics such as WordNet and cosine similarity. Additionally, since the Pearson correlation score determines the degree of proportionality between two data objects with a single line, we also utilized this metric to calculate the level of similarity. One of the significant achievements of this research was the very high accuracy rate in classification approaches. The 70% result in the evaluation of the Term Frequency-Inverse Document Frequency method indicated that this method cannot be a good criterion for classification. Therefore, two other machine learning-based methods, Support Vector Classifier and Random Forest, were used for classifier training. Although the accuracy obtained in the SVC approach is less than that in the Random Forest (73% and 96%, respectively), overall, our proposed approach for extracting, correlating, and classifying tweets related to a specific hashtag on Twitter streams has provided acceptable results. The hashtag validation approach applied in this research can be influential in recommending impactful hashtags. However, challenges and limitations were evident in this process that may significantly impact the final results and performance. Accordingly, the decision was made to select only one hashtag from the extensive dataset, which may reduce the diversity of topics and limit the results to a specific field. Additionally, the tweet labeling process (based on the presence or absence of hashtags) requires high accuracy and thorough examination to properly reflect the research results. Another issue that can facilitate more accurate and efficient processing but requires effective management is the large volume of the dataset, which increases computational time. In general, the research results show that classification algorithms have performed well. However, there is a need for attention to the precise selection and adjustment of parameters to improve algorithm performance and increase accuracy and efficiency. Despite the challenges, this research demonstrates that the use of semantic graphs and classification algorithms can lead to improving the keyword extraction process from tweets. In future research, exceptional attention to improving these limitations and enhancing algorithm performance is recommended. Declarations Data availability Majority of the data were presented in this manuscript. Additional raw data may be made available upon request. Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors. Consent to participate Not applicable. Consent for publication Not applicable. Conflict of interest The authors declare no competing interests. Funding Statement The authors declare that no funds, grants, or other financial support were received during the preparation of this manuscript. --------------------------------------------------- Author Contributions Fatemeh L. reviewed the main text and developed the model. Hassan S. performed critical review and revisions of the manuscript. Meraj H., as the corresponding author, carried out the editing and coordinated the submission process. Hassan A. was responsible for data acquisition. All authors critically reviewed the manuscript, approved the final version for submission, and agree to be accountable for all aspects of the work. References Figueiredo F, Jorge A (2019) Identifying topic relevant hashtags in Twitter streams. Inf Sci 512:1–16. https://doi.org/10.1016/j.ins.2019.07.062 Devika R, Subramaniyaswamy V (2019) A semantic graph-based keyword extraction model using the ranking method on big social data. Wirel Netw 27(8):5447–5459. https://doi.org/10.1007/s11276-019-02128-x Shin H, Lee HJ, Cho S (2022) General-use unsupervised keyword extraction model for text analysis. SSRN Electron J. https://doi.org/10.2139/ssrn.4201176 Bordoloi M, Biswas SK (2020) Graph-based sentiment analysis using keyword rank-based polarity assignment. Multimed Tools Appl 79(47):36033–36062. https://doi.org/10.1007/s11042-020-09289-4 Pan R (2023) Automatic keyword extraction algorithm for Chinese text based on word clustering. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3592793 AlShammari AF (2023) Implementing keyword extraction using term frequency-inverse document frequency (TF-IDF) in Python. Int J Comput Appl 185(35):9–14. https://doi.org/10.5120/ijca2023923137 Ning X, Zhang H, Zhao X, Liu Y (2023) Retracted: TextRank keyword extraction algorithm using word vector clustering based on rough data-deduction. Comput Intell Neurosci 2023:9861397. https://doi.org/10.1155/2023/9861397 Ma J (2022) Research on keyword extraction algorithm in English text based on cluster analysis. Comput Intell Neurosci 2022:4293102. https://doi.org/10.1155/2022/4293102 Kelebercová L, Munk M (2022) Search queries related to COVID-19 based on keyword extraction. Procedia Comput Sci 207:2618–2627. https://doi.org/10.1016/j.procs.2022.09.320 Qiu Q, Xie Z, Wang B (2021) GKEEP: An enhanced graph-based keyword extractor with error-feedback propagation for geoscience reports. Earth Space Sci 8(2). https://doi.org/10.1029/2020EA001602 Huang Z, Xie Z (2022) A patent keywords extraction method using TextRank model with prior public knowledge. Complex Intell Syst 8(1):1. https://doi.org/10.1007/s40747-021-00343-8 Ni P, Li Y, Chang V (2020) Research on text classification based on automatically extracted keywords. Int J Enterp Inf Syst 16(4):1–16. https://doi.org/10.4018/IJEIS.2020100101 Best KB, Gilligan JM, Baroud H, Carrico AR, Donato KM, Ackerly BA, Mallick B (2021) Random forest analysis of two household surveys can identify important predictors of migration in Bangladesh. J Comput Soc Sci 4(1):77–100. https://doi.org/10.1007/s42001-020-00066-9 Shi B, Poghosyan G, Ifrim G, Hurley N (2018) Hashtagger+: Efficient high-coverage social tagging of streaming news. IEEE Trans Knowl Data Eng 30(1):43–58. https://doi.org/10.1109/TKDE.2017.2754253 Shah SHH, Lei S, Ali M, Doronin D, Hussain ST (2020) Prosumption: bibliometric analysis using HistCite and VOSviewer. Kybernetes 49(3):1020–1045. https://doi.org/10.1108/K-12-2018-0696 Graph-based keyword extraction for Twitter data (2022) In: Emerging research in computing, information, communication and applications. pp 863–871. https://doi.org/10.1007/978-981-16-1342-5_68 Kumar N, Baskaran E, Konjengbam A, Singh M (2021) Hashtag recommendation for short social media texts using word embeddings and external knowledge. Knowl Inf Syst 63(11). https://doi.org/10.1007/s10115-020-01515-7 Cabezas J, Moctezuma D, Isabel A, Martín de Diego I (2021) Detecting emotional evolution on Twitter during the COVID-19 pandemic using text analysis. Int J Environ Res Public Health 18(13):6981. https://doi.org/10.3390/ijerph18136981 Chakrabarti P, Malvi E, Bansal S, Kumar N (2023) Hashtag recommendation for enhancing the popularity of social media posts. Soc Netw Anal Min 13(1). https://doi.org/10.1007/s13278-023-01024-9 Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Revision Version 1 posted Editorial decision: Revision requested 08 Jan, 2026 Reviews received at journal 19 Dec, 2025 Reviewers agreed at journal 19 Dec, 2025 Reviews received at journal 12 Dec, 2025 Reviewers agreed at journal 22 Nov, 2025 Reviewers agreed at journal 06 Nov, 2025 Reviewers invited by journal 27 Oct, 2025 Editor invited by journal 24 Oct, 2025 Editor assigned by journal 15 Oct, 2025 Submission checks completed at journal 15 Oct, 2025 First submitted to journal 11 Oct, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7832005","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":540644329,"identity":"6ebd8c2e-54db-4d48-9181-40e0097fdc04","order_by":0,"name":"Seyedeh Fatemeh Langari","email":"","orcid":"","institution":"Islamic Azad University, Science and Research Branch","correspondingAuthor":false,"prefix":"","firstName":"Seyedeh","middleName":"Fatemeh","lastName":"Langari","suffix":""},{"id":540644330,"identity":"3727e589-85c3-49c9-bc77-19bfcb5e39ce","order_by":1,"name":"Hassan Saneifar","email":"","orcid":"","institution":"Islamic Azad University, Science and Research Branch","correspondingAuthor":false,"prefix":"","firstName":"Hassan","middleName":"","lastName":"Saneifar","suffix":""},{"id":540644331,"identity":"b5258cd7-a292-4464-aa2a-445290757248","order_by":2,"name":"Meraj Hejazi","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA+UlEQVRIie2RsWrDMBCGzwjcJZBVS5snKMgYioeQZ5EwuEuGjIEOOS/NkuDVD5NB4aBe8gCBFhJR0NTBo4cOVUo6VvZYqL7hQIc+/ScOIBD4q3DQECGAbsW1I4cq+3qw4t7/rmw0ZKb79da0GbzdsXVjaLqYKbyhM5jd78rDoUk5B5tGGylpLnKFo0KAtB7lWIBTSJUotVOYQpi7v2iPcrKsc8qqrAxSJlYKxx89yjGOLykyqnMgEKSQ96UcijjjgpKytrDfiCZ95lZor9K8sFe+pElSPb633efTbTXOjek8ygXG3QYTvJ5i+FmTh6h1ZdJ3KxAIBP4vX2OLVKhBzDcBAAAAAElFTkSuQmCC","orcid":"","institution":"Islamic Azad University, Nour, Iran","correspondingAuthor":true,"prefix":"","firstName":"Meraj","middleName":"","lastName":"Hejazi","suffix":""},{"id":540644332,"identity":"dccf8fc6-15ab-4dff-975c-773633e5ebde","order_by":3,"name":"Hassan Ahmadi Choukalaei","email":"","orcid":"","institution":"Urmia University","correspondingAuthor":false,"prefix":"","firstName":"Hassan","middleName":"Ahmadi","lastName":"Choukalaei","suffix":""}],"badges":[],"createdAt":"2025-10-11 05:53:07","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7832005/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7832005/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":95524244,"identity":"7cb9eb65-5aa7-42a3-a078-42c9027af5fc","added_by":"auto","created_at":"2025-11-10 10:02:32","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6757132,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.docx","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/e1a586599779da10b0966032.docx"},{"id":95321790,"identity":"9694ccd8-101d-4ad4-b469-8c112e8037d7","added_by":"auto","created_at":"2025-11-06 16:51:16","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6722,"visible":true,"origin":"","legend":"","description":"","filename":"f0851d8118b546a6b0d4fb3339dc7569.json","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/30c38c68223404da9e15782f.json"},{"id":95321799,"identity":"dac876f8-e3fa-4253-9044-84947ceef739","added_by":"auto","created_at":"2025-11-06 16:51:16","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":114430,"visible":true,"origin":"","legend":"","description":"","filename":"f0851d8118b546a6b0d4fb3339dc75691enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/fac9aaaa87437d451e372d28.xml"},{"id":95321800,"identity":"a5961bea-d882-4e85-8915-2530054ae1c9","added_by":"auto","created_at":"2025-11-06 16:51:16","extension":"jpeg","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":620423,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/9045a970f7a46eb3a6892159.jpeg"},{"id":95321796,"identity":"0cd24154-52b5-4f27-afc8-3e6d8ffff114","added_by":"auto","created_at":"2025-11-06 16:51:16","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":329897,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/518bb31e58bff8236d497078.png"},{"id":95321802,"identity":"a13a1654-1d3d-416a-9922-3745056575c7","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":80373,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage11.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/3b24dd115711b15e81d1996e.png"},{"id":95321811,"identity":"8ca11405-0ac1-449e-b889-415ad3a6950a","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":93719,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage12.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/e090f8aec080649496dcf8c7.png"},{"id":95321816,"identity":"6ca871ca-4fee-4bd8-915d-4f8b1da3a586","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"jpeg","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":219159,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage13.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/5c5f5630281402d238cacd90.jpeg"},{"id":95524481,"identity":"f9770ba9-fd12-4d31-a0d6-b46e5b142635","added_by":"auto","created_at":"2025-11-10 10:02:48","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1097200,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/67ee8ea60cce6df4b76f9f22.png"},{"id":95321808,"identity":"8b20970e-f46c-4be4-abf2-840a7a996ad7","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1013262,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/0bf4a38f621d1528d4eae95d.png"},{"id":95523505,"identity":"ce691c63-a629-4eaf-85af-b27086092896","added_by":"auto","created_at":"2025-11-10 09:57:16","extension":"emf","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2366480,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.emf","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/7eadea673f5f386dca09321f.emf"},{"id":95524479,"identity":"8656e9a0-5728-45d5-8c23-40327d9d4eef","added_by":"auto","created_at":"2025-11-10 10:02:48","extension":"jpeg","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1560801,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/1097660ec5784d6f7814df49.jpeg"},{"id":95524520,"identity":"b43da710-d781-47d8-8175-8bda83463a8f","added_by":"auto","created_at":"2025-11-10 10:02:52","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":157979,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/bdb18f3ced64c9dcdc6599af.png"},{"id":95321817,"identity":"6247a606-56d1-4529-bc96-54f35192206e","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":348133,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/2d737bdc69edf3c5a181b553.png"},{"id":95524431,"identity":"ac2f67a5-b81c-433b-acc5-7524f5d732f1","added_by":"auto","created_at":"2025-11-10 10:02:46","extension":"png","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":119523,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/18202ae79de7e036581ae7bc.png"},{"id":95321813,"identity":"097ccd89-9cf5-4c24-a0de-b35d8af1c01c","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":19549,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/d4f854b677ac6ad0ed60fdd0.png"},{"id":95321819,"identity":"0fb7cd68-9cec-448c-a05f-2463ce8cd793","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":106131,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/8ada58a6c5807c4b91e58c21.png"},{"id":95524262,"identity":"f2a7aaaf-dd6d-45ad-9bb6-50e901c41584","added_by":"auto","created_at":"2025-11-10 10:02:33","extension":"png","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":31082,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/29d2a11a80db836887f5ab51.png"},{"id":95321822,"identity":"1d388fcc-40c1-4679-b9d1-6e2b51ab047a","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":18,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":13593,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage11.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/631c5ef487b6c4b0a4e0ce0d.png"},{"id":95524457,"identity":"91969d9e-8006-4bc5-8725-66deb706d327","added_by":"auto","created_at":"2025-11-10 10:02:47","extension":"png","order_by":19,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":10638,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage12.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/832cdc5a8f39c8f594c1f0f6.png"},{"id":95523568,"identity":"aa7b8160-90c5-4988-94e7-d881264890cd","added_by":"auto","created_at":"2025-11-10 09:58:25","extension":"png","order_by":20,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":54107,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage13.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/4f63a39e04db55fae16124d6.png"},{"id":95321824,"identity":"fd77b1ae-14c4-42a1-88f6-72988824500d","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":21,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":171048,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/7dd8d177228c942e4000dfb3.png"},{"id":95524303,"identity":"9035a954-df0e-4b0a-979a-63e7ac86e170","added_by":"auto","created_at":"2025-11-10 10:02:36","extension":"png","order_by":22,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":141425,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/226f2e4d9b1475627ff578d0.png"},{"id":95321830,"identity":"38780e7d-fbc6-4680-9648-f6f0b997172e","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":23,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":74412,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/05395c6631afa7fc5aed0f64.png"},{"id":95524370,"identity":"284c1f9a-a679-4b2d-bb0c-65b5ffa362c3","added_by":"auto","created_at":"2025-11-10 10:02:40","extension":"png","order_by":24,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":316592,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/3f0bf7cc2c80006980242f0f.png"},{"id":95321831,"identity":"fd561b0f-e9e4-448d-8c74-6dbcae41839f","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":25,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":23601,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/3eb5a595588e6608d7d7ab12.png"},{"id":95321829,"identity":"9f855bb5-1ca6-43f1-841d-2ce715589eb4","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":26,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":50233,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/8436c2456e23c45c91e8a7cc.png"},{"id":95321834,"identity":"146be7e7-e028-4d34-b827-7e1ee24aac5c","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":27,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":18740,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/eae8334fc1630f3720d8fd93.png"},{"id":95524514,"identity":"179cadf5-763b-4061-ba38-73e3f23506e3","added_by":"auto","created_at":"2025-11-10 10:02:51","extension":"png","order_by":28,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6386,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/5ae45352b13bebb030ce37f9.png"},{"id":95321833,"identity":"6b30303b-6dcb-49c8-8438-0a591556f5bb","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"xml","order_by":29,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":111947,"visible":true,"origin":"","legend":"","description":"","filename":"f0851d8118b546a6b0d4fb3339dc75691structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/211acb3ae2060a67af14c4ce.xml"},{"id":95321832,"identity":"e732a6ab-1138-42ed-bcce-6858405c2102","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"html","order_by":30,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":124652,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/1164eaafd1743ef6fb3f7c81.html"},{"id":95321791,"identity":"16f9ae90-9271-4300-a7d9-3c6456c69be1","added_by":"auto","created_at":"2025-11-06 16:51:16","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":235850,"visible":true,"origin":"","legend":"\u003cp\u003eResults of network analysis of research conducted on hashtags\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/ed1d169d6ee878e3b8cdc632.png"},{"id":95523805,"identity":"e198da84-99cd-4739-aea6-f7779586f9e3","added_by":"auto","created_at":"2025-11-10 10:00:55","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":397686,"visible":true,"origin":"","legend":"\u003cp\u003eResults of network analysis of research conducted on hashtags\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/179f6bb1ee16dfed39be0345.png"},{"id":95321795,"identity":"43c4e892-5ee1-41f1-96f3-5738f2158767","added_by":"auto","created_at":"2025-11-06 16:51:16","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":366598,"visible":true,"origin":"","legend":"\u003cp\u003eResults of network analysis of research conducted on hashtags from 2014 to 2024\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/c677873878974270288c85c1.png"},{"id":95321793,"identity":"236dcf8d-fe8c-4271-8a23-3891e5169fbd","added_by":"auto","created_at":"2025-11-06 16:51:16","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":38980,"visible":true,"origin":"","legend":"\u003cp\u003eOverview of the research process\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/c0d635551da69719d5027085.png"},{"id":95523827,"identity":"303dac62-9060-43c8-ae1b-3d21fadf252f","added_by":"auto","created_at":"2025-11-10 10:01:05","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":672592,"visible":true,"origin":"","legend":"\u003cp\u003eA view of the semantic graph made with WordNet to show the semantic relationship between keywords\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/3fb2dfcb6a1cb6f0c0a06d92.png"},{"id":95321798,"identity":"fd32364e-12da-41a0-bcef-352278f621ea","added_by":"auto","created_at":"2025-11-06 16:51:16","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":105837,"visible":true,"origin":"","legend":"\u003cp\u003eCalculating the similarity of Keywords with the hashtag “CovidVaccine” with the help of WordNet\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/bffc9d1b395a2c2853079fa7.png"},{"id":95523989,"identity":"ad30fe7c-80ba-41b4-9201-b2f7d082baba","added_by":"auto","created_at":"2025-11-10 10:01:49","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":184632,"visible":true,"origin":"","legend":"\u003cp\u003eThe comparison rate obtained from the similarity of the hashtag “covidvaccine” with keywords using Cosine and Pearson methods\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/a0edd3b051298ffb94f2d515.png"},{"id":95523793,"identity":"e9e06cb6-2197-4627-b01a-22da2eb5a3c1","added_by":"auto","created_at":"2025-11-10 10:00:51","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":77206,"visible":true,"origin":"","legend":"\u003cp\u003eCalculated ROC curve for evaluating the TF-IDF algorithm- the area under the curve shows an accuracy of 70%\u003c/p\u003e","description":"","filename":"8.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/325f4b4d58a37c5c6e7ed179.png"},{"id":95523799,"identity":"41d5fb2e-9cb3-4a12-84cd-321f30e110da","added_by":"auto","created_at":"2025-11-10 10:00:54","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":24134,"visible":true,"origin":"","legend":"\u003cp\u003eConfusion matrix of TF-IDF algorithm- This matrix specifies the number of samples that are correctly or incorrectly detected, positive or negative\u003c/p\u003e","description":"","filename":"9.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/160bfacc41496279f677bbd2.png"},{"id":95524355,"identity":"2db03631-75fe-4929-9f04-5e066e1447a9","added_by":"auto","created_at":"2025-11-10 10:02:40","extension":"png","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":63852,"visible":true,"origin":"","legend":"\u003cp\u003eThe confusion matrix of the SVC algorithm\u003c/p\u003e","description":"","filename":"10.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/06db10fd48b8a936df1c0d52.png"},{"id":95321806,"identity":"0884b38a-803f-4496-a94d-a19e0d0eeb16","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":11,"title":"Figure 11","display":"","copyAsset":false,"role":"figure","size":76817,"visible":true,"origin":"","legend":"\u003cp\u003eThe ROC curve calculated to evaluate the SVC algorithm shows the area under the accuracy curve of 73%.\u003c/p\u003e","description":"","filename":"11.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/49f0c3bd08a6af384a23b530.png"},{"id":95321810,"identity":"b0a821ed-0264-4e87-90e1-26488accd52c","added_by":"auto","created_at":"2025-11-06 16:51:17","extension":"png","order_by":12,"title":"Figure 12","display":"","copyAsset":false,"role":"figure","size":88159,"visible":true,"origin":"","legend":"\u003cp\u003eROC curve calculated for random forest algorithm evaluation- the area under the curve shows 96% accuracy\u003c/p\u003e","description":"","filename":"12.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/2e41b75dcf15fd03ea5fe9b0.png"},{"id":95524379,"identity":"0e2407f4-bd0c-475c-a748-34a0ebf97102","added_by":"auto","created_at":"2025-11-10 10:02:41","extension":"png","order_by":13,"title":"Figure 13","display":"","copyAsset":false,"role":"figure","size":112226,"visible":true,"origin":"","legend":"\u003cp\u003eThe output of the SVC algorithm in separating tweets labeled “yes” and “no”\u003c/p\u003e","description":"","filename":"13.png","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/59a0bac84ff0539881a38620.png"},{"id":95654058,"identity":"d633f6ba-7a97-41ab-8bcf-e22e00a4e7a6","added_by":"auto","created_at":"2025-11-11 16:09:28","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2992669,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7832005/v1/98409461-8298-4fe4-a564-0b2c956e81c0.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Presenting a new method for identifying and extracting keywords on Twitter related to Covid-19","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eRecently, a vast amount of textual data has been generated on social\u0026nbsp;media platforms for social purposes. Twitter has been a rich environment for analysis, enabling research to delve into real-world phenomena on a large global scale. This platform has evolved into one of the premier microblogging platforms globally, with over 556 million monthly active users. Users can connect by sending brief messages known as tweets, with a maximum length of 280 characters. Despite the compact nature of these messages, the platform's extensive user base enables the sending of 6000 tweets per second, leading to a remarkable volume of information being shared on the internet daily [1]. As microblogging sites like Twitter continue to thrive, we need help in effective post categorization and search. Moreover, finding pertinent information on a particular subject within this platform has become challenging. It is established that hashtags act as approximate indicators of tweet topics, as their purpose is to reference previously defined or emerging content. Due to character limitations on this platform, hashtags are heavily utilized and are often created to save space. Consequently, users began using hashtags with the \"#\" sign in their tweets to make them accessible and categorized for other users, making search results by other interested users easier. While hashtags offer numerous benefits to their users, issues still exist when searching for content on a specific topic due to the ease and flexibility provided by their usage. A single hashtag may refer to various topics. Since individuals have the freedom to choose hashtags, discovering tweets relevant to another user's search can prove challenging. Hashtags frequently harbor ambiguity, and the concepts they convey may be subsets of broader topics or might be better expressed through alternative hashtags [1]. Conversely, hashtags exhibit a dynamic nature, with their popularity and interpretations evolving constantly. Given that only one hashtag can be searched at any given moment, identifying the most effective phrase to explore a particular topic may require some deliberation. On the other hand, keyword extraction is a more critical step in analyzing data obtained from Twitter. This is a fundamental task for displaying the main content of a post or a set of posts. Therefore, it has become an essential and urgent research topic. Keywords in a text are usually several words or phrases that correlate best with the textual content, which, if properly extracted, can comprehensively analyze the text's topic and make good decisions on it. Keywords play a pivotal role across diverse domains, including information retrieval, topic identification, text mining, SEO, text classification, recommendation systems, NLP, and more [2]. The primary importance of keywords is summarizing the textual information of a document. That is, with some keywords, the main idea of a document can be predicted. Keywords are also essential for readers to quickly understand the information in the text. Existing methods for keyword extraction based on graphs have limitations and require parameters to construct the graph. A method utilizing graphs for extracting keywords from Twitter data, which employs closeness and farness centrality to ascertain node importance, has been proposed. Closeness and eccentricity centralities tend to be less reliable when dealing with disconnected graphs. This issue commonly arises in graphs generated from tweets, as the diversity of tweet content often leads to disconnection within the graph. Therefore, an effective keyword extraction method based on a graph is needed to overcome many of the limitations of graph-based models, including those mentioned above [4] Although it has been proposed to consider a simultaneous relationship between keywords, ultimately, this relationship uses language that is not supported by NLP tools. Graph-based keyword extraction can be conducted without the need for sophisticated linguistic knowledge, yet this simplicity often results in inferior analysis of Twitter data [2]. Therefore, meaningful relationships need to be considered in the extracted node. For a while, keywords were considered the only tool for content presentation. However, the use of hashtags, especially on Twitter, became prevalent. There is a symbiotic relationship between keywords and hashtags. Hashtags work like keywords but are more commonly used and require less time for followers. Keywords are successful tools for indicating the text's topic. Whether searching for a specific hashtag on Twitter or typing a phrase in the search box, if targeted words and phrases are used correctly, posts and content will be placed in front of users' attention. Additionally, researching and analyzing hashtags can help with keyword strategy. For example, by examining popular hashtags, some keywords and popular topics of tweets can be identified. Then, similar content can be shared with those hashtags. In this study, we address the following questions:\u003c/p\u003e\n\u003col\u003e\n \u003cli\u003eWhat techniques can be employed to enhance keyword extraction processes?\u003c/li\u003e\n \u003cli\u003eDoes the combination of semantic graphs and numerical metrics increase the quality of extracted keywords?\u003c/li\u003e\n \u003cli\u003eHow can the relationship between extracted keywords and hashtags be calculated?\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eIn section 2, we explore related works and research background. Section 3 explains the necessary actions for selecting the optimal keywords and keyword similarity measurement with hashtags, followed by segregating relevant and irrelevant tweets. Section 4 introduces the dataset, and Section 5 elaborates on evaluating the implemented methods for tweet classification. Finally, the results of this research are presented in Section 6.\u003c/p\u003e"},{"header":"2. Background of the research","content":"\u003cp\u003eIn a study by Figueiredo and Jorge, one of the fundamental challenges related to identifying hashtags relevant to topics in Twitter streams was discussed. This challenge arises due to the complexity and high volume of Twitter data. The paper presents TORHID (Topic-Oriented et al.), a technique aimed at modeling topics to retrieve and identify relevant hashtags for specific topics on Twitter. It initiates with a seed hashtag and employs a classifier to filter out irrelevant hashtags [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. In 2019, Devika R. and colleagues introduced an efficient approach known as Graph-based Keyword Extraction from Twitter (SKEM), which leverages ranking methods. In this suggested model, following thorough preprocessing, a semantic graph model was created. Then, numerical graph metrics were used to measure the importance of nodes in the semantic graph. The PageRank algorithm was also employed to rank the nodes, identifying the top ten nodes that effectively represent the most influential ones [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. In another study published by Shin et al. in 2022, the challenge presented for keyword extraction was examined. The primary aim of this article is to introduce a model or approach for unsupervised keyword extraction in text analysis. This research proposed a general keyword extraction model using logistic regression and the Least Absolute Shrinkage and Selection Operator (LASSO). This model uses a classification-based structure to learn words that distinguish document groups and, unlike conventional models, is less sensitive to repetitive word occurrences. Experiments show that this model performs acceptably well in keyword extraction with high representation and distinction from documents, even when applied to unlabeled document classes [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Bordoloi and Biswas introduced an innovative unsupervised technique for graph-based keyword extraction, termed Keyword Extraction with Collective Node Weight (KECNW). This method assesses the importance of a keyword collectively by taking into account various influential parameters. In this article, the authors describe keyword extraction based on node weight allocation, which includes centrality, neighbor power, and node position [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Pan and Rui addressed challenges in automatic keyword extraction from Chinese text by proposing an algorithm based on word clustering. They employed statistical methods to calculate keyword features, measured interdependence with point mutual information and constructed a quantification matrix for keyword features. Semantic similarity was assessed, and low-importance single words were eliminated. They utilized a Bayesian framework to reduce dimensionality, and clustering and classification methods to refine keyword extraction [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. AlShammari and colleagues developed a keyword extraction program using TF-IDF in Python. This program performed the main stages of keyword extraction: text preprocessing, creating a word list, etc. Ultimately, this program underwent experimentation using a sample text from Wikipedia, yielding the required outcomes [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Ning et al. proposed an enhanced TextRank keyword extraction algorithm, RDD-WRank, which leverages rough data reasoning in conjunction with word vector clustering. Initially, the method employs rough data reasoning to extract connections between candidate terms, broadening the exploration scope and producing more comprehensive outcomes. Subsequently, utilizing Wikipedia, word embedding techniques are integrated into the enhanced algorithm, and the word representations of TextRank nodes are grouped to adjust the voting significance of nodes within the cluster [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. A new article by Ma et al. addressed a challenge in finding optimal optimization algorithms for keyword extraction from English texts using cluster analysis. The author proposed a new method for community detection in networks using complex network theory. This method utilizes communities present in complex networks for text clustering. They also proposed a new keyword extraction algorithm based on complex networks, which improves the connectivity of textual vocabulary using eigenvalue computations. This research demonstrated that this algorithm has high accuracy and speed in extracting text topics [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. L\u0026iacute;via and Michal explored the utilization of Keyword Extraction to identify trending COVID-19-related search queries and evaluate their potential inclusion in models for predicting fake news. Using the KeyBERT machine learning technique, they extracted keywords from true and fake news articles, subsequently employing them to generate relevant search queries using the Google Trends API. [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThe fundamental challenges associated with keyword extraction from scientific reports related to earth and space sciences were examined by Qiu et al. Their main objective was to present a practical model or method for keyword extraction from scientific reports in this field using a graph-based algorithm. This paper presents an unsupervised algorithm known as Graph-based Keyword Extraction with Error Propagation (GKEEP), which improves upon graph-based keyword extraction methods through the utilization of an error propagation mechanism similar to feedback propagation [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. Huang and Xie proposed an improved model of TextRank for extracting keywords from patent texts. This method extracts essential and relevant keywords from patent texts using the TextRank algorithm and prior common knowledge. The methodology involves establishing a TextRank framework for each patent document and a foundational knowledge graph sourced from public lexical resources. Subsequently, a metric is devised to assess the significance ranking of nodes within the TextRank frameworks of patent documents, incorporating prior contextual insights from the foundational knowledge graph. Finally, the extraction of patent keywords involves identifying the top k nodes ranked by their value [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. In 2020, the research primarily centered on text classification methodologies utilizing keywords for feature extraction and introduced a text classification technique utilizing automatic extraction of keywords with genetic algorithm optimization. The study encompasses five distinct methods for keyword extraction\u0026mdash;synchronicity statistical information, maximum frequency, TF-IDF, reactivity-driven keyword extraction, and TextRank\u0026mdash;and assesses their performance against four standard text classification techniques: Naive Bayes, Random Forest, Support Vector Machine ,and Linear Regression [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. Kelsea et al., applied novel machine-learning techniques to social survey data, demonstrating their usefulness in identifying key variables in large, complex datasets. They used random forest classification and regression models to find significant predictors of household migration decisions. The results showed that random forests could capture nuances in migration predictors and outperform logistic regression and support vector machines in all cases analyzed. Thus, random forests and other machine-learning methods can enhance the predictive accuracy of migration models and uncover patterns in complex social datasets [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. Shi and colleagues presented Hashtagger\u0026thinsp;+\u0026thinsp;in 2018, an improved version of their previous research, Hashtagger, which utilizes cold-start algorithms to expedite the recommendation process. Additionally, by proposing new data collection and feature computation techniques, they enhanced the performance and coverage of the advanced hashtag model [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Hashtagger employed a rank-based learning approach to model hashtag relationships to find relevant hashtags to stimulate news article dissemination. In today's context, with the vast amount of data available and the diverse range of research information, data analysis has emerged as a crucial subject. These analyses can lead to a vast amount of data and information.\u003c/p\u003e\u003cp\u003eVOSviewer is a tool capable of generating networks representing scientific publications, academic journals, researchers, research institutions, countries, keywords, or phrases [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Figure\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e illustrates the results of evaluating a research network on \"Hashtag validation.\" This figure represents the evaluation results of over 1000 studies conducted in this field. The size of each circle reflects the number of studies carried out in the respective area, with larger circles indicating a greater volume of research conducted within that domain. In Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, the examination results are displayed in a clustered manner. Seven clusters are formed and displayed with different colors. The size of each label is directly related to the number of studies conducted in that area, meaning that if more studies are conducted in one area, its label size is more significant than in other areas. Also, the thicker the connecting lines between the labels, the stronger the connection between them. Figure\u0026nbsp;3 indicates which areas have seen more research in recent years. This assessment is based on studies conducted in the past ten years. The areas with yellow and orange labels indicate that research has been directed towards these areas in recent years.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eFigure 3. Results of network analysis of research conducted on hashtags from 2014 to 2024\u003c/p\u003e\u003cp\u003eIn our study, to foster innovation, we've emphasized the adoption of an efficient algorithm for keyword extraction, contributing to enhancements in the quality and relevance of tweets Upon analyzing the words within tweets represented in a graph-like structure, we ascertain that the prominence and significance of a keyword on Twitter are contingent on several factors, including eigenvector centrality, closeness centrality, betweenness centrality, and degree centrality of the said keyword. These factors are regarded as pivotal influencers in our proposed extraction methodology.\u003c/p\u003e"},{"header":"3. Implementation Method","content":"\u003cp\u003eFigure (4) provides an overview of this process. The objective of this study is to develop a method for validating hashtags related to a specific topic with the utmost accuracy. Our proposed approach is segmented into four stages.\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eKeyword extraction and graph construction\u003c/li\u003e\n \u003cli\u003eCalculation of the semantic relationship between keywords and the seed hashtag using three similarity metrics\u003c/li\u003e\n \u003cli\u003eLabeling and categorizing tweets based on their relevance to the seed hashtag\u003c/li\u003e\n \u003cli\u003eTraining the classifier on the data to achieve the appropriate result of tweets classification\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003e3.1 Keyword Extraction\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe most straightforward approach might be using frequency metrics to select essential keywords in a document. Given the typically subpar results associated with this method, our study presents a model for keyword extraction based on a semantic graph structure. This model incorporates various parameters such as closeness centrality, betweenness centrality, eigenvector centrality, and degree centrality of the graph nodes to calculate node weights.\u0026nbsp;The advantage of implementing this model is that the extracted keyword will be not only semantically relevant but also structurally important based on graph sentence analysis. The implementation of this model includes the following steps:\u003c/p\u003e\n\u003cp\u003ea) Preprocessing\u003c/p\u003e\n\u003cp\u003eb) Construction of a semantic graph\u003c/p\u003e\n\u003cp\u003ec) Calculation of node scores\u003c/p\u003e\n\u003cp\u003ed) Node selection\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.1.1 Preprocessing Phase\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTwitter is a microblogging platform where its texts are usually unstructured because individuals mostly write conversationally, resulting in tweets containing various symbols that need more useful information. Therefore, their raw content is highly noisy for text mining tasks. Preprocessing is a process where surface-level text modifications are made. This stage is crucial as it reduces input noise. Moreover, efficient preprocessing significantly increases prediction accuracy. This is one of the critical reasons for enhancing the accuracy of the implemented model.\u003c/p\u003e\n\u003cp\u003ea) \u0026nbsp; Username, Retweets, and Links Removal:\u003c/p\u003e\n\u003cp\u003eTweets frequently include usernames preceded by the \u0026quot;@\u0026quot; symbol, and sometimes, they are retweeted by others and feature the \u0026quot;RT\u0026quot; symbol. These usernames, retweet symbols, and URL links are irrelevant for extracting keywords and introduce noise. Consequently, usernames and retweet symbols are eliminated.\u003c/p\u003e\n\u003cp\u003eb) \u0026nbsp; Numbers and Emphasized Words Removal:\u003c/p\u003e\n\u003cp\u003eEliminating numerical data from Twitter content improves the efficiency of extracting meaningful keywords from documents. Additionally, informal online communications frequently express emotions and sentiments by elongating words. For instance, \u0026quot;Gooooddd\u0026quot; instead of \u0026quot;Good\u0026quot; is a common occurrence on social media platforms. This elongation not only wastes space but also affects the word count. Given that frequency is the parameter of interest, it is crucial to handle such elongations appropriately.\u003c/p\u003e\n\u003cp\u003ec) \u0026nbsp; Converting All Letters to Lowercase:\u003c/p\u003e\n\u003cp\u003eWe convert text data to lowercase to achieve uniformity in the text. Similar words written in lowercase and uppercase can be considered separate words by the computer.\u003c/p\u003e\n\u003cp\u003ed) \u0026nbsp; Tokenization:\u003c/p\u003e\n\u003cp\u003eTokenization entails parsing text into smaller components, such as sentences or words, considering each component as a separate token. Various tokenization techniques can be executed based on language and modeling goals.\u003c/p\u003e\n\u003cp\u003ee) \u0026nbsp; Removing Punctuation Marks and Stop Words:\u003c/p\u003e\n\u003cp\u003ePunctuation marks are often disregarded in keyword extraction models due to their substantial influence on the models\u0026apos; performance. Similarly, stop words, such as \u0026quot;the,\u0026quot; are frequently excluded as they can hinder the accurate identification of phrases, such as \u0026quot;the who.\u0026quot;\u003c/p\u003e\n\u003cp\u003ef) \u0026nbsp; Replacing Negators with Antonyms:\u003c/p\u003e\n\u003cp\u003eIn sentences, certain words are formed with a negator, but treating them as individual words may not be the most effective approach. For instance, words like \u0026quot;not\u0026quot; and \u0026quot;never\u0026quot; should be considered together with the subsequent word, rather than separately. To address this, words can be replaced with their relevant antonyms to ensure the inclusion of appropriate and meaningful words in the dataset.\u003c/p\u003e\n\u003cp\u003eg) \u0026nbsp; Stemming and Lemmatization:\u003c/p\u003e\n\u003cp\u003eStemming involves reducing an expression to its root form, while lemmatization aims to group various forms of a word into a common root, known as a lemma. Lemmatization maps multiple words to a shared root. For instance, words such as \u0026quot;gone,\u0026quot; \u0026quot;going,\u0026quot; and \u0026quot;went\u0026quot; are all transformed to \u0026quot;go.\u0026quot;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.1.2 Keyword Extraction\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTerm Frequency-Inverse Document Frequency (TF-IDF) is a widely recognized technique for assessing the significance of a word in a document. Term Frequency (TF) measures how often a word appears in a document relative to the total number of words in that document. Since each content has a different length, it is possible for a word to appear much more in longer content compared to shorter ones. Therefore, this metric is normalized over the length of the document. Inverse Document Frequency (IDF) is the logarithm of the total number of documents divided by the number of documents containing a particular word. The importance of a word in a document can be expressed as follows: The weight of the word is equal to the product of the term frequency and the inverse document frequency. After calculating the weights of all words in the input document, words with the highest weights are introduced as candidate keywords. Table 1 presents information related to the weights of keywords.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.1.3 Semantic Graph Construction Phase\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThere are various advantages to extracting keywords based on free-text graph representation. The clearest ones are the simplicity of understanding the applied algorithms visually and their expected results. Another advantage is the adaptability of graph representation, which encompasses much more than simple word co-occurrences, such as semantic relationships, syntactic dependencies, contextual associations, and more. Incorporating these pieces of information enriches the representation explicitly and allows algorithms to use them independently or specifically. In this stage, constructing the semantic graph involves assigning vertices and connecting edges between vertices. Let G = (V, E) be a weighted undirected graph. This can be described as a framework that delineates relationships among elements of a set, where V represents a collection of elements known as vertices or nodes, and E denotes a set of connections between vertices referred to as edges. In our semantic graph, graph nodes represent unique keywords, and edges connect keywords with semantic relationships. The weight of an edge is equal to the level of similarity between two keywords. This similarity is calculated through WordNet. The desired graph is depicted in Figure 2. WordNet seamlessly integrates traditional lexical information with modern computational methods, establishing semantic relationships that connect sets of synonyms.\u003c/p\u003e\n\u003cp\u003eTable 1. Weight of pre-processed keywords calculated by TF-IDF method\u003c/p\u003e\n\u003cdiv align=\"Left\"\u003e\n \u003ctable dir=\"rtl\" border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 180px;\"\u003e\n \u003cp dir=\"LTR\"\u003e\u003cstrong\u003eTf-idf\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp dir=\"LTR\"\u003e\u003cstrong\u003eWords\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 204px;\"\u003e\n \u003cp dir=\"LTR\"\u003e\u003cstrong\u003eIdf-weight\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp dir=\"LTR\"\u003e\u003cstrong\u003eWords\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 30px;\"\u003e\n \u003cp dir=\"LTR\"\u003e\u003cstrong\u003eNo.\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 180px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.469551\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp dir=\"LTR\"\u003evaccin\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 204px;\"\u003e\n \u003cp dir=\"LTR\"\u003e13.114407\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp dir=\"LTR\"\u003evaccin\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 30px;\"\u003e\n \u003cp dir=\"LTR\"\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 180px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.36006\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp dir=\"LTR\"\u003epeopl\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 204px;\"\u003e\n \u003cp dir=\"LTR\"\u003e13.114407\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp dir=\"LTR\"\u003epeopl\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 30px;\"\u003e\n \u003cp dir=\"LTR\"\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 180px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.323352\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp dir=\"LTR\"\u003eshot\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 204px;\"\u003e\n \u003cp dir=\"LTR\"\u003e13.114407\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp dir=\"LTR\"\u003eshot\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 30px;\"\u003e\n \u003cp dir=\"LTR\"\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 180px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.312904\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp dir=\"LTR\"\u003eday\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 204px;\"\u003e\n \u003cp dir=\"LTR\"\u003e13.114407\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp dir=\"LTR\"\u003eday\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 30px;\"\u003e\n \u003cp dir=\"LTR\"\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 180px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.304243\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp dir=\"LTR\"\u003ecovid\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 204px;\"\u003e\n \u003cp dir=\"LTR\"\u003e13.114407\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp dir=\"LTR\"\u003ecovid\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 30px;\"\u003e\n \u003cp dir=\"LTR\"\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 180px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp dir=\"LTR\"\u003esay\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 204px;\"\u003e\n \u003cp dir=\"LTR\"\u003e3.671171\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp dir=\"LTR\"\u003esay\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 30px;\"\u003e\n \u003cp dir=\"LTR\"\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 180px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp dir=\"LTR\"\u003eyou\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 204px;\"\u003e\n \u003cp dir=\"LTR\"\u003e3.309552\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp dir=\"LTR\"\u003eu\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 30px;\"\u003e\n \u003cp dir=\"LTR\"\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 180px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp dir=\"LTR\"\u003ego\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 204px;\"\u003e\n \u003cp dir=\"LTR\"\u003e2.940911\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp dir=\"LTR\"\u003ego\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 30px;\"\u003e\n \u003cp dir=\"LTR\"\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 180px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp dir=\"LTR\"\u003eamp\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 204px;\"\u003e\n \u003cp dir=\"LTR\"\u003e2.452431\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp dir=\"LTR\"\u003eamp\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 30px;\"\u003e\n \u003cp dir=\"LTR\"\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 180px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 66px;\"\u003e\n \u003cp dir=\"LTR\"\u003ejab\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 204px;\"\u003e\n \u003cp dir=\"LTR\"\u003e2.113741\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp dir=\"LTR\"\u003ejab\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 30px;\"\u003e\n \u003cp dir=\"LTR\"\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003e\u003cstrong\u003e3.1.4 Node Score Calculation Phase\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn graph theory, centrality measures are applied to identify the most important nodes in a graph and are used for ranking nodes. As previously stated, the node weight is determined through four metrics renowned for their efficacy in gauging the significance of a keyword, as they effectively emphasize the prominence of each word. These metrics include (a) betweenness centrality, (b) eigenvector centrality, (c) closeness centrality, and (d) degree centrality. Generally, centrality refers to how important a node is in a network, or more simply, which nodes are important in the network. Degree centrality quantifies the number of links a node possesses. Betweenness centrality assesses a node\u0026apos;s centrality in a graph based on the shortest paths. Closeness centrality measures the average distance from a node to all other nodes in the graph. Eigenvector centrality measures the influence of a node in a network Connections to nodes with higher scores carry more weight in determining the score of the target node, while connections to nodes with lower scores have a comparatively lesser impact. This principle guides the allocation of relative scores to all nodes in the graph (Table 2).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTable2. Criteria calculated for each node along with their final weight\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"623\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eNo.\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eKeywords\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eEigenvector\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eDegree\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eCloseness\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eBetweenness\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eNode weight\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003evaccin\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.83\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.81\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.83\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.60\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e3.07\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003ecovid\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.83\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.81\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.83\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.60\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e3.07\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003epeopl\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.78\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.68\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.27\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e2.63\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003eneed\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.55\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.78\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.55\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e2.41\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003edose\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.40\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e2.36\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003eweek\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.40\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e2.36\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003ehealth\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.41\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.77\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.78\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e2.26\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003eeffect\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.41\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.77\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.78\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e2.26\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003ereceiv\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e2.26\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003eshot\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.34\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.53\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.05\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.99\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003eday\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.96\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003eamp\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.96\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003etake\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.59\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.69\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.96\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003eget\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.96\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e15\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003eplea\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.78\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.40\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.78\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.90\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003ework\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.64\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.90\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e17\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003etoday\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.64\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.90\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003elike\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.59\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.64\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.90\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e19\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003esay\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.64\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.85\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003eshow\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.55\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.48\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.06\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.83\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e21\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003egot\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.64\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.59\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003ethrough\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.01\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.50\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e23\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003ego\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.40\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.45\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e1.17\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e24\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003ejab\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.64\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.64\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.91\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003eappoint\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.27\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.85\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e26\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003eu\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.35\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.07\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.27\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.69\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e27\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003eone\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.07\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.35\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.23\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.65\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 80px;\"\u003e\n \u003cp\u003e28\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 88px;\"\u003e\n \u003cp\u003ethank\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 96px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 84px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 101px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 87px;\"\u003e\n \u003cp\u003e0.00\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eTable 3. Keywords selected by PageRank algorithm\u003c/p\u003e\n\u003cdiv align=\"Left\"\u003e\n \u003ctable dir=\"rtl\" border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e\u003cstrong\u003ePR val\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003e\u003cstrong\u003eTop words\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e\u003cstrong\u003eNo.\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.032715\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003evaccin\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.032520\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003ecovid\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.032026\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003epeopl\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.032026\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003eneed\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.030517\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003edose\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.028796\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003eweek\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.026645\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003ehealth\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.026645\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003ereceiv\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.026162\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003eget\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.025675\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003eshot\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.023077\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003eamp\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e11\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.023077\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003etake\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e12\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.023036\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003eplea\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.021601\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003eday\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e14\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 209px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.020829\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 208px;\"\u003e\n \u003cp dir=\"LTR\"\u003eeffect\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 69px;\"\u003e\n \u003cp dir=\"LTR\"\u003e15\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003e\u003cstrong\u003e3.1.5 Node Selection Phase\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn this stage of the algorithm, PageRank was used to find the top keywords based on their importance. PageRank is an algorithm primarily used to determine the importance of nodes in a graph. According to PageRank, the score associated with a node is determined based on the arrangement given to it, as well as the score of the node issuing these arrangements. Moreover, given the multifaceted nature of keyword importance on Twitter, we utilized the four centrality metrics calculated in the previous stage to ascertain the significance of the nodes. In our proposed method, these four metrics are sent to the PageRank algorithm in combination so that the algorithm can identify the top keywords by considering all aspects of the relationships between nodes (Table 3).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.2 Calculating Semantic Relationship Between Keywords and Hashtag\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAfter extracting the most commonly used hashtags from the dataset, which was done using the NLTK library, the semantic relationship between hashtags and keywords was calculated for validation using three methods:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.2.1 WordNet\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAs mentioned earlier, WordNet is a lexical resource where synonym sets are associated with three numerical scores, explaining how objective, positive, or negative the expressions in the set are. One of its simplest uses is to power text representation in text mining tasks and add information about the sentiment-related features of expressions in the text.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.2.2 Cosine Similarity\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCosine similarity, widely used in text analysis, is a popular method in natural language processing for approximating the similarity of document vectors, regardless of their size. Mathematically, cosine calculates the angle between two vectors represented in a multi-dimensional space. Simply put, we determine the similarity between a hashtag and a keyword by examining the angle between their respective vectors. A cosine similarity score of +1 signifies perfect alignment between vectors, while a score of -1 indicates they are diametrically opposed.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.2.3 Pearson Correlation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe Pearson correlation coefficient is utilized to evaluate the relationship between two or more variables by quantifying both the direction and strength of the correlation. It serves as a mathematical index applicable to distributions involving multiple variables.\u003cspan dir=\"RTL\"\u003e\u0026nbsp;\u003c/span\u003eSuppose the values of two variables change similarly, meaning that as one increases or decreases, the other also increases or decreases in a way that their relationship can be expressed by an equation. In that case, we say there is a correlation between these two variables. The Pearson correlation score determines the proportionality of two given data sets with a line measured from -1 to 1+. This coefficient has two parts: numerical value and sign. The sign of the Pearson correlation coefficient indicates whether the relationship between the two variables is positive or negative, while the numerical value signifies the strength of the linear relationship.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.3 Separation of relevant and irrelevant tweets with hashtags\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAfter determining the similarity of each keyword with hashtags in the previous stage, it is time to separate the data.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.3.1 Labeling tweets\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSince our dataset consists of many tweets with various hashtags, we needed a human force to label the tweets manually. Because our focus is specifically on the coronavirus vaccine, tweets whose content aligned with the hashtag received a \u0026quot;yes\u0026quot; label, while those whose content did not align received a \u0026quot;no\u0026quot; label. This stage was implemented to see if we could predict tweet content based on the extracted keywords.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.3.2 Classifier training\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn the previous stage, the dataset was divided into two classes labeled \u0026quot;yes\u0026quot; and \u0026quot;no.\u0026quot; For this stage, binary classification algorithms are required. Initially, we employed a Support Vector Classifier (SVC), a supervised machine learning algorithm commonly utilized for classification and regression tasks. SVC categorizes input data by mapping it to a high-dimensional space and determining the optimal hyperplane that separates the data into two classes. However, due to the higher probability of achieving improved performance, we opted for Random Forests in the second attempt. This algorithm comprises multiple decision trees trained on different subsets of the dataset and aggregates their predictions to enhance accuracy. Unlike a single decision tree, Random Forest predicts the final outcome based on majority voting among the trees. Increasing the number of trees in the forest enhances accuracy and mitigates overfitting concerns.\u003c/p\u003e"},{"header":"4. Dataset and evaluation","content":"\u003cp\u003eThe dataset used in this research, consisting of 364964 tweets with various hashtags related to the coronavirus, was obtained for analysis and evaluation from reputable sources such as the Kaggle website. The oldest tweets in this dataset date back to October 1, 2019. This dataset underwent a redesign on March 20, 2020, and includes features such as usernames, follower count, friend count, like count, retweet count, active hashtags, and geographic location. Due to the explicit focus of our research on the coronavirus vaccine, we decided to conduct our study precisely on the #CovidVaccine hashtag, which is recognized as the most frequent hashtag in this dataset. The purpose of evaluation in this work is to examine to what extent our proposed method has successfully validated the desired hashtag and its association with tweets. Python version 3.9 was used for algorithm implementation, and Jupyter Notebook was used as the development environment for experiments, and the operating system used was Windows 11. We start by evaluating keyword extraction with the TF-IDF, then move on to the two classification methods mentioned in the previous chapter.\u003c/p\u003e\n\u003cp\u003eThe values of the matrix (Figure 9) indicate that the Term Frequency-Inverse Document Frequency algorithm has identified 3250 keywords in the Yes label category. According to the report in Figure 5, this algorithm has only been able to make correct predictions up to 70%. Likewise, metrics such as accuracy and recall show figures of 0.72 and 0.71, respectively. Therefore, it can be said that if only the TF-IDF algorithm is used, this metric alone will not be sufficient for distinguishing between the two categories. Hence, we used centrality metrics.\u003c/p\u003e\n\u003cp\u003eNext, we evaluate the performance of the Support Vector Classifier (SVC). Overall, we had 23,437 correct predictions and 8,563 incorrect predictions by this algorithm (Figure 10). In our training set, we had 12,544 tweets with the \u0026quot;No\u0026quot; hashtag and 19,456 tweets with the \u0026quot;Yes\u0026quot; hashtag, accounting for a relatively balanced data distribution. The ROC curve is a standard method for evaluating the quality of binary classification. As the name suggests, the ROC is a probability curve, and the AUC (Area Under Curve) measures separability. The higher the AUC and the closer to 1 it is, the better the model is. According to Figure 11, our model\u0026apos;s performance is 0.96, indicating good results. Unlike accuracy, ROC curves are not sensitive to class imbalances, although there was balance in our training set.\u003c/p\u003e\n\u003cp\u003eThe accuracy score indicates that the Support Vector Classifier algorithm was able to distinguish between our data by 73% (Figures 11 and 13). The accuracy column means that in 84% of cases, tweets containing the desired hashtag were correctly predicted. The recall value indicates that our model identified 69% of tweets with the \u0026quot;Yes\u0026quot; label (Table 4).\u003c/p\u003e\n\u003cp\u003eTable 4. Results of classification report of TF-IDF, SVC, and Random Forest algorithms\u003c/p\u003e\n\u003cdiv align=\"Left\"\u003e\n \u003ctable dir=\"rtl\" border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 110px;\"\u003e\n \u003cp dir=\"LTR\"\u003eF1-score\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 121px;\"\u003e\n \u003cp dir=\"LTR\"\u003eRecall\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp dir=\"LTR\"\u003ePrecision\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 140px;\"\u003e\n \u003cp dir=\"LTR\"\u003eAccuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 143px;\"\u003e\n \u003cp dir=\"LTR\"\u003eClassifiers\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 110px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.71\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 121px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.71\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.72\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 140px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.70\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 143px;\"\u003e\n \u003cp dir=\"LTR\"\u003eTF-IDF\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 110px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.76\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 121px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.69\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.84\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 140px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.73\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 143px;\"\u003e\n \u003cp dir=\"LTR\"\u003eSVC\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 110px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.97\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 121px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.96\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 142px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.97\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 140px;\"\u003e\n \u003cp dir=\"LTR\"\u003e0.96\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 143px;\"\u003e\n \u003cp dir=\"LTR\"\u003eRandom-Forest\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003e\u003cbr\u003e\u003c/p\u003e"},{"header":"5. Conclusion","content":"\u003cp\u003eCurrently, users need to quickly review many Twitter posts to find topics relevant to their interests. Analyzing such vast amounts of data can be easier if we can have a subset of words (keywords) that can provide us with the main features, concepts, and document subjects. Hashtags, which have been widely used since 2007, are always used for marking keywords in tweets for message categorization and initiating conversations on Twitter, primarily aiding in attracting attention to user posts and encouraging interaction with them. For keyword ranking, a statistical approach based on graphs has been adopted. Text graph construction using WordNet involves node allocation and edge creation between nodes with semantic proximity to keywords. Not all keywords have the same importance in determining sentiment opinions. To calculate their importance, the weight of each keyword is proposed to be determined using various influential parameters such as position, centrality, and neighbor importance, and finally, the nodes are ranked using the PageRank algorithm. Semantic similarity between keywords and hashtags was performed using widely used metrics such as WordNet and cosine similarity. Additionally, since the Pearson correlation score determines the degree of proportionality between two data objects with a single line, we also utilized this metric to calculate the level of similarity. One of the significant achievements of this research was the very high accuracy rate in classification approaches. The 70% result in the evaluation of the Term Frequency-Inverse Document Frequency method indicated that this method cannot be a good criterion for classification. Therefore, two other machine learning-based methods, Support Vector Classifier and Random Forest, were used for classifier training. Although the accuracy obtained in the SVC approach is less than that in the Random Forest (73% and 96%, respectively), overall, our proposed approach for extracting, correlating, and classifying tweets related to a specific hashtag on Twitter streams has provided acceptable results. The hashtag validation approach applied in this research can be influential in recommending impactful hashtags. However, challenges and limitations were evident in this process that may significantly impact the final results and performance. Accordingly, the decision was made to select only one hashtag from the extensive dataset, which may reduce the diversity of topics and limit the results to a specific field. Additionally, the tweet labeling process (based on the presence or absence of hashtags) requires high accuracy and thorough examination to properly reflect the research results. Another issue that can facilitate more accurate and efficient processing but requires effective management is the large volume of the dataset, which increases computational time. In general, the research results show that classification algorithms have performed well. However, there is a need for attention to the precise selection and adjustment of parameters to improve algorithm performance and increase accuracy and efficiency. Despite the challenges, this research demonstrates that the use of semantic graphs and classification algorithms can lead to improving the keyword extraction process from tweets. In future research, exceptional attention to improving these limitations and enhancing algorithm performance is recommended.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e Majority of the data were presented in this manuscript. Additional raw data may be made available upon request.\u003cstrong\u003e\u003cbr\u003e\u003c/strong\u003e\u003cstrong\u003eEthical approval\u003c/strong\u003e This article does not contain any studies with human\u003cbr\u003eparticipants or animals performed by any of the authors.\u003cbr\u003e \u003cstrong\u003eConsent to participate\u003c/strong\u003e Not applicable.\u003cbr\u003e \u003cstrong\u003eConsent for publication\u003c/strong\u003e Not applicable.\u003cbr\u003e \u003cstrong\u003eConflict of interest\u003c/strong\u003e The authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding Statement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that no funds, grants, or other financial support were received during the preparation of this manuscript.\u003c/p\u003e\n\u003cp\u003e---------------------------------------------------\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFatemeh L. reviewed the main text and developed the model. Hassan S. performed critical review and revisions of the manuscript. Meraj H., as the corresponding author, carried out the editing and coordinated the submission process. Hassan A. was responsible for data acquisition. All authors critically reviewed the manuscript, approved the final version for submission, and agree to be accountable for all aspects of the work.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eFigueiredo F, Jorge A (2019) Identifying topic relevant hashtags in Twitter streams. Inf Sci 512:1\u0026ndash;16. https://doi.org/10.1016/j.ins.2019.07.062\u003c/li\u003e\n\u003cli\u003eDevika R, Subramaniyaswamy V (2019) A semantic graph-based keyword extraction model using the ranking method on big social data. Wirel Netw 27(8):5447\u0026ndash;5459. https://doi.org/10.1007/s11276-019-02128-x\u003c/li\u003e\n\u003cli\u003eShin H, Lee HJ, Cho S (2022) General-use unsupervised keyword extraction model for text analysis. SSRN Electron J. https://doi.org/10.2139/ssrn.4201176\u003c/li\u003e\n\u003cli\u003eBordoloi M, Biswas SK (2020) Graph-based sentiment analysis using keyword rank-based polarity assignment. Multimed Tools Appl 79(47):36033\u0026ndash;36062. https://doi.org/10.1007/s11042-020-09289-4\u003c/li\u003e\n\u003cli\u003ePan R (2023) Automatic keyword extraction algorithm for Chinese text based on word clustering. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3592793\u003c/li\u003e\n\u003cli\u003eAlShammari AF (2023) Implementing keyword extraction using term frequency-inverse document frequency (TF-IDF) in Python. Int J Comput Appl 185(35):9\u0026ndash;14. https://doi.org/10.5120/ijca2023923137\u003c/li\u003e\n\u003cli\u003eNing X, Zhang H, Zhao X, Liu Y (2023) Retracted: TextRank keyword extraction algorithm using word vector clustering based on rough data-deduction. Comput Intell Neurosci 2023:9861397. https://doi.org/10.1155/2023/9861397\u003c/li\u003e\n\u003cli\u003eMa J (2022) Research on keyword extraction algorithm in English text based on cluster analysis. Comput Intell Neurosci 2022:4293102. https://doi.org/10.1155/2022/4293102\u003c/li\u003e\n\u003cli\u003eKelebercov\u0026aacute; L, Munk M (2022) Search queries related to COVID-19 based on keyword extraction. Procedia Comput Sci 207:2618\u0026ndash;2627. https://doi.org/10.1016/j.procs.2022.09.320\u003c/li\u003e\n\u003cli\u003eQiu Q, Xie Z, Wang B (2021) GKEEP: An enhanced graph-based keyword extractor with error-feedback propagation for geoscience reports. Earth Space Sci 8(2). https://doi.org/10.1029/2020EA001602\u003c/li\u003e\n\u003cli\u003eHuang Z, Xie Z (2022) A patent keywords extraction method using TextRank model with prior public knowledge. Complex Intell Syst 8(1):1. https://doi.org/10.1007/s40747-021-00343-8\u003c/li\u003e\n\u003cli\u003eNi P, Li Y, Chang V (2020) Research on text classification based on automatically extracted keywords. Int J Enterp Inf Syst 16(4):1\u0026ndash;16. https://doi.org/10.4018/IJEIS.2020100101\u003c/li\u003e\n\u003cli\u003eBest KB, Gilligan JM, Baroud H, Carrico AR, Donato KM, Ackerly BA, Mallick B (2021) Random forest analysis of two household surveys can identify important predictors of migration in Bangladesh. J Comput Soc Sci 4(1):77\u0026ndash;100. https://doi.org/10.1007/s42001-020-00066-9\u003c/li\u003e\n\u003cli\u003eShi B, Poghosyan G, Ifrim G, Hurley N (2018) Hashtagger+: Efficient high-coverage social tagging of streaming news. IEEE Trans Knowl Data Eng 30(1):43\u0026ndash;58. https://doi.org/10.1109/TKDE.2017.2754253\u003c/li\u003e\n\u003cli\u003eShah SHH, Lei S, Ali M, Doronin D, Hussain ST (2020) Prosumption: bibliometric analysis using HistCite and VOSviewer. Kybernetes 49(3):1020\u0026ndash;1045. https://doi.org/10.1108/K-12-2018-0696\u003c/li\u003e\n\u003cli\u003eGraph-based keyword extraction for Twitter data (2022) In: Emerging research in computing, information, communication and applications. pp 863\u0026ndash;871. https://doi.org/10.1007/978-981-16-1342-5_68\u003c/li\u003e\n\u003cli\u003eKumar N, Baskaran E, Konjengbam A, Singh M (2021) Hashtag recommendation for short social media texts using word embeddings and external knowledge. Knowl Inf Syst 63(11). https://doi.org/10.1007/s10115-020-01515-7\u003c/li\u003e\n\u003cli\u003eCabezas J, Moctezuma D, Isabel A, Mart\u0026iacute;n de Diego I (2021) Detecting emotional evolution on Twitter during the COVID-19 pandemic using text analysis. Int J Environ Res Public Health 18(13):6981. https://doi.org/10.3390/ijerph18136981\u003c/li\u003e\n\u003cli\u003eChakrabarti P, Malvi E, Bansal S, Kumar N (2023) Hashtag recommendation for enhancing the popularity of social media posts. Soc Netw Anal Min 13(1). https://doi.org/10.1007/s13278-023-01024-9\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Semantic graph, Keyword extraction, Hashtag validation, Classification, Twitter, Covid vaccine","lastPublishedDoi":"10.21203/rs.3.rs-7832005/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7832005/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe relationship between hashtags and keywords in content generated on social media platforms is considered essential and fundamental. Retrieving information related to a specific topic on Twitter and categorizing them encounters difficulties. Hashtags act as approximate indicators of tweet topics, but due to their ambiguity and flexibility in usage, challenges still exist in searching for content related to a specific topic. Extracting keywords from Twitter is a vital step in displaying the main content of a post or a set of posts. These keywords usually have the best correlation with the textual content. Correctly extracting these keywords can provide the ability to analyze the text's topic and make critical decisions comprehensively. Therefore, research on extracting relationships between hashtags and keywords is of significant importance. It has been transformed into a necessity due to its fundamental role in improving the search and categorization of content on social media platforms. Consequently, this study evaluated the semantic relationship between keyword sets of a Twitter dataset regarding the coronavirus disease and the embedded hashtags in tweets. The dataset tweets amounted to 364,964 in English, each containing up to 280 characters without any images. Also, hashtag validation has been explicitly focused on the coronavirus vaccine; hence, we assume this topic is identified only with a specific hashtag. In this regard, a novel method is introduced, which utilizes semantic graph visualization and ranking techniques for keyword extraction. The proposed model involves the construction of a semantic graph, followed by the application of centrality measures to assign weights to its nodes. Subsequently, the similarity between keywords and hashtags was evaluated using three methods. Finally, two machine learning algorithms were implemented to distinguish between relevant and irrelevant tweets with the hashtag. The results of the two classification algorithms, with 73% and 96% accuracy, respectively, indicate that this approach can effectively validate the relationship between keywords and hashtags.\u003c/p\u003e","manuscriptTitle":"Presenting a new method for identifying and extracting keywords on Twitter related to Covid-19","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-11-06 16:51:12","doi":"10.21203/rs.3.rs-7832005/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-01-08T06:17:57+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-19T23:40:39+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"120545203348994366394578384530027531767","date":"2025-12-19T23:36:56+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-12T14:49:13+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"180397964042122123743424775952945086232","date":"2025-11-22T11:57:55+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"193117007494436270702966207574561281486","date":"2025-11-06T05:53:47+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-10-27T15:34:36+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-10-24T13:55:13+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-10-15T07:29:07+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-10-15T07:28:54+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-10-11T05:37:30+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"7091c31a-c79e-4346-90e4-6f6afd3f3f43","owner":[],"postedDate":"November 6th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"in-revision","subjectAreas":[{"id":57513845,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":57513846,"name":"Physical sciences/Mathematics and computing"}],"tags":[],"updatedAt":"2026-01-08T06:24:28+00:00","versionOfRecord":[],"versionCreatedAt":"2025-11-06 16:51:12","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7832005","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7832005","identity":"rs-7832005","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00