Effect of dimension size and window size on word embedding in classification tasks

preprint OA: closed
Full text JSON View at publisher
Full text 105,422 characters · extracted from preprint-html · click to expand
Effect of dimension size and window size on word embedding in classification tasks | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Effect of dimension size and window size on word embedding in classification tasks Dávid Držík, Jozef Kapusta This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4532901/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract In natural language processing, there are several approaches to transform text into multi-dimensional word vectors, such as TF-IDF (term frequency - inverse document frequency), Word2Vec, GloVe (Global Vectors), which are widely used to this day. The meaning of a word in Word2Vec and GloVe models represents its context. Syntactic or semantic relationships between words are preserved, and the vector distances between individual words correspond to human perception of the relationship between words. Word2Vec and GloVe generate a vector for each word, which can be further utilized. Unlike GPT, ELMo, or BERT, we don't need a model trained on a corpus for further text processing. It's important to know how to set the size of the context window and the dimension size for Word2Vec and GloVe models, as an improper combination of these parameters can lead to low-quality word vectors. In our article, we experimented with these parameters. The results show that it's necessary to choose an appropriate window size based on the embedding method used. In terms of dimension size, according to our results, dimensions smaller than 50 are no longer suitable. On the other hand, with dimensions larger than 150, the results did not significantly improve. Spam Classification Text Mining Natural Language Processing Word Embbedings Dimension Size Context Window Size Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 Figure 15 Figure 16 1 Introduction Nowadays, artificial intelligence algorithms are used in almost every area of life. Most of these methods work with data that are in numerical form. The same applies to the field of natural language processing, whose main task is to understand human language as humans understand it. Machine learning algorithms cannot deal with words or sentences of text, therefore there are several approaches to transform individual words into digital form, most commonly into the form of multidimensional word vectors. Several techniques are used for this transformation, from the most trivial, such as One-hot-encoding or TF-IDF (term frequency - inverse document frequency) [ 1 ], to Word2Vec [ 2 ] and GloVe (Global Vectors) [ 3 ], which are still widely used today. The word vectors created by these methods are nothing more than word vectors mapped into a multidimensional space using continuous floating-point numbers that represent the meaning of the word [ 4 ], [ 5 ], [ 6 ]. The meaning of a word in the Word2Vec model is represented by its context, i.e., the words that come before and after it. As a result, this model creates one multidimensional vector for each word. The syntactic or semantic relations between words are preserved, and the vector distances of individual words correspond to the human conception of the relation between words [ 7 ], [ 8 ]. In addition to GloVe and Word2Vec, other newer methods such as BERT, GPT, and the like are currently being utilized. Unlike GloVe and Word2Vec, which provide context-independent word embeddings, these models generate a single vector (embedding) for each word, amalgamating all its various senses into one representation. This numeric representation, termed embedding or vector, remains consistent regardless of the word's position within a sentence or its multiple meanings. Conversely, GPT, ELMo, and BERT are capable of producing diverse word embeddings that encapsulate the contextual nuances of a word, including its syntactic and semantic context within a sentence. This disparity has practical implications; while word2vec and GloVe vectors, trained on extensive corpora, can be directly employed for downstream tasks without necessitating the original model used for training, GPT, ELMo, and BERT require access to the model post-training, as they rely on context to generate word vectors. Consequently, for numerous projects, employing Word2Vec and GloVe may prove more advantageous due to their context-independent nature. It is important to know how to set the context window size and dimension size parameters for Word2Vec and GloVe models, because the wrong combination of these parameters can result in low quality word vectors. Adewumi et al. [ 9 ] tried to determine the correct setting of these parameters. He empirically investigated the parameters dimension size from 100 to 3000 and window sizes with values of 4 and 8. By evaluating the results, he proved that the best model is usually task-specific, and a high analogy score does not necessarily correlate positively with the F1 score metric. He further found out that increasing the size of the vector dimension too much (above 400) resulted in poor quality word vectors. And in addition, he claims that using a small corpus, he obtained a better WordSim score, corresponding Spearman correlation, compared to the original model by Mikolov et al. [ 10 ], trained on a corpus of 100 billion words. Another author Yang et al. [ 11 ] solved the task of classification on posts from Twitter in English and classified whether the given post was related to the topic of elections or not. He used a convolutional neural network model and, comparing the results, found that the input corpus for creating vectors should be similar to the corpus used for classification, and he achieved the best classification accuracy by setting the dimension to 800, the window size to 10, at the level of 80.8% by F1 score metric. Nazir et al. [ 12 ] formed vectors with different dimensions and window sizes. They focused on creating Urdu word embedding. They created the best model using 500 dimension size and 5 window size. In our article, we want to verify the findings achieved by both authors. Like them, we want to experiment with window size and the dimension size. The aim of the paper is to find out the most suitable values of the window size and the dimension size for the calculation of predictive word models. To verify the suitability of the parameters of the window size and the size of the dimension for the calculation of predictive word models, we proceeded according to the following methodology (Fig. 1 ): 1. Pre-processing of the input text corpus : The input text corpus underwent preprocessing procedures. 2. Creation of word embedding models : A total of 225 word models were generated, resulting from the combination of three embedding methods, three window sizes, and 25 dimensions. 3. Generation of input vectors for classification task : Each word model was utilized to generate input vectors for the classification task, derived from the input texts of the dataset. 4. Creation of classifiers : Input vectors of words were employed to develop classifiers. Specifically, 225 classifiers were crafted for individual input vectors generated using predictive models. 5. Performance evaluation : Performance metrics including accuracy, precision, recall, and F1 score were computed for each classifier created, based on the test set. Simultaneously, the time taken for word vector creation was recorded for result analysis. 6. Results assessment : The outcomes were systematically evaluated. There are several approaches to compare the success of the proposed method. Nazir et al. [ 12 ] evaluate success using datasets wordsim-353 and Lexsim-999, which contain calculated similarities of selected pairs. Calculating the similarities between two words is often presented in educational examples focused on word vectors. However, from a practical point of view, word vectors are used in different practical tasks. Classification is one of them. For this reason, we will verify the suitability of our method by using the trained methods to create vectors in the classification task. Subsequently, we will evaluate the performance measures of the classification itself, i.e. we will not evaluate the word vectors themselves directly, but we will use them to solve the classification task and evaluate the success of the classification. The structure of the paper is as follows. The current state of research in the area of influence of window size and dimension size parameters is summarized in the second part. Spam datasets used in the research, as well as related text pre-processing techniques and text vectorization models used, are described in the third section. The most important results are summarized in the fourth section. Discussion and conclusions form the content of the last part of the paper. 2 Related works A model for creating multiple word vectors called Word2vec was proposed relatively recently by Mikolov et al. [ 7 ], [ 13 ]. The Word2vec model is based on the fact that the very meaning of a word is represented by its context, i.e., the words that come before and after it. As a result, this model creates one multiple vectors for each word. The syntactic or semantic relations between words are preserved, and the vector distances of individual words correspond to the human idea of the relation between words. In this model, the context window size parameter is very important. Authors Levy and Goldberg [ 14 ] in their work claim that the larger the value of the window size, the more the model tends to capture semantic information, and conversely, the smaller the window size parameter, the more syntactic information is captured in the vectors. Another very important parameter is the dimension of word vectors itself. Small dimensions cannot capture some significant relationships between words, while large dimensions allow the construction of a complete vector that captures information between words. However, some dimensions that are too high will cause the resulting word vectors to contain redundant information without further added value. In general, the most used dimension size value for the creation of word vectors is between 50 and 500 [ 7 ], [ 15 ]. The creators of the Word2vec model themselves state two basic architectures: CBOW (Continuous Bag-of-Words) and Skip-gram. Both of these architectures consist of an artificial neural network that contains an input layer (where input words are encoded as One-hot-encoding), one hidden layer (also called a embedding layer), and an output layer. After training the network, the result will not actually be the output from the output layer of the neural network, but the weights from the hidden layer will be the result. These weights will represent multiple word vectors [ 4 ], [ 7 ], [ 8 ], [ 15 ], [ 16 ]. The difference between these architectures is that in CBOW, a specific word is predicted based on its context words. In general, the CBOW architecture performs slightly better than Skip-gram for very frequent words. Unlike CBOW, Skip-gram learns to predict context words from a given word. If two words (one occurring rarely and the other occurring more often) are placed next to each other, they will be treated very similarly. Otherwise, each word will be treated as both a target and a context word [ 4 ], [ 7 ], [ 8 ], [ 13 ]. Another well-known model that is similar to the previous model is the GloVe (Global Vectors) model. This model also captures syntactic or semantic relations between words when creating word vectors. But it works on a slightly different principle than Word2Vec because the GloVe model is based on the matrix factorization technique on the word context matrix. So, first a large matrix of co-occurrence information (words x context) is created, i.e., for each word it is counted how often we see this word in some context in a large corpus. Then the matrix is factored to obtain a lower dimensional matrix [ 17 ], [ 18 ], [ 19 ]. The goal of training word vectors with the GloVe model consists of learning word vectors in such a way that the scalar product of the represented words is equal to the logarithm of the probability of their co-occurrence. Given that the logarithm of the ratio is equal to the difference of the logarithms, the differences of their vector representation are associated with the ratios of the probabilities of their co-occurrence. The resulting word vectors thus perform very well on verbal analogy tasks, as the model can capture a variety of linguistic patterns [ 17 ]. The motivation for our research was mainly two studies [ 11 ], [ 20 ], which we already mentioned in the Introduction section. These studies investigated the influence of window size and dimension size parameters of word vectors. We attach one more study, by the author Chugh et al [ 21 ], who in his study investigated the stability of the Word2vec model in terms of how word vectors are consistent with respect to word frequency in the corpus and the dimension size parameter. He examined three different corpora, selecting exactly 10,000 of the most frequent words from each corpus in the pre-processing process. And in addition, from these 10,000 words, he selected 10 words of upper (0–10), middle (4995–5005) and lower (9990–10,000) frequencies. The size of the dimension that appeared as a parameter in the model ranged from 1 to 377, while the individual values were elements of the Fibonacci sequence. At each run of the word vector training algorithm, it kept the 10 nearest neighbor words in the vector space for words labelled as high, medium, and low frequency words. To calculate the stability of the model, he used the Jaccard similarity coefficient between pairs of sets of neighbors of all words. By comparing the obtained results, he found that the size of the dimension of the word vectors has a significant effect on the consistency of the model and depends on the corpora. And in addition, he found that high-frequency words show a higher degree of stability than medium and low-frequency words, but since any English corpus will always contain low- and medium-frequency words, the overall stability of the model will always decrease with these words [ 21 ]. Nazir et al. (2022) focuses on creating Urdu word embedding. They formed vectors with different dimensions and window sizes. During training, the word vectors were updated and the trained model was evaluated with spearman coefficient. They used commonly used vectors of 100, 200, 300, 400, and 500 dimensions, and the standard vector dimensions. They used the word2vec model. To conduct their experiment, they used a window size of 3, 5, and 7. To evaluate the word embedding model they used two benchmark datasets that represent similarities of different English word pairs: SimLex-999 [ 22 ] and wordsim-353 [ 23 ]. It was observed that wordsim-353 models with vector dimensions of 100 and 128 performed better than the rest of the models. However, for Lexsim-999 word embedding model with 500 vector dimensions and 5 window size performed better. 3 Methodology 3.1 Datasets and text data pre-processing We used 2 datasets for this research. The first dataset The Enron Email Dataset is a freely available text dataset ( https://www.kaggle.com/datasets/wcukierski/enron-email-dataset/discussion/116088 ) containing 500,000 English-language emails sent between 150 users. We used this data file as an input text corpus to create word vectors. The second SMS spam collection dataset is also freely available ( https://github.com/mohitgupta-omg/Kaggle-SMS-Spam-Collection-Dataset-/blob/master/spam.csv ) and served in the evaluation phase as a validation tool to verify the correctness of generated word vectors for the problem of spam classification [ 24 ], [ 25 ], [ 26 ]. This dataset consisted of a total of 5565 records of correctly labelled short text messages (747 spam and 4819 "ham", non-spam). Both datasets had to be pre-processed in a certain way before use. The pre-processing was done in Python programming language and consisted of converting the whole text into sentences, then we converted the characters of individual sentences into lowercase letters, we removed words that contained special characters, numbers, or whitespace characters, we also removed the so-called stop words, which occur frequently in the language and themselves they have no meaning to themselves, and we have lemmatized all the remaining words. 3.2 Used word embedding methods Word2Vec – CBOW The goal of the CBOW (Continuous Bag of Words) architecture of the Word2Vec method is to predict a target word based on its neighboring or context words within a context window. This architecture basically consists of a simple feed forward neural network with three layers. Contextual words encoded by the one hot encoding method enter the input layer, and the hidden layer contains as many neurons as the size of the dimension we set for the word vectors. The output layer contains one neuron with a non-linear Softmax activation function. The CBOW architecture is shown in Fig. 2 along with the Skip-gram architecture. The result of neural network training will not be the output of the output layer, but the output will be the weights (word vectors) of the hidden layer. After training the network, these weights are set so that they can capture the meaning of the input words [ 4 ], [ 7 ], [ 8 ], [ 13 ]. Word2Vec - Skip-gram The second architecture is Skip-gram (in Fig. 2 ) and its aim, unlike CBOW, is to predict context words based on the target word within the context window. This architecture also consists of a simple neural network, but the input to it is only one target word in each, and the output layer contains as many neurons as the context words we want to predict. The output layer contains the Softmax activation function, but in practice Word2Vec uses its algorithmic enhancements Hierarchical Softmax and Negative Sampling. The disadvantage of the Softmax function is that to represent the probability of each word, it is necessary to go through all the words in the dictionary, which is computationally very demanding [ 4 ], [ 7 ], [ 8 ], [ 13 ]. GloVe While the Word2Vec model is a predictive model consisting of a feedforward neural network, the GloVe (Global Vectors) model is a count-based model. The first step is to create a matrix of their co-occurrence, which contains information on how often each of the words occurs in some context. The matrix is later factorized to reduce the dimension of this matrix. The goal of the GloVe model is to train word vectors so that their scalar product is equal to the logarithm of the probability of co-occurrence of words [ 16 ], [ 17 ], [ 18 ]. Using the Gensim library [ 27 ], we implemented these 3 embedding methods in Python by creating the corresponding models. The input to the models was preprocessed text data and input parameters, the most important for us being the size of the window size and the dimension size. 3.3 Sequential neural network classifier We verified the correctness of the word vectors trained by the Word2Vec and GloVe models on spam classification using the Keras library [ 28 ] sequential neural network classifier in Python. The neural network itself consists of three layers: Embedding, Flatten and Dense layers. As input parameters to the first Embedding layer, we set the size of the vocabulary, the size of the dimension of the word vectors, the word vectors themselves, which represent the weights of the hidden layer, and the length of the input sequences. The output Dense layer contains one neuron with Sigmoid activation function. We used the Flatten layer to connect the Embedding layer with the output Dense layer. We compiled the network model using the RMSprop optimizer and the BinaryCrossentropy loss function. The pre-processed second data file contains, in addition to a short message, a label of whether it is spam or not. We encoded each message (all words in it) into a sequence of numbers. The Keras library requires these sequences to be vectorized and to have the same length. We trained the network model on the training data of these sequences and using basic statistical metrics we obtained the classification accuracy based on the test data. 3.4 Validation and used evaluation metrics To assess the performance of our classification model, i.e. our input word vectors to that model, we will use basic performance evaluation metrics, which are accuracy, precision, recall and f1-score. Accuracy is the ratio of correct predictions to the total number of samples and is computed as in Eq. ( 1 ): $$acc=\frac{TP+TN}{TP+TN+FP+FN} ,$$ 1 where TP represents the number of True Positive results, FP represents the number of False Positive results, TN represents the number of True Negative results, and FN represents the number of False Negative results. To evaluate, analyze and describe the results of our model we also calculated precision (in Eq. ( 2 )), recall (in Eq. ( 3 )) and F1 score (in Eq. ( 4 )) as follows: $$precision=\frac{TP}{TP+FP} ,$$ 2 $$recall=\frac{TP}{TP+FN} ,$$ 3 $$F1 score=2*\frac{precision*recall}{precision+recall}$$ 4 Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Recall shows the ratio of correctly predicted positive observations to all observations in actual class. F1 score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. We used the Scikit-learn [ 29 ] Python library to implement all these evaluation metrics. 4 Results In the research, we focused on finding the most suitable window size and dimension size for computing the predictive word model Word2Vec and the count-based model GloVe. The basic principle of the Word2Vec model is to train a neural network to predict the target word based on neighboring words from the selected pop-up window, or vice versa to predict neighboring words based on the target word. However, the trained neural network is no longer used for prediction. The main goal of this process is to learn the weights of the projection layer, which represent the searched word vectors themselves. It is assumed that the weight vectors, which are trained for prediction, can also successfully represent the meaning of individual words. The goal of the GloVe model is to create word vectors so that their scalar product is equal to the logarithm of the probability of co-occurrence of words from a previously created and factorized matrix, which contains information on how often each of the words occurs in some context. We will determine the appropriate parameter of the window size and the size of the dimension for the generation of word vectors by these models using the analysis of performance metrics for the classification of machine learning problems that were created by the input vectors. The classification problem we are solving is the classification of spam, the dataset of which we have already described. 4.1. Window size We present the results individually for each performance metric. From the results for the accuracy metric, it is clear (Fig. 3 ) that the best values were observed for the Word2Vec - Skip-gram method, which reached the highest value of the median, lower and upper quartile and the maximum value. We achieve the most homogeneous results with the Word2Vec - CBOW method. From the Word2Vec - Skip-gram method, the highest median value was observed at window size 5 (0.9449), the maximum value at window size 5 and 6 (0.9536), and the highest upper quartile value at window size 5 (0.9487) and the highest lower quartile value also at window size windows 5 (0.9391). It is clear from the graph that the differences in the measured values for the Word2Vec - Skip-gram method in all three window sizes are probably not statistically significant. For this reason, we did not verify statistical significance. It is interesting to look at the remaining two Word2Vec methods - CBOW and GloVe. Both achieved the highest median values for a window size of 7 (0.9410 for Word2Vec - CBOW and 0.9304 for GloVe). The values for the lower and upper quartile were similarly successful in favor of a window size 7. The same trend can be observed in the results for the performance measure precision (Fig. 4 ). The best results were recorded for the Word2Vec - Skip-gram method in all three window sizes. Within this method, the smallest size of size 5 (median = 0.8956) appears to be the most appropriate window size. Conversely, in the remaining two methods, the trend was opposite and the median, upper and lower quartile were highest in the window size 7. In our case, the recall metric determines how many spam messages were identified as spam. The results (Fig. 5 ) here are already different than in the case of the accuracy and precision metrics. For the Word2Vec - Skip-gram method, the highest median value (0.8664) is still in the window of size 5, but the differences between the median or upper quartile values are very small. Interestingly, however, the highest median value (0.8767) was observed in the Word2Vec – CBOW method for a window size of 7. For completeness, we also present the Box-plot (Fig. 6 ) for the variable F1-score, which represents the weighted average of Precision and Recall. In addition to performance metrics, we also measured time during training of word vectors for all window sizes and dimension sizes (Fig. 7 ). Time was measured using the Time library in Python, and the result (in seconds) was the difference between the time recorded before training and the time after training word vectors for a particular parameter setting. It is obvious that the lowest values of the variable time were measured for window size 5 and the highest for windows of size 7. From the time point of view, Word2Vec – CBOW appears to be the most efficient method for window size 5 (median = 9.83). Within the Word2Vec – CBOW method, low median values (10.82 and 10.20) were also observed for the other window sizes 6 and 7. Compared to Word2Vec - Skip-gram (median = 27.89 for window size 5) and GloVe (median = 32.82 for window size 5), the Word2Vec - CBOW method appears to be time efficient. 4.2. Dimension size The dimension size parameter was investigated for 25 different dimensions (10, 20, ..., 240, 250), for 3 different window sizes (5, 6, 7) and three embedding methods. In the graphs (Fig. 7 , Fig. 8 , Fig. 9 ) we present the measured values for the accuracy metric for 3 different window sizes. It is clear from the results that the biggest differences, i.e., the most significant increase, was recorded for the size of the dimension up to 50. At the size of 50 and more, the accuracy was already higher than 0.92 (except Glove for 60 and 80) for almost all methods. It is interesting that the most successful method Word2Vec - Skip-gram had the highest measure of accuracy from size 110 for all methods and window sizes. After dimension size 110, the Word2Vec - CBOW method was among the more successful. In the case of the recall metric, the results (Fig. 11 , Fig. 12 , Fig. 13 ) are very similar. Here, with a dimension size of 130 and more, recall values above 0.80 were achieved even for the Glove method and 0.85 for the remaining two methods. For completeness, we also present the results (Fig. 14 , Fig. 15 , Fig. 16 ) of the F1-score metric. Also based on these results, it is clear that the dimension lower than 50 shows significantly lower values. On the other hand, for a dimension above 10, we achieve values greater than 0.82 for the Glove method and values greater than 0.86 for the remaining two methods. Discussion From the results, it is possible to observe the appropriate parameter settings of the dimension size and the window size. We investigated the differences between window sizes 5, 6 and 7. We chose only three window sizes. It is obvious that for more reliable results it would be necessary to work with more window sizes. However, this calculation is time-consuming, so we chose only three window sizes, but these were also chosen based on our previous experience with the investigated methods. However, it is clear from the results that the mentioned three sizes were chosen appropriately, because in one method the most suitable size appears to be size 5 and in the remaining two size 7. The results therefore show that the appropriate window size must be chosen based on the embedding method used. On the other hand, looking at the results for each method separately, the difference was not very large for different window sizes. For the Word2Vec - Skip-gram method, the best results were observed for a window size of 5, and for the Word2Vec - CBOW and GloVe methods for a window size of 7. From the perspective of dimension size, based on our results, dimensions smaller than 50 are no longer suitable. On the other hand, for dimensions larger than 150, the results did not improve significantly. The motivation for our research was the papers [ 20 ] and [ 11 ]. Several results of these works were also confirmed in our research. We agree with Adewumi et al. [ 9 ] that the best model is usually task specific. In terms of dimension size, the authors report that increasing the size of the vector dimension (above 400) resulted in poor quality word vectors. Our results show that dimension 150 is fully sufficient. Yang et al. [ 10 ] used a convolutional neural network model, achieving a similar classification success rate of approximately 0.81. The best classification accuracy was achieved by setting the dimension to 800, the window size to 10. According to our findings, these values are too high. In our case, we already achieved the accepted accuracy with a dimension of 150 and a window size of 7. It is worth noting that there are two perspectives when it comes to evaluating created models: intrinsic and extrinsic evaluation. Intrinsic evaluation (e.g., by computing the similarity of word pairs) is justified when using Word2Vec and GloVe models for identifying semantically similar words. In our article, we focused on extrinsic evaluation. We believe that in practical applications, these models will be mainly used for preparing word vectors for classification tasks. For this reason, we aimed to evaluate our Word2Vec and GloVe model improvements based on improved performance measures in classification tasks. Conclusion In this research, we focused on the empirical investigation of setting the values of the context window size and dimension size input parameters for Word2Vec and GloVe word vector embedding models. We created 3 types of models (W2V CBOW, W2V Skip-gram and GloVe), with different window size settings (5, 6, 7) and with different dimension size settings, totally 225 (3 x 3 x 25) models. With each of these models, we obtained word vectors based on the same input pre-processed text corpus. To tell which parameter settings were better than others, we used a neural network classifier along with a spam classification dataset. Using classification performance evaluation metrics, we analyzed the achieved results through graphs and found that the size of the context window size must be chosen based on the embedding method used. We achieved the best results with the W2V Skip-gram method for a window size of 5. The results we obtained by examining the size of the dimension of word vectors indicate that dimensions smaller than 50 are no longer suitable for our classification task, and the results of dimensions larger than 150 did not improve much with increasing dimension. Similar to other authors, we also assume that the parameter setting depends on the specific classification task. In the future, we want to carry out research in which we will try to find a suitable setting of the studied parameters depending on the size of the input training text corpus as well as the dataset for the classification problem. Declarations Data availibility The datasets and codes used and/or analyzed during the current study are available from the corresponding author (Jozef Kapusta; Email: [email protected] ) upon reasonable request. Acknowledgments. This work was supported by the Scientific Grant Agency of the Ministry of Education of the Slovak Republic and Slovak Academy of Sciences under Contract VEGA-1/0821/21, also by the Slovak Research and Development Agency under the contract no. APVV-18-0473. References M. Liang and T. Niu, “Research on Text Classification Techniques Based on Improved TF-IDF Algorithm and LSTM Inputs,” Procedia Comput Sci , vol. 208, pp. 460–470, 2022, doi: 10.1016/j.procs.2022.10.064. A. Sharma and S. Kumar, “Ontology-based semantic retrieval of documents using Word2vec model,” Data Knowl Eng , vol. 144, p. 102110, Mar. 2023, doi: 10.1016/j.datak.2022.102110. N. Badri, F. Kboubi, and A. H. Chaibi, “Combining FastText and Glove Word Embedding for Offensive and Hate speech Text Detection,” Procedia Comput Sci , vol. 207, pp. 769–778, 2022, doi: 10.1016/j.procs.2022.09.132. E. M. Dharma, F. Lumban Gaol, H. Leslie, H. S. Warnars, and B. Soewito, “THE ACCURACY COMPARISON AMONG WORD2VEC, GLOVE, AND FASTTEXT TOWARDS CONVOLUTION NEURAL NETWORK (CNN) TEXT CLASSIFICATION,” J Theor Appl Inf Technol , vol. 31, no. 2, 2022, [Online]. Available: www.jatit.org J. M. Wyatt, G. J. Booth, and A. H. Goldman, “Natural Language Processing and Its Use in Orthopaedic Research,” Curr Rev Musculoskelet Med , vol. 14, no. 6, pp. 392–396, Dec. 2021, doi: 10.1007/s12178-021-09734-3. D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: state of the art, current trends and challenges,” Multimed Tools Appl , vol. 82, no. 3, pp. 3713–3744, Jan. 2023, doi: 10.1007/s11042-022-13428-4. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Jan. 2013. H. D. Abubakar and M. Umar, “Sentiment Classification: Review of Text Vectorization Methods: Bag of Words, Tf-Idf, Word2vec and Doc2vec,” SLU Journal of Science and Technology , vol. 4, no. 1 & 2, pp. 27–33, Aug. 2022, doi: 10.56471/slujst.v4i.266. T. P. Adewumi, F. Liwicki, and M. Liwicki, “Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks,” Mar. 2020. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” Oct. 2013. X. Yang, C. Macdonald, and I. Ounis, “Using word embeddings in Twitter election classification,” Information Retrieval Journal , vol. 21, no. 2–3, pp. 183–207, Jun. 2018, doi: 10.1007/s10791-017-9319-5. S. Nazir, M. Asif, S. A. Sahi, S. Ahmad, Y. Y. Ghadi, and M. H. Aziz, “Toward the Development of Large-Scale Word Embedding for Low-Resourced Language,” IEEE Access , vol. 10, pp. 54091–54097, 2022, doi: 10.1109/ACCESS.2022.3173259. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” Oct. 2013. O. Levy and Y. Goldberg, “Dependency-Based Word Embeddings,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , Stroudsburg, PA, USA: Association for Computational Linguistics, 2014, pp. 302–308. doi: 10.3115/v1/P14-2050. Y. Goldberg and O. Levy, “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method,” Feb. 2014. Md. A. H. Wadud, M. F. Mridha, and M. M. Rahman, “Word Embedding Methods for Word Representation in Deep Learning for Natural Language Processing,” Iraqi Journal of Science , pp. 1349–1361, Mar. 2022, doi: 10.24996/ijs.2022.63.3.37. J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word Representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Stroudsburg, PA, USA: Association for Computational Linguistics, 2014, pp. 1532–1543. doi: 10.3115/v1/D14-1162. G. Muñetón-Santa, D. Escobar-Grisales, F. O. López-Pabón, P. A. Pérez-Toro, and J. R. Orozco-Arroyave, “Classification of Poverty Condition Using Natural Language Processing,” Soc Indic Res , vol. 162, no. 3, pp. 1413–1435, Aug. 2022, doi: 10.1007/s11205-022-02883-z. J. Kapusta, M. Drlik, and M. Munk, “Using of n-grams from morphological tags for fake news classification,” PeerJ Comput Sci , vol. 7, p. e624, Jul. 2021, doi: 10.7717/peerj-cs.624. T. P. Adewumi, F. Liwicki, and M. Liwicki, “Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks,” Mar. 2020. M. Chugh, P. A. Whigham, and G. Dick, “Stability of Word Embeddings Using Word2Vec,” 2018, pp. 812–818. doi: 10.1007/978-3-030-03991-2_73. F. Hill, R. Reichart, and A. Korhonen, “SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation,” Computational Linguistics , vol. 41, no. 4, pp. 665–695, Dec. 2015, doi: 10.1162/COLI_a_00237. E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Paşca, and A. Soroa, “A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches,” in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics , Boulder, Colorado: Association for Computational Linguistics, Jun. 2009, pp. 19–27. [Online]. Available: https://aclanthology.org/N09-1003 O. Abayomi-Alli, S. Misra, A. Abayomi-Alli, and M. Odusami, “A review of soft techniques for SMS spam classification: Methods, approaches and applications,” Eng Appl Artif Intell , vol. 86, pp. 197–212, Nov. 2019, doi: 10.1016/j.engappai.2019.08.024. G. Waja, G. Patil, C. Mehta, and S. Patil, “How AI Can be Used for Governance of Messaging Services: A Study on Spam Classification Leveraging Multi-Channel Convolutional Neural Network,” International Journal of Information Management Data Insights , vol. 3, no. 1, p. 100147, Apr. 2023, doi: 10.1016/j.jjimei.2022.100147. S. Dutta, A. K. Das, S. Ghosh, and D. Samanta, “Attribute selection to improve spam classification,” in Data Analytics for Social Microblogging Platforms , Elsevier, 2023, pp. 95–127. doi: 10.1016/B978-0-32-391785-8.00016-0. R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks , ELRA, Feb. 2010, pp. 45–50. F. Chollet, “Keras.” Accessed: Dec. 28, 2022. [Online]. Available: https://keras.io F. Pedregosa et al. , “Scikit-learn: Machine Learning in Python,” Journal of Machine Learning Research , vol. 12, pp. 2825–2830, 2011. Additional Declarations No competing interests reported. Supplementary Files results.zip Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4532901","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":318650512,"identity":"3355de97-0369-47ae-b771-9501c452e9d7","order_by":0,"name":"Dávid Držík","email":"","orcid":"","institution":"Constantine the Philosopher University in Nitra","correspondingAuthor":false,"prefix":"","firstName":"Dávid","middleName":"","lastName":"Držík","suffix":""},{"id":318650513,"identity":"069241b3-464d-4767-a1e3-d030d99a0af6","order_by":1,"name":"Jozef Kapusta","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAyElEQVRIiWNgGAWjYBACxh4og5+BjeEDA0MC0VoMGCQb2BhnEKWFgQeqxeAAsVqYew4/e8zz54+88Y20xAaGP2lEOKy3zdyYt83AcNuNtIMNjG05RGjpZzCT5m0wYNx2I739AWNDBTFa2L9J8/wxsN88I70R6DBitPT2mEnzsBkkbpAAOoyBjRiH9ZwpN5zbZpw848yzxIbENiK8b9iTvu3Bmz9ytv3taYYNH/4kE6EF6BgEL4GwBgYGeQZkLaNgFIyCUTAKsAEA/c07D5Bb8BEAAAAASUVORK5CYII=","orcid":"","institution":"Constantine the Philosopher University in Nitra","correspondingAuthor":true,"prefix":"","firstName":"Jozef","middleName":"","lastName":"Kapusta","suffix":""}],"badges":[],"createdAt":"2024-06-05 09:23:08","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4532901/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4532901/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":59873618,"identity":"d543ea27-d2e4-4623-b256-f422e21fcd0d","added_by":"auto","created_at":"2024-07-08 17:28:50","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":386682,"visible":true,"origin":"","legend":"\u003cp\u003eMethodological steps in the diagram\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/0bbc71fa09d1e490904bbb1a.png"},{"id":59872348,"identity":"07bcacaa-e741-49ac-b905-69a1393a2735","added_by":"auto","created_at":"2024-07-08 17:12:50","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":27410,"visible":true,"origin":"","legend":"\u003cp\u003eCBOW and Skip-gram architectures\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/e36ffc3e429cba896a0c4b46.png"},{"id":59872350,"identity":"e50a2162-c079-4e08-a1b7-361aa85065d3","added_by":"auto","created_at":"2024-07-08 17:12:50","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":82758,"visible":true,"origin":"","legend":"\u003cp\u003eBoxplot of Accuracy metric for window sizes\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/dc977c8d1a0ac737f075474d.png"},{"id":59873285,"identity":"da54715e-24bb-4e4c-962d-dd9a5f8f8db7","added_by":"auto","created_at":"2024-07-08 17:20:50","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":77856,"visible":true,"origin":"","legend":"\u003cp\u003eBoxplot of Precision metric for window sizes\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/8b9012470aae9c450cd0f769.png"},{"id":59873288,"identity":"e3867297-99e2-4595-8f98-bf50287c8783","added_by":"auto","created_at":"2024-07-08 17:20:50","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":78636,"visible":true,"origin":"","legend":"\u003cp\u003eBoxplot of Recall metric for window sizes\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/d3773a39ba77fc9950f7c6d3.png"},{"id":59872354,"identity":"09462cbf-eec7-43ed-90ae-8c42f7fe18d8","added_by":"auto","created_at":"2024-07-08 17:12:50","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":76506,"visible":true,"origin":"","legend":"\u003cp\u003eBoxplot of F1-score metric for window sizes\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/e9e36432e914296d690fe7fd.png"},{"id":59872362,"identity":"12b920f6-9416-48a2-ba2f-e0a088896c34","added_by":"auto","created_at":"2024-07-08 17:12:51","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":69896,"visible":true,"origin":"","legend":"\u003cp\u003eBoxplot of word vector training time for window sizes\u003c/p\u003e","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/a58373f606e8330dfd6d21de.png"},{"id":59873620,"identity":"bf65eac5-d7c4-421d-996c-a516abd53313","added_by":"auto","created_at":"2024-07-08 17:28:51","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":28741,"visible":true,"origin":"","legend":"\u003cp\u003eDot plots of Accuracy metric on different dimensions of the window size 5\u003c/p\u003e","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/ecf8c95a8a9e77dfea5714a0.png"},{"id":59873289,"identity":"6f4c434d-6729-4268-842b-3ca3ced9973a","added_by":"auto","created_at":"2024-07-08 17:20:51","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":28283,"visible":true,"origin":"","legend":"\u003cp\u003eDot plots of Accuracy metric on different dimensions of the window size 6\u003c/p\u003e","description":"","filename":"floatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/603075c80c257b3a25912eb8.png"},{"id":59872365,"identity":"c339e25d-c0e4-4f74-8803-0f741644eb72","added_by":"auto","created_at":"2024-07-08 17:12:51","extension":"png","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":29333,"visible":true,"origin":"","legend":"\u003cp\u003eDot plots of Accuracy metric on different dimensions of the window size 7\u003c/p\u003e","description":"","filename":"floatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/fd9fd9508d76ad1f1881ce4a.png"},{"id":59872361,"identity":"6715288a-bd2c-4856-a214-2c61fc52f533","added_by":"auto","created_at":"2024-07-08 17:12:51","extension":"png","order_by":11,"title":"Figure 11","display":"","copyAsset":false,"role":"figure","size":28716,"visible":true,"origin":"","legend":"\u003cp\u003eDot plots of Recall metric on different dimensions of the window size 5\u003c/p\u003e","description":"","filename":"floatimage11.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/ccc3eb248f720cba0e7bdbdb.png"},{"id":59872357,"identity":"356623f1-d851-4e5c-b5d8-0a473caed49f","added_by":"auto","created_at":"2024-07-08 17:12:51","extension":"png","order_by":12,"title":"Figure 12","display":"","copyAsset":false,"role":"figure","size":27650,"visible":true,"origin":"","legend":"\u003cp\u003eDot plots of Recall metric on different dimensions of the window size 6\u003c/p\u003e","description":"","filename":"floatimage12.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/4dc0a36a46164c62a93fdd19.png"},{"id":59872364,"identity":"0944967c-ecdc-4b5d-9f4f-6e72c1562cab","added_by":"auto","created_at":"2024-07-08 17:12:51","extension":"png","order_by":13,"title":"Figure 13","display":"","copyAsset":false,"role":"figure","size":25765,"visible":true,"origin":"","legend":"\u003cp\u003eDot plots of Recall metric on different dimensions of the window size 7\u003c/p\u003e","description":"","filename":"floatimage13.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/efc1728ce81626cd23b30ac9.png"},{"id":59872358,"identity":"eeef2a9b-b88c-481a-a1b5-a9e7d51252d0","added_by":"auto","created_at":"2024-07-08 17:12:51","extension":"png","order_by":14,"title":"Figure 14","display":"","copyAsset":false,"role":"figure","size":31205,"visible":true,"origin":"","legend":"\u003cp\u003eDot plots of F1-score metric on different dimensions of the window size 5\u003c/p\u003e","description":"","filename":"floatimage14.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/983938a002e2ab0006684888.png"},{"id":59872356,"identity":"7e3b5820-4ace-473c-a75c-941e3b45ecab","added_by":"auto","created_at":"2024-07-08 17:12:50","extension":"png","order_by":15,"title":"Figure 15","display":"","copyAsset":false,"role":"figure","size":30516,"visible":true,"origin":"","legend":"\u003cp\u003eDot plots of F1-score metric on different dimensions of the window size 6\u003c/p\u003e","description":"","filename":"floatimage15.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/5965fd25b87921e219c72677.png"},{"id":59872363,"identity":"0d352a78-b436-44f2-9253-a267b0e32ae3","added_by":"auto","created_at":"2024-07-08 17:12:51","extension":"png","order_by":16,"title":"Figure 16","display":"","copyAsset":false,"role":"figure","size":29739,"visible":true,"origin":"","legend":"\u003cp\u003eDot plots of F1-score metric on different dimensions of the window size 7\u003c/p\u003e","description":"","filename":"floatimage16.png","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/d97429b82cbc3cc1477f4683.png"},{"id":63850933,"identity":"fddd562c-216f-4bf6-b11e-f4b73c6d0265","added_by":"auto","created_at":"2024-09-03 04:02:47","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1496637,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/17e313cd-75ed-4358-b614-065078ce3992.pdf"},{"id":59873953,"identity":"34f9ec49-2ae9-496b-8838-7f63d31a5dc6","added_by":"auto","created_at":"2024-07-08 17:36:50","extension":"zip","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":60888,"visible":true,"origin":"","legend":"","description":"","filename":"results.zip","url":"https://assets-eu.researchsquare.com/files/rs-4532901/v1/41f1ad49884d780520287aff.zip"}],"financialInterests":"No competing interests reported.","formattedTitle":"Effect of dimension size and window size on word embedding in classification tasks","fulltext":[{"header":"1 Introduction","content":"\u003cp\u003eNowadays, artificial intelligence algorithms are used in almost every area of life. Most of these methods work with data that are in numerical form. The same applies to the field of natural language processing, whose main task is to understand human language as humans understand it. Machine learning algorithms cannot deal with words or sentences of text, therefore there are several approaches to transform individual words into digital form, most commonly into the form of multidimensional word vectors. Several techniques are used for this transformation, from the most trivial, such as One-hot-encoding or TF-IDF (term frequency - inverse document frequency) [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e], to Word2Vec [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e] and GloVe (Global Vectors) [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e], which are still widely used today. The word vectors created by these methods are nothing more than word vectors mapped into a multidimensional space using continuous floating-point numbers that represent the meaning of the word [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe meaning of a word in the Word2Vec model is represented by its context, i.e., the words that come before and after it. As a result, this model creates one multidimensional vector for each word. The syntactic or semantic relations between words are preserved, and the vector distances of individual words correspond to the human conception of the relation between words [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn addition to GloVe and Word2Vec, other newer methods such as BERT, GPT, and the like are currently being utilized. Unlike GloVe and Word2Vec, which provide context-independent word embeddings, these models generate a single vector (embedding) for each word, amalgamating all its various senses into one representation. This numeric representation, termed embedding or vector, remains consistent regardless of the word's position within a sentence or its multiple meanings. Conversely, GPT, ELMo, and BERT are capable of producing diverse word embeddings that encapsulate the contextual nuances of a word, including its syntactic and semantic context within a sentence. This disparity has practical implications; while word2vec and GloVe vectors, trained on extensive corpora, can be directly employed for downstream tasks without necessitating the original model used for training, GPT, ELMo, and BERT require access to the model post-training, as they rely on context to generate word vectors. Consequently, for numerous projects, employing Word2Vec and GloVe may prove more advantageous due to their context-independent nature.\u003c/p\u003e \u003cp\u003eIt is important to know how to set the context window size and dimension size parameters for Word2Vec and GloVe models, because the wrong combination of these parameters can result in low quality word vectors.\u003c/p\u003e \u003cp\u003eAdewumi et al. [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] tried to determine the correct setting of these parameters. He empirically investigated the parameters dimension size from 100 to 3000 and window sizes with values of 4 and 8. By evaluating the results, he proved that the best model is usually task-specific, and a high analogy score does not necessarily correlate positively with the F1 score metric. He further found out that increasing the size of the vector dimension too much (above 400) resulted in poor quality word vectors. And in addition, he claims that using a small corpus, he obtained a better WordSim score, corresponding Spearman correlation, compared to the original model by Mikolov et al. [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e], trained on a corpus of 100\u0026nbsp;billion words.\u003c/p\u003e \u003cp\u003eAnother author Yang et al. [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e] solved the task of classification on posts from Twitter in English and classified whether the given post was related to the topic of elections or not. He used a convolutional neural network model and, comparing the results, found that the input corpus for creating vectors should be similar to the corpus used for classification, and he achieved the best classification accuracy by setting the dimension to 800, the window size to 10, at the level of 80.8% by F1 score metric.\u003c/p\u003e \u003cp\u003eNazir et al. [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] formed vectors with different dimensions and window sizes. They focused on creating Urdu word embedding. They created the best model using 500 dimension size and 5 window size.\u003c/p\u003e \u003cp\u003eIn our article, we want to verify the findings achieved by both authors. Like them, we want to experiment with window size and the dimension size. The aim of the paper is to find out the most suitable values of the window size and the dimension size for the calculation of predictive word models.\u003c/p\u003e \u003cp\u003eTo verify the suitability of the parameters of the window size and the size of the dimension for the calculation of predictive word models, we proceeded according to the following methodology (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e):\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e1. \u003cb\u003ePre-processing of the input text corpus\u003c/b\u003e: The input text corpus underwent preprocessing procedures.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e2. \u003cb\u003eCreation of word embedding models\u003c/b\u003e: A total of 225 word models were generated, resulting from the combination of three embedding methods, three window sizes, and 25 dimensions.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e3. \u003cb\u003eGeneration of input vectors for classification task\u003c/b\u003e: Each word model was utilized to generate input vectors for the classification task, derived from the input texts of the dataset.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e4. \u003cb\u003eCreation of classifiers\u003c/b\u003e: Input vectors of words were employed to develop classifiers. Specifically, 225 classifiers were crafted for individual input vectors generated using predictive models.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e5. \u003cb\u003ePerformance evaluation\u003c/b\u003e: Performance metrics including accuracy, precision, recall, and F1 score were computed for each classifier created, based on the test set. Simultaneously, the time taken for word vector creation was recorded for result analysis.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e6. \u003cb\u003eResults assessment\u003c/b\u003e: The outcomes were systematically evaluated.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThere are several approaches to compare the success of the proposed method. Nazir et al. [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] evaluate success using datasets wordsim-353 and Lexsim-999, which contain calculated similarities of selected pairs. Calculating the similarities between two words is often presented in educational examples focused on word vectors. However, from a practical point of view, word vectors are used in different practical tasks. Classification is one of them. For this reason, we will verify the suitability of our method by using the trained methods to create vectors in the classification task. Subsequently, we will evaluate the performance measures of the classification itself, i.e. we will not evaluate the word vectors themselves directly, but we will use them to solve the classification task and evaluate the success of the classification.\u003c/p\u003e \u003cp\u003eThe structure of the paper is as follows. The current state of research in the area of influence of window size and dimension size parameters is summarized in the second part. Spam datasets used in the research, as well as related text pre-processing techniques and text vectorization models used, are described in the third section. The most important results are summarized in the fourth section. Discussion and conclusions form the content of the last part of the paper.\u003c/p\u003e"},{"header":"2 Related works","content":"\u003cp\u003eA model for creating multiple word vectors called Word2vec was proposed relatively recently by Mikolov et al. [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. The Word2vec model is based on the fact that the very meaning of a word is represented by its context, i.e., the words that come before and after it. As a result, this model creates one multiple vectors for each word. The syntactic or semantic relations between words are preserved, and the vector distances of individual words correspond to the human idea of the relation between words.\u003c/p\u003e \u003cp\u003eIn this model, the context window size parameter is very important. Authors Levy and Goldberg [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] in their work claim that the larger the value of the window size, the more the model tends to capture semantic information, and conversely, the smaller the window size parameter, the more syntactic information is captured in the vectors.\u003c/p\u003e \u003cp\u003eAnother very important parameter is the dimension of word vectors itself. Small dimensions cannot capture some significant relationships between words, while large dimensions allow the construction of a complete vector that captures information between words. However, some dimensions that are too high will cause the resulting word vectors to contain redundant information without further added value. In general, the most used dimension size value for the creation of word vectors is between 50 and 500 [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe creators of the Word2vec model themselves state two basic architectures: CBOW (Continuous Bag-of-Words) and Skip-gram. Both of these architectures consist of an artificial neural network that contains an input layer (where input words are encoded as One-hot-encoding), one hidden layer (also called a embedding layer), and an output layer. After training the network, the result will not actually be the output from the output layer of the neural network, but the weights from the hidden layer will be the result. These weights will represent multiple word vectors [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e], [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe difference between these architectures is that in CBOW, a specific word is predicted based on its context words. In general, the CBOW architecture performs slightly better than Skip-gram for very frequent words. Unlike CBOW, Skip-gram learns to predict context words from a given word. If two words (one occurring rarely and the other occurring more often) are placed next to each other, they will be treated very similarly. Otherwise, each word will be treated as both a target and a context word [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAnother well-known model that is similar to the previous model is the GloVe (Global Vectors) model. This model also captures syntactic or semantic relations between words when creating word vectors. But it works on a slightly different principle than Word2Vec because the GloVe model is based on the matrix factorization technique on the word context matrix. So, first a large matrix of co-occurrence information (words x context) is created, i.e., for each word it is counted how often we see this word in some context in a large corpus. Then the matrix is factored to obtain a lower dimensional matrix [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e], [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e], [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe goal of training word vectors with the GloVe model consists of learning word vectors in such a way that the scalar product of the represented words is equal to the logarithm of the probability of their co-occurrence. Given that the logarithm of the ratio is equal to the difference of the logarithms, the differences of their vector representation are associated with the ratios of the probabilities of their co-occurrence. The resulting word vectors thus perform very well on verbal analogy tasks, as the model can capture a variety of linguistic patterns [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe motivation for our research was mainly two studies [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e], [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e], which we already mentioned in the \u003cspan refid=\"Sec1\" class=\"InternalRef\"\u003eIntroduction\u003c/span\u003e section. These studies investigated the influence of window size and dimension size parameters of word vectors. We attach one more study, by the author Chugh et al [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e], who in his study investigated the stability of the Word2vec model in terms of how word vectors are consistent with respect to word frequency in the corpus and the dimension size parameter. He examined three different corpora, selecting exactly 10,000 of the most frequent words from each corpus in the pre-processing process. And in addition, from these 10,000 words, he selected 10 words of upper (0\u0026ndash;10), middle (4995\u0026ndash;5005) and lower (9990\u0026ndash;10,000) frequencies. The size of the dimension that appeared as a parameter in the model ranged from 1 to 377, while the individual values were elements of the Fibonacci sequence. At each run of the word vector training algorithm, it kept the 10 nearest neighbor words in the vector space for words labelled as high, medium, and low frequency words. To calculate the stability of the model, he used the Jaccard similarity coefficient between pairs of sets of neighbors of all words.\u003c/p\u003e \u003cp\u003eBy comparing the obtained results, he found that the size of the dimension of the word vectors has a significant effect on the consistency of the model and depends on the corpora. And in addition, he found that high-frequency words show a higher degree of stability than medium and low-frequency words, but since any English corpus will always contain low- and medium-frequency words, the overall stability of the model will always decrease with these words [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eNazir et al. (2022) focuses on creating Urdu word embedding. They formed vectors with different dimensions and window sizes. During training, the word vectors were updated and the trained model was evaluated with spearman coefficient. They used commonly used vectors of 100, 200, 300, 400, and 500 dimensions, and the standard vector dimensions. They used the word2vec model. To conduct their experiment, they used a window size of 3, 5, and 7. To evaluate the word embedding model they used two benchmark datasets that represent similarities of different English word pairs: SimLex-999 [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e] and wordsim-353 [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. It was observed that wordsim-353 models with vector dimensions of 100 and 128 performed better than the rest of the models. However, for Lexsim-999 word embedding model with 500 vector dimensions and 5 window size performed better.\u003c/p\u003e"},{"header":"3 Methodology","content":"\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Datasets and text data pre-processing\u003c/h2\u003e \u003cp\u003eWe used 2 datasets for this research. The first dataset The Enron Email Dataset is a freely available text dataset (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.kaggle.com/datasets/wcukierski/enron-email-dataset/discussion/116088\u003c/span\u003e\u003cspan address=\"https://www.kaggle.com/datasets/wcukierski/enron-email-dataset/discussion/116088\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) containing 500,000 English-language emails sent between 150 users. We used this data file as an input text corpus to create word vectors. The second SMS spam collection dataset is also freely available (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/mohitgupta-omg/Kaggle-SMS-Spam-Collection-Dataset-/blob/master/spam.csv\u003c/span\u003e\u003cspan address=\"https://github.com/mohitgupta-omg/Kaggle-SMS-Spam-Collection-Dataset-/blob/master/spam.csv\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) and served in the evaluation phase as a validation tool to verify the correctness of generated word vectors for the problem of spam classification [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e], [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e], [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. This dataset consisted of a total of 5565 records of correctly labelled short text messages (747 spam and 4819 \"ham\", non-spam).\u003c/p\u003e \u003cp\u003eBoth datasets had to be pre-processed in a certain way before use. The pre-processing was done in Python programming language and consisted of converting the whole text into sentences, then we converted the characters of individual sentences into lowercase letters, we removed words that contained special characters, numbers, or whitespace characters, we also removed the so-called stop words, which occur frequently in the language and themselves they have no meaning to themselves, and we have lemmatized all the remaining words.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Used word embedding methods\u003c/h2\u003e \u003cp\u003e \u003cb\u003eWord2Vec \u0026ndash; CBOW\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe goal of the CBOW (Continuous Bag of Words) architecture of the Word2Vec method is to predict a target word based on its neighboring or context words within a context window. This architecture basically consists of a simple feed forward neural network with three layers. Contextual words encoded by the one hot encoding method enter the input layer, and the hidden layer contains as many neurons as the size of the dimension we set for the word vectors. The output layer contains one neuron with a non-linear Softmax activation function. The CBOW architecture is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e along with the Skip-gram architecture. The result of neural network training will not be the output of the output layer, but the output will be the weights (word vectors) of the hidden layer. After training the network, these weights are set so that they can capture the meaning of the input words [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eWord2Vec - Skip-gram\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe second architecture is Skip-gram (in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e) and its aim, unlike CBOW, is to predict context words based on the target word within the context window. This architecture also consists of a simple neural network, but the input to it is only one target word in each, and the output layer contains as many neurons as the context words we want to predict. The output layer contains the Softmax activation function, but in practice Word2Vec uses its algorithmic enhancements Hierarchical Softmax and Negative Sampling. The disadvantage of the Softmax function is that to represent the probability of each word, it is necessary to go through all the words in the dictionary, which is computationally very demanding [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cb\u003eGloVe\u003c/b\u003e \u003c/p\u003e \u003cp\u003eWhile the Word2Vec model is a predictive model consisting of a feedforward neural network, the GloVe (Global Vectors) model is a count-based model. The first step is to create a matrix of their co-occurrence, which contains information on how often each of the words occurs in some context. The matrix is later factorized to reduce the dimension of this matrix. The goal of the GloVe model is to train word vectors so that their scalar product is equal to the logarithm of the probability of co-occurrence of words [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e], [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e], [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eUsing the Gensim library [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e], we implemented these 3 embedding methods in Python by creating the corresponding models. The input to the models was preprocessed text data and input parameters, the most important for us being the size of the window size and the dimension size.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Sequential neural network classifier\u003c/h2\u003e \u003cp\u003eWe verified the correctness of the word vectors trained by the Word2Vec and GloVe models on spam classification using the Keras library [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e] sequential neural network classifier in Python. The neural network itself consists of three layers: Embedding, Flatten and Dense layers.\u003c/p\u003e \u003cp\u003eAs input parameters to the first Embedding layer, we set the size of the vocabulary, the size of the dimension of the word vectors, the word vectors themselves, which represent the weights of the hidden layer, and the length of the input sequences. The output Dense layer contains one neuron with Sigmoid activation function. We used the Flatten layer to connect the Embedding layer with the output Dense layer. We compiled the network model using the RMSprop optimizer and the BinaryCrossentropy loss function.\u003c/p\u003e \u003cp\u003eThe pre-processed second data file contains, in addition to a short message, a label of whether it is spam or not. We encoded each message (all words in it) into a sequence of numbers. The Keras library requires these sequences to be vectorized and to have the same length. We trained the network model on the training data of these sequences and using basic statistical metrics we obtained the classification accuracy based on the test data.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e3.4 Validation and used evaluation metrics\u003c/h2\u003e \u003cp\u003eTo assess the performance of our classification model, i.e. our input word vectors to that model, we will use basic performance evaluation metrics, which are accuracy, precision, recall and f1-score. Accuracy is the ratio of correct predictions to the total number of samples and is computed as in Eq.\u0026nbsp;(\u003cspan refid=\"Equ1\" class=\"InternalRef\"\u003e1\u003c/span\u003e):\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$$acc=\\frac{TP+TN}{TP+TN+FP+FN} ,$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere TP represents the number of True Positive results, FP represents the number of False Positive results, TN represents the number of True Negative results, and FN represents the number of False Negative results. To evaluate, analyze and describe the results of our model we also calculated precision (in Eq.\u0026nbsp;(\u003cspan refid=\"Equ2\" class=\"InternalRef\"\u003e2\u003c/span\u003e)), recall (in Eq.\u0026nbsp;(\u003cspan refid=\"Equ3\" class=\"InternalRef\"\u003e3\u003c/span\u003e)) and F1 score (in Eq.\u0026nbsp;(\u003cspan refid=\"Equ4\" class=\"InternalRef\"\u003e4\u003c/span\u003e)) as follows:\u003cdiv id=\"Equ2\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e\n$$precision=\\frac{TP}{TP+FP} ,$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equ3\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ3\" name=\"EquationSource\"\u003e\n$$recall=\\frac{TP}{TP+FN} ,$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e3\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equ4\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ4\" name=\"EquationSource\"\u003e\n$$F1 score=2*\\frac{precision*recall}{precision+recall}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e4\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ePrecision is the ratio of correctly predicted positive observations to the total predicted positive observations. Recall shows the ratio of correctly predicted positive observations to all observations in actual class. F1 score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. We used the Scikit-learn [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e] Python library to implement all these evaluation metrics.\u003c/p\u003e \u003c/div\u003e"},{"header":"4 Results","content":"\u003cp\u003eIn the research, we focused on finding the most suitable window size and dimension size for computing the predictive word model Word2Vec and the count-based model GloVe. The basic principle of the Word2Vec model is to train a neural network to predict the target word based on neighboring words from the selected pop-up window, or vice versa to predict neighboring words based on the target word. However, the trained neural network is no longer used for prediction. The main goal of this process is to learn the weights of the projection layer, which represent the searched word vectors themselves. It is assumed that the weight vectors, which are trained for prediction, can also successfully represent the meaning of individual words. The goal of the GloVe model is to create word vectors so that their scalar product is equal to the logarithm of the probability of co-occurrence of words from a previously created and factorized matrix, which contains information on how often each of the words occurs in some context.\u003c/p\u003e \u003cp\u003eWe will determine the appropriate parameter of the window size and the size of the dimension for the generation of word vectors by these models using the analysis of performance metrics for the classification of machine learning problems that were created by the input vectors. The classification problem we are solving is the classification of spam, the dataset of which we have already described.\u003c/p\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e4.1. Window size\u003c/h2\u003e \u003cp\u003eWe present the results individually for each performance metric. From the results for the accuracy metric, it is clear (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e) that the best values were observed for the Word2Vec - Skip-gram method, which reached the highest value of the median, lower and upper quartile and the maximum value. We achieve the most homogeneous results with the Word2Vec - CBOW method. From the Word2Vec - Skip-gram method, the highest median value was observed at window size 5 (0.9449), the maximum value at window size 5 and 6 (0.9536), and the highest upper quartile value at window size 5 (0.9487) and the highest lower quartile value also at window size windows 5 (0.9391). It is clear from the graph that the differences in the measured values for the Word2Vec - Skip-gram method in all three window sizes are probably not statistically significant. For this reason, we did not verify statistical significance.\u003c/p\u003e \u003cp\u003eIt is interesting to look at the remaining two Word2Vec methods - CBOW and GloVe. Both achieved the highest median values for a window size of 7 (0.9410 for Word2Vec - CBOW and 0.9304 for GloVe). The values for the lower and upper quartile were similarly successful in favor of a window size 7.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe same trend can be observed in the results for the performance measure precision (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). The best results were recorded for the Word2Vec - Skip-gram method in all three window sizes. Within this method, the smallest size of size 5 (median = 0.8956) appears to be the most appropriate window size. Conversely, in the remaining two methods, the trend was opposite and the median, upper and lower quartile were highest in the window size 7.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn our case, the recall metric determines how many spam messages were identified as spam. The results (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e) here are already different than in the case of the accuracy and precision metrics. For the Word2Vec - Skip-gram method, the highest median value (0.8664) is still in the window of size 5, but the differences between the median or upper quartile values are very small. Interestingly, however, the highest median value (0.8767) was observed in the Word2Vec – CBOW method for a window size of 7.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFor completeness, we also present the Box-plot (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e) for the variable F1-score, which represents the weighted average of Precision and Recall.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn addition to performance metrics, we also measured time during training of word vectors for all window sizes and dimension sizes (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e). Time was measured using the Time library in Python, and the result (in seconds) was the difference between the time recorded before training and the time after training word vectors for a particular parameter setting. It is obvious that the lowest values of the variable time were measured for window size 5 and the highest for windows of size 7. From the time point of view, Word2Vec – CBOW appears to be the most efficient method for window size 5 (median = 9.83). Within the Word2Vec – CBOW method, low median values (10.82 and 10.20) were also observed for the other window sizes 6 and 7. Compared to Word2Vec - Skip-gram (median = 27.89 for window size 5) and GloVe (median = 32.82 for window size 5), the Word2Vec - CBOW method appears to be time efficient.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e4.2. Dimension size\u003c/h2\u003e \u003cp\u003eThe dimension size parameter was investigated for 25 different dimensions (10, 20, ..., 240, 250), for 3 different window sizes (5, 6, 7) and three embedding methods. In the graphs (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e, Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e, Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e) we present the measured values for the accuracy metric for 3 different window sizes. It is clear from the results that the biggest differences, i.e., the most significant increase, was recorded for the size of the dimension up to 50. At the size of 50 and more, the accuracy was already higher than 0.92 (except Glove for 60 and 80) for almost all methods.\u003c/p\u003e \u003cp\u003eIt is interesting that the most successful method Word2Vec - Skip-gram had the highest measure of accuracy from size 110 for all methods and window sizes. After dimension size 110, the Word2Vec - CBOW method was among the more successful.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn the case of the recall metric, the results (Fig.\u0026nbsp;\u003cspan refid=\"Fig11\" class=\"InternalRef\"\u003e11\u003c/span\u003e, Fig.\u0026nbsp;\u003cspan refid=\"Fig12\" class=\"InternalRef\"\u003e12\u003c/span\u003e, Fig.\u0026nbsp;\u003cspan refid=\"Fig13\" class=\"InternalRef\"\u003e13\u003c/span\u003e) are very similar. Here, with a dimension size of 130 and more, recall values above 0.80 were achieved even for the Glove method and 0.85 for the remaining two methods.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFor completeness, we also present the results (Fig.\u0026nbsp;\u003cspan refid=\"Fig14\" class=\"InternalRef\"\u003e14\u003c/span\u003e, Fig.\u0026nbsp;\u003cspan refid=\"Fig15\" class=\"InternalRef\"\u003e15\u003c/span\u003e, Fig.\u0026nbsp;\u003cspan refid=\"Fig16\" class=\"InternalRef\"\u003e16\u003c/span\u003e) of the F1-score metric. Also based on these results, it is clear that the dimension lower than 50 shows significantly lower values. On the other hand, for a dimension above 10, we achieve values greater than 0.82 for the Glove method and values greater than 0.86 for the remaining two methods.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eFrom the results, it is possible to observe the appropriate parameter settings of the dimension size and the window size. We investigated the differences between window sizes 5, 6 and 7. We chose only three window sizes. It is obvious that for more reliable results it would be necessary to work with more window sizes. However, this calculation is time-consuming, so we chose only three window sizes, but these were also chosen based on our previous experience with the investigated methods. However, it is clear from the results that the mentioned three sizes were chosen appropriately, because in one method the most suitable size appears to be size 5 and in the remaining two size 7. The results therefore show that the appropriate window size must be chosen based on the embedding method used. On the other hand, looking at the results for each method separately, the difference was not very large for different window sizes. For the Word2Vec - Skip-gram method, the best results were observed for a window size of 5, and for the Word2Vec - CBOW and GloVe methods for a window size of 7.\u003c/p\u003e\u003cp\u003eFrom the perspective of dimension size, based on our results, dimensions smaller than 50 are no longer suitable. On the other hand, for dimensions larger than 150, the results did not improve significantly. The motivation for our research was the papers [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e] and [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Several results of these works were also confirmed in our research.\u003c/p\u003e\u003cp\u003eWe agree with Adewumi et al. [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] that the best model is usually task specific. In terms of dimension size, the authors report that increasing the size of the vector dimension (above 400) resulted in poor quality word vectors. Our results show that dimension 150 is fully sufficient.\u003c/p\u003e\u003cp\u003eYang et al. [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e] used a convolutional neural network model, achieving a similar classification success rate of approximately 0.81. The best classification accuracy was achieved by setting the dimension to 800, the window size to 10. According to our findings, these values are too high. In our case, we already achieved the accepted accuracy with a dimension of 150 and a window size of 7.\u003c/p\u003e\u003cp\u003eIt is worth noting that there are two perspectives when it comes to evaluating created models: intrinsic and extrinsic evaluation. Intrinsic evaluation (e.g., by computing the similarity of word pairs) is justified when using Word2Vec and GloVe models for identifying semantically similar words. In our article, we focused on extrinsic evaluation. We believe that in practical applications, these models will be mainly used for preparing word vectors for classification tasks. For this reason, we aimed to evaluate our Word2Vec and GloVe model improvements based on improved performance measures in classification tasks.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eIn this research, we focused on the empirical investigation of setting the values of the context window size and dimension size input parameters for Word2Vec and GloVe word vector embedding models. We created 3 types of models (W2V CBOW, W2V Skip-gram and GloVe), with different window size settings (5, 6, 7) and with different dimension size settings, totally 225 (3 x 3 x 25) models. With each of these models, we obtained word vectors based on the same input pre-processed text corpus. To tell which parameter settings were better than others, we used a neural network classifier along with a spam classification dataset. Using classification performance evaluation metrics, we analyzed the achieved results through graphs and found that the size of the context window size must be chosen based on the embedding method used. We achieved the best results with the W2V Skip-gram method for a window size of 5. The results we obtained by examining the size of the dimension of word vectors indicate that dimensions smaller than 50 are no longer suitable for our classification task, and the results of dimensions larger than 150 did not improve much with increasing dimension.\u003c/p\u003e \u003cp\u003eSimilar to other authors, we also assume that the parameter setting depends on the specific classification task. In the future, we want to carry out research in which we will try to find a suitable setting of the studied parameters depending on the size of the input training text corpus as well as the dataset for the classification problem.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e\u003c/p\u003e "},{"header":"Declarations","content":"\u003cp\u003eData availibility\u003c/p\u003e\n\u003cp\u003eThe datasets and codes used and/or analyzed during the current study are available from the corresponding author (Jozef Kapusta; Email: [email protected]) upon reasonable request.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;Acknowledgments. This work was supported by the Scientific Grant Agency of the Ministry of Education of the Slovak Republic and Slovak Academy of Sciences under Contract VEGA-1/0821/21, also by the Slovak Research and Development Agency under the contract no. APVV-18-0473.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eM. Liang and T. Niu, \u0026ldquo;Research on Text Classification Techniques Based on Improved TF-IDF Algorithm and LSTM Inputs,\u0026rdquo; \u003cem\u003eProcedia Comput Sci\u003c/em\u003e, vol. 208, pp. 460\u0026ndash;470, 2022, doi: 10.1016/j.procs.2022.10.064.\u003c/li\u003e\n\u003cli\u003eA. Sharma and S. Kumar, \u0026ldquo;Ontology-based semantic retrieval of documents using Word2vec model,\u0026rdquo; \u003cem\u003eData Knowl Eng\u003c/em\u003e, vol. 144, p. 102110, Mar. 2023, doi: 10.1016/j.datak.2022.102110.\u003c/li\u003e\n\u003cli\u003eN. Badri, F. Kboubi, and A. H. Chaibi, \u0026ldquo;Combining FastText and Glove Word Embedding for Offensive and Hate speech Text Detection,\u0026rdquo; \u003cem\u003eProcedia Comput Sci\u003c/em\u003e, vol. 207, pp. 769\u0026ndash;778, 2022, doi: 10.1016/j.procs.2022.09.132.\u003c/li\u003e\n\u003cli\u003eE. M. Dharma, F. Lumban Gaol, H. Leslie, H. S. Warnars, and B. Soewito, \u0026ldquo;THE ACCURACY COMPARISON AMONG WORD2VEC, GLOVE, AND FASTTEXT TOWARDS CONVOLUTION NEURAL NETWORK (CNN) TEXT CLASSIFICATION,\u0026rdquo; \u003cem\u003eJ Theor Appl Inf Technol\u003c/em\u003e, vol. 31, no. 2, 2022, [Online]. Available: www.jatit.org\u003c/li\u003e\n\u003cli\u003eJ. M. Wyatt, G. J. Booth, and A. H. Goldman, \u0026ldquo;Natural Language Processing and Its Use in Orthopaedic Research,\u0026rdquo; \u003cem\u003eCurr Rev Musculoskelet Med\u003c/em\u003e, vol. 14, no. 6, pp. 392\u0026ndash;396, Dec. 2021, doi: 10.1007/s12178-021-09734-3.\u003c/li\u003e\n\u003cli\u003eD. Khurana, A. Koli, K. Khatter, and S. Singh, \u0026ldquo;Natural language processing: state of the art, current trends and challenges,\u0026rdquo; \u003cem\u003eMultimed Tools Appl\u003c/em\u003e, vol. 82, no. 3, pp. 3713\u0026ndash;3744, Jan. 2023, doi: 10.1007/s11042-022-13428-4.\u003c/li\u003e\n\u003cli\u003eT. Mikolov, K. Chen, G. Corrado, and J. Dean, \u0026ldquo;Efficient Estimation of Word Representations in Vector Space,\u0026rdquo; Jan. 2013.\u003c/li\u003e\n\u003cli\u003eH. D. Abubakar and M. Umar, \u0026ldquo;Sentiment Classification: Review of Text Vectorization Methods: Bag of Words, Tf-Idf, Word2vec and Doc2vec,\u0026rdquo; \u003cem\u003eSLU Journal of Science and Technology\u003c/em\u003e, vol. 4, no. 1 \u0026amp; 2, pp. 27\u0026ndash;33, Aug. 2022, doi: 10.56471/slujst.v4i.266.\u003c/li\u003e\n\u003cli\u003eT. P. Adewumi, F. Liwicki, and M. Liwicki, \u0026ldquo;Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks,\u0026rdquo; Mar. 2020.\u003c/li\u003e\n\u003cli\u003eT. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, \u0026ldquo;Distributed Representations of Words and Phrases and their Compositionality,\u0026rdquo; Oct. 2013.\u003c/li\u003e\n\u003cli\u003eX. Yang, C. Macdonald, and I. Ounis, \u0026ldquo;Using word embeddings in Twitter election classification,\u0026rdquo; \u003cem\u003eInformation Retrieval Journal\u003c/em\u003e, vol. 21, no. 2\u0026ndash;3, pp. 183\u0026ndash;207, Jun. 2018, doi: 10.1007/s10791-017-9319-5.\u003c/li\u003e\n\u003cli\u003eS. Nazir, M. Asif, S. A. Sahi, S. Ahmad, Y. Y. Ghadi, and M. H. Aziz, \u0026ldquo;Toward the Development of Large-Scale Word Embedding for Low-Resourced Language,\u0026rdquo; \u003cem\u003eIEEE Access\u003c/em\u003e, vol. 10, pp. 54091\u0026ndash;54097, 2022, doi: 10.1109/ACCESS.2022.3173259.\u003c/li\u003e\n\u003cli\u003eT. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, \u0026ldquo;Distributed Representations of Words and Phrases and their Compositionality,\u0026rdquo; Oct. 2013.\u003c/li\u003e\n\u003cli\u003eO. Levy and Y. Goldberg, \u0026ldquo;Dependency-Based Word Embeddings,\u0026rdquo; in \u003cem\u003eProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)\u003c/em\u003e, Stroudsburg, PA, USA: Association for Computational Linguistics, 2014, pp. 302\u0026ndash;308. doi: 10.3115/v1/P14-2050.\u003c/li\u003e\n\u003cli\u003eY. Goldberg and O. Levy, \u0026ldquo;word2vec Explained: deriving Mikolov et al.\u0026rsquo;s negative-sampling word-embedding method,\u0026rdquo; Feb. 2014.\u003c/li\u003e\n\u003cli\u003eMd. A. H. Wadud, M. F. Mridha, and M. M. Rahman, \u0026ldquo;Word Embedding Methods for Word Representation in Deep Learning for Natural Language Processing,\u0026rdquo; \u003cem\u003eIraqi Journal of Science\u003c/em\u003e, pp. 1349\u0026ndash;1361, Mar. 2022, doi: 10.24996/ijs.2022.63.3.37.\u003c/li\u003e\n\u003cli\u003eJ. Pennington, R. Socher, and C. Manning, \u0026ldquo;Glove: Global Vectors for Word Representation,\u0026rdquo; in \u003cem\u003eProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)\u003c/em\u003e, Stroudsburg, PA, USA: Association for Computational Linguistics, 2014, pp. 1532\u0026ndash;1543. doi: 10.3115/v1/D14-1162.\u003c/li\u003e\n\u003cli\u003eG. Mu\u0026ntilde;et\u0026oacute;n-Santa, D. Escobar-Grisales, F. O. L\u0026oacute;pez-Pab\u0026oacute;n, P. A. P\u0026eacute;rez-Toro, and J. R. Orozco-Arroyave, \u0026ldquo;Classification of Poverty Condition Using Natural Language Processing,\u0026rdquo; \u003cem\u003eSoc Indic Res\u003c/em\u003e, vol. 162, no. 3, pp. 1413\u0026ndash;1435, Aug. 2022, doi: 10.1007/s11205-022-02883-z.\u003c/li\u003e\n\u003cli\u003eJ. Kapusta, M. Drlik, and M. Munk, \u0026ldquo;Using of n-grams from morphological tags for fake news classification,\u0026rdquo; \u003cem\u003ePeerJ Comput Sci\u003c/em\u003e, vol. 7, p. e624, Jul. 2021, doi: 10.7717/peerj-cs.624.\u003c/li\u003e\n\u003cli\u003eT. P. Adewumi, F. Liwicki, and M. Liwicki, \u0026ldquo;Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks,\u0026rdquo; Mar. 2020.\u003c/li\u003e\n\u003cli\u003eM. Chugh, P. A. Whigham, and G. Dick, \u0026ldquo;Stability of Word Embeddings Using Word2Vec,\u0026rdquo; 2018, pp. 812\u0026ndash;818. doi: 10.1007/978-3-030-03991-2_73.\u003c/li\u003e\n\u003cli\u003eF. Hill, R. Reichart, and A. Korhonen, \u0026ldquo;SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation,\u0026rdquo; \u003cem\u003eComputational Linguistics\u003c/em\u003e, vol. 41, no. 4, pp. 665\u0026ndash;695, Dec. 2015, doi: 10.1162/COLI_a_00237.\u003c/li\u003e\n\u003cli\u003eE. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Paşca, and A. Soroa, \u0026ldquo;A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches,\u0026rdquo; in \u003cem\u003eProceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics\u003c/em\u003e, Boulder, Colorado: Association for Computational Linguistics, Jun. 2009, pp. 19\u0026ndash;27. [Online]. Available: https://aclanthology.org/N09-1003\u003c/li\u003e\n\u003cli\u003eO. Abayomi-Alli, S. Misra, A. Abayomi-Alli, and M. Odusami, \u0026ldquo;A review of soft techniques for SMS spam classification: Methods, approaches and applications,\u0026rdquo; \u003cem\u003eEng Appl Artif Intell\u003c/em\u003e, vol. 86, pp. 197\u0026ndash;212, Nov. 2019, doi: 10.1016/j.engappai.2019.08.024.\u003c/li\u003e\n\u003cli\u003eG. Waja, G. Patil, C. Mehta, and S. Patil, \u0026ldquo;How AI Can be Used for Governance of Messaging Services: A Study on Spam Classification Leveraging Multi-Channel Convolutional Neural Network,\u0026rdquo; \u003cem\u003eInternational Journal of Information Management Data Insights\u003c/em\u003e, vol. 3, no. 1, p. 100147, Apr. 2023, doi: 10.1016/j.jjimei.2022.100147.\u003c/li\u003e\n\u003cli\u003eS. Dutta, A. K. Das, S. Ghosh, and D. Samanta, \u0026ldquo;Attribute selection to improve spam classification,\u0026rdquo; in \u003cem\u003eData Analytics for Social Microblogging Platforms\u003c/em\u003e, Elsevier, 2023, pp. 95\u0026ndash;127. doi: 10.1016/B978-0-32-391785-8.00016-0.\u003c/li\u003e\n\u003cli\u003eR. Řehůřek and P. Sojka, \u0026ldquo;Software Framework for Topic Modelling with Large Corpora,\u0026rdquo; in \u003cem\u003eProceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks\u003c/em\u003e, ELRA, Feb. 2010, pp. 45\u0026ndash;50.\u003c/li\u003e\n\u003cli\u003eF. Chollet, \u0026ldquo;Keras.\u0026rdquo; Accessed: Dec. 28, 2022. [Online]. Available: https://keras.io\u003c/li\u003e\n\u003cli\u003eF. Pedregosa \u003cem\u003eet al.\u003c/em\u003e, \u0026ldquo;Scikit-learn: Machine Learning in Python,\u0026rdquo; \u003cem\u003eJournal of Machine Learning Research\u003c/em\u003e, vol. 12, pp. 2825\u0026ndash;2830, 2011.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Spam Classification, Text Mining, Natural Language Processing, Word Embbedings, Dimension Size, Context Window Size","lastPublishedDoi":"10.21203/rs.3.rs-4532901/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4532901/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eIn natural language processing, there are several approaches to transform text into multi-dimensional word vectors, such as TF-IDF (term frequency - inverse document frequency), Word2Vec, GloVe (Global Vectors), which are widely used to this day. The meaning of a word in Word2Vec and GloVe models represents its context. Syntactic or semantic relationships between words are preserved, and the vector distances between individual words correspond to human perception of the relationship between words. Word2Vec and GloVe generate a vector for each word, which can be further utilized. Unlike GPT, ELMo, or BERT, we don't need a model trained on a corpus for further text processing. It's important to know how to set the size of the context window and the dimension size for Word2Vec and GloVe models, as an improper combination of these parameters can lead to low-quality word vectors. In our article, we experimented with these parameters. The results show that it's necessary to choose an appropriate window size based on the embedding method used. In terms of dimension size, according to our results, dimensions smaller than 50 are no longer suitable. On the other hand, with dimensions larger than 150, the results did not significantly improve.\u003c/p\u003e","manuscriptTitle":"Effect of dimension size and window size on word embedding in classification tasks","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-07-08 17:12:45","doi":"10.21203/rs.3.rs-4532901/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"8340a8a9-8270-45f1-bd8e-910237a5c515","owner":[],"postedDate":"July 8th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-09-03T03:54:39+00:00","versionOfRecord":[],"versionCreatedAt":"2024-07-08 17:12:45","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4532901","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4532901","identity":"rs-4532901","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00