Sentiment Analysis of Movie Review Using Deep Learning Techniques

preprint OA: closed CC-BY-4.0
📄 Open PDF Full text JSON View at publisher
Full text 49,659 characters · extracted from preprint-html · click to expand
Sentiment Analysis of Movie Review Using Deep Learning Techniques | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Sentiment Analysis of Movie Review Using Deep Learning Techniques Sweta Solanki, Krutika Trivedi This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4388226/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Text mining is a process of identifying meaningful information from text-based data. A large amount of data in the form of reviews and tweets is available on the web. It is difficult to manually read these reviews and assign sentiments to them; therefore, an automated system can be created that analyses the text and extracts user precepts. In this system, sentiment analysis is performed on collected restaurant reviews. The implemented system performs sentiment analysis on the available restaurant review data. This result can provide an opinion that is positive or negative. In the baseline system, we demonstrate two different feature extraction techniques. First, the bag-of-words model is used for feature extraction. The second technique used was the term frequency-inverse document frequency scores. We examined the effectiveness of several N-gram ranges using the naïve Bayes classifier. We analyze the results of the baseline system and the scope of improvement in the results using the word embedding technique. The CNN model efficiently extracts higher level features using convolutional layers and max pooling layers. The LSTM model is capable of capturing long-term dependencies between word sequences. We propose a hybrid model using LSTM and a CNN model, named the Hybrid CNN-LSTM Model, to overcome the sentiment analysis problem. We obtained improved accuracy for sentiment analysis on the IMDB movie review dataset. Sentiment analysis NLP RNN LSTM model CNN Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 I. INTRODUCTION NLP is the ability of a computer program to understand human language as it is spoken. Sentiment analysis is one of the most popular applications of NLP. The term sentiment analysis first appeared in (Nasukawa and Yi, 2003), and the term opinion mining first appeared in (Dave, Lawrence, and Pennock, 2003). However, research on sentiment and opinions appeared earlier (Das and Chen, 2001; Morinaga et al., 2002; Pang, Lee and Vaidyanathan, 2002; Tong, 2001; Turney, 2002; Wiebe, 2000) [ 2 ]. A sentiment analysis system for text analysis combines natural language processing and machine learning techniques to assign weighted sentiment scores to entities, topics, themes, and categories within a sentence or phrase [ 1 ]. There are also many names and slightly different tasks, e.g., sentiment analysis, opinion mining, opinion extraction, sentiment mining, subjectivity analysis, effect analysis, emotion analysis, and review mining.[ 2 ]. In this study, we propose a hybrid model using LSTM and a CNN model named the Hybrid CNN-LSTM Model to overcome the sentiment analysis problem. First, we use an embedding layer to train initial word embeddings. Word embedding translates text strings into a vector of numeric values. Afterword embedding is performed, in which the proposed model combines sets of features that are extracted by convolution and max pooling layers with long-term dependencies. The proposed model also uses dropout technology to improve the accuracy. Our results show that the proposed Hybrid CNN-LSTM Model outperforms traditional deep learning and machine learning techniques in terms of precision, recall, F-measure, and accuracy. We propose a system that performs sentiment analysis on the IMDB movie review dataset two categories: positive and negative. II. METHODOLOGY Step 1: Define the Objective Identify exactly what you want to achieve with sentiment analysis. For movie reviews, this often involves classifying reviews as positive, negative, or neutral. You might also be interested in more fine-grained analysis, such as different levels of positivity or negativity. Step 2: Data collection and preparation Dataset : Obtain a dataset of movie reviews. A popular choice is the IMDb dataset, which is widely used for sentiment analysis and is available on platforms such as Kaggle. Preprocessing : Cleaning Text : Remove HTML tags, special characters, and numbers. All text was converted to lowercase to maintain consistency. Tokenization : Convert sentences into words or tokens to simplify the analysis. Stopword removal : Eliminate common words that may not contribute to sentiment analysis (e.g., "the", "is", "at"). Lemmatization/Stemming : Reduce words to their base or root form. Vectorization : Since deep learning models work with numerical data, text is converted to vectors using techniques such as the following: Word Embeddings : Pretrained models such as Word2Vec, GloVe, or fastText provide a dense representation of words and capture semantic meanings. Embedding Layer : An embedding layer is used in the neural network to learn an embedding for all words in the dataset. Step 3: Choose a Model Architecture Recurrent Neural Networks (RNNs) : Good for sequence data such as text. Long short-term memory (LSTM) : RNN variants are better at capturing long-range dependencies and avoiding the vanishing gradient problem. Convolutional Neural Networks (CNNs) : Although traditionally used for image processing, CNNs can be effective for sentence classification tasks. Step 4: Model training Split the Data : Typically, the data are split into training (80%) and validation (20%) sets. Model Configuration : Set up the neural network architecture. When using LSTMs or GRUs, the number of layers and units per layer are configured. Compile the Model : Choose an optimizer (such as Adam), a loss function (typically binary cross entropy for binary classification or categorical cross entropy for multiclass classification), and evaluation metrics (such as accuracy). Training : The model was trained on the training data using batch processing and validated using the validation set. Techniques such as early stopping and model checkpointing are used to avoid overfitting. Step 5: Evaluation Testing : After training, the model was evaluated on a separate test dataset to assess its real-world performance. Metrics : Use accuracy, precision, recall, F1-score, and perhaps ROC-AUC to evaluate performance. Confusion matrix: This matrix helps in understanding true positives, false positives, true negatives, and false negatives. III. PROPOSED WORK Baseline System We propose a baseline model for this system's first load of data. Next, we perform the preprocessing task. After that, the text is converted to vectors using a bag of and the vector is transformed into a TF-IDF vector. In this study, different classifier algorithms were used to classify the reviews. In Fig. 1 , we show how to construct a baseline model for sentiment analysis following a few simple steps: • First is the preprocessing step. the only thing we need to do is remove punctuation, special characters, tokenization and convert everything to lowercase. • The vectorization step, which produces numerical features for the classifier, follows. For this purpose, we used a bag of words and TF-IDF, a simple vectorization technique that consists of computing word frequencies and downscaling them for words that are too common. • This method is applied to the Restaurant Reviews dataset, to train a sentiment analysis classifier that uses multinomial naïve Bayes. Machines cannot understand the text. Machines can work with numbers. Machine learning can only work with numbers. Therefore, we needed to convert our text into numbers. In this case, we use the bag-ofwords model to convert our text to numbers. The script above uses the CountVector class from the sklearn. feature_extraction. Text library. Some important parameters must be passed to the constructor of the class. The first parameter is the max_features parameter, which is set to 1500. This is because when we convert words to numbers using the bag-of-words approach, all the unique words in all the documents are converted into features. All the documents can contain tens of thousands of unique words. However, words that have a very low frequency of occurrence are typically not good parameters for classifying documents. Therefore, we set the max_features parameter to 1500, which means that we want to use the 1500 most common words as features for training our classifier. The fit_transform function of the CountVector class converts text documents into corresponding numeric features. The values obtained using the bag-of-words model were converted into TFIDF values. We divided our data into 70 training and 30 testing sets. We use the naïve Bayes algorithm to train our model. To train our machine learning model, we used the multinomial naïve Bayes classifier from the class sklearn.naïve_bayes library. The fit method of this class is used to train the algorithm. The training data and training target sets need to be passed through this method. Finally, to predict the sentiment for the documents in our test set, we can use The prediction method of the naïve Bayes classifier class. To evaluate the performance of a classification model, such as one in which we use trainee methods or transfer learning approaches, we also considered leveraging pretrained models and enhancing the system’s predictive capabilities. Convolution neural networks for text classification CNNs are a class of deep, feed-forward artificial neural networks. Image classification and text analysis methods based on CNNs can obtain important features of text through pooling [ 4 ]. The figures below show how such a convolution works. It starts by taking the input features with the size of the filter. With this, the dot product of the multiplied weights of the filter is taken, its values are multiplied elementwise with the original matrix and then summed. The one-dimensional convnet in that certain sequence can be recognized at a different position. This approach can be helpful for identifying certain patterns in text classification. To obtain the full convolution, this process is performed for each element by sliding the filter over the whole matrix [ 3 ]. Instead of image pixels, the inputs for most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word. That is, each row is a vector that represents a word. These vectors are word embeddings that index the word into a vocabulary. For a 10-word sentence using 100-dimensional embedding, we use a 10×100 matrix as our input [ 4 ]. In NLP, we typically use filters that slide over the full rows of the matrix (words). Thus, the width of our filters is usually the same as the width of the input matrix. The height, or region size, may vary, but sliding windows over 2–5 words at a time is typical [ 3 ]. Recurrent neural networks can obtain contextual information, but the order of words leads to bias. A text analysis method based on a CNN can obtain important features of text through pooling, but it is difficult to obtain contextual information. IV. EXPERIMENTAL RESULTS We conducted several experiments to compare the performance of these models. We can use metrics such as the confusion matrix, F1 measure, and accuracy. To find these values, we use a classification report, a confusion matrix, and accuracy score utilities. Table I: Performance of Model We compare the training and test accuracies; see that the testing accuracy for the CNN with the LSTM model approach is greater than the testing accuracy of the LSTM. Analysis of test data from misdiagnosed reviews Here analyzing the misclassification of reviews from test data is a Word ambiguity problem for sentiment analysis. actual output: 1 predicted output: 0.00012990832 wow this wa a great italian zombie movie by two great director s fulci zombie and bruno mattie hell of the living dead lucio started this movie and wa ill so the great bruno took over and it turned out surprisingly better than i expected it to turn out so that you have seen hell of the living dead directed by bruno mattie and if you saw zombie directed by lucio fulci and liked both or one of theme then this is a movie you must watch it ha great zombie make up which equal great looking zombie ha a funny zombie flying head and zombie bird that spit acid at you and turn you into a zombie that only to two people but they are mainly just the great toxic zombie like in bruno hell of the living dead so if you like italian zombie movie or just zombie movie s in general than check this one out it a great italian zombie movie actual output: 1 predicted output: 0.004727751 One of the more sensible comeds to hit the hindi film screen, a remake of s Malayalam hit a boeing boeing, which in turn, wa a remake of the s hit of the same name, garam masala, elevates the standard of comedy in hindi cinema akshay kumar ha once again, proving his is one of the best superstars of hindi cinema. He can do comedy he ha combined well with the new hunk john abraham; however, the john still remains in shadow and fails to rise to the occasion the new gal is cute and does complete justice to their role a mustwatch comedy leave your brain away and laugh for hr after all laughter is the best medicine ask, priyadarshan and akshay kumar. Here, another problem is that people express their negative sentiments using positive words or vice versa. actual output: 0 predicted output: 0.99998474 Despite the fact that the excellent cast is an unremarkable film, especially from the aviation perspective, it may be somewhat better than the egregious von and Brown, but not by much blue max remains the best of a small market over the last year. Similarly, darling lilli is fun if not taken seriously. It is interesting to speculate what ilm could do with zeppelin and, in a new high-quality ww i film, actual output: 1 predicted output: 0.00018617511 This is simply the funniest movie i ve seen in a long time the bad-acting bad script bad scenery bad costume bad camera work and bad special effect are so stupid that you find yourself reeling with laughter so it is not gonna win an oscar but if you ve got beer and friend round, then you can t go wrong. V. CONCLUSION In this work, we first studied basic concepts and techniques related to sentiment analysis. We analyzed various datasets and selected one of the datasets for the experiment. The baseline system was implemented with N-gram and TF-IDF features with a naïve Bayes classifier. The proposed model combines CNN and LSTM. Due to the convolution layer and pooling layer, the CNN can automatically extract local features and reduce the computational complexity. and LSTM has learning sequence characteristics. We propose a hybrid CNN-LSTM model for sentiment analysis to solve the word negation problem. Therefore, our model achieves good performance in sentiment classification. VII. Future Work Considering the results of this work, we believe that the pretrained embedding (Word2vec and Glove) method will apply this model to the task of sentiment analysis with the hope of improving classification performance. The GloVe algorithm is an extension of word2vec. Vector space model representations of words were developed using matrix factorization techniques. This unsupervised learning algorithm was developed by Stanford to generate word embeddings by aggregating the global word-word co-occurrence matrix from a corpus. Our proposed model is improved by training on large datasets. In future work, systems could be improved by using other variants of RNN architectures, such as GRUs and Bi-LSTM. Declarations Author Contribution sweta solanki wrote the main manuscript text and krutika trivedi prepared figure 1-3. References Haseena Rahmath, P.: and Tanvir Ahmad. Sentiment Anal. Techniques-A Comp. Study (2014) Liu, B.: Sentiment analysis and opinion mining. Synthesis lectures on human language technologies 5.1 : 1-167. (2012) https:// towardsdatascience.com/understanding-how-convolutional-neural-network-cnn-perform-text-classification-with-word-d2ee64b9dd0b Huang, Q., et al.: Deep sentiment representation based on CNN and LSTM. 2017 International Conference on Green Informatics (ICGI). IEEE, (2017) Vohra, S.M., Teraiya, J.B.: A comparative study of sentiment analysis techniques. J. JIKRCE. 2 (2), 313–317 (2013) Deho, B.O., Agangiba, A.W., Aryeh, L.F., Ansah, A.J.: Sentiment Analysis with Word Embedding. In 2018 IEEE 7th International Conference on Adaptive Science & Technology (ICAST) (pp. 1–4). IEEE (2018), August Rezaeinia, S., Mahdi, A., Ghodsi, and Rouhollah Rahmani:. Improving the accuracy of pretrained word embeddings for sentiment analysis. arXiv preprint arXiv:1711.08609 (2017) Joshi, V.C., Vipul, M.: Vekariya. An approach to sentiment analysis on Gujarati tweets. Adv. Comput. Sci. Technol. 10 (5), 1487–1493 (2017) Rakholia, R.M., Jatinderkumar, R., Saini: Classification of Gujarati Documents using Naïve Bayes Classifier. Indian J. Sci. Technol. 5 , 1–9 (2017) Gohil, L., Patel, D.: A Sentiment Analysis of Gujarati Text using Gujarati Senti word Net. Cliche, M.: Bb_twtr at semeval-2017 task 4: Twitter sentiment analysis with cnns and lstms. arXiv preprint arXiv:1704.06125 (2017) Riyadh, A., Zonayed, N., Alvi, and Kamrul Hasan Talukder:. Exploring human emotion via Twitter. 2017 20th International Conference of ComputerInformation Technology (ICCIT). IEEE Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4388226","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":300670127,"identity":"046080d5-996d-4a88-8987-bf9a92dcd8bc","order_by":0,"name":"Sweta Solanki","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA/klEQVRIiWNgGAWjYBACxgYGAwaGChsI7wMQs7ETpeVMGpDJzMA4A6SFmbBFBgyMbYfBWph5GMA0fsDcfnjjZx62w/YGx/sPfrb5tU2eD2jbh485eBzWk1YszcOTnrjhzGFm6dy+24ZtQNskZ27D55ccA8kZEtYJZjeSGaRze24zArWwMfPi09L/xvjnDANme7P7j5l/W/bctiesZUaOmcSHBGfGbTeY2aQZftxOJELLszKLDwfSEvefSTaz7G24ndzGzNiM1y+G/cmbbyT+s7GXbD/4+MaPP7dt57c3H/zwEZ+WBhQ728BkAzaVcCCPyv2DV/EoGAWjYBSMUAAApb5SV7hpYA8AAAAASUVORK5CYII=","orcid":"","institution":"Sardar Patel College of Engineering, GTU","correspondingAuthor":true,"prefix":"","firstName":"Sweta","middleName":"","lastName":"Solanki","suffix":""},{"id":300670128,"identity":"a7434735-6549-400d-9600-2c85da1b7d81","order_by":1,"name":"Krutika Trivedi","email":"","orcid":"","institution":"Sardar Patel College of Engineering, GTU","correspondingAuthor":false,"prefix":"","firstName":"Krutika","middleName":"","lastName":"Trivedi","suffix":""}],"badges":[],"createdAt":"2024-05-08 09:42:26","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4388226/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4388226/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":56779417,"identity":"db6e9e35-3a32-4000-8728-702612cbde31","added_by":"auto","created_at":"2024-05-20 11:10:10","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":44165,"visible":true,"origin":"","legend":"\u003cp\u003eAbstract baseline system approach\u003c/p\u003e","description":"","filename":"fig1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4388226/v1/9dc81456db8bf8339dd60f81.jpg"},{"id":56779418,"identity":"a2ae8202-3646-46ce-8dec-7552acaf3ad7","added_by":"auto","created_at":"2024-05-20 11:10:10","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":55346,"visible":true,"origin":"","legend":"\u003cp\u003eCNN for text classification\u003c/p\u003e","description":"","filename":"fig2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4388226/v1/68dbf71d6a6c43989a8a26f2.jpg"},{"id":56779825,"identity":"9e7f408a-70d8-4514-8f36-99fb3c6808d6","added_by":"auto","created_at":"2024-05-20 11:18:10","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":45059,"visible":true,"origin":"","legend":"\u003cp\u003eMethodology of the proposed hybrid CNN-LSTM model\u003c/p\u003e","description":"","filename":"fig3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4388226/v1/d9a9d3c23c063b77c0acc38d.jpg"},{"id":56779420,"identity":"e72f47d1-dfcb-4957-8c60-5d99671458bb","added_by":"auto","created_at":"2024-05-20 11:10:10","extension":"jpg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":40323,"visible":true,"origin":"","legend":"\u003cp\u003eThe hybrid model Adam optimizer\u003c/p\u003e","description":"","filename":"fig4.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4388226/v1/970aa0d7d11c81a6085269b7.jpg"},{"id":56779421,"identity":"73e8fa19-5d9f-415f-8b87-e008854c867c","added_by":"auto","created_at":"2024-05-20 11:10:10","extension":"jpg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":40431,"visible":true,"origin":"","legend":"\u003cp\u003eHybrid model RMSprop optimizer\u003c/p\u003e","description":"","filename":"fig5.jpg","url":"https://assets-eu.researchsquare.com/files/rs-4388226/v1/e3956ff7d9a637b26507fe78.jpg"},{"id":57253607,"identity":"a7232fda-081e-4587-8b27-b54f36f8e17e","added_by":"auto","created_at":"2024-05-28 07:56:10","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":571627,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4388226/v1/4830d5bb-b499-4e53-84fc-ef9d02c75183.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"\u003cp\u003eSentiment Analysis of Movie Review Using Deep Learning Techniques\u003c/p\u003e","fulltext":[{"header":"I. INTRODUCTION","content":"\u003cp\u003eNLP is the ability of a computer program to understand human language as it is spoken. Sentiment analysis is one of the most popular applications of NLP. The term sentiment analysis first appeared in (Nasukawa and Yi, 2003), and the term opinion mining first appeared in (Dave, Lawrence, and Pennock, 2003). However, research on sentiment and opinions appeared earlier (Das and Chen, 2001; Morinaga et al., 2002; Pang, Lee and Vaidyanathan, 2002; Tong, 2001; Turney, 2002; Wiebe, 2000) [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. A sentiment analysis system for text analysis combines natural language processing and machine learning techniques to assign weighted sentiment scores to entities, topics, themes, and categories within a sentence or phrase [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. There are also many names and slightly different tasks, e.g., sentiment analysis, opinion mining, opinion extraction, sentiment mining, subjectivity analysis, effect analysis, emotion analysis, and review mining.[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn this study, we propose a hybrid model using LSTM and a CNN model named the Hybrid CNN-LSTM Model to overcome the sentiment analysis problem. First, we use an embedding layer to train initial word embeddings. Word embedding translates text strings into a vector of numeric values. Afterword embedding is performed, in which the proposed model combines sets of features that are extracted by convolution and max pooling layers with long-term dependencies. The proposed model also uses dropout technology to improve the accuracy. Our results show that the proposed Hybrid CNN-LSTM Model outperforms traditional deep learning and machine learning techniques in terms of precision, recall, F-measure, and accuracy. We propose a system that performs sentiment analysis on the IMDB movie review dataset two categories: positive and negative.\u003c/p\u003e"},{"header":"II. METHODOLOGY","content":"\u003cp\u003e \u003cb\u003eStep 1: Define the Objective\u003c/b\u003e \u003c/p\u003e \u003cp\u003eIdentify exactly what you want to achieve with sentiment analysis. For movie reviews, this often involves classifying reviews as positive, negative, or neutral. You might also be interested in more fine-grained analysis, such as different levels of positivity or negativity.\u003c/p\u003e \u003cp\u003e \u003cb\u003eStep 2: Data collection and preparation\u003c/b\u003e \u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eDataset\u003c/b\u003e: Obtain a dataset of movie reviews. A popular choice is the IMDb dataset, which is widely used for sentiment analysis and is available on platforms such as Kaggle.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003ePreprocessing\u003c/b\u003e:\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eCleaning Text\u003c/b\u003e: Remove HTML tags, special characters, and numbers. All text was converted to lowercase to maintain consistency.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eTokenization\u003c/b\u003e: Convert sentences into words or tokens to simplify the analysis.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eStopword removal\u003c/b\u003e: Eliminate common words that may not contribute to sentiment analysis (e.g., \"the\", \"is\", \"at\").\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eLemmatization/Stemming\u003c/b\u003e: Reduce words to their base or root form.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eVectorization\u003c/b\u003e: Since deep learning models work with numerical data, text is converted to vectors using techniques such as the following:\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eWord Embeddings\u003c/b\u003e: Pretrained models such as Word2Vec, GloVe, or fastText provide a dense representation of words and capture semantic meanings.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eEmbedding Layer\u003c/b\u003e: An embedding layer is used in the neural network to learn an embedding for all words in the dataset.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eStep 3: Choose a Model Architecture\u003c/b\u003e \u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eRecurrent Neural Networks (RNNs)\u003c/b\u003e: Good for sequence data such as text.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eLong short-term memory (LSTM)\u003c/b\u003e: RNN variants are better at capturing long-range dependencies and avoiding the vanishing gradient problem.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eConvolutional Neural Networks (CNNs)\u003c/b\u003e: Although traditionally used for image processing, CNNs can be effective for sentence classification tasks.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eStep 4: Model training\u003c/b\u003e \u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eSplit the Data\u003c/b\u003e: Typically, the data are split into training (80%) and validation (20%) sets.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eModel Configuration\u003c/b\u003e: Set up the neural network architecture. When using LSTMs or GRUs, the number of layers and units per layer are configured.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eCompile the Model\u003c/b\u003e: Choose an optimizer (such as Adam), a loss function (typically binary cross entropy for binary classification or categorical cross entropy for multiclass classification), and evaluation metrics (such as accuracy).\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eTraining\u003c/b\u003e: The model was trained on the training data using batch processing and validated using the validation set. Techniques such as early stopping and model checkpointing are used to avoid overfitting.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eStep 5: Evaluation\u003c/b\u003e \u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eTesting\u003c/b\u003e: After training, the model was evaluated on a separate test dataset to assess its real-world performance.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eMetrics\u003c/b\u003e: Use accuracy, precision, recall, F1-score, and perhaps ROC-AUC to evaluate performance.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eConfusion\u003c/b\u003e matrix: This matrix helps in understanding true positives, false positives, true negatives, and false negatives.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e"},{"header":"III. PROPOSED WORK","content":"\u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eBaseline System\u003c/b\u003e \u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003eWe propose a baseline model for this system's first load of data. Next, we perform the preprocessing task. After that, the text is converted to vectors using a bag of and the vector is transformed into a TF-IDF vector. In this study, different classifier algorithms were used to classify the reviews.\u003c/p\u003e \u003cp\u003eIn Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, we show how to construct a baseline model for sentiment analysis following a few simple steps:\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e\u0026bull; First is the preprocessing step. the only thing we need to do is remove punctuation, special characters, tokenization and convert everything to lowercase.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e\u0026bull; The vectorization step, which produces numerical features for the classifier, follows. For this purpose, we used a bag of words and TF-IDF, a simple vectorization technique that consists of computing word frequencies and downscaling them for words that are too common.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e\u0026bull; This method is applied to the Restaurant Reviews dataset, to train a sentiment analysis classifier that uses multinomial na\u0026iuml;ve Bayes.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eMachines cannot understand the text. Machines can work with numbers. Machine learning can only work with numbers. Therefore, we needed to convert our text into numbers. In this case, we use the bag-ofwords model to convert our text to numbers.\u003c/p\u003e \u003cp\u003eThe script above uses the CountVector class from the sklearn. feature_extraction. Text library.\u003c/p\u003e \u003cp\u003eSome important parameters must be passed to the constructor of the class. The first parameter is the max_features parameter, which is set to 1500. This is because when we convert words to numbers using the bag-of-words approach, all the unique words in all the documents are converted into features. All the documents can contain tens of thousands of unique words. However, words that have a very low frequency of occurrence are typically not good parameters for classifying documents.\u003c/p\u003e \u003cp\u003eTherefore, we set the max_features parameter to 1500, which means that we want to use the 1500 most common words as features for training our classifier. The fit_transform function of the CountVector class converts text documents into corresponding numeric features. The values obtained using the bag-of-words model were converted into TFIDF values. We divided our data into 70 training and 30 testing sets. We use the na\u0026iuml;ve Bayes algorithm to train our model.\u003c/p\u003e \u003cp\u003eTo train our machine learning model, we used the multinomial na\u0026iuml;ve Bayes classifier from the class sklearn.na\u0026iuml;ve_bayes library. The fit method of this class is used to train the algorithm. The training data and training target sets need to be passed through this method. Finally, to predict the sentiment for the documents in our test set, we can use The prediction method of the na\u0026iuml;ve Bayes classifier class. To evaluate the performance of a classification model, such as one in which we use trainee methods or transfer learning approaches, we also considered leveraging pretrained models and enhancing the system\u0026rsquo;s predictive capabilities.\u003c/p\u003e \u003cp\u003eConvolution neural networks for text classification CNNs are a class of deep, feed-forward artificial neural networks. Image classification and text analysis methods based on CNNs can obtain important features of text through pooling [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe figures below show how such a convolution works. It starts by taking the input features with the size of the filter. With this, the dot product of the multiplied weights of the filter is taken, its values are multiplied elementwise with the original matrix and then summed.\u003c/p\u003e \u003cp\u003eThe one-dimensional convnet in that certain sequence can be recognized at a different position. This approach can be helpful for identifying certain patterns in text classification. To obtain the full convolution, this process is performed for each element by sliding the filter over the whole matrix [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eInstead of image pixels, the inputs for most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word. That is, each row is a vector that represents a word. These vectors are word embeddings that index the word into a vocabulary. For a 10-word sentence using 100-dimensional embedding, we use a 10\u0026times;100 matrix as our input [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn NLP, we typically use filters that slide over the full rows of the matrix (words). Thus, the width of our filters is usually the same as the width of the input matrix. The height, or region size, may vary, but sliding windows over 2\u0026ndash;5 words at a time is typical [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Recurrent neural networks can obtain contextual information, but the order of words leads to bias. A text analysis method based on a CNN can obtain important features of text through pooling, but it is difficult to obtain contextual information.\u003c/p\u003e"},{"header":"IV. EXPERIMENTAL RESULTS","content":"\u003cp\u003eWe conducted several experiments to compare the performance of these models. We can use metrics such as the confusion matrix, F1 measure, and accuracy. To find these values, we use a classification report, a confusion matrix, and accuracy score utilities.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTable I: Performance of Model\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWe compare the training and test accuracies; see that the testing accuracy for the CNN with the LSTM model approach is greater than the testing accuracy of the LSTM.\u003c/p\u003e \u003cp\u003e \u003cb\u003eAnalysis of test data from misdiagnosed reviews\u003c/b\u003e \u003c/p\u003e \u003cp\u003eHere analyzing the misclassification of reviews from test data is a Word ambiguity problem for sentiment analysis.\u003c/p\u003e \u003cp\u003eactual output: 1\u003c/p\u003e \u003cp\u003epredicted output: 0.00012990832\u003c/p\u003e \u003cp\u003ewow this wa a great italian zombie movie by two great director s fulci zombie and bruno mattie hell of the living dead lucio started this movie and wa ill so the great bruno took over and it turned out surprisingly better than i expected it to turn out so that you have seen hell of the living dead directed by bruno mattie and if you saw zombie directed by lucio fulci and liked both or one of theme then this is a movie you must watch it ha great zombie make up which equal great looking zombie ha a funny zombie flying head and zombie bird that spit acid at you and turn you into a zombie that only to two people but they are mainly just the great toxic zombie like in bruno hell of the living dead so if you like italian zombie movie or just zombie movie s in general than check this one out it a great italian zombie movie\u003c/p\u003e \u003cp\u003eactual output: 1\u003c/p\u003e \u003cp\u003epredicted output: 0.004727751\u003c/p\u003e \u003cp\u003eOne of the more sensible comeds to hit the hindi film screen, a remake of s Malayalam hit a boeing boeing, which in turn, wa a remake of the s hit of the same name, garam masala, elevates the standard of comedy in hindi cinema akshay kumar ha once again, proving his is one of the best superstars of hindi cinema. He can do comedy he ha combined well with the new hunk john abraham; however, the john still remains in shadow and fails to rise to the occasion the new gal is cute and does complete justice to their role a mustwatch comedy leave your brain away and laugh for hr after all laughter is the best medicine ask, priyadarshan and akshay kumar.\u003c/p\u003e \u003cp\u003eHere, another problem is that people express their negative sentiments using positive words or vice versa.\u003c/p\u003e \u003cp\u003eactual output: 0\u003c/p\u003e \u003cp\u003epredicted output: 0.99998474\u003c/p\u003e \u003cp\u003eDespite the fact that the excellent cast is an unremarkable film, especially from the aviation perspective, it may be somewhat better than the egregious von and Brown, but not by much blue max remains the best of a small market over the last year. Similarly, darling lilli is fun if not taken seriously. It is interesting to speculate what ilm could do with zeppelin and, in a new high-quality ww i film, actual output: 1 predicted output: 0.00018617511 This is simply the funniest movie i ve seen in a long time the bad-acting bad script bad scenery bad costume bad camera work and bad special effect are so stupid that you find yourself reeling with laughter so it is not gonna win an oscar but if you ve got beer and friend round, then you can t go wrong.\u003c/p\u003e"},{"header":"V. CONCLUSION","content":"\u003cp\u003eIn this work, we first studied basic concepts and techniques related to sentiment analysis. We analyzed various datasets and selected one of the datasets for the experiment. The baseline system was implemented with N-gram and TF-IDF features with a na\u0026iuml;ve Bayes classifier. The proposed model combines CNN and LSTM. Due to the convolution layer and pooling layer, the CNN can automatically extract local features and reduce the computational complexity. and LSTM has learning sequence characteristics. We propose a hybrid CNN-LSTM model for sentiment analysis to solve the word negation problem. Therefore, our model achieves good performance in sentiment classification.\u003c/p\u003e"},{"header":"VII. Future Work","content":"\u003cp\u003eConsidering the results of this work, we believe that the pretrained embedding (Word2vec and Glove) method will apply this model to the task of sentiment analysis with the hope of improving classification performance. The GloVe algorithm is an extension of word2vec. Vector space model representations of words were developed using matrix factorization techniques. This unsupervised learning algorithm was developed by Stanford to generate word embeddings by aggregating the global word-word co-occurrence matrix from a corpus.\u003c/p\u003e \u003cp\u003eOur proposed model is improved by training on large datasets. In future work, systems could be improved by using other variants of RNN architectures, such as GRUs and Bi-LSTM.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003esweta solanki wrote the main manuscript text and krutika trivedi prepared figure 1-3.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eHaseena Rahmath, P.: and Tanvir Ahmad. Sentiment Anal. Techniques-A Comp. Study (2014)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu, B.: Sentiment analysis and opinion mining. Synthesis lectures on human language technologies 5.1 : 1-167. (2012)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ehttps://\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003etowardsdatascience.com/understanding-how-convolutional-neural-network-cnn-perform-text-classification-with-word-d2ee64b9dd0b\u003c/span\u003e\u003cspan address=\"http://towardsdatascience.com/understanding-how-convolutional-neural-network-cnn-perform-text-classification-with-word-d2ee64b9dd0b\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuang, Q., et al.: Deep sentiment representation based on CNN and LSTM. 2017 International Conference on Green Informatics (ICGI). IEEE, (2017)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVohra, S.M., Teraiya, J.B.: A comparative study of sentiment analysis techniques. J. JIKRCE. \u003cb\u003e2\u003c/b\u003e(2), 313\u0026ndash;317 (2013)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDeho, B.O., Agangiba, A.W., Aryeh, L.F., Ansah, A.J.: Sentiment Analysis with Word Embedding. In 2018 IEEE 7th International Conference on Adaptive Science \u0026amp; Technology (ICAST) (pp. 1\u0026ndash;4). IEEE (2018), August\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRezaeinia, S., Mahdi, A., Ghodsi, and Rouhollah Rahmani:. Improving the accuracy of pretrained word embeddings for sentiment analysis. arXiv preprint arXiv:1711.08609 (2017)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJoshi, V.C., Vipul, M.: Vekariya. An approach to sentiment analysis on Gujarati tweets. Adv. Comput. Sci. Technol. \u003cb\u003e10\u003c/b\u003e(5), 1487\u0026ndash;1493 (2017)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRakholia, R.M., Jatinderkumar, R., Saini: Classification of Gujarati Documents using Na\u0026iuml;ve Bayes Classifier. Indian J. Sci. Technol. \u003cb\u003e5\u003c/b\u003e, 1\u0026ndash;9 (2017)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGohil, L., Patel, D.: A Sentiment Analysis of Gujarati Text using Gujarati Senti word Net.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCliche, M.: Bb_twtr at semeval-2017 task 4: Twitter sentiment analysis with cnns and lstms. arXiv preprint arXiv:1704.06125 (2017)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRiyadh, A., Zonayed, N., Alvi, and Kamrul Hasan Talukder:. Exploring human emotion via Twitter. 2017 20th International Conference of ComputerInformation Technology (ICCIT). IEEE\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Sentiment analysis, NLP, RNN, LSTM model, CNN","lastPublishedDoi":"10.21203/rs.3.rs-4388226/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4388226/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eText mining is a process of identifying meaningful information from text-based data. A large amount of data in the form of reviews and tweets is available on the web. It is difficult to manually read these reviews and assign sentiments to them; therefore, an automated system can be created that analyses the text and extracts user precepts. In this system, sentiment analysis is performed on collected restaurant reviews. The implemented system performs sentiment analysis on the available restaurant review data. This result can provide an opinion that is positive or negative.\u003c/p\u003e \u003cp\u003eIn the baseline system, we demonstrate two different feature extraction techniques. First, the bag-of-words model is used for feature extraction. The second technique used was the term frequency-inverse document frequency scores. We examined the effectiveness of several N-gram ranges using the na\u0026iuml;ve Bayes classifier. We analyze the results of the baseline system and the scope of improvement in the results using the word embedding technique. The CNN model efficiently extracts higher level features using convolutional layers and max pooling layers. The LSTM model is capable of capturing long-term dependencies between word sequences. We propose a hybrid model using LSTM and a CNN model, named the Hybrid CNN-LSTM Model, to overcome the sentiment analysis problem. We obtained improved accuracy for sentiment analysis on the IMDB movie review dataset.\u003c/p\u003e","manuscriptTitle":"Sentiment Analysis of Movie Review Using Deep Learning Techniques","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-05-20 11:10:05","doi":"10.21203/rs.3.rs-4388226/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"3d3f22d4-b889-4961-9405-20fb3a90916e","owner":[],"postedDate":"May 20th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-05-28T07:48:03+00:00","versionOfRecord":[],"versionCreatedAt":"2024-05-20 11:10:05","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4388226","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4388226","identity":"rs-4388226","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-24T02:00:01.246996+00:00
License: CC-BY-4.0