Advancing Sentiment Analysis on Product Reviews: A Comparative Evaluation of Classical, Deep Learning, and Transformer Models | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Advancing Sentiment Analysis on Product Reviews: A Comparative Evaluation of Classical, Deep Learning, and Transformer Models Kshiti Deshpande, Jyoti Yadav This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6885027/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract The proliferation of e-commerce platforms has led to an explosion of user-generated content, particularly in the form of product reviews. These reviews offer valuable insights into consumer sentiment, influencing business strategies and customer satisfaction initiatives. This study undertakes a comprehensive evaluation of sentiment analysis models applied to product reviews, spanning traditional machine learning algorithms, deep learning architectures, and state-of-the-art transformer-based models. The models evaluated include Logistic Regression, Support Vector Machines, Decision Trees, Random Forests, Naïve Bayes, Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and advanced transformers such as ELMo, DistilBERT, ELECTRA, T5, BERT, and RoBERTa. Using a standardized Amazon product review dataset, models were assessed on accuracy, precision, recall, and F1-score. Results indicate that transformer-based models significantly outperform their predecessors, with RoBERTa achieving the highest accuracy of 96.36%. These findings underscore the growing importance of transformer architectures in sentiment classification, offering promising directions for real-time applications in e-commerce, social analytics, and recommendation systems. Sentiment Analysis Natural Language Processing Transformer Models Product Reviews Machine Learning Deep Learning Text Classification RoBERTa Opinion Mining Figures Figure 1 Figure 2 1 Introduction The digital transformation of commerce, communication, and consumer behavior has dramatically reshaped the landscape of data generation and analysis. In this dynamic environment, user-generated content—particularly in the form of online product reviews—has emerged as a rich source of sentiment data. Customers routinely express their experiences, preferences, and grievances through these textual reviews, making them valuable assets for business intelligence. Leveraging this content to understand customer sentiment has become critical for organizations aiming to enhance customer satisfaction, tailor marketing efforts, and stay competitive in a crowded marketplace. Sentiment analysis, also referred to as opinion mining, is a subfield of Natural Language Processing (NLP) that aims to computationally identify and categorize opinions expressed in text to determine whether the writer's attitude is positive, negative, or neutral. This technique plays a pivotal role in various domains, including e-commerce, healthcare, finance, entertainment, and politics. For instance, by analyzing sentiments from reviews and social media posts, businesses can make data-driven decisions about product launches, service improvements, and brand positioning. Despite its practical significance, sentiment analysis presents numerous challenges. Human language is inherently ambiguous and context-dependent, often laden with sarcasm, idioms, and evolving slang. Words can carry different sentiments depending on the context in which they are used. For example, the phrase “This phone is sick” could express admiration or dissatisfaction, depending on the speaker’s intent and the cultural context. Accurately capturing such nuances is a complex task that necessitates sophisticated computational models. Early approaches to sentiment analysis relied heavily on lexicon-based methods and rule-based systems. While straightforward to implement, these approaches were limited in their ability to generalize across domains and handle context-sensitive language. As a result, researchers turned to traditional machine learning (ML) methods, which introduced statistical learning to text classification. Models such as Support Vector Machines (SVM), Logistic Regression (LR), Naïve Bayes (NB), Decision Trees (DT), and Random Forests (RF) offered significant improvements over rule-based systems by learning patterns from annotated datasets. These classical ML models typically rely on feature engineering, where raw text is converted into numerical representations using techniques such as Bag-of-Words (BoW) and Term Frequency–Inverse Document Frequency (TF-IDF). Although effective in certain settings, these representations are sparse and fail to capture the syntactic structure or semantics of language. As a result, they struggle with complex sentence constructs, negations, and idiomatic expressions. The advent of deep learning (DL) brought a paradigm shift in the field of NLP. Models like Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks began to dominate sentiment analysis tasks due to their ability to learn features automatically from raw data. CNNs are particularly useful for capturing local features and n-gram-like patterns in text, while LSTM networks excel in modeling sequential dependencies, making them suitable for language modeling tasks. However, these architectures come with their own limitations. While LSTMs are effective at learning temporal dependencies, they process data sequentially, which restricts parallelization and increases computational time. Additionally, both CNNs and LSTMs have difficulty modeling long-range dependencies and understanding bidirectional context without complex architectural enhancements, such as attention mechanisms. To overcome these challenges, researchers developed attention-based models, culminating in the introduction of the Transformer architecture by Vaswani et al. in 2017. Unlike recurrent models, transformers process entire sequences in parallel and utilize self-attention mechanisms to learn the relationships between words, regardless of their position in the sentence. This innovation enabled models to better capture long-range dependencies and complex linguistic relationships. Building on the transformer architecture, a new generation of pre-trained language models was introduced, starting with BERT (Bidirectional Encoder Representations from Transformers). BERT's ability to consider context from both directions—left and right—marked a significant advancement over traditional word embedding techniques such as Word2Vec and GloVe. BERT was soon followed by several enhanced versions, including RoBERTa (Robustly Optimized BERT Approach), DistilBERT (a lightweight version of BERT), ELECTRA (which focuses on replaced token detection), T5 (Text-To-Text Transfer Transformer), and ELMo (Embeddings from Language Models), each bringing unique architectural or training innovations. These transformer-based models are pre-trained on massive corpora using unsupervised learning objectives and subsequently fine-tuned on downstream tasks like sentiment classification. This pretraining-finetuning paradigm allows them to generalize effectively across tasks and domains, even with limited labeled data. The success of these models has led to their widespread adoption in sentiment analysis, outperforming both traditional ML and deep learning methods on benchmark datasets. However, the superior performance of transformer models comes at the cost of increased computational requirements. Training and deploying such models demand substantial memory and processing power, which can hinder real-time applications in resource-constrained environments like mobile devices or edge computing platforms. As a result, there is a growing interest in developing optimized or distilled transformer models that balance accuracy and efficiency. In this study, we aim to address several key questions in sentiment analysis research. First, how do traditional ML methods compare with modern deep learning and transformer-based models in classifying product review sentiments? Second, which model offers the best trade-off between performance and computational feasibility? Third, what are the practical implications of using state-of-the-art NLP models for sentiment analysis in real-world settings? To answer these questions, we conduct a comprehensive comparative evaluation of a wide spectrum of sentiment analysis models on a standardized dataset of Amazon product reviews. Our analysis covers classical models like Logistic Regression, SVM, Naïve Bayes, Decision Trees, and Random Forests; deep learning models such as CNNs and LSTMs; and advanced transformers including ELMo, DistilBERT, ELECTRA, T5, BERT, and RoBERTa. All models are fine-tuned under consistent experimental conditions to ensure fair comparison. Evaluation metrics include accuracy, precision, recall, and F1-score, providing a holistic view of each model’s capabilities. The results reveal a clear performance advantage of transformer-based models over their traditional and deep learning counterparts. Among them, RoBERTa consistently achieves the highest scores across all metrics, highlighting its robustness and practical utility. These findings underscore the need to adopt advanced NLP models in real-world applications such as customer feedback analysis, brand monitoring, product recommendation systems, and policy-making. In conclusion, this study not only benchmarks sentiment analysis techniques across multiple paradigms but also provides actionable insights for selecting the most effective model based on accuracy, computational efficiency, and deployment requirements. The transition from classical approaches to deep learning and ultimately to transformers represents an evolution that mirrors the increasing complexity and scale of textual data in modern digital ecosystems. By systematically analyzing this progression, we aim to contribute to the development of more intelligent, adaptable, and efficient sentiment analysis systems. 2 Literature review Sentiment analysis has evolved significantly over the past two decades, transitioning from rule-based and classical machine learning approaches to sophisticated deep learning and transformer-based architectures. This section provides a comprehensive review of prior research contributions, methodological advancements, and observed limitations, with a focus on sentiment analysis of product reviews, especially within e-commerce domains. 2.1 Traditional Machine Learning Approaches One of the earliest studies by Pang et al. ( 2002 ) pioneered sentiment classification using machine learning, employing algorithms like Naïve Bayes, Maximum Entropy, and SVM. Their work revealed that SVM yielded better accuracy for sentiment polarity tasks on movie reviews, laying the groundwork for future text classification research. Fang and Zhan ( 2015 ) further investigated sentiment polarity classification using SVM, Naïve Bayes, and Random Forests. Their work on Amazon product reviews highlighted the effectiveness of ensemble methods, although they noted issues such as dataset noise due to spam and fake reviews. Anto et al. ( 2016 ) explored sentiment analysis using SVM, part-of-speech tagging, and sentiment dictionaries applied to Twitter data, achieving moderate accuracy. However, they acknowledged the limited generalization ability of models trained on microblogging content. Sindhu et al. ( 2017 ) conducted a comparative study between SVM and Naïve Bayes on shopping reviews, emphasizing the need for robust models capable of handling sarcasm and domain-specific language. Shivaprasad and Shetty ( 2017 ) reviewed ML techniques such as Max Entropy, NB, and SVM, concluding that traditional methods lacked the capability to capture deeper contextual semantics. Singh et al. ( 2017 ) demonstrated improved accuracy by optimizing ML classifiers like Decision Trees and SVM using hyperparameter tuning. Despite gains, these approaches remained reliant on hand-crafted features like TF-IDF and BoW. Bose et al. ( 2019 ) applied Naïve Bayes to political sentiment analysis on Twitter, revealing domain-specific variations in language sentiment interpretation. Pratama et al. ( 2019 ) implemented a similar Naïve Bayes model for Indonesian tweets, echoing limitations in language portability. Georgios and Thelwall (2012) proposed an unsupervised sentiment analysis method across platforms like Twitter and MySpace. Though useful for exploratory analysis, the method suffered from poor sentiment polarity resolution. 2.2 Deep Learning Models The introduction of deep learning in NLP brought substantial improvements. Kalchbrenner et al. ( 2014 ) proposed a CNN for sentence modeling, showing it could automatically extract useful local features for sentiment classification. Zhou et al. ( 2015 ) proposed a C-LSTM architecture, combining convolutional and recurrent layers to capture both spatial and sequential features, resulting in enhanced accuracy on sentence classification tasks. Jain et al. ( 2021 ) employed a hybrid CNN-LSTM model for consumer sentiment analysis, demonstrating that combined architectures yield better context awareness than standalone networks. Simanihuruk and Suparwito ( 2025 ) compared LSTM and BiLSTM on Shopee skincare product reviews. BiLSTM showed superior performance (95.91% accuracy), reinforcing the importance of bidirectional context in sentiment analysis. Dong et al. ( 2025 ) introduced a BiLSTM-CNN model tailored for Chinese product reviews. Despite improved classification performance, the domain specificity and language constraints limited its generalization. Yang et al. ( 2020 ) proposed a hybrid approach combining CNN, BiGRU, and sentiment lexicons, demonstrating effective sentiment categorization in Chinese e-commerce contexts. Priyadarshini and Cotton ( 2021 ) developed a grid search-based LSTM-CNN architecture optimized for movie review sentiment, illustrating the benefits of hyperparameter tuning for deep architectures. Feizollah et al. ( 2019 ) introduced a stacked DL model integrating CNN and LSTM layers for halal product sentiment analysis on Twitter, validating that model ensembling improves robustness. Abid et al. ( 2019 ) used a CNN-LSTM hybrid for Twitter sentiment classification, while Zhao et al. ( 2020 ) adopted LSTM for predicting user personality based on sentiment-laden preferences. Haque et al. ( 2018 ) evaluated DL models on large-scale Amazon reviews, identifying scalability challenges and domain variability in sentiment accuracy. 2.3 Word Embeddings and Representation Learning Pre-trained word embeddings have significantly improved sentiment classification. Li et al. ( 2017 ) highlighted the benefits of learning domain-adaptive embeddings. Qorich and El Ouazzani (2023) used Word2Vec in conjunction with CNNs, reporting strong results on Amazon reviews. Muhammad et al. ( 2021 ) applied Word2Vec and LSTM to Indonesian hotel reviews, capturing both word-level and temporal sentiment dependencies. Barry ( 2017 ) compared BoW and LSTM models on online reviews, confirming the superiority of context-aware neural networks. Similarly, Abdi et al. ( 2019 ) explored multi-feature fusion with DL models for enhanced accuracy. 2.4 Transformer-Based Architectures Transformer models represent a significant leap forward. Vaswani et al. ( 2017 ) introduced the original Transformer architecture, leveraging self-attention to handle long-range dependencies with greater efficiency. BERT (Devlin et al., 2018 ) brought bidirectional pre-training to sentiment classification. Noriega et al. ( 2023 ) compared BERT, XLNet, ULMFiT, and RoBERTa on Amazon reviews, with RoBERTa achieving the highest performance (82%), albeit with high computational demand. RoBERTa (Liu et al., 2019 ), optimized with dynamic masking and longer training sequences, showed improved accuracy across benchmarks. Our study confirms RoBERTa’s dominance with an accuracy of 96.36%, the highest among all models evaluated. Wang ( 2025 ) fine-tuned LLaMa-3 for sentiment analysis, outperforming previous transformer models but requiring significant hardware resources. Similarly, Cambray and Podsadowski (2019) used BiRNNs for offensive content classification, highlighting the value of recurrent bidirectionality. Khan et al. (2023) developed a hybrid DNN model with attention, integrating semantic context and syntactic structure. Cheng et al. (2021) proposed a capsule network with CNN-BiGRU fusion for nuanced sentiment classification. Hakimi et al. (2025) conducted sentiment analysis with abstractive summarization, combining content understanding with classification. Kumar et al. (2025) provided a meta-review of evolving techniques in sentiment analysis, particularly the role of transfer learning. Sun et al. (2019) used auxiliary sentence construction with BERT for aspect-based sentiment analysis, showing how task-specific tuning improves model relevance. Zheng et al. ( 2023 ) proposed lightweight attention-based architectures for faster inference without compromising accuracy. Derbentsev et al. ( 2022 ) evaluated multiple DL models on social media texts and emphasized the need for domain adaptability. Dhaoui et al. ( 2017 ) contrasted lexicon-based and ML-based approaches, finding the latter to be more scalable and accurate. 2.5 Summary and Research Gap While traditional ML models offer interpretability and efficiency, they lack contextual depth. Deep learning models improve sentiment representation but struggle with computational constraints and long-range dependencies. Transformer-based models like BERT and RoBERTa excel in both context modeling and performance, but often require fine-tuning and substantial computational resources. Despite numerous studies, a unified, systematic benchmark across ML, DL, and multiple transformer variants on standardized datasets remains underexplored. Furthermore, many prior studies focus on specific domains or languages, limiting generalizability. The proposed research bridges this gap by benchmarking a broad set of models—including Logistic Regression, SVM, Random Forest, Naïve Bayes, CNN, LSTM, ELMo, DistilBERT, ELECTRA, T5, BERT, and RoBERTa—on the Amazon product review dataset under identical experimental settings. The experimental results confirm RoBERTa's superiority, validating its ability to handle contextual sentiment variations with high accuracy, precision, recall, and F1-score. 3 Research Methodology The methodological framework employed in this study is designed to systematically evaluate the performance of traditional machine learning, deep learning, and transformer-based models for sentiment analysis of product reviews. The objective is to ensure consistent preprocessing, training, and evaluation procedures across all models to facilitate a fair comparison. The methodology is structured into the following key phases: dataset acquisition and preparation, feature extraction and tokenization, model development and training, and evaluation and analysis. 3.1 Proposed Methodological Framework To conduct a fair and comprehensive evaluation of various sentiment analysis models, a structured methodology was designed to ensure consistency across preprocessing, model training, and evaluation. The framework integrates traditional machine learning, deep learning, and transformer-based models, facilitating comparative analysis under a unified experimental setup. The following workflow diagram outlines the sequential phases adopted in this study, from data acquisition to model assessment. 3.2 Dataset Description The dataset used in this study comprises publicly available Amazon product reviews obtained from Kaggle. It contains textual reviews along with sentiment labels categorized as positive, negative, or neutral. The dataset includes reviews from diverse product categories, ensuring robustness and generalizability of the model evaluations. 3.3 Text Preprocessing Raw text data is often noisy and inconsistent. To standardize the inputs for learning algorithms, the following preprocessing steps were applied: Lowercasing: All text was converted to lowercase to ensure uniformity. Punctuation Removal: Punctuation marks were removed to focus on word-level semantics. Stopword Removal: Common stopwords (e.g., “is”, “the”, “and”) were eliminated to reduce noise. Tokenization: Text was broken down into individual tokens (words) using NLTK and HuggingFace tokenizers, depending on the model requirements. Sequence Padding/Truncation: For transformer models, each sequence was padded or truncated to a uniform length of 128 tokens. 3.4 Feature Engineering and Representation Feature engineering is a critical step in sentiment analysis, as it transforms raw textual data into structured numerical formats that can be interpreted by machine learning and deep learning algorithms. The effectiveness of a model heavily depends on the quality and nature of input representations. In this study, three distinct strategies were adopted based on the modeling approach: traditional machine learning, deep learning, and transformer-based models. 3.4.1 Traditional Feature Extraction (for ML Models) For classical machine learning algorithms, the textual data was converted into sparse vector representations using the following methods: Bag-of-Words (BoW) This technique creates a vocabulary of all words in the corpus and represents each document as a vector of word occurrence counts. Although simple and effective for small datasets, BoW fails to capture semantic relationships or word ordering. Term Frequency–Inverse Document Frequency (TF-IDF) TF-IDF improves upon BoW by down-weighting frequently occurring terms and giving importance to words that are rare and thus potentially more informative. Each word in a review is weighted by its frequency in that review (TF) and its inverse frequency across all reviews (IDF), resulting in a more meaningful feature space. Both BoW and TF-IDF were implemented using Scikit-learn’s CountVectorizer and TfidfVectorizer, respectively. The resulting vectors were used to train models like SVM, Logistic Regression, Naïve Bayes, and Random Forest. 3.4.2 Embedding-Based Representations (for DL Models) Deep learning models like CNN and LSTM require dense and continuous representations that retain syntactic and semantic information. For this purpose: Pre-trained Word Embeddings GloVe embeddings (Global Vectors for Word Representation) were used to convert tokens into 100-dimensional vectors. These embeddings capture relationships between words in a vector space, where semantically similar words are closer together. Embedding Layer An embedding layer was initialized with GloVe weights and trained further during model optimization. Out-of-vocabulary tokens were assigned random vectors and updated via backpropagation. The use of embeddings allows deep learning models to understand relationships between words, enhancing their ability to detect subtle sentiment cues such as sarcasm or comparative sentiment. 3.4.3 Contextualized Representations (for Transformer Models) Unlike static embeddings, transformer models generate contextual embeddings, which adjust the representation of each word depending on its surrounding words in the sentence. The following steps were adopted for transformer-based models: Model-specific Tokenization Each transformer architecture (BERT, RoBERTa, DistilBERT, ELECTRA, T5) has its own tokenizer. These tokenizers split text into subword tokens using Byte Pair Encoding (BPE) or WordPiece algorithms. Input Construction For each sentence, input IDs, attention masks, and segment IDs (where applicable) were generated. Padding or truncation was applied to ensure all input sequences were of uniform length (128 tokens). Positional Embeddings Transformers do not inherently understand word order. Therefore, positional embeddings were automatically added during training to encode the position of tokens in the sequence. This rich and dynamic representation enables transformer models to understand deeper semantic and syntactic relationships, allowing them to excel in sentiment analysis tasks. 3.5 Model Development A diverse set of models from three learning paradigms—traditional machine learning, deep learning, and transformer-based learning—were implemented and evaluated. This diversity ensures a robust benchmark and allows identification of the best trade-offs between accuracy, interpretability, and computational cost. 3.5.1 Traditional Machine Learning Models The following classical models were developed using Scikit-learn: Logistic Regression (LR): A linear classifier that uses the sigmoid function to estimate class probabilities. It is fast, interpretable, and works well with linearly separable data. Support Vector Machine (SVM): Employs hyperplanes to separate sentiment classes with maximum margin. The radial basis function (RBF) kernel was tested for non-linear separation. Naïve Bayes (NB): A probabilistic classifier based on Bayes’ theorem. It assumes feature independence and is particularly effective for text classification due to its simplicity. Decision Tree (DT): Builds a hierarchical tree structure based on entropy or Gini impurity to classify data. It is prone to overfitting but provides intuitive rules. Random Forest (RF): An ensemble of decision trees built using bootstrap aggregation. It improves generalization and reduces variance, often outperforming individual trees. Each model was trained using TF-IDF vectors and evaluated using 5-fold cross-validation. Hyperparameters (e.g., C in SVM, max_depth in RF) were optimized using grid search. 3.5.2 Deep Learning Models Implemented using TensorFlow and Keras, these models exploit neural architectures to automatically learn hierarchical features from text: Convolutional Neural Network (CNN): A one-dimensional CNN architecture was used with multiple filters (kernel sizes = 2, 3, 4) and ReLU activations, followed by global max-pooling and a dense softmax output layer. CNNs are efficient in capturing local features and sentiment-bearing n-grams. Long Short-Term Memory (LSTM): A sequence model capable of capturing long-term dependencies. The architecture consisted of a single LSTM layer followed by dropout and a dense classification head. LSTM was chosen for its ability to retain context across longer input sequences. Both models used pre-trained GloVe embeddings and were trained for 5 epochs using the Adam optimizer, binary cross-entropy loss, and batch size of 64. Early stopping was applied to avoid overfitting. 3.5.3 Transformer-Based Models Transformer models were fine-tuned using the HuggingFace Transformers library with the PyTorch backend. The following models were explored: ELMo (Embeddings from Language Models): Generates context-aware embeddings using deep bi-directional LSTMs. Implemented via AllenNLP for comparative analysis. BERT (Bidirectional Encoder Representations from Transformers): Fine-tuned using bert-base-uncased. It utilizes bidirectional self-attention and is pre-trained using masked language modeling and next sentence prediction. RoBERTa: A robust optimization of BERT that removes next sentence prediction and trains with dynamic masking on longer sequences. It achieved the highest performance across all models. DistilBERT: A compressed version of BERT with fewer parameters, providing faster inference while retaining ~ 97% of BERT’s performance. ELECTRA: Trained using a novel pre-training task (replaced token detection) instead of masked language modeling, making it faster and more efficient. T5 (Text-to-Text Transfer Transformer): Casts every NLP task into a text-to-text format, allowing for greater flexibility. Fine-tuned using the t5-small checkpoint for sentiment classification as a text generation problem. 4 Results and Discussion The performance evaluation of the sentiment classification models was conducted using four primary metrics: accuracy, precision, recall, and F1-score. Each model was tested under identical preprocessing and training conditions to ensure fairness. The results are structured across three categories—traditional machine learning (ML) models, deep learning (DL) architectures, and transformer-based language models. 4.1 Performance of Traditional Machine Learning Models The traditional ML models were trained using TF-IDF representations of the Amazon product reviews. These models performed moderately well in terms of baseline sentiment classification accuracy as shown in Table 1 . Table 1 Performance of Traditional ML Models Model Accuracy Precision Recall F1-score Logistic Regression 0.8811 0.8811 0.8807 0.8810 Decision Tree 0.7508 0.7508 0.7508 0.7508 Random Forest 0.8533 0.8536 0.8533 0.8533 Naïve Bayes 0.8376 0.8378 0.8376 0.8376 SVM 0.8803 0.8804 0.8803 0.8803 Observation : Logistic Regression and SVM yielded the highest accuracy (~ 88%) among traditional models, demonstrating strong generalization on high-dimensional TF-IDF vectors. Decision Trees underperformed due to overfitting and lack of contextual learning. Overall, traditional models lacked the capacity to capture sequential and semantic relationships in text. 4.2 Performance of Deep Learning Models The deep learning models—CNN and LSTM—were trained using pre-trained GloVe embeddings to capture word semantics and positional information as shown in Table 2 . Table 2 Performance of DL Models Model Accuracy Precision Recall F1-score CNN 0.9091 0.9091 0.9091 0.9091 LSTM 0.9137 0.9137 0.9137 0.9137 Observation : Both CNN and LSTM outperformed all traditional ML models. LSTM slightly edged over CNN, reflecting its superior ability to capture long-range dependencies and sequential patterns. However, both models required more computational resources and longer training time. 4.3 Performance of Transformer-Based Models Transformer models were fine-tuned on the same dataset using contextualized token embeddings. These models achieved significantly higher performance across all evaluation metrics as shown in Table 3 . Table 3 Performance of Transformer-based Models Model Accuracy Precision Recall F1-score ELMO 0.8014 0.8015 0.8014 0.8014 DistilBERT 0.9498 0.9501 0.9498 0.9498 ELECTRA 0.9513 0.9514 0.9513 0.9513 T5 0.9465 0.9468 0.9465 0.9465 BERT 0.9594 0.9600 0.9594 0.9594 RoBERTa 0.9636 0.9639 0.9636 0.9636 Observation : RoBERTa clearly surpassed all other models, including BERT, ELECTRA, and DistilBERT. Its improved pretraining strategy with dynamic masking and a larger training corpus enabled it to learn deeper contextual associations, crucial for accurate sentiment detection. 4.4 Comparative Analysis and Best Model Justification The chart below visualizes the comparative performance (accuracy) of all models across categories (Fig. 2 ): Traditional ML models serve as fast and lightweight baselines but are inherently limited by sparse vector representations and lack of contextual understanding. DL models capture better semantic features and achieve higher accuracy (~ 91%) but still lag behind transformer models due to limitations in bidirectional context modeling. Transformer-based models show a significant leap in performance, with RoBERTa achieving the highest accuracy (96.36%). This demonstrates the advantage of attention mechanisms, bidirectional context, and large-scale pretraining. RoBERTa's superior performance can be attributed to several key factors. Firstly, RoBERTa builds upon BERT's architecture but enhances it with dynamic masking and training on a larger corpus, enabling it to learn richer contextual relationships in the data. Additionally, RoBERTa's pre-training involves training on more data with longer sequences, which improves its ability to capture long-range dependencies. These improvements in the model's training methodology, combined with its robust architecture, allow RoBERTa to better handle the intricacies of sentiment classification, especially with complex or ambiguous text. This superior result positions RoBERTa as the most effective model in this study, offering substantial improvements over earlier methods like BERT, ELECTRA, and DistilBERT. It not only improves upon the accuracy but also achieves a more balanced performance across precision, recall, and F1-score. 4.5. Benefits and Applications Sentiment analysis of product reviews holds significant value across multiple domains. In e-commerce, businesses can utilize sentiment classification models to assess customer feedback, identify trends, and refine marketing strategies. Automated sentiment analysis allows companies to process vast amounts of user-generated content, providing insights into consumer satisfaction and potential areas of product improvement. Furthermore, customer service departments can leverage sentiment analysis tools to prioritize and address negative reviews efficiently, enhancing user experience and brand reputation. Beyond e-commerce, sentiment analysis has applications in social media monitoring, financial market analysis, political opinion mining, and healthcare. Social media platforms can employ sentiment classification to detect public opinion on brands, policies, or global events. Financial analysts can utilize sentiment trends to predict stock market fluctuations, while political organizations can gauge public sentiment towards candidates or policies. Additionally, in healthcare, sentiment analysis can be applied to patient reviews of medical services, aiding in quality assessment and service enhancements. 4.6 Importance of Transformer-Based Approaches The evolution of sentiment analysis from traditional models to deep learning-based methods has led to significant advancements in accuracy and efficiency. Transformer-based models address limitations associated with earlier techniques by efficiently handling contextual nuances, sarcasm, and complex linguistic structures. Unlike recurrent neural networks (RNNs), which process text sequentially, transformers use parallelized self-attention mechanisms, reducing computational overhead and improving scalability. The findings of this study emphasize the growing need for adopting transformer-based architectures in sentiment analysis applications. As the volume of user-generated textual data continues to expand, efficient and accurate sentiment classification models become crucial for data-driven decision-making. The ability to automate and analyze customer sentiments in real time has far-reaching implications, empowering organizations to make informed strategic decisions based on consumer insights. 5 Conclusions This study presents a comprehensive evaluation of sentiment analysis techniques across three major paradigms: traditional machine learning, deep learning, and transformer-based models. By standardizing the experimental setup, including preprocessing, feature representation, and evaluation metrics, the study offers a fair and rigorous comparison of models on the Amazon product review dataset. Among traditional models, Logistic Regression and SVM delivered strong baseline performance, though their reliance on sparse feature vectors limited their ability to model complex linguistic dependencies. Deep learning models such as CNN and LSTM improved upon these limitations by learning semantic features directly from data. LSTM, in particular, demonstrated the ability to capture sequential dependencies in sentiment-bearing phrases, achieving over 91% accuracy. Transformer-based models emerged as the most robust and accurate class of models. With their attention mechanisms and deep contextual understanding, models like BERT and ELECTRA significantly outperformed previous methods. Notably, RoBERTa achieved the highest performance across all metrics, including an accuracy of 96.36%, validating the effectiveness of its optimized training strategy. These findings reaffirm the importance of context-aware language models in sentiment analysis and suggest that transformer architectures are currently the most suitable tools for real-world sentiment classification applications. Their deployment in e-commerce platforms, customer service systems, and social media analytics can drastically improve the quality and scalability of consumer insight extraction. 6 Future Directions While this study establishes the effectiveness of transformer-based models for sentiment analysis, several areas remain open for further exploration. One promising direction is multi-modal sentiment analysis, integrating text, images, and audio cues to enhance classification accuracy. Additionally, fine-tuning transformer models on domain-specific datasets could improve performance by adapting them to specialized language patterns and sentiment expressions. Another crucial aspect is explainability and interpretability in transformer-based models. As sentiment analysis becomes increasingly integrated into decision-making processes, developing methods to explain and justify model predictions is essential. Future research could explore techniques such as attention visualization, feature attribution, and explainable AI (XAI) frameworks to enhance transparency in sentiment classification. Furthermore, real-time sentiment analysis presents an exciting challenge, particularly in resource-constrained environments such as mobile devices and edge computing. Optimizing transformer models for low-latency inference and reduced computational complexity could enable real-time applications in customer feedback monitoring, social media analysis, and personalized recommendation systems. Overall, the advancements in deep learning and NLP will continue to drive improvements in sentiment analysis, enabling more accurate, context-aware, and scalable AI-driven solutions for diverse applications. Declarations Competing Interests The authors have no relevant financial or non-financial interests to disclose. Funding The authors declare that no funds, grants, or other support were received during the preparation of this manuscript. Data Availability The datasets analysed during the current study are available in the TensorFlow-Sentiment-Analysis-on-Amazon-Reviews-Data repository, https://github.com/MuhammedBuyukkinaci/TensorFlow-Sentiment-Analysis-on-Amazon-Reviews-Data/tree/master/dataset References Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. arXiv preprint cs/0205070 Fang X, Zhan J (2015) Sentiment analysis using product review data. J Big Data 2(1):1–14 Anto MP et al (2016) Product rating using sentiment analysis, in Proc. IEEE Int. Conf. Electrical, Electronics, and Optimization Techniques (ICEEOT) , pp. 3458–3462 Sindhu C, Vyas DV, Pradyoth K (2017) Sentiment analysis based product rating using textual reviews, in Proc. IEEE Int. Conf. Electronics, Communication and Aerospace Technology (ICECA) , vol. 2, pp. 727–731 Shivaprasad TK, Shetty J (2017) Sentiment analysis of product reviews: A review, in Proc. IEEE Int. Conf. Inventive Communication and Computational Technologies (ICICCT) , pp. 298–301 Singh J, Singh G, Singh R (2017) Optimization of sentiment analysis using machine learning classifiers. Human-centric Comput Inf Sci 7(1):1–12 Bose R et al (2019) Analyzing political sentiment using Twitter data, in Proc. ICTIS 2018 , Springer, Singapore, pp. 427–436 Pratama Y et al (2019) Implementation of sentiment analysis on Twitter using Naïve Bayes algorithm to know the people responses to debate of DKI Jakarta governor election. J Phys Conf Ser 1175:012102 Paltoglou G, Thelwall M (2012) Twitter, MySpace, Digg: Unsupervised Sentiment Analysis in Social Media. ACM Trans Intell Syst Technol 3(4):1–19 Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 Zhou C, Sun C, Liu Z, Lau F (2015) A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630 Jain PK, Saravanan V, Pamula R (2021) A hybrid CNN-LSTM: A deep learning approach for consumer sentiment analysis using qualitative user-generated contents. ACM Trans Asian Low-Resour Lang Inf Process 20(5):1–15 Simanihuruk L, Suparwito H (2025) Long Short-Term Memory and Bidirectional Long Short-Term Memory Algorithms for Sentiment Analysis of Skintific Product Reviews, in ITM Web Conf. , vol. 71, p. 01016 Dong Y et al (2025) DC-BiLSTM-CNN Algorithm for Sentiment Analysis of Chinese Product Reviews. Appl Artif Intell, 39, 1 Yang L, Li Y, Wang J, Sherratt RS (2020) Sentiment analysis for E-commerce product reviews in Chinese based on sentiment lexicon and deep learning. IEEE Access 8:23522–23530 Priyadarshini I, Cotton C (2021) A novel LSTM–CNN–grid search-based deep neural network for sentiment analysis. J Supercomput 77:13911–13932 Feizollah A et al (2019) Halal products on Twitter: Data extraction and sentiment analysis using stack of deep learning algorithms. IEEE Access 7:83354–83362 Abid F et al (2019) Sentiment analysis through recurrent variants latterly on convolutional neural network of Twitter. Future Gener Comput Syst 95:292–308 Zhao J et al (2020) User personality prediction based on topic preference and sentiment analysis using LSTM model. Pattern Recognit Lett 138:397–402 Haque TU, Saber NN, Shah FM (2018) Sentiment analysis on large scale Amazon product reviews, in Proc. IEEE Int. Conf. Innov. Res. Dev. (ICIRD) , pp. 1–6 Li Y et al (2017) Learning word representations for sentiment analysis. Cogn Comput 9:843–851 Qorich M, Ouazzani RE (2023) Text sentiment classification of Amazon reviews using word embeddings and convolutional neural networks. J Supercomput 79(10):11029–11054 Muhammad PF, Kusumaningrum R, Wibowo A (2021) Sentiment analysis using Word2vec and long short-term memory (LSTM) for Indonesian hotel reviews. Procedia Comput Sci 179:728–735 Barry J (2017) Sentiment Analysis of Online Reviews Using Bag-of-Words and LSTM Approaches, in Proc. AICS , pp. 272–274 Abdi A et al (2019) Deep learning-based sentiment classification of evaluative text based on Multi-feature fusion. Inf Process Manag 56(4):1245–1259 Vaswani A et al (2017) Attention is all you need, in Proc. Advances in Neural Information Processing Systems (NeurIPS) , pp. 5998–6008 Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 Liu Y et al (2019) RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 Noriega I et al (2023) Sentiment Analysis of Amazon Reviews using Deep Learning Techniques, RG Publication , [Online]. Available: https://doi.org/10.13140/RG.2.2.35547.75046 Wang Y (2025) Sentiment Analysis of Product Reviews Using Fine-Tuned LLaMa-3 Model, in ITM Web Conf. , vol. 70, p. 04021 Sun C, Huang L, Qiu X Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence. arXiv preprint arXiv:1903.09588, 2019. Zheng W et al (2023) Lightweight Multilayer Interactive Attention Network for Aspect-Based Sentiment Analysis. Connection Sci, 35, 1 Derbentsev VD et al (2022) A comparative study of deep learning models for sentiment analysis of social media texts, in M3E2-MLPEED , pp. 168–188 Dhaoui C, Webster CM, Tan LP (2017) Social media sentiment analysis: lexicon versus machine learning. J Consum Mark 34(6):480–488 Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6885027","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":572062349,"identity":"18456c07-011c-40c3-85b5-a27d80938e38","order_by":0,"name":"Kshiti Deshpande","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABNUlEQVRIie3PsUrDQADG8QsH1+Vq1gvR5BVOAlki9VUSAukSS0EoikMEIU7iGnDwFTJlNeFAlxjXQBysvkCDCIqCXoPQYtO4Ct5/+ob7cXcAiER/MAIwSAGF88VDQJMhzB74xBtdJF0ihnIaunROUAfh1yyIE+e5+b3bU85vMzYbW5p+cZY9jyc70nFpm4cv/mATATh9LFeJSkY2f9jQoPeFq0aFB6XI9qqtxOUPQ4bhrxKNYMoJc2LiU7UfMgSJfV0pCeQEI7WNyHlDgsvIN9774SdGxAn3lSRYS1TgN8QGpW/yW1KCMUNSnbC1RIk4yelwOy5HnoULl5JeCFUpucEItv+F3OXG7ODD0vVoj1V4MgiumFzXb8nRrtw7mT61kKXwYsJmw87jP4j0+utpkUgk+kd9AddwY/Fzo1E7AAAAAElFTkSuQmCC","orcid":"https://orcid.org/0009-0006-9869-024X","institution":"Savitribai Phule Pune University","correspondingAuthor":true,"prefix":"","firstName":"Kshiti","middleName":"","lastName":"Deshpande","suffix":""},{"id":572062350,"identity":"e308aba7-68b6-4eb0-8090-6170a23c9e49","order_by":1,"name":"Jyoti Yadav","email":"","orcid":"https://orcid.org/0000-0001-8602-2240","institution":"Savitribai Phule Pune University","correspondingAuthor":false,"prefix":"","firstName":"Jyoti","middleName":"","lastName":"Yadav","suffix":""}],"badges":[],"createdAt":"2025-06-13 06:00:11","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6885027/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6885027/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":100268193,"identity":"b6984659-668f-49d5-b69a-fa85f5a3d53e","added_by":"auto","created_at":"2026-01-14 19:05:58","extension":"xml","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6301,"visible":true,"origin":"","legend":"","description":"","filename":"socoSOCOD2501498.xml","url":"https://assets-eu.researchsquare.com/files/rs-6885027/v1/8a9d1feb03e6bd26d8880bfd.xml"},{"id":100371761,"identity":"4bdce429-fa4e-4412-bb6d-545215d7a2af","added_by":"auto","created_at":"2026-01-16 08:10:51","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":949,"visible":true,"origin":"","legend":"","description":"","filename":"SOCOD250149889192.go.xml","url":"https://assets-eu.researchsquare.com/files/rs-6885027/v1/b586bbcdf0419daca9819975.xml"},{"id":100268194,"identity":"2cc14b7f-a0af-48bb-9539-fd95e0d39de4","added_by":"auto","created_at":"2026-01-14 19:05:58","extension":"xml","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":814,"visible":true,"origin":"","legend":"","description":"","filename":"SOCOD2501498Import.xml","url":"https://assets-eu.researchsquare.com/files/rs-6885027/v1/e8abea15d5b83797a1f7595a.xml"},{"id":100268199,"identity":"8e14f714-63c5-46b1-9932-4e2d9472e098","added_by":"auto","created_at":"2026-01-14 19:05:59","extension":"xml","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":88525,"visible":true,"origin":"","legend":"","description":"","filename":"SOCOD25014980enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-6885027/v1/0c5c4c5edc56e7e77578de18.xml"},{"id":100268196,"identity":"46b3dbe9-1ac5-4f4c-b2e4-a06ec8b97ad0","added_by":"auto","created_at":"2026-01-14 19:05:58","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":16224,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-6885027/v1/a3c2810962929b42a59d3054.png"},{"id":100372481,"identity":"3733707b-7169-4203-9fd3-151a988e8b94","added_by":"auto","created_at":"2026-01-16 08:12:28","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":12407,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-6885027/v1/60997c221c03cd27c37d51cc.png"},{"id":100373069,"identity":"a29723e3-05f7-4a1c-92b0-a1732114f144","added_by":"auto","created_at":"2026-01-16 08:13:34","extension":"xml","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":87627,"visible":true,"origin":"","legend":"","description":"","filename":"SOCOD25014980structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-6885027/v1/64fd0c1c4dc0ab6a4db1eccc.xml"},{"id":100268200,"identity":"b4a8019c-cec6-4335-9f0c-c5d795bf68ee","added_by":"auto","created_at":"2026-01-14 19:05:59","extension":"html","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":94335,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-6885027/v1/2047c1ac60c35fd9f4ee795a.html"},{"id":100268192,"identity":"1d52c098-e869-4855-8cf7-eee1a5a6441c","added_by":"auto","created_at":"2026-01-14 19:05:58","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":48244,"visible":true,"origin":"","legend":"\u003cp\u003eProposed Methodological Framework\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-6885027/v1/74c684b956f87283e4dc1252.png"},{"id":100268198,"identity":"8601537b-d344-452c-b361-cb6c6d38eef2","added_by":"auto","created_at":"2026-01-14 19:05:59","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":44671,"visible":true,"origin":"","legend":"\u003cp\u003eAccuracy Comparison Across All Models\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-6885027/v1/639e24eb9ed25d42095efec8.png"},{"id":102298120,"identity":"49e4bfd9-2ddd-4716-9a5a-cf26dacadacc","added_by":"auto","created_at":"2026-02-10 10:30:51","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1071620,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6885027/v1/779abaca-5a7a-420a-91d3-f3ba03f015d7.pdf"}],"financialInterests":"","formattedTitle":"Advancing Sentiment Analysis on Product Reviews: A Comparative Evaluation of Classical, Deep Learning, and Transformer Models","fulltext":[{"header":"1 Introduction","content":"\u003cp\u003eThe digital transformation of commerce, communication, and consumer behavior has dramatically reshaped the landscape of data generation and analysis. In this dynamic environment, user-generated content\u0026mdash;particularly in the form of online product reviews\u0026mdash;has emerged as a rich source of sentiment data. Customers routinely express their experiences, preferences, and grievances through these textual reviews, making them valuable assets for business intelligence. Leveraging this content to understand customer sentiment has become critical for organizations aiming to enhance customer satisfaction, tailor marketing efforts, and stay competitive in a crowded marketplace.\u003c/p\u003e \u003cp\u003eSentiment analysis, also referred to as opinion mining, is a subfield of Natural Language Processing (NLP) that aims to computationally identify and categorize opinions expressed in text to determine whether the writer's attitude is positive, negative, or neutral. This technique plays a pivotal role in various domains, including e-commerce, healthcare, finance, entertainment, and politics. For instance, by analyzing sentiments from reviews and social media posts, businesses can make data-driven decisions about product launches, service improvements, and brand positioning.\u003c/p\u003e \u003cp\u003eDespite its practical significance, sentiment analysis presents numerous challenges. Human language is inherently ambiguous and context-dependent, often laden with sarcasm, idioms, and evolving slang. Words can carry different sentiments depending on the context in which they are used. For example, the phrase \u0026ldquo;This phone is sick\u0026rdquo; could express admiration or dissatisfaction, depending on the speaker\u0026rsquo;s intent and the cultural context. Accurately capturing such nuances is a complex task that necessitates sophisticated computational models.\u003c/p\u003e \u003cp\u003eEarly approaches to sentiment analysis relied heavily on lexicon-based methods and rule-based systems. While straightforward to implement, these approaches were limited in their ability to generalize across domains and handle context-sensitive language. As a result, researchers turned to traditional machine learning (ML) methods, which introduced statistical learning to text classification. Models such as Support Vector Machines (SVM), Logistic Regression (LR), Na\u0026iuml;ve Bayes (NB), Decision Trees (DT), and Random Forests (RF) offered significant improvements over rule-based systems by learning patterns from annotated datasets.\u003c/p\u003e \u003cp\u003eThese classical ML models typically rely on feature engineering, where raw text is converted into numerical representations using techniques such as Bag-of-Words (BoW) and Term Frequency\u0026ndash;Inverse Document Frequency (TF-IDF). Although effective in certain settings, these representations are sparse and fail to capture the syntactic structure or semantics of language. As a result, they struggle with complex sentence constructs, negations, and idiomatic expressions.\u003c/p\u003e \u003cp\u003eThe advent of deep learning (DL) brought a paradigm shift in the field of NLP. Models like Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks began to dominate sentiment analysis tasks due to their ability to learn features automatically from raw data. CNNs are particularly useful for capturing local features and n-gram-like patterns in text, while LSTM networks excel in modeling sequential dependencies, making them suitable for language modeling tasks.\u003c/p\u003e \u003cp\u003eHowever, these architectures come with their own limitations. While LSTMs are effective at learning temporal dependencies, they process data sequentially, which restricts parallelization and increases computational time. Additionally, both CNNs and LSTMs have difficulty modeling long-range dependencies and understanding bidirectional context without complex architectural enhancements, such as attention mechanisms.\u003c/p\u003e \u003cp\u003eTo overcome these challenges, researchers developed attention-based models, culminating in the introduction of the Transformer architecture by Vaswani et al. in 2017. Unlike recurrent models, transformers process entire sequences in parallel and utilize self-attention mechanisms to learn the relationships between words, regardless of their position in the sentence. This innovation enabled models to better capture long-range dependencies and complex linguistic relationships.\u003c/p\u003e \u003cp\u003eBuilding on the transformer architecture, a new generation of pre-trained language models was introduced, starting with BERT (Bidirectional Encoder Representations from Transformers). BERT's ability to consider context from both directions\u0026mdash;left and right\u0026mdash;marked a significant advancement over traditional word embedding techniques such as Word2Vec and GloVe. BERT was soon followed by several enhanced versions, including RoBERTa (Robustly Optimized BERT Approach), DistilBERT (a lightweight version of BERT), ELECTRA (which focuses on replaced token detection), T5 (Text-To-Text Transfer Transformer), and ELMo (Embeddings from Language Models), each bringing unique architectural or training innovations.\u003c/p\u003e \u003cp\u003eThese transformer-based models are pre-trained on massive corpora using unsupervised learning objectives and subsequently fine-tuned on downstream tasks like sentiment classification. This pretraining-finetuning paradigm allows them to generalize effectively across tasks and domains, even with limited labeled data. The success of these models has led to their widespread adoption in sentiment analysis, outperforming both traditional ML and deep learning methods on benchmark datasets.\u003c/p\u003e \u003cp\u003eHowever, the superior performance of transformer models comes at the cost of increased computational requirements. Training and deploying such models demand substantial memory and processing power, which can hinder real-time applications in resource-constrained environments like mobile devices or edge computing platforms. As a result, there is a growing interest in developing optimized or distilled transformer models that balance accuracy and efficiency.\u003c/p\u003e \u003cp\u003eIn this study, we aim to address several key questions in sentiment analysis research. First, how do traditional ML methods compare with modern deep learning and transformer-based models in classifying product review sentiments? Second, which model offers the best trade-off between performance and computational feasibility? Third, what are the practical implications of using state-of-the-art NLP models for sentiment analysis in real-world settings?\u003c/p\u003e \u003cp\u003eTo answer these questions, we conduct a comprehensive comparative evaluation of a wide spectrum of sentiment analysis models on a standardized dataset of Amazon product reviews. Our analysis covers classical models like Logistic Regression, SVM, Na\u0026iuml;ve Bayes, Decision Trees, and Random Forests; deep learning models such as CNNs and LSTMs; and advanced transformers including ELMo, DistilBERT, ELECTRA, T5, BERT, and RoBERTa. All models are fine-tuned under consistent experimental conditions to ensure fair comparison. Evaluation metrics include accuracy, precision, recall, and F1-score, providing a holistic view of each model\u0026rsquo;s capabilities.\u003c/p\u003e \u003cp\u003eThe results reveal a clear performance advantage of transformer-based models over their traditional and deep learning counterparts. Among them, RoBERTa consistently achieves the highest scores across all metrics, highlighting its robustness and practical utility. These findings underscore the need to adopt advanced NLP models in real-world applications such as customer feedback analysis, brand monitoring, product recommendation systems, and policy-making.\u003c/p\u003e \u003cp\u003eIn conclusion, this study not only benchmarks sentiment analysis techniques across multiple paradigms but also provides actionable insights for selecting the most effective model based on accuracy, computational efficiency, and deployment requirements. The transition from classical approaches to deep learning and ultimately to transformers represents an evolution that mirrors the increasing complexity and scale of textual data in modern digital ecosystems. By systematically analyzing this progression, we aim to contribute to the development of more intelligent, adaptable, and efficient sentiment analysis systems.\u003c/p\u003e"},{"header":"2 Literature review","content":"\u003cp\u003eSentiment analysis has evolved significantly over the past two decades, transitioning from rule-based and classical machine learning approaches to sophisticated deep learning and transformer-based architectures. This section provides a comprehensive review of prior research contributions, methodological advancements, and observed limitations, with a focus on sentiment analysis of product reviews, especially within e-commerce domains.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Traditional Machine Learning Approaches\u003c/h2\u003e \u003cp\u003eOne of the earliest studies by Pang et al. (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2002\u003c/span\u003e) pioneered sentiment classification using machine learning, employing algorithms like Na\u0026iuml;ve Bayes, Maximum Entropy, and SVM. Their work revealed that SVM yielded better accuracy for sentiment polarity tasks on movie reviews, laying the groundwork for future text classification research. Fang and Zhan (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2015\u003c/span\u003e) further investigated sentiment polarity classification using SVM, Na\u0026iuml;ve Bayes, and Random Forests. Their work on Amazon product reviews highlighted the effectiveness of ensemble methods, although they noted issues such as dataset noise due to spam and fake reviews. Anto et al. (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) explored sentiment analysis using SVM, part-of-speech tagging, and sentiment dictionaries applied to Twitter data, achieving moderate accuracy. However, they acknowledged the limited generalization ability of models trained on microblogging content.\u003c/p\u003e \u003cp\u003eSindhu et al. (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) conducted a comparative study between SVM and Na\u0026iuml;ve Bayes on shopping reviews, emphasizing the need for robust models capable of handling sarcasm and domain-specific language. Shivaprasad and Shetty (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) reviewed ML techniques such as Max Entropy, NB, and SVM, concluding that traditional methods lacked the capability to capture deeper contextual semantics. Singh et al. (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) demonstrated improved accuracy by optimizing ML classifiers like Decision Trees and SVM using hyperparameter tuning. Despite gains, these approaches remained reliant on hand-crafted features like TF-IDF and BoW. Bose et al. (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) applied Na\u0026iuml;ve Bayes to political sentiment analysis on Twitter, revealing domain-specific variations in language sentiment interpretation. Pratama et al. (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) implemented a similar Na\u0026iuml;ve Bayes model for Indonesian tweets, echoing limitations in language portability. Georgios and Thelwall (2012) proposed an unsupervised sentiment analysis method across platforms like Twitter and MySpace. Though useful for exploratory analysis, the method suffered from poor sentiment polarity resolution.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2 Deep Learning Models\u003c/h2\u003e \u003cp\u003eThe introduction of deep learning in NLP brought substantial improvements. Kalchbrenner et al. (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2014\u003c/span\u003e) proposed a CNN for sentence modeling, showing it could automatically extract useful local features for sentiment classification. Zhou et al. (\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2015\u003c/span\u003e) proposed a C-LSTM architecture, combining convolutional and recurrent layers to capture both spatial and sequential features, resulting in enhanced accuracy on sentence classification tasks. Jain et al. (\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2021\u003c/span\u003e) employed a hybrid CNN-LSTM model for consumer sentiment analysis, demonstrating that combined architectures yield better context awareness than standalone networks. Simanihuruk and Suparwito (\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) compared LSTM and BiLSTM on Shopee skincare product reviews. BiLSTM showed superior performance (95.91% accuracy), reinforcing the importance of bidirectional context in sentiment analysis. Dong et al. (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) introduced a BiLSTM-CNN model tailored for Chinese product reviews. Despite improved classification performance, the domain specificity and language constraints limited its generalization.\u003c/p\u003e \u003cp\u003eYang et al. (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) proposed a hybrid approach combining CNN, BiGRU, and sentiment lexicons, demonstrating effective sentiment categorization in Chinese e-commerce contexts. Priyadarshini and Cotton (\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2021\u003c/span\u003e) developed a grid search-based LSTM-CNN architecture optimized for movie review sentiment, illustrating the benefits of hyperparameter tuning for deep architectures. Feizollah et al. (\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) introduced a stacked DL model integrating CNN and LSTM layers for halal product sentiment analysis on Twitter, validating that model ensembling improves robustness. Abid et al. (\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) used a CNN-LSTM hybrid for Twitter sentiment classification, while Zhao et al. (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) adopted LSTM for predicting user personality based on sentiment-laden preferences. Haque et al. (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) evaluated DL models on large-scale Amazon reviews, identifying scalability challenges and domain variability in sentiment accuracy.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Word Embeddings and Representation Learning\u003c/h2\u003e \u003cp\u003ePre-trained word embeddings have significantly improved sentiment classification. Li et al. (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) highlighted the benefits of learning domain-adaptive embeddings. Qorich and El Ouazzani (2023) used Word2Vec in conjunction with CNNs, reporting strong results on Amazon reviews. Muhammad et al. (\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2021\u003c/span\u003e) applied Word2Vec and LSTM to Indonesian hotel reviews, capturing both word-level and temporal sentiment dependencies. Barry (\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) compared BoW and LSTM models on online reviews, confirming the superiority of context-aware neural networks. Similarly, Abdi et al. (\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) explored multi-feature fusion with DL models for enhanced accuracy.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4 Transformer-Based Architectures\u003c/h2\u003e \u003cp\u003eTransformer models represent a significant leap forward. Vaswani et al. (\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) introduced the original Transformer architecture, leveraging self-attention to handle long-range dependencies with greater efficiency. BERT (Devlin et al., \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) brought bidirectional pre-training to sentiment classification. Noriega et al. (\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) compared BERT, XLNet, ULMFiT, and RoBERTa on Amazon reviews, with RoBERTa achieving the highest performance (82%), albeit with high computational demand. RoBERTa (Liu et al., \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e2019\u003c/span\u003e), optimized with dynamic masking and longer training sequences, showed improved accuracy across benchmarks. Our study confirms RoBERTa\u0026rsquo;s dominance with an accuracy of 96.36%, the highest among all models evaluated. Wang (\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) fine-tuned LLaMa-3 for sentiment analysis, outperforming previous transformer models but requiring significant hardware resources. Similarly, Cambray and Podsadowski (2019) used BiRNNs for offensive content classification, highlighting the value of recurrent bidirectionality.\u003c/p\u003e \u003cp\u003eKhan et al. (2023) developed a hybrid DNN model with attention, integrating semantic context and syntactic structure. Cheng et al. (2021) proposed a capsule network with CNN-BiGRU fusion for nuanced sentiment classification. Hakimi et al. (2025) conducted sentiment analysis with abstractive summarization, combining content understanding with classification. Kumar et al. (2025) provided a meta-review of evolving techniques in sentiment analysis, particularly the role of transfer learning. Sun et al. (2019) used auxiliary sentence construction with BERT for aspect-based sentiment analysis, showing how task-specific tuning improves model relevance. Zheng et al. (\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) proposed lightweight attention-based architectures for faster inference without compromising accuracy. Derbentsev et al. (\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) evaluated multiple DL models on social media texts and emphasized the need for domain adaptability. Dhaoui et al. (\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) contrasted lexicon-based and ML-based approaches, finding the latter to be more scalable and accurate.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.5 Summary and Research Gap\u003c/h2\u003e \u003cp\u003eWhile traditional ML models offer interpretability and efficiency, they lack contextual depth. Deep learning models improve sentiment representation but struggle with computational constraints and long-range dependencies. Transformer-based models like BERT and RoBERTa excel in both context modeling and performance, but often require fine-tuning and substantial computational resources. Despite numerous studies, a unified, systematic benchmark across ML, DL, and multiple transformer variants on standardized datasets remains underexplored. Furthermore, many prior studies focus on specific domains or languages, limiting generalizability.\u003c/p\u003e \u003cp\u003eThe proposed research bridges this gap by benchmarking a broad set of models\u0026mdash;including Logistic Regression, SVM, Random Forest, Na\u0026iuml;ve Bayes, CNN, LSTM, ELMo, DistilBERT, ELECTRA, T5, BERT, and RoBERTa\u0026mdash;on the Amazon product review dataset under identical experimental settings. The experimental results confirm RoBERTa's superiority, validating its ability to handle contextual sentiment variations with high accuracy, precision, recall, and F1-score.\u003c/p\u003e \u003c/div\u003e"},{"header":"3 Research Methodology","content":"\u003cp\u003eThe methodological framework employed in this study is designed to systematically evaluate the performance of traditional machine learning, deep learning, and transformer-based models for sentiment analysis of product reviews. The objective is to ensure consistent preprocessing, training, and evaluation procedures across all models to facilitate a fair comparison. The methodology is structured into the following key phases: dataset acquisition and preparation, feature extraction and tokenization, model development and training, and evaluation and analysis.\u003c/p\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Proposed Methodological Framework\u003c/h2\u003e \u003cp\u003eTo conduct a fair and comprehensive evaluation of various sentiment analysis models, a structured methodology was designed to ensure consistency across preprocessing, model training, and evaluation. The framework integrates traditional machine learning, deep learning, and transformer-based models, facilitating comparative analysis under a unified experimental setup. The following workflow diagram outlines the sequential phases adopted in this study, from data acquisition to model assessment.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Dataset Description\u003c/h2\u003e \u003cp\u003eThe dataset used in this study comprises publicly available Amazon product reviews obtained from Kaggle. It contains textual reviews along with sentiment labels categorized as positive, negative, or neutral. The dataset includes reviews from diverse product categories, ensuring robustness and generalizability of the model evaluations.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Text Preprocessing\u003c/h2\u003e \u003cp\u003eRaw text data is often noisy and inconsistent. To standardize the inputs for learning algorithms, the following preprocessing steps were applied:\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eLowercasing: All text was converted to lowercase to ensure uniformity.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003ePunctuation Removal: Punctuation marks were removed to focus on word-level semantics.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eStopword Removal: Common stopwords (e.g., \u0026ldquo;is\u0026rdquo;, \u0026ldquo;the\u0026rdquo;, \u0026ldquo;and\u0026rdquo;) were eliminated to reduce noise.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eTokenization: Text was broken down into individual tokens (words) using NLTK and HuggingFace tokenizers, depending on the model requirements.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eSequence Padding/Truncation: For transformer models, each sequence was padded or truncated to a uniform length of 128 tokens.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e3.4 Feature Engineering and Representation\u003c/h2\u003e \u003cp\u003eFeature engineering is a critical step in sentiment analysis, as it transforms raw textual data into structured numerical formats that can be interpreted by machine learning and deep learning algorithms. The effectiveness of a model heavily depends on the quality and nature of input representations. In this study, three distinct strategies were adopted based on the modeling approach: traditional machine learning, deep learning, and transformer-based models.\u003c/p\u003e \u003cdiv id=\"Sec13\" class=\"Section3\"\u003e \u003ch2\u003e3.4.1 Traditional Feature Extraction (for ML Models)\u003c/h2\u003e \u003cp\u003eFor classical machine learning algorithms, the textual data was converted into sparse vector representations using the following methods:\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eBag-of-Words (BoW)\u003c/strong\u003e \u003cp\u003eThis technique creates a vocabulary of all words in the corpus and represents each document as a vector of word occurrence counts. Although simple and effective for small datasets, BoW fails to capture semantic relationships or word ordering.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eTerm Frequency\u0026ndash;Inverse Document Frequency (TF-IDF)\u003c/strong\u003e \u003cp\u003eTF-IDF improves upon BoW by down-weighting frequently occurring terms and giving importance to words that are rare and thus potentially more informative. Each word in a review is weighted by its frequency in that review (TF) and its inverse frequency across all reviews (IDF), resulting in a more meaningful feature space.\u003c/p\u003e \u003c/p\u003e \u003cp\u003eBoth BoW and TF-IDF were implemented using Scikit-learn\u0026rsquo;s CountVectorizer and TfidfVectorizer, respectively. The resulting vectors were used to train models like SVM, Logistic Regression, Na\u0026iuml;ve Bayes, and Random Forest.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section3\"\u003e \u003ch2\u003e3.4.2 Embedding-Based Representations (for DL Models)\u003c/h2\u003e \u003cp\u003eDeep learning models like CNN and LSTM require dense and continuous representations that retain syntactic and semantic information. For this purpose:\u003c/p\u003e \u003cp\u003e \u003cstrong\u003ePre-trained Word Embeddings\u003c/strong\u003e \u003cp\u003eGloVe embeddings (Global Vectors for Word Representation) were used to convert tokens into 100-dimensional vectors. These embeddings capture relationships between words in a vector space, where semantically similar words are closer together.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eEmbedding Layer\u003c/strong\u003e \u003cp\u003eAn embedding layer was initialized with GloVe weights and trained further during model optimization. Out-of-vocabulary tokens were assigned random vectors and updated via backpropagation.\u003c/p\u003e \u003c/p\u003e \u003cp\u003eThe use of embeddings allows deep learning models to understand relationships between words, enhancing their ability to detect subtle sentiment cues such as sarcasm or comparative sentiment.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section3\"\u003e \u003ch2\u003e3.4.3 Contextualized Representations (for Transformer Models)\u003c/h2\u003e \u003cp\u003eUnlike static embeddings, transformer models generate contextual embeddings, which adjust the representation of each word depending on its surrounding words in the sentence. The following steps were adopted for transformer-based models:\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eModel-specific Tokenization\u003c/strong\u003e \u003cp\u003eEach transformer architecture (BERT, RoBERTa, DistilBERT, ELECTRA, T5) has its own tokenizer. These tokenizers split text into subword tokens using Byte Pair Encoding (BPE) or WordPiece algorithms.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eInput Construction\u003c/strong\u003e \u003cp\u003eFor each sentence, input IDs, attention masks, and segment IDs (where applicable) were generated. Padding or truncation was applied to ensure all input sequences were of uniform length (128 tokens).\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003ePositional Embeddings\u003c/strong\u003e \u003cp\u003eTransformers do not inherently understand word order. Therefore, positional embeddings were automatically added during training to encode the position of tokens in the sequence.\u003c/p\u003e \u003c/p\u003e \u003cp\u003eThis rich and dynamic representation enables transformer models to understand deeper semantic and syntactic relationships, allowing them to excel in sentiment analysis tasks.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e3.5 Model Development\u003c/h2\u003e \u003cp\u003eA diverse set of models from three learning paradigms\u0026mdash;traditional machine learning, deep learning, and transformer-based learning\u0026mdash;were implemented and evaluated. This diversity ensures a robust benchmark and allows identification of the best trade-offs between accuracy, interpretability, and computational cost.\u003c/p\u003e \u003cdiv id=\"Sec17\" class=\"Section3\"\u003e \u003ch2\u003e3.5.1 Traditional Machine Learning Models\u003c/h2\u003e \u003cp\u003eThe following classical models were developed using Scikit-learn:\u003c/p\u003e \u003cp\u003eLogistic Regression (LR): A linear classifier that uses the sigmoid function to estimate class probabilities. It is fast, interpretable, and works well with linearly separable data.\u003c/p\u003e \u003cp\u003eSupport Vector Machine (SVM): Employs hyperplanes to separate sentiment classes with maximum margin. The radial basis function (RBF) kernel was tested for non-linear separation.\u003c/p\u003e \u003cp\u003eNa\u0026iuml;ve Bayes (NB): A probabilistic classifier based on Bayes\u0026rsquo; theorem. It assumes feature independence and is particularly effective for text classification due to its simplicity.\u003c/p\u003e \u003cp\u003eDecision Tree (DT): Builds a hierarchical tree structure based on entropy or Gini impurity to classify data. It is prone to overfitting but provides intuitive rules.\u003c/p\u003e \u003cp\u003eRandom Forest (RF): An ensemble of decision trees built using bootstrap aggregation. It improves generalization and reduces variance, often outperforming individual trees.\u003c/p\u003e \u003cp\u003eEach model was trained using TF-IDF vectors and evaluated using 5-fold cross-validation. Hyperparameters (e.g., C in SVM, max_depth in RF) were optimized using grid search.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section3\"\u003e \u003ch2\u003e3.5.2 Deep Learning Models\u003c/h2\u003e \u003cp\u003eImplemented using TensorFlow and Keras, these models exploit neural architectures to automatically learn hierarchical features from text:\u003c/p\u003e \u003cp\u003eConvolutional Neural Network (CNN): A one-dimensional CNN architecture was used with multiple filters (kernel sizes\u0026thinsp;=\u0026thinsp;2, 3, 4) and ReLU activations, followed by global max-pooling and a dense softmax output layer. CNNs are efficient in capturing local features and sentiment-bearing n-grams.\u003c/p\u003e \u003cp\u003eLong Short-Term Memory (LSTM): A sequence model capable of capturing long-term dependencies. The architecture consisted of a single LSTM layer followed by dropout and a dense classification head. LSTM was chosen for its ability to retain context across longer input sequences.\u003c/p\u003e \u003cp\u003eBoth models used pre-trained GloVe embeddings and were trained for 5 epochs using the Adam optimizer, binary cross-entropy loss, and batch size of 64. Early stopping was applied to avoid overfitting.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section3\"\u003e \u003ch2\u003e3.5.3 Transformer-Based Models\u003c/h2\u003e \u003cp\u003eTransformer models were fine-tuned using the HuggingFace Transformers library with the PyTorch backend. The following models were explored:\u003c/p\u003e \u003cp\u003eELMo (Embeddings from Language Models): Generates context-aware embeddings using deep bi-directional LSTMs. Implemented via AllenNLP for comparative analysis.\u003c/p\u003e \u003cp\u003eBERT (Bidirectional Encoder Representations from Transformers): Fine-tuned using bert-base-uncased. It utilizes bidirectional self-attention and is pre-trained using masked language modeling and next sentence prediction.\u003c/p\u003e \u003cp\u003eRoBERTa: A robust optimization of BERT that removes next sentence prediction and trains with dynamic masking on longer sequences. It achieved the highest performance across all models.\u003c/p\u003e \u003cp\u003eDistilBERT: A compressed version of BERT with fewer parameters, providing faster inference while retaining\u0026thinsp;~\u0026thinsp;97% of BERT\u0026rsquo;s performance.\u003c/p\u003e \u003cp\u003eELECTRA: Trained using a novel pre-training task (replaced token detection) instead of masked language modeling, making it faster and more efficient.\u003c/p\u003e \u003cp\u003eT5 (Text-to-Text Transfer Transformer): Casts every NLP task into a text-to-text format, allowing for greater flexibility. Fine-tuned using the t5-small checkpoint for sentiment classification as a text generation problem.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"4 Results and Discussion","content":"\u003cp\u003eThe performance evaluation of the sentiment classification models was conducted using four primary metrics: accuracy, precision, recall, and F1-score. Each model was tested under identical preprocessing and training conditions to ensure fairness. The results are structured across three categories\u0026mdash;traditional machine learning (ML) models, deep learning (DL) architectures, and transformer-based language models.\u003c/p\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Performance of Traditional Machine Learning Models\u003c/h2\u003e \u003cp\u003eThe traditional ML models were trained using TF-IDF representations of the Amazon product reviews. These models performed moderately well in terms of baseline sentiment classification accuracy as shown in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerformance of Traditional ML Models\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eRecall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eF1-score\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLogistic Regression\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.8811\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.8811\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.8807\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.8810\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDecision Tree\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.7508\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.7508\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.7508\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.7508\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRandom Forest\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.8533\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.8536\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.8533\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.8533\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNa\u0026iuml;ve Bayes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.8376\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.8378\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.8376\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.8376\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSVM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.8803\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.8804\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.8803\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.8803\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eObservation\u003c/b\u003e:\u003c/p\u003e \u003cp\u003eLogistic Regression and SVM yielded the highest accuracy (~\u0026thinsp;88%) among traditional models, demonstrating strong generalization on high-dimensional TF-IDF vectors. Decision Trees underperformed due to overfitting and lack of contextual learning. Overall, traditional models lacked the capacity to capture sequential and semantic relationships in text.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003e4.2 Performance of Deep Learning Models\u003c/h2\u003e \u003cp\u003eThe deep learning models\u0026mdash;CNN and LSTM\u0026mdash;were trained using pre-trained GloVe embeddings to capture word semantics and positional information as shown in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerformance of DL Models\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eRecall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eF1-score\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.9091\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9091\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9091\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9091\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLSTM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.9137\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9137\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9137\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9137\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eObservation\u003c/b\u003e:\u003c/p\u003e \u003cp\u003eBoth CNN and LSTM outperformed all traditional ML models. LSTM slightly edged over CNN, reflecting its superior ability to capture long-range dependencies and sequential patterns. However, both models required more computational resources and longer training time.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec23\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Performance of Transformer-Based Models\u003c/h2\u003e \u003cp\u003eTransformer models were fine-tuned on the same dataset using contextualized token embeddings. These models achieved significantly higher performance across all evaluation metrics as shown in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerformance of Transformer-based Models\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eRecall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eF1-score\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eELMO\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.8014\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.8015\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.8014\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.8014\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDistilBERT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.9498\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9501\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9498\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9498\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eELECTRA\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.9513\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9514\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9513\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9513\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eT5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.9465\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9468\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9465\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9465\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBERT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.9594\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.9600\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.9594\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9594\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eRoBERTa\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e0.9636\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.9639\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.9636\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.9636\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eObservation\u003c/b\u003e:\u003c/p\u003e \u003cp\u003eRoBERTa clearly surpassed all other models, including BERT, ELECTRA, and DistilBERT. Its improved pretraining strategy with dynamic masking and a larger training corpus enabled it to learn deeper contextual associations, crucial for accurate sentiment detection.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec24\" class=\"Section2\"\u003e \u003ch2\u003e4.4 Comparative Analysis and Best Model Justification\u003c/h2\u003e \u003cp\u003eThe chart below visualizes the comparative performance (accuracy) of all models across categories (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e):\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTraditional ML models serve as fast and lightweight baselines but are inherently limited by sparse vector representations and lack of contextual understanding.\u003c/p\u003e \u003cp\u003eDL models capture better semantic features and achieve higher accuracy (~\u0026thinsp;91%) but still lag behind transformer models due to limitations in bidirectional context modeling.\u003c/p\u003e \u003cp\u003eTransformer-based models show a significant leap in performance, with RoBERTa achieving the highest accuracy (96.36%). This demonstrates the advantage of attention mechanisms, bidirectional context, and large-scale pretraining. RoBERTa's superior performance can be attributed to several key factors. Firstly, RoBERTa builds upon BERT's architecture but enhances it with dynamic masking and training on a larger corpus, enabling it to learn richer contextual relationships in the data. Additionally, RoBERTa's pre-training involves training on more data with longer sequences, which improves its ability to capture long-range dependencies. These improvements in the model's training methodology, combined with its robust architecture, allow RoBERTa to better handle the intricacies of sentiment classification, especially with complex or ambiguous text. This superior result positions RoBERTa as the most effective model in this study, offering substantial improvements over earlier methods like BERT, ELECTRA, and DistilBERT. It not only improves upon the accuracy but also achieves a more balanced performance across precision, recall, and F1-score.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec25\" class=\"Section2\"\u003e \u003ch2\u003e4.5. Benefits and Applications\u003c/h2\u003e \u003cp\u003eSentiment analysis of product reviews holds significant value across multiple domains. In e-commerce, businesses can utilize sentiment classification models to assess customer feedback, identify trends, and refine marketing strategies. Automated sentiment analysis allows companies to process vast amounts of user-generated content, providing insights into consumer satisfaction and potential areas of product improvement. Furthermore, customer service departments can leverage sentiment analysis tools to prioritize and address negative reviews efficiently, enhancing user experience and brand reputation.\u003c/p\u003e \u003cp\u003eBeyond e-commerce, sentiment analysis has applications in social media monitoring, financial market analysis, political opinion mining, and healthcare. Social media platforms can employ sentiment classification to detect public opinion on brands, policies, or global events. Financial analysts can utilize sentiment trends to predict stock market fluctuations, while political organizations can gauge public sentiment towards candidates or policies. Additionally, in healthcare, sentiment analysis can be applied to patient reviews of medical services, aiding in quality assessment and service enhancements.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec26\" class=\"Section2\"\u003e \u003ch2\u003e4.6 Importance of Transformer-Based Approaches\u003c/h2\u003e \u003cp\u003eThe evolution of sentiment analysis from traditional models to deep learning-based methods has led to significant advancements in accuracy and efficiency. Transformer-based models address limitations associated with earlier techniques by efficiently handling contextual nuances, sarcasm, and complex linguistic structures. Unlike recurrent neural networks (RNNs), which process text sequentially, transformers use parallelized self-attention mechanisms, reducing computational overhead and improving scalability.\u003c/p\u003e \u003cp\u003eThe findings of this study emphasize the growing need for adopting transformer-based architectures in sentiment analysis applications. As the volume of user-generated textual data continues to expand, efficient and accurate sentiment classification models become crucial for data-driven decision-making. The ability to automate and analyze customer sentiments in real time has far-reaching implications, empowering organizations to make informed strategic decisions based on consumer insights.\u003c/p\u003e \u003c/div\u003e"},{"header":"5 Conclusions","content":"\u003cp\u003eThis study presents a comprehensive evaluation of sentiment analysis techniques across three major paradigms: traditional machine learning, deep learning, and transformer-based models. By standardizing the experimental setup, including preprocessing, feature representation, and evaluation metrics, the study offers a fair and rigorous comparison of models on the Amazon product review dataset.\u003c/p\u003e \u003cp\u003eAmong traditional models, Logistic Regression and SVM delivered strong baseline performance, though their reliance on sparse feature vectors limited their ability to model complex linguistic dependencies. Deep learning models such as CNN and LSTM improved upon these limitations by learning semantic features directly from data. LSTM, in particular, demonstrated the ability to capture sequential dependencies in sentiment-bearing phrases, achieving over 91% accuracy.\u003c/p\u003e \u003cp\u003eTransformer-based models emerged as the most robust and accurate class of models. With their attention mechanisms and deep contextual understanding, models like BERT and ELECTRA significantly outperformed previous methods. Notably, RoBERTa achieved the highest performance across all metrics, including an accuracy of 96.36%, validating the effectiveness of its optimized training strategy.\u003c/p\u003e \u003cp\u003eThese findings reaffirm the importance of context-aware language models in sentiment analysis and suggest that transformer architectures are currently the most suitable tools for real-world sentiment classification applications. Their deployment in e-commerce platforms, customer service systems, and social media analytics can drastically improve the quality and scalability of consumer insight extraction.\u003c/p\u003e"},{"header":"6 Future Directions","content":"\u003cp\u003eWhile this study establishes the effectiveness of transformer-based models for sentiment analysis, several areas remain open for further exploration. One promising direction is multi-modal sentiment analysis, integrating text, images, and audio cues to enhance classification accuracy. Additionally, fine-tuning transformer models on domain-specific datasets could improve performance by adapting them to specialized language patterns and sentiment expressions.\u003c/p\u003e \u003cp\u003eAnother crucial aspect is explainability and interpretability in transformer-based models. As sentiment analysis becomes increasingly integrated into decision-making processes, developing methods to explain and justify model predictions is essential. Future research could explore techniques such as attention visualization, feature attribution, and explainable AI (XAI) frameworks to enhance transparency in sentiment classification.\u003c/p\u003e \u003cp\u003eFurthermore, real-time sentiment analysis presents an exciting challenge, particularly in resource-constrained environments such as mobile devices and edge computing. Optimizing transformer models for low-latency inference and reduced computational complexity could enable real-time applications in customer feedback monitoring, social media analysis, and personalized recommendation systems.\u003c/p\u003e \u003cp\u003eOverall, the advancements in deep learning and NLP will continue to drive improvements in sentiment analysis, enabling more accurate, context-aware, and scalable AI-driven solutions for diverse applications.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e \u003ch2\u003eCompeting Interests\u003c/h2\u003e \u003cp\u003eThe authors have no relevant financial or non-financial interests to disclose.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eFunding\u003c/h2\u003e \u003cp\u003eThe authors declare that no funds, grants, or other support were received during the preparation of this manuscript.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e \u003cp\u003eThe datasets analysed during the current study are available in the TensorFlow-Sentiment-Analysis-on-Amazon-Reviews-Data repository,\u003c/p\u003e \u003cp\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/MuhammedBuyukkinaci/TensorFlow-Sentiment-Analysis-on-Amazon-Reviews-Data/tree/master/dataset\u003c/span\u003e\u003cspan address=\"https://github.com/MuhammedBuyukkinaci/TensorFlow-Sentiment-Analysis-on-Amazon-Reviews-Data/tree/master/dataset\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e \u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003ePang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. arXiv preprint cs/0205070\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFang X, Zhan J (2015) Sentiment analysis using product review data. J Big Data 2(1):1\u0026ndash;14\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAnto MP et al (2016) Product rating using sentiment analysis, in \u003cem\u003eProc. IEEE Int. Conf. Electrical, Electronics, and Optimization Techniques (ICEEOT)\u003c/em\u003e, pp. 3458\u0026ndash;3462\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSindhu C, Vyas DV, Pradyoth K (2017) Sentiment analysis based product rating using textual reviews, in \u003cem\u003eProc. IEEE Int. Conf. Electronics, Communication and Aerospace Technology (ICECA)\u003c/em\u003e, vol. 2, pp. 727\u0026ndash;731\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShivaprasad TK, Shetty J (2017) Sentiment analysis of product reviews: A review, in \u003cem\u003eProc. IEEE Int. Conf. Inventive Communication and Computational Technologies (ICICCT)\u003c/em\u003e, pp. 298\u0026ndash;301\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSingh J, Singh G, Singh R (2017) Optimization of sentiment analysis using machine learning classifiers. Human-centric Comput Inf Sci 7(1):1\u0026ndash;12\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBose R et al (2019) Analyzing political sentiment using Twitter data, in \u003cem\u003eProc. ICTIS 2018\u003c/em\u003e, Springer, Singapore, pp. 427\u0026ndash;436\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePratama Y et al (2019) Implementation of sentiment analysis on Twitter using Na\u0026iuml;ve Bayes algorithm to know the people responses to debate of DKI Jakarta governor election. J Phys Conf Ser 1175:012102\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePaltoglou G, Thelwall M (2012) Twitter, MySpace, Digg: Unsupervised Sentiment Analysis in Social Media. ACM Trans Intell Syst Technol 3(4):1\u0026ndash;19\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou C, Sun C, Liu Z, Lau F (2015) A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJain PK, Saravanan V, Pamula R (2021) A hybrid CNN-LSTM: A deep learning approach for consumer sentiment analysis using qualitative user-generated contents. ACM Trans Asian Low-Resour Lang Inf Process 20(5):1\u0026ndash;15\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSimanihuruk L, Suparwito H (2025) Long Short-Term Memory and Bidirectional Long Short-Term Memory Algorithms for Sentiment Analysis of Skintific Product Reviews, in \u003cem\u003eITM Web Conf.\u003c/em\u003e, vol. 71, p. 01016\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDong Y et al (2025) DC-BiLSTM-CNN Algorithm for Sentiment Analysis of Chinese Product Reviews. Appl Artif Intell, 39, 1\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang L, Li Y, Wang J, Sherratt RS (2020) Sentiment analysis for E-commerce product reviews in Chinese based on sentiment lexicon and deep learning. IEEE Access 8:23522\u0026ndash;23530\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePriyadarshini I, Cotton C (2021) A novel LSTM\u0026ndash;CNN\u0026ndash;grid search-based deep neural network for sentiment analysis. J Supercomput 77:13911\u0026ndash;13932\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFeizollah A et al (2019) Halal products on Twitter: Data extraction and sentiment analysis using stack of deep learning algorithms. IEEE Access 7:83354\u0026ndash;83362\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbid F et al (2019) Sentiment analysis through recurrent variants latterly on convolutional neural network of Twitter. Future Gener Comput Syst 95:292\u0026ndash;308\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhao J et al (2020) User personality prediction based on topic preference and sentiment analysis using LSTM model. Pattern Recognit Lett 138:397\u0026ndash;402\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHaque TU, Saber NN, Shah FM (2018) Sentiment analysis on large scale Amazon product reviews, in \u003cem\u003eProc. IEEE Int. Conf. Innov. Res. Dev. (ICIRD)\u003c/em\u003e, pp. 1\u0026ndash;6\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi Y et al (2017) Learning word representations for sentiment analysis. Cogn Comput 9:843\u0026ndash;851\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQorich M, Ouazzani RE (2023) Text sentiment classification of Amazon reviews using word embeddings and convolutional neural networks. J Supercomput 79(10):11029\u0026ndash;11054\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMuhammad PF, Kusumaningrum R, Wibowo A (2021) Sentiment analysis using Word2vec and long short-term memory (LSTM) for Indonesian hotel reviews. Procedia Comput Sci 179:728\u0026ndash;735\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBarry J (2017) Sentiment Analysis of Online Reviews Using Bag-of-Words and LSTM Approaches, in \u003cem\u003eProc. AICS\u003c/em\u003e, pp. 272\u0026ndash;274\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbdi A et al (2019) Deep learning-based sentiment classification of evaluative text based on Multi-feature fusion. Inf Process Manag 56(4):1245\u0026ndash;1259\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVaswani A et al (2017) Attention is all you need, in \u003cem\u003eProc. Advances in Neural Information Processing Systems (NeurIPS)\u003c/em\u003e, pp. 5998\u0026ndash;6008\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDevlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding, \u003cem\u003earXiv preprint\u003c/em\u003e arXiv:1810.04805\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu Y et al (2019) RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNoriega I et al (2023) Sentiment Analysis of Amazon Reviews using Deep Learning Techniques, \u003cem\u003eRG Publication\u003c/em\u003e, [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.13140/RG.2.2.35547.75046\u003c/span\u003e\u003cspan address=\"10.13140/RG.2.2.35547.75046\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang Y (2025) Sentiment Analysis of Product Reviews Using Fine-Tuned LLaMa-3 Model, in \u003cem\u003eITM Web Conf.\u003c/em\u003e, vol. 70, p. 04021\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSun C, Huang L, Qiu X Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence. arXiv preprint arXiv:1903.09588, 2019.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZheng W et al (2023) Lightweight Multilayer Interactive Attention Network for Aspect-Based Sentiment Analysis. Connection Sci, 35, 1\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDerbentsev VD et al (2022) A comparative study of deep learning models for sentiment analysis of social media texts, in \u003cem\u003eM3E2-MLPEED\u003c/em\u003e, pp. 168\u0026ndash;188\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDhaoui C, Webster CM, Tan LP (2017) Social media sentiment analysis: lexicon versus machine learning. J Consum Mark 34(6):480\u0026ndash;488\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Sentiment Analysis, Natural Language Processing, Transformer Models, Product Reviews, Machine Learning, Deep Learning, Text Classification, RoBERTa, Opinion Mining","lastPublishedDoi":"10.21203/rs.3.rs-6885027/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6885027/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe proliferation of e-commerce platforms has led to an explosion of user-generated content, particularly in the form of product reviews. These reviews offer valuable insights into consumer sentiment, influencing business strategies and customer satisfaction initiatives. This study undertakes a comprehensive evaluation of sentiment analysis models applied to product reviews, spanning traditional machine learning algorithms, deep learning architectures, and state-of-the-art transformer-based models. The models evaluated include Logistic Regression, Support Vector Machines, Decision Trees, Random Forests, Na\u0026iuml;ve Bayes, Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and advanced transformers such as ELMo, DistilBERT, ELECTRA, T5, BERT, and RoBERTa. Using a standardized Amazon product review dataset, models were assessed on accuracy, precision, recall, and F1-score. Results indicate that transformer-based models significantly outperform their predecessors, with RoBERTa achieving the highest accuracy of 96.36%. These findings underscore the growing importance of transformer architectures in sentiment classification, offering promising directions for real-time applications in e-commerce, social analytics, and recommendation systems.\u003c/p\u003e","manuscriptTitle":"Advancing Sentiment Analysis on Product Reviews: A Comparative Evaluation of Classical, Deep Learning, and Transformer Models","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-01-14 19:05:50","doi":"10.21203/rs.3.rs-6885027/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"df1919e1-8796-4d53-b3d7-94381038dc06","owner":[],"postedDate":"January 14th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-02-10T09:02:59+00:00","versionOfRecord":[],"versionCreatedAt":"2026-01-14 19:05:50","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6885027","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6885027","identity":"rs-6885027","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.