Detection of Adult Content in Arabic Tweets Using Machine Learning Models | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Detection of Adult Content in Arabic Tweets Using Machine Learning Models Aram Ibrahim Al-anazi This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7579505/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract This study evaluates the effectiveness of various machine learning and deep learning models in detecting adult content in Arabic tweets, addressing unique linguistic and cultural challenges. Using a dataset of 33,691 Arabic tweets, we implemented and compared Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory networks (LSTM), and AraBERT. The data underwent thorough preprocessing, including cleaning, tokenization, and segmentation into training, validation, and test sets. Performance metrics such as accuracy, F1 score, and confusion matrices were used to assess model efficacy. AraBERT achieved the highest accuracy (100%), demonstrating superior capability in capturing spatial patterns for content classification. CNN and RNN also performed well, with accuracies of 94.27% and 94.22%, respectively, while LSTM achieved an accuracy of 88.37%. These findings highlight AraBERT's potential for effective content moderation in Arabic digital spaces, contributing to safer online environments. Artificial Intelligence and Machine Learning Arabic Tweets AraBERT Convolutional Neural Networks (CNN) Long Short-Term Memory (LSTM) Natural Language Processing (NLP) Recurrent Neural Networks (RNN) Text Classification Figures Figure 1 Figure 2 INTRODUCTION Adult content detection has become an increasingly critical issue for online platforms due to the rapid expansion of data exchanged across digital spaces. Social media platforms, websites, and other online environments handle vast amounts of information every second, resulting in a dynamic yet demanding content moderation ecosystem. The availability of explicit information online, often without adequate age controls or permission procedures, raises substantial concerns, particularly for vulnerable users such as children and teenagers. The risks of exposure to inappropriate content extend beyond immediate harm, potentially affecting psychological development, behavior, and societal norms [ 1 ]. Traditional approaches for identifying adult content have evolved significantly, from manual moderation to automated systems powered by machine learning (ML) and deep learning (DL) technologies. Early methods, such as keyword filtering and image-based detection, were limited in their ability to handle the complexities of modern content. Advances in ML and DL have led to the development of sophisticated algorithms capable of analyzing large datasets, recognizing subtle patterns, and providing real-time results [ 2 ]. Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) networks, and transformer-based architectures like BERT and its variants have shown promising results in text and image classification tasks, including the detection of explicit content [ 3 ], [ 4 ]. Despite these advancements, there remains a significant gap in research focused on detecting adult content in the Arabic digital realm compared to other languages. Arabic presents unique linguistic challenges, including its rich morphology, diverse dialects, and frequent use of colloquialisms, which complicate automated content moderation [ 5 ]. Additionally, the inclusion of emoticons, URLs, and hashtags in Arabic tweets further complicates text processing, necessitating more advanced and adaptive algorithms [ 6 ]. This study aims to address these challenges by evaluating the effectiveness of various ML and DL models in detecting adult content in Arabic tweets. The study investigates models including CNN, RNN, LSTM, and AraBERT to determine the most effective methods for achieving this aim. The results reveal the capabilities and limitations of current systems and highlight the need for continued innovation in content moderation. Finally, our study contributes to developing safer digital environments, especially for Arabic- speaking users, by improving our understanding of how to efficiently and reliably filter explicit information in text content. RELATED WORK The dataset is organized into several categories to capture the diverse linguistic and cultural aspects of Arabic social media content. It includes tweets in various Arabic dialects, such as Gulf, Levantine, and Egyptian, showcasing the linguistic diversity of the Arabic-speaking world. The dataset also features informal text styles common on social media, including emojis, hashtags, and abbreviations. Additionally, it contains sensitive language, including vulgar words and satirical and metaphorical references to adult entertainment, reflecting the nuanced ways in which adult content is expressed in Arabic. Contextual information, such as hashtags, mentions, and links, is retained to enhance the accuracy of natural language processing models in identifying semantic and syntactic patterns. Several studies have focused on the automatic identification of adult content, whether in image-based, video-based, or text-based formats. Appati et al. (2021) provided a comprehensive review of image analysis techniques for adult content detection, focusing on both traditional and deep learning methods [ 1 ]. They categorized the approaches to Region of Interest (ROI) techniques and deep learning methods. ROI techniques, such as skin pixel ratio and explicit content weighting, utilize algorithms like Support Vector Machines (SVM) and K Nearest Neighbors (KNN) for classification. In contrast, deep learning methods, particularly Convolutional Neural Networks (CNNs), have shown superior performance in detecting explicit content due to their ability to learn complex patterns from large datasets. Gajula et al. (2020) proposed a model based on supervised learning using the Support Vector Machine (SVM) algorithm to detect and blur explicit content in images. The model was trained on a dataset containing 7,000 pornographic images and 7,000 normal images, achieving an accuracy of 97.8% during testing. The classification process was based on the amount of skin percentage exposed in the images. If an image was classified as pornographic, it was then processed to blur the explicit content using image processing techniques. This approach ensured that end-users were not exposed to inappropriate material, making it a robust solution for protecting children from adult content on the internet [ 7 ]. Dubettier et al. (2023) conducted a comparative study to evaluate the efficacy of various methods for detecting sexual content in images, aiming to protect children and enhance digital forensic investigations. The study assessed five tools: the nsfw model, NudeNet, NuDetective, SkinDetection, and DeepPornDetection, each employing distinct techniques such as skin detection, deep learning, or transfer learning. Using three datasets with varying degrees of explicit content and complexity, the researchers found that the nsfw model and NudeNet achieved the highest accuracy across datasets, while DeepPornDetection performed best on the dataset it was trained on, indicating a training bias. The study highlighted several challenges, including the insufficiency of skin detection alone, as high skin exposure does not always indicate sexual content, and some explicit images have low skin exposure. Additionally, the findings underscored the subjectivity in distinguishing between acceptable and harmful content due to cultural and environmental factors. While the nsfw model and NudeNet showed potential, the authors concluded that further enhancements are needed. They recommended future research to explore adaptive criteria based on cultural variables for more precise screening [ 4 ]. Ochoa et al. (2012) explored machine learning-based strategies for recognizing pornographic video material, emphasizing the integration of spatial and temporal data [ 2 ]. Their study demonstrated the effectiveness of combining spatial features, such as skin detection and color histograms, with temporal features like shot duration and camera motion. The use of SVM classifiers achieved an accuracy rate of up to 94.44%, highlighting the potential of deep learning techniques in this domain. Barrientos et al. (2020) conducted a comprehensive study on the use of machine learning techniques to detect inappropriate erotic content in text, with a focus on protecting children from exposure to such material. The study highlighted the growing need for automated moderation tools due to the impracticality of manual moderation in the face of increasing user-generated content. The researchers employed twelve different models, including three text encoders (Bag of Words, TF-IDF, and Word2Vec) and four classifiers (SVM, Logistic Regression, k- Nearest Neighbors, and Random Forest), to identify unsuitable material. Using a dataset of over 110,000-word samples from Reddit, categorized as sexual or neutral, they found that the combination of TF-IDF and SVM (linear kernel) achieved the highest performance, with an accuracy of 97% and an F-score of 0.96. This study underscored the effectiveness of machine learning approaches in real-time content filtering, particularly for social networks. The authors also suggested future research directions, including the exploration of deep learning models and feature reduction techniques to further enhance content detection systems. The findings demonstrated the practicality and potential of automated tools in maintaining a safer online environment for children [ 3 ]. Hamdy et al. (2021) conducted a study focused on identifying explicit content in Arabic tweets. The researchers created a dataset comprising 50,000 Twitter accounts, with 6,000 identified as adult content accounts. This dataset was meticulously annotated using Arabic-related keywords and hashtags. The analysis revealed that adult tweets are generally shorter, use fewer words, and contain more URLs and emojis compared to non-adult tweets. Significant patterns were identified in the use of words, emojis, and hashtags. The study evaluated several machine learning models, including Support Vector Machines (SVM), FastText, multilingual BERT, and AraBERT. Among these, AraBERT outperformed the others, achieving an F1 score of 96.8% when combining user data with tweet content. The researchers concluded that even basic information, such as usernames and brief descriptions, can be effective in identifying sexual content accounts. They suggested that future research could explore multimodal content analysis to further enhance detection accuracy [ 8 ]. These studies provided a foundation for our research, highlighting the strengths and limitations of existing approaches. Our study aims to build on this foundation by evaluating the performance of various machine learning and deep learning models in detecting adult content in Arabic tweets, addressing the identified gaps in the literature (Table 1). Table 1: Summary of the key findings from the related work, comparing the performance of different models and techniques in detecting adult content Study Approach Dataset Accuracy Key Findings Appati et al. ROI Technique s, CNNs Image Data 94.44% CNNs outperfor m traditiona l methods Ochoa et al. SVM, Spatial and Temporal Features Video Data 94.44% Integratio nof spatial and temporal data improves accuracy Gajula et al. SVM, Skin Percentag e Image Data 97.8% High accuracy in detecting and blurring explicit content Barrient os et al. TF-IDF, SVM Text Data 97% Effective real-time content filtering for social networks Dubettie r et al. nsfw model, NudeNet Image Data High accurac y Challenge s in skin detection and cultural subjectivi ty Hamdy et al. AraBERT Arabi c Tweet s 96.8% Effective in identifyin g explicit content in Arabic tweets DATASET In this study, we used the dataset from the previous study by Hamdy et al. (2021) [ 8 ], we evaluated adult content on Arabic Twitter. The dataset consists of 33,691 samples with two main columns, named "text2" and "categories." The "text2" column contains the full tweet text or main textual content, which includes various linguistic structures ranging from Fusha (formal Arabic) to Ammiya (colloquial/spoken Arabic) and code- switching between Arabic and English. The "categories" column indicates whether the tweet contains adult content or not, with two classes: ADULT (1) and NOT_ADULT (0). The dataset was meticulously collected and annotated by expert linguists and content reviewers to ensure high quality and reliability. They manually evaluated and annotated tweets using Arabic-related keywords and hashtags to accurately identify and classify adult content. The dataset encompasses a wide range of linguistic, artistic, and expressive features, including tweets in Gulf, Levantine, and Egyptian Arabic dialects, reflecting the linguistic diversity of the Arabic-speaking world. It also includes informal text styles common on social media, such as emojis, hashtags, and abbreviations. Additionally, the dataset contains vulgar words and satirical and metaphorical references to adult entertainment, capturing the nuanced ways in which adult content is expressed in Arabic. Contextual information typically found in social media content, such as hashtags, mentions, and links, is retained in the dataset. This contextual data is crucial for advanced natural language processing (NLP) models, as it helps identify semantic and syntactic patterns that improve the accuracy of adult content classification. Data Preprocessing To prepare the dataset for model training, several preprocessing steps were undertaken (Fig. 1 ): Data Cleaning: Duplicate tweets and language flaws that could skew the study findings were removed. Tokenization: The text input was tokenized using a tokenizer to convert the tweets into a format suitable for model training. Label Encoding: The labels were converted to numerical values using LabelEncoder to facilitate model training and evaluation. Data Splitting: The dataset was split into training, validation, and test sets using a 70/30 ratio, ensuring a robust evaluation of the models. Dataset Size and Diversity The dataset's large size and diversity enable robust machine-learning training for the automatic identification and classification of adult content on Arabic social media sites. This wealth of cultural and expressive dimensions helps in crafting strong, generalizable models that are well-suited to modeling the complexity and nuances of Arabic text. Generalizability With such diverse and realistic data, the models trained on this dataset can generalize well across various scenarios, making this dataset an essential part of the solution for building tools to monitor and filter out adult content in Arabic language social media websites. The inclusion of contextual information and the representation of different dialects and informal text styles further enhance the dataset's utility for developing effective content moderation systems. METHODOLOGY In this study, we utilized a dataset of 33,691 tweets classified as adult or non-adult material. The methodology involved several key processes to ensure the accuracy and reliability of the results: (2) Data Preprocessing 1. Data Cleaning: The dataset was meticulously cleaned to remove duplicate tweets and correct language flaws that could skew the study findings. This step ensured that the data was of high quality and free from inconsistencies. 2. Tokenization: The text input was tokenized using a tokenizer, which converted the tweets into a format suitable for model training. This process involved breaking down the text into individual tokens or words, which could then be analyzed by the models. 3. Label Encoding: Labels were converted to numerical values using LabelEncoder. This step facilitated the training and evaluation of the models by providing a standardized format for the classification labels. 4. Data Splitting: The dataset was split into training, validation, and test sets using a 70/30 ratio. This approach ensured that the models were trained on a substantial portion of the data while retaining enough data for validation and testing to assess model performance accurately. (3) Models Used We employed several machine learning and deep learning models to evaluate their effectiveness in detecting adult content in Arabic tweets: 1. Convolutional Neural Network (CNN): The CNN model consisted of embedding layers, convolutional layers (Conv1D), and pooling layers. This architecture enabled the capture of spatial patterns in the text, making it suitable for text classification tasks. 2. Long Short-Term Memory Network (LSTM): The LSTM model was designed to handle sequential dependencies in the text. It included layers for long-term memory, allowing it to capture contextual meaning across long sequences, which is particularly beneficial for lengthier sentences. 3. Recurrent Neural Network (RNN): A simple RNN was used to analyze text sequentially using tanh activation functions. This model was appropriate for short to medium-length texts. For binary classification, we employed a dense output layer with sigmoid activation. 4. AraBERT Model: Specifically designed for Arabic, the AraBERT model was trained on a large corpus of Arabic texts and optimized to handle the unique features of the language, including its morphology and syntax. Natural language processing tasks were performed using the pretrained aubmindlab/bert-base- arabertv02 model, set up with Trainer to have a low learning rate and a small batch size. Few-shot learning was also investigated by evaluating performance with a small collection of examples. (4) Model Training and Evaluation 1. Training: Each model was trained on the training set, with hyperparameters tuned to optimize performance. The training process involved adjusting the model parameters to minimize the loss function and improve classification accuracy. 2. Validation: The validation set was used to monitor the models' performance during training and prevent overfitting. Techniques such as dropout and batch normalization were employed to enhance model generalization. 3. Testing: The final evaluation of the models was conducted on the test set. Performance metrics such as accuracy, F1 score, and confusion matrices were used to assess how successfully each model identified explicit content from safe material. (5) Evaluation Metrics 1. Accuracy: The proportion of correctly classified tweets out of the total number of tweets. 2. F1 Score: The harmonic means of precision and recall provide a balanced measure of model performance. 3. Confusion Matrices: Graphical representations of the true positives, false positives, true negatives, and false negatives, allow for a detailed comparison of model performance across categories. Table 2: A structured overview of the methodology steps, model used, and evaluation metrics Step Description Data Preprocessing Data Cleaning Removing duplicate tweets and correcting language flaws. Tokenization Converting text input into tokens suitable for model training. Label Encoding Converting labels to numerical values. Data Splitting Splitting the dataset into training (70%), validation, and test sets (30%). Models Used Convolutional Neural Network (CNN) Embedding layers, Convolutional layers (Conv1D), and Poolinglayers to capture spatial patterns. Long Short-Term Memory Network (LSTM) Handling sequential dependencies with layers for long-term memory. Recurrent Neural Network (RNN) Analyzing text sequentially using tanh activation functions. AraBERT Pretrained on a large corpus of Arabic texts, optimizedfor Arabic morphology and syntax. Model Training and Evaluation Training Adjusting model parameters to minimize loss and improve accuracy. Validation Monitoring performance during training to prevent overfitting. Testing Final evaluation using accuracy, F1 score, and confusion matrices. Evaluation Metrics Accuracy The proportion of correctly classified tweets. F1 Score Harmonic means of precision and recall. Confusion Matrices Graphical representation of true positives, false positives,true negatives, and false negatives. RESULTS The findings of this study demonstrated the varying efficacy of different machine learning and deep learning models in classifying adult content in Arabic tweets. The models evaluated include AraBERT, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Long Short-Term Memory networks (LSTM). Each model's performance was assessed using metrics such as accuracy and F1 score, and confusion matrices were constructed to provide a detailed comparison. Results revealed that AraBERT emerged as the most successful model, achieving an impressive accuracy of 100%. This model's ability to efficiently capture spatial patterns in the text makes it highly effective for content classification. The high accuracy indicated that AraBERT can reliably distinguish between adult and non-adult content in Arabic tweets, making it a valuable tool for content moderation. In addition, the CNN model also performed well, with an accuracy of 94.27%. This model's architecture, which includes embedding layers, convolutional layers (Conv1D), and pooling layers, enables it to capture spatial patterns in the text effectively. The high accuracy of the CNN model suggests that it was well-suited for handling short text sequences and can be a robust option for detecting adult content in Arabic tweets. Moreover, the RNN model achieved an accuracy of 94.22%, compared to the CNN model. The RNN's ability to analyze text sequentially using tanh activation functions makes it suitable for short to medium-length texts. The model's performance indicates that it can effectively handle brief text sequences and classify them accurately. The LSTM model, designed to handle sequential dependencies in the text, achieved an accuracy of 88.37%. While this model is good at capturing contextual meaning across long sequences, its performance was somewhat lower than that of the CNN and RNN models. The LSTM's ability to manage temporal dependencies makes it beneficial for lengthier sentences, but it may be less effective for shorter text sequences compared to the other models. The results highlighted the strengths and limitations of each model in classifying adult content in Arabic tweets. AraBERT's superior performance can be attributed to its design, which was specifically optimized for the Arabic language, including its morphology and syntax. The CNN and RNN models also demonstrated strong performance, indicating their suitability for text classification tasks involving short to medium-length texts. The LSTM model, while effective in handling longer sequences, showed lower accuracy, suggesting that it may be more suitable for tasks requiring more context or longer text analysis. Confusion matrices were constructed for each model to provide a detailed comparison of their performance across categories. These matrices illustrate the true positives, false positives, true negatives, and false negatives, offering insights into each model's classification capabilities. The high accuracy and low error rates of the AraBERT, CNN, and RNN models indicate their reliability in distinguishing between adult and non-adult content. CONCLUSION In conclusion, the study demonstrates that AraBERT is the most effective model for detecting adult content in Arabic tweets, with the CNN model being a close second. The RNN model also performs well, while the LSTM model, although effective in handling longer sequences, shows slightly lower performance levels. These findings underscore the importance of selecting the appropriate model based on the specific characteristics of the text data and the classification task at hand. By providing a detailed analysis of each model's performance, this study contributes to the development of more accurate and reliable content moderation systems for Arabic social media platforms. The results highlight the need for continued innovation and improvement in machine learning and deep learning models to enhance their effectiveness in detecting explicit content (Table 3) (Fig. 2 ). Table 3: Comparison of the performance of each model in terms of accuracy and F1 score Model Accuracy (%) F1 Score AraBERT 100.00 1.00 CNN 94.27 0.94 RNN 94.22 0.94 LSTM 88.37 0.88 tweets, contributing to the development of safer and more accurate content moderation systems. FUTURE WORK Future research in this area could focus on several key aspects to enhance model performance and improve the accuracy of detecting adult content in Arabic tweets. Advanced preprocessing techniques for Arabic text, such as handling diacritics and addressing colloquial idioms and spelling variations, can significantly improve model accuracy. Exploring hybrid and ensemble models by combining different models to leverage their strengths can lead to improved outcomes. Fine-tuning pretrained models specifically for Arabic or domain-specific datasets can increase their ability to understand nuanced language characteristics. Incorporating multimodal data, such as images or videos, along with text can provide additional context and improve classification accuracy. Increasing the dataset size and diversity by expanding the dataset and incorporating more diverse examples can improve the models' generalizability. Developing models optimized for real-time performance and ensuring scalability is crucial for practical content moderation. Investigating adaptive criteria based on cultural variables and addressing ethical implications, including privacy concerns and potential biases, is essential for developing responsible and fair systems. By focusing on these areas, future research can significantly enhance the accuracy and effectiveness of models for detecting adult content in Arabic References J. K. Appati, K. Y. Lodonu, and R. Chris-Koka, “A Review of Image Analysis Techniques for Adult Content Detection: Child Protection,” https://services.igi global.com/resolvedoi/resolve.aspx?d oi=10.4018/IJSI.2021040106 , vol. 9, no. 2, pp. 102–121, Jan. 1AD, doi: 10.4018/IJSI.2021040106. V. M. T. Ochoa, S. Y. Yayilgan, and F. A. Cheikh, “Adult video content detection using machine learning techniques,” 8 th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r , pp. 967–974, 2012, doi: 10.1109/SITIS.2012.143. G. M. Barrientos, R. Alaiz-Rodríguez, V. González-Castro, and A. C. Parnell, “Machine learning techniques for the detection of inappropriate erotic content in text,” International Journal of Computational Intelligence Systems , vol. 13, no. 1, pp. 591–603, Jun. 2020, doi: 10.2991/IJCIS.D.200519.003/METRI CS. A. Dubettier, T. Gernot, E. Giguet, and C. Rosenberger, “A Comparative Study of Tools for Explicit Content Detection in Images,” Proceedings - 2023 International Conference on Cyberworlds, CW 2023 , pp. 464–471, 2023, doi: 10.1109/CW58918.2023.00077. H. Mubarak, A. Rashed, K. Darwish, Y. Samih, and A. Abdelali, “Arabic Offensive Language on Twitter: Analysis and Experiments,” WANLP 2021 - 6th Arabic Natural Language Processing Workshop, Proceedings of the Workshop , pp. 126–135, Apr. 2020, Accessed: Jan. 11, 2025. [Online]. Available: https://arxiv.org/abs/2004.02192v3 Y. Mao, Q. Liu, and Y. Zhang, “Sentiment analysis methods, applications, and challenges: A systematic literature review,” Journal of King Saud University - Computer and Information Sciences , vol. 36, no. 4, p. 102048, Apr. 2024, doi: 10.1016/J.JKSUCI.2024.102048. G. Gajula, A. Hundiwale, S. Mujumdar, and L. R. Saritha, “A machine learning based adult content detection using support vector machine,” Proceedings of the 7th International Conference on Computing for Sustainable Global Development, INDIACom 2020 , pp. 181–185, Mar. 2020, doi: 10.23919/INDIACOM49435.2020.90 83700. H. Mubarak, S. Hassan, and A. Abdelali, “Adult Content Detection on Arabic Twitter: Analysis and Experiments,” 2021. Accessed: Jan. 11, 2025. [Online]. Available: https://aclanthology.org/2021.wanlp- 1.14/ Additional Declarations The authors declare no competing interests. Supplementary Files Highlights.docx Highlights Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7579505","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":512900618,"identity":"914eb40b-2e21-4524-bb35-6dad475169ab","order_by":0,"name":"Aram Ibrahim Al-anazi","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABJUlEQVRIie3PsUrDQBjA8QsHl8EvdL1BmldoKMSClL7KHYVM6SQ4FbEE4iJ1FRx8hUAg1E04zHTqGkFEKXZwChRCXYp3AbdE6iZy/+Fygftx9yFkMv3V+ClCPaw2r/oP41u10v3dCKsJ0R8KPxJLE/RNENT7VtK5Eu/J2+LFPbD3eMmmz657BuuPYjoAZIu7pIHQh8AvuDzybiInpSxfebPIyQ7DXD0MgqBoukYiRWJmJcJJKCPCirCT9UOiCAW/ibjSrjQZKZJu2FaMYgyrfrhtJz0J9S1ckYzyWPBzDHg5iduJJ+FYzcLGmgz4XIwvMfHxZE6BtMzSlXb29Llgw+TxPi3KSgyvL8RyHVYn3Y4t8sbxmyK0Xnc9rsPlb06bTCbTv+8LDL5oi7njGvUAAAAASUVORK5CYII=","orcid":"","institution":"Information Management Specialist","correspondingAuthor":true,"prefix":"","firstName":"Aram","middleName":"Ibrahim","lastName":"Al-anazi","suffix":""}],"badges":[],"createdAt":"2025-09-10 06:42:35","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":true,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":true},"doi":"10.21203/rs.3.rs-7579505/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7579505/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":91495911,"identity":"c1b89999-51ba-4970-80e8-bf6381bfde37","added_by":"auto","created_at":"2025-09-17 06:23:13","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":58923,"visible":true,"origin":"","legend":"\u003cp\u003eData processing steps\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7579505/v1/c9e2194f796cde86d704d4d0.png"},{"id":91495426,"identity":"fca6ab2b-9e45-4381-8137-1f0282dc0307","added_by":"auto","created_at":"2025-09-17 06:15:13","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":18534,"visible":true,"origin":"","legend":"\u003cp\u003eThe performance of each model in terms of accuracy and F1 score\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7579505/v1/9aaae1a6a16369e9bd402e81.png"},{"id":91495949,"identity":"bfd0acb5-1469-4455-8c22-73de5473826f","added_by":"auto","created_at":"2025-09-17 06:23:18","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":448574,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7579505/v1/f0fe6b1c-a1d1-44c4-b06f-cef1b3f16ce9.pdf"},{"id":91495910,"identity":"96ba4245-1ce8-46ba-a618-5166af2559eb","added_by":"auto","created_at":"2025-09-17 06:23:13","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":15853,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eHighlights\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Highlights.docx","url":"https://assets-eu.researchsquare.com/files/rs-7579505/v1/f35de143993dd3dbd66bef60.docx"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eDetection of Adult Content in Arabic Tweets Using Machine Learning Models\u003c/p\u003e","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eAdult content detection has become an increasingly critical issue for online platforms due to the rapid expansion of data exchanged across digital spaces. Social media platforms, websites, and other online environments handle vast amounts of information every second, resulting in a dynamic yet demanding content moderation ecosystem. The availability of explicit information online, often without adequate age controls or permission procedures, raises substantial concerns, particularly for vulnerable users such as children and teenagers. The risks of exposure to inappropriate content extend beyond immediate harm, potentially affecting psychological development, behavior, and societal norms [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eTraditional approaches for identifying adult content have evolved significantly, from manual moderation to automated systems powered by machine learning (ML) and deep learning (DL) technologies. Early methods, such as keyword filtering and image-based detection, were limited in their ability to handle the complexities of modern content. Advances in ML and DL have led to the development of sophisticated algorithms capable of analyzing large datasets, recognizing subtle patterns, and providing real-time results [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) networks, and transformer-based architectures like BERT and its variants have shown promising results in text and image classification tasks, including the detection of explicit content [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e], [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eDespite these advancements, there remains a significant gap in research focused on detecting adult content in the Arabic digital realm compared to other languages. Arabic presents unique linguistic challenges, including its rich morphology, diverse dialects, and frequent use of colloquialisms, which complicate automated content moderation [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Additionally, the inclusion of emoticons, URLs, and hashtags in Arabic tweets further complicates text processing, necessitating more advanced and adaptive algorithms [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThis study aims to address these challenges by evaluating the effectiveness of various ML and DL models in detecting adult content in Arabic tweets. The study investigates models including CNN, RNN, LSTM, and AraBERT to determine the most effective methods for achieving this aim. The results reveal the capabilities and limitations of current systems and highlight the need for continued innovation in content moderation. Finally, our study contributes to developing safer digital environments, especially for Arabic- speaking users, by improving our understanding of how to efficiently and reliably filter explicit information in text content.\u003c/p\u003e"},{"header":"RELATED WORK","content":"\u003cp\u003eThe dataset is organized into several categories to capture the diverse linguistic and cultural aspects of Arabic social media content. It includes tweets in various Arabic dialects, such as Gulf, Levantine, and Egyptian, showcasing the linguistic diversity of the Arabic-speaking world. The dataset also features informal text styles common on social media, including emojis, hashtags, and abbreviations. Additionally, it contains sensitive language, including vulgar words and satirical and metaphorical references to adult entertainment, reflecting the nuanced ways in which adult content is expressed in Arabic. Contextual information, such as hashtags, mentions, and links, is retained to enhance the accuracy of natural language processing models in identifying semantic and syntactic patterns. Several studies have focused on the automatic identification of adult content, whether in image-based, video-based, or text-based formats.\u003c/p\u003e\u003cp\u003eAppati et al. (2021) provided a comprehensive review of image analysis techniques for adult content detection, focusing on both traditional and deep learning methods [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. They categorized the approaches to Region of Interest (ROI) techniques and deep learning methods. ROI techniques, such as skin pixel ratio and explicit content weighting, utilize algorithms like Support Vector Machines (SVM) and K Nearest Neighbors (KNN) for classification. In contrast, deep learning methods, particularly Convolutional Neural Networks (CNNs), have shown superior performance in detecting explicit content due to their ability to learn complex patterns from large datasets.\u003c/p\u003e\u003cp\u003eGajula et al. (2020) proposed a model based on supervised learning using the Support Vector Machine (SVM) algorithm to detect and blur explicit content in images. The model was trained on a dataset containing 7,000 pornographic images and 7,000 normal images, achieving an accuracy of 97.8% during testing. The classification process was based on the amount of skin percentage exposed in the images. If an image was classified as pornographic, it was then processed to blur the explicit content using image processing techniques. This approach ensured that end-users were not exposed to inappropriate material, making it a robust solution for protecting children from adult content on the internet [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eDubettier et al. (2023) conducted a comparative study to evaluate the efficacy of various methods for detecting sexual content in images, aiming to protect children and enhance digital forensic investigations. The study assessed five tools: the nsfw model, NudeNet, NuDetective, SkinDetection, and DeepPornDetection, each employing distinct techniques such as skin detection, deep learning, or transfer learning. Using three datasets with varying degrees of explicit content and complexity, the researchers found that the nsfw model and NudeNet achieved the highest accuracy across datasets, while DeepPornDetection performed best on the dataset it was trained on, indicating a training bias. The study highlighted several challenges, including the insufficiency of skin detection alone, as high skin exposure does not always indicate sexual content, and some explicit images have low skin exposure. Additionally, the findings underscored the subjectivity in distinguishing between acceptable and harmful content due to cultural and environmental factors. While the nsfw model and NudeNet showed potential, the authors concluded that further enhancements are needed. They recommended future research to explore adaptive criteria based on cultural variables for more precise screening [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. Ochoa et al. (2012) explored machine learning-based strategies for recognizing pornographic video material, emphasizing the integration of spatial and temporal data [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Their study demonstrated the effectiveness of combining spatial features, such as skin detection and color histograms, with temporal features like shot duration and camera motion. The use of SVM classifiers achieved an accuracy rate of up to 94.44%, highlighting the potential of deep learning techniques in this domain.\u003c/p\u003e\u003cp\u003eBarrientos et al. (2020) conducted a comprehensive study on the use of machine learning techniques to detect inappropriate erotic content in text, with a focus on protecting children from exposure to such material. The study highlighted the growing need for automated moderation tools due to the impracticality of manual moderation in the face of increasing user-generated content. The researchers employed twelve different models, including three text encoders (Bag of Words, TF-IDF, and Word2Vec) and four classifiers (SVM, Logistic Regression, k- Nearest Neighbors, and Random Forest), to identify unsuitable material. Using a dataset of over 110,000-word samples from Reddit, categorized as sexual or neutral, they found that the combination of TF-IDF and SVM (linear kernel) achieved the highest performance, with an accuracy of 97% and an F-score of 0.96. This study underscored the effectiveness of machine learning approaches in real-time content filtering, particularly for social networks. The authors also suggested future research directions, including the exploration of deep learning models and feature reduction techniques to further enhance content detection systems. The findings demonstrated the practicality and potential of automated tools in maintaining a safer online environment for children [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eHamdy et al. (2021) conducted a study focused on identifying explicit content in Arabic tweets. The researchers created a dataset comprising 50,000 Twitter accounts, with 6,000 identified as adult content accounts. This dataset was meticulously annotated using Arabic-related keywords and hashtags. The analysis revealed that adult tweets are generally shorter, use fewer words, and contain more URLs and emojis compared to non-adult tweets. Significant patterns were identified in the use of words, emojis, and hashtags. The study evaluated several machine learning models, including Support Vector Machines (SVM), FastText, multilingual BERT, and AraBERT. Among these, AraBERT outperformed the others, achieving an F1 score of 96.8% when combining user data with tweet content. The researchers concluded that even basic information, such as usernames and brief descriptions, can be effective in identifying sexual content accounts. They suggested that future research could explore multimodal content analysis to further enhance detection accuracy [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThese studies provided a foundation for our research, highlighting the strengths and limitations of existing approaches. Our study aims to build on this foundation by evaluating the performance of various machine learning and deep learning models in detecting adult content in Arabic tweets, addressing the identified gaps in the literature (Table\u0026nbsp;1).\u003c/p\u003e\u003cp\u003eTable 1: Summary of the key findings from the related work, comparing the performance of different models and techniques in detecting adult content\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003ctable float=\"No\" id=\"Taba\" border=\"1\"\u003e\u003ccolgroup cols=\"5\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eStudy\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eApproach\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eDataset\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eAccuracy\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eKey Findings\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAppati et al.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eROI\u003c/p\u003e\u003cp\u003eTechnique s, CNNs\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eImage Data\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e94.44%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eCNNs\u003c/p\u003e\u003cp\u003eoutperfor m\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003etraditiona l methods\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOchoa et al.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSVM,\u003c/p\u003e\u003cp\u003eSpatial and Temporal Features\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eVideo Data\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e94.44%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eIntegratio nof\u003c/p\u003e\u003cp\u003espatial and\u003c/p\u003e\u003cp\u003etemporal data improves accuracy\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGajula et al.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSVM,\u003c/p\u003e\u003cp\u003eSkin Percentag e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eImage Data\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e97.8%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eHigh accuracy in\u003c/p\u003e\u003cp\u003edetecting and\u003c/p\u003e\u003cp\u003eblurring explicit content\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBarrient os et al.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTF-IDF, SVM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eText Data\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e97%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eEffective real-time content\u003c/p\u003e\u003cp\u003efiltering for social networks\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDubettie r et al.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003ensfw model, NudeNet\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eImage Data\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eHigh accurac y\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eChallenge s in skin detection and\u003c/p\u003e\u003cp\u003ecultural subjectivi ty\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eHamdy et al.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAraBERT\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eArabi c\u003c/p\u003e\u003cp\u003eTweet s\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e96.8%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eEffective in\u003c/p\u003e\u003cp\u003eidentifyin g explicit content in Arabic tweets\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e"},{"header":"DATASET","content":"\u003cp\u003eIn this study, we used the dataset from the previous study by Hamdy et al. (2021) [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], we evaluated adult content on Arabic Twitter. The dataset consists of 33,691 samples with two main columns, named \"text2\" and \"categories.\" The \"text2\" column contains the full tweet text or main textual content, which includes various linguistic structures ranging from Fusha (formal Arabic) to Ammiya (colloquial/spoken Arabic) and code- switching between Arabic and English. The \"categories\" column indicates whether the tweet contains adult content or not, with two classes: ADULT (1) and NOT_ADULT (0).\u003c/p\u003e\u003cp\u003eThe dataset was meticulously collected and annotated by expert linguists and content reviewers to ensure high quality and reliability. They manually evaluated and annotated tweets using Arabic-related keywords and hashtags to accurately identify and classify adult content.\u003c/p\u003e\u003cp\u003eThe dataset encompasses a wide range of linguistic, artistic, and expressive features, including tweets in Gulf, Levantine, and Egyptian Arabic dialects, reflecting the linguistic diversity of the Arabic-speaking world. It also includes informal text styles common on social media, such as emojis, hashtags, and abbreviations. Additionally, the dataset contains vulgar words and satirical and metaphorical references to adult entertainment, capturing the nuanced ways in which adult content is expressed in Arabic.\u003c/p\u003e\u003cp\u003eContextual information typically found in social media content, such as hashtags, mentions, and links, is retained in the dataset. This contextual data is crucial for advanced natural language processing (NLP) models, as it helps identify semantic and syntactic patterns that improve the accuracy of adult content classification.\u003c/p\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eData Preprocessing\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003cp\u003eTo prepare the dataset for model training, several preprocessing steps were undertaken (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e):\u003c/p\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eData Cleaning: Duplicate tweets and language flaws that could skew the study findings were removed.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eTokenization: The text input was tokenized using a tokenizer to convert the tweets into a format suitable for model training.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eLabel Encoding: The labels were converted to numerical values using LabelEncoder to facilitate model training and evaluation.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eData Splitting: The dataset was split into training, validation, and test sets using a 70/30 ratio, ensuring a robust evaluation of the models.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003col style=\"list-style-type: upper-alpha;\"\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eDataset Size and Diversity\u003c/b\u003e The dataset's large size and diversity enable robust machine-learning training for the automatic identification and classification of adult content on Arabic social media sites. This wealth of cultural and expressive dimensions helps in crafting strong, generalizable models that are well-suited to modeling the complexity and nuances of Arabic text.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eGeneralizability\u003c/b\u003e With such diverse and realistic data, the models trained on this dataset can generalize well across various scenarios, making this dataset an essential part of the solution for building tools to monitor and filter out adult content in Arabic language social media websites. The inclusion of contextual information and the representation of different dialects and informal text styles further enhance the dataset's utility for developing effective content moderation systems.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e"},{"header":"METHODOLOGY","content":"\u003cp\u003eIn this study, we utilized a dataset of 33,691 tweets classified as adult or non-adult material. The methodology involved several key processes to ensure the accuracy and reliability of the results:\u003c/p\u003e\n\u003cp\u003e(2)\u0026nbsp;Data Preprocessing\u003c/p\u003e\n\u003cp\u003e1.\u0026nbsp;\u0026nbsp;Data Cleaning:\u0026nbsp;The dataset was meticulously cleaned to remove duplicate tweets and correct language flaws that could skew the study findings. This step ensured that the data was of high quality and free from inconsistencies.\u003c/p\u003e\n\u003cp\u003e2.\u0026nbsp;\u0026nbsp;Tokenization:\u0026nbsp;The text input was tokenized using a tokenizer, which converted the tweets into a format suitable for model training. This process involved breaking down the text into individual tokens or words, which could then be analyzed by the models.\u003c/p\u003e\n\u003cp\u003e3.\u0026nbsp;\u0026nbsp;Label Encoding:\u0026nbsp;Labels were converted to numerical values using LabelEncoder. This step facilitated the training and evaluation of the models by providing a standardized format for the classification labels.\u003c/p\u003e\n\u003cp\u003e4.\u0026nbsp;\u0026nbsp;Data Splitting:\u0026nbsp;The dataset was split into training, validation, and test sets using a 70/30 ratio. This approach ensured that the models were trained on a substantial portion of the data while retaining enough data for validation and testing to assess model performance accurately.\u003c/p\u003e\n\u003cp\u003e(3)\u0026nbsp;Models Used\u003c/p\u003e\n\u003cp\u003eWe employed several machine learning and deep learning models to evaluate their effectiveness in detecting adult content in Arabic tweets:\u003c/p\u003e\n\u003cp\u003e1.\u0026nbsp; \u0026nbsp;Convolutional Neural Network (CNN):\u0026nbsp;The CNN model consisted of embedding layers, convolutional layers (Conv1D), and pooling layers. This architecture enabled the capture of spatial patterns in the text, making it suitable for text classification tasks.\u003c/p\u003e\n\u003cp\u003e2.\u0026nbsp; \u0026nbsp;Long Short-Term Memory Network (LSTM): The LSTM model was designed to handle sequential dependencies in the text. It included layers for long-term memory, allowing it to capture contextual meaning across long sequences, which is particularly beneficial for lengthier sentences.\u003c/p\u003e\n\u003cp\u003e3.\u0026nbsp; \u0026nbsp;Recurrent Neural Network (RNN):\u0026nbsp;A simple RNN was used to analyze text sequentially using tanh activation functions. This model was appropriate for short to medium-length texts. For binary classification, we employed a dense output layer with sigmoid activation.\u003c/p\u003e\n\u003cp\u003e4.\u0026nbsp; \u0026nbsp;AraBERT Model:\u0026nbsp;Specifically designed for Arabic, the AraBERT model was trained on a large corpus of Arabic texts and optimized to handle the unique features of the language, including its morphology and syntax. Natural language processing tasks were performed using the pretrained aubmindlab/bert-base- arabertv02 model, set up with Trainer to have a low learning rate and a small batch size. Few-shot learning was also investigated by evaluating performance with a small collection of examples.\u003c/p\u003e\n\u003cp\u003e(4)\u0026nbsp;Model Training and Evaluation\u003c/p\u003e\n\u003cp\u003e1.\u0026nbsp; \u0026nbsp;Training:\u0026nbsp;Each model was trained on the training set, with hyperparameters tuned to optimize performance. The training process involved adjusting the model parameters to minimize the loss function and improve classification accuracy.\u003c/p\u003e\n\u003cp\u003e2.\u0026nbsp; \u0026nbsp;Validation:\u0026nbsp;The validation set was used to monitor the models\u0026apos; performance during training and prevent overfitting. Techniques such as dropout and batch normalization were employed to enhance model generalization.\u003c/p\u003e\n\u003cp\u003e3.\u0026nbsp; \u0026nbsp;Testing:\u0026nbsp;The final evaluation of the models was conducted on the test set. Performance metrics such as accuracy, F1 score, and confusion matrices were used to assess how successfully each model identified explicit content from safe material.\u003c/p\u003e\n\u003cp\u003e(5)\u0026nbsp;Evaluation Metrics\u003c/p\u003e\n\u003cp\u003e1. Accuracy:\u0026nbsp;The proportion of correctly classified tweets out of the total number of tweets.\u003c/p\u003e\n\u003cp\u003e2.\u0026nbsp; \u0026nbsp;F1 Score:\u0026nbsp;The harmonic means of precision and recall provide a balanced measure of model performance.\u003c/p\u003e\n\u003cp\u003e3. \u0026nbsp; Confusion Matrices: Graphical representations of the true positives, false positives, true negatives, and false negatives, allow for a detailed comparison of model performance across categories.\u003c/p\u003e\n\u003cp\u003eTable 2: A structured overview of the methodology steps, model used, and evaluation metrics\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"66%\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eStep\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eDescription\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eData Preprocessing\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\u003cbr\u003e\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eData Cleaning\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eRemoving duplicate tweets and correcting language flaws.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eTokenization\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eConverting text input into tokens suitable for model training.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eLabel Encoding\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eConverting labels to numerical values.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eData Splitting\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSplitting the dataset into training (70%), validation, and test sets (30%).\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eModels Used\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\u003cbr\u003e\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eConvolutional Neural Network (CNN)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eEmbedding layers, Convolutional layers (Conv1D), and Poolinglayers to capture spatial patterns.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eLong Short-Term\u003c/p\u003e\n \u003cp\u003eMemory Network (LSTM)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHandling sequential\u003c/p\u003e\n \u003cp\u003edependencies with layers for long-term memory.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eRecurrent Neural Network (RNN)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAnalyzing text sequentially using tanh activation functions.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eAraBERT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003ePretrained on a large corpus of Arabic texts, optimizedfor Arabic morphology and syntax.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eModel Training and Evaluation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\u003cbr\u003e\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eTraining\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAdjusting model parameters to minimize loss and improve accuracy.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eValidation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMonitoring performance during training to prevent overfitting.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eTesting\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eFinal evaluation using accuracy, F1 score, and confusion matrices.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eEvaluation Metrics\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\u003cbr\u003e\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eAccuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eThe proportion of correctly classified tweets.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eF1 Score\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHarmonic means of precision and recall.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eConfusion Matrices\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eGraphical representation of true positives, false positives,true negatives, and false negatives.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"},{"header":"RESULTS","content":"\u003cp\u003eThe findings of this study demonstrated the varying efficacy of different machine learning and deep learning models in classifying adult content in Arabic tweets. The models evaluated include AraBERT, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Long Short-Term Memory networks (LSTM). Each model's performance was assessed using metrics such as accuracy and F1 score, and confusion matrices were constructed to provide a detailed comparison.\u003c/p\u003e\u003cp\u003eResults revealed that AraBERT emerged as the most successful model, achieving an impressive accuracy of 100%. This model's ability to efficiently capture spatial patterns in the text makes it highly effective for content classification. The high accuracy indicated that AraBERT can reliably distinguish between adult and non-adult content in Arabic tweets, making it a valuable tool for content moderation.\u003c/p\u003e\u003cp\u003eIn addition, the CNN model also performed well, with an accuracy of 94.27%. This model's architecture, which includes embedding layers, convolutional layers (Conv1D), and pooling layers, enables it to\u003c/p\u003e\u003cp\u003ecapture spatial patterns in the text effectively. The high accuracy of the CNN model suggests that it was well-suited for handling short text sequences and can be a robust option for detecting adult content in Arabic tweets.\u003c/p\u003e\u003cp\u003eMoreover, the RNN model achieved an accuracy of 94.22%, compared to the CNN model. The RNN's ability to analyze text sequentially using tanh activation functions makes it suitable for short to medium-length texts. The model's performance indicates that it can effectively handle brief text sequences and classify them accurately.\u003c/p\u003e\u003cp\u003eThe LSTM model, designed to handle sequential dependencies in the text, achieved an accuracy of 88.37%. While this model is good at capturing contextual meaning across long sequences, its performance was somewhat lower than that of the CNN and RNN models. The LSTM's ability to manage temporal dependencies makes it beneficial for lengthier sentences, but it may be less effective for shorter text sequences compared to the other models.\u003c/p\u003e\u003cp\u003eThe results highlighted the strengths and limitations of each model in classifying adult content in Arabic tweets. AraBERT's superior performance can be attributed to its design, which was specifically optimized for the Arabic language, including its morphology and syntax. The CNN and RNN models also demonstrated strong performance, indicating their suitability for text classification tasks involving short to medium-length texts. The LSTM model, while effective in handling longer sequences, showed lower accuracy, suggesting that it may be more suitable for tasks requiring more context or longer text analysis.\u003c/p\u003e\u003cp\u003eConfusion matrices were constructed for each model to provide a detailed comparison of their performance across categories. These matrices illustrate the true positives, false positives, true negatives, and false negatives, offering insights into each model's classification capabilities. The high accuracy and low error rates of the AraBERT, CNN, and RNN models indicate their reliability in distinguishing between adult and non-adult content.\u003c/p\u003e"},{"header":"CONCLUSION","content":"\u003cp\u003eIn conclusion, the study demonstrates that AraBERT is the most effective model for detecting adult content in Arabic tweets, with the CNN model being a close second. The RNN model also performs well, while the LSTM model, although effective in handling longer sequences, shows slightly lower performance levels. These findings underscore the importance of selecting the appropriate model based on the specific characteristics of the text data and the classification task at hand. By providing a detailed analysis of each model's performance, this study contributes to the development of more accurate and reliable content moderation systems for Arabic social media platforms. The results highlight the need for continued innovation and improvement in machine learning and deep learning models to enhance their effectiveness in detecting explicit content (Table\u0026nbsp;3) (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eTable 3: Comparison of the performance of each model in terms of accuracy and F1 score\u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003ctable float=\"No\" id=\"Tabc\" border=\"1\"\u003e\u003ccolgroup cols=\"3\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy (%)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eF1 Score\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAraBERT\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e100.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e1.00\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCNN\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e94.27\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.94\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRNN\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e94.22\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.94\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLSTM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e88.37\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.88\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003etweets, contributing to the development of safer and more accurate content moderation systems.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e"},{"header":"FUTURE WORK","content":"\u003cp\u003eFuture research in this area could focus on several key aspects to enhance model performance and improve the accuracy of detecting adult content in Arabic tweets. Advanced preprocessing techniques for Arabic text, such as handling diacritics and addressing colloquial idioms and spelling variations, can significantly improve model accuracy. Exploring hybrid and ensemble models by combining different models to leverage their strengths can lead to improved outcomes. Fine-tuning pretrained models specifically for Arabic or domain-specific datasets can increase their ability to understand nuanced language characteristics. Incorporating multimodal data, such as images or videos, along with text can provide additional context and improve classification accuracy. Increasing the dataset size and diversity by expanding the dataset and incorporating more diverse examples can improve the models' generalizability. Developing models optimized for real-time performance and ensuring scalability is crucial for practical content moderation. Investigating adaptive criteria based on cultural variables and addressing ethical implications, including privacy concerns and potential biases, is essential for developing responsible and fair systems. By focusing on these areas, future research can significantly enhance the accuracy and effectiveness of models for detecting adult content in Arabic\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eJ. K. Appati, K. Y. Lodonu, and R. Chris-Koka, \u0026ldquo;A Review of Image Analysis Techniques for Adult Content Detection: Child Protection,\u0026rdquo; \u003cem\u003ehttps://services.igi\u003c/em\u003e\u003cem\u003e global.com/resolvedoi/resolve.aspx?d \u003c/em\u003e\u003cem\u003eoi=10.4018/IJSI.2021040106\u003c/em\u003e, vol. 9, no. 2, pp. 102\u0026ndash;121, Jan. 1AD, doi: 10.4018/IJSI.2021040106.\u003c/li\u003e\n\u003cli\u003eV. M. T. Ochoa, S. Y. Yayilgan, and F. A. Cheikh, \u0026ldquo;Adult video content detection using machine learning techniques,\u0026rdquo; \u003cem\u003e8\u003csup\u003eth\u003c/sup\u003e International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r\u003c/em\u003e, pp. 967\u0026ndash;974, 2012, doi: 10.1109/SITIS.2012.143.\u003c/li\u003e\n\u003cli\u003eG. M. Barrientos, R. Alaiz-Rodr\u0026iacute;guez, V. Gonz\u0026aacute;lez-Castro, and A. C. Parnell, \u0026ldquo;Machine learning techniques for the detection of inappropriate erotic content in text,\u0026rdquo; \u003cem\u003eInternational Journal of Computational Intelligence Systems\u003c/em\u003e, vol. 13, no. 1, pp. 591\u0026ndash;603, Jun. 2020, doi: 10.2991/IJCIS.D.200519.003/METRI CS.\u003c/li\u003e\n\u003cli\u003eA. Dubettier, T. Gernot, E. Giguet, and C. Rosenberger, \u0026ldquo;A Comparative Study of Tools for Explicit Content Detection in Images,\u0026rdquo; \u003cem\u003eProceedings - 2023 International Conference on Cyberworlds, CW 2023\u003c/em\u003e, pp. 464\u0026ndash;471, 2023, doi: 10.1109/CW58918.2023.00077.\u003c/li\u003e\n\u003cli\u003eH. Mubarak, A. Rashed, K. Darwish, Y. Samih, and A. Abdelali, \u0026ldquo;Arabic Offensive Language on Twitter: Analysis and Experiments,\u0026rdquo; \u003cem\u003eWANLP 2021 - 6th Arabic Natural Language Processing Workshop, Proceedings of the Workshop\u003c/em\u003e, pp. 126\u0026ndash;135, Apr. 2020, Accessed: Jan. 11, 2025. [Online]. Available: https://arxiv.org/abs/2004.02192v3\u003c/li\u003e\n\u003cli\u003eY. Mao, Q. Liu, and Y. Zhang, \u0026ldquo;Sentiment analysis methods, applications, and challenges: A systematic literature review,\u0026rdquo; \u003cem\u003eJournal of King Saud University - Computer and Information Sciences\u003c/em\u003e, vol. 36, no. 4, p. 102048, Apr. 2024, doi: 10.1016/J.JKSUCI.2024.102048.\u003c/li\u003e\n\u003cli\u003eG. Gajula, A. Hundiwale, S. Mujumdar, and L. R. Saritha, \u0026ldquo;A machine learning based adult content detection using support vector machine,\u0026rdquo; \u003cem\u003eProceedings of the 7th International Conference on Computing for Sustainable Global Development, INDIACom 2020\u003c/em\u003e, pp. 181\u0026ndash;185, Mar. 2020, doi: 10.23919/INDIACOM49435.2020.90 83700.\u003c/li\u003e\n\u003cli\u003eH. Mubarak, S. Hassan, and A. Abdelali, \u0026ldquo;Adult Content Detection on Arabic Twitter: Analysis and Experiments,\u0026rdquo; 2021. Accessed: Jan. 11, 2025. [Online]. Available: https://aclanthology.org/2021.wanlp- 1.14/\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Arabic Tweets, AraBERT, Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Natural Language Processing (NLP), Recurrent Neural Networks (RNN), Text Classification","lastPublishedDoi":"10.21203/rs.3.rs-7579505/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7579505/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThis study evaluates the effectiveness of various machine learning and deep learning models in detecting adult content in Arabic tweets, addressing unique linguistic and cultural challenges. Using a dataset of 33,691 Arabic tweets, we implemented and compared Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory networks (LSTM), and AraBERT. The data underwent thorough preprocessing, including cleaning, tokenization, and segmentation into training, validation, and test sets. Performance metrics such as accuracy, F1 score, and confusion matrices were used to assess model efficacy. AraBERT achieved the highest accuracy (100%), demonstrating superior capability in capturing spatial patterns for content classification. CNN and RNN also performed well, with accuracies of 94.27% and 94.22%, respectively, while LSTM achieved an accuracy of 88.37%. These findings highlight AraBERT's potential for effective content moderation in Arabic digital spaces, contributing to safer online environments.\u003c/p\u003e","manuscriptTitle":"Detection of Adult Content in Arabic Tweets Using Machine Learning Models","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-17 06:15:08","doi":"10.21203/rs.3.rs-7579505/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"2e528299-22a2-44fc-a8b2-8d8f13caf261","owner":[],"postedDate":"September 17th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":54486150,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2025-09-17T06:15:08+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-17 06:15:08","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7579505","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7579505","identity":"rs-7579505","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.