Real-Time Amharic Hate Speech Detection in Live Streams and Video Chats

doi:10.21203/rs.3.rs-6028931/v1

Real-Time Amharic Hate Speech Detection in Live Streams and Video Chats

2025 · doi:10.21203/rs.3.rs-6028931/v1

preprint OA: closed

Full text JSON View at publisher

Full text 68,191 characters · extracted from preprint-html · click to expand

Real-Time Amharic Hate Speech Detection in Live Streams and Video Chats | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Real-Time Amharic Hate Speech Detection in Live Streams and Video Chats Baye Atnafu Ferede This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6028931/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract This work proposes a novel approach to real-time hate speech detection in Amharic live streams and video chats using state-of-the-art deep learning techniques. The extensive use of social media and live communication tools has amplified hateful and offensive content, and thus it is essential to develop efficient mitigation strategies. The system uses multimodal data, including text, audio, and video, to efficiently detect and mitigate hate speech. By integrating multiple sources of data, the system provides a general overview of the content, identifying both implicit and explicit hate speech. The text analysis module employs a Bidirectional Long Short-Term Memory (BiLSTM) network to process chat messages and comments, while the audio analysis module employs Convolutional Neural Networks (CNN) to extract and process acoustic features. The computer vision module of video analysis detects visual cues of hate speech. Integration of the modalities enables the system to output robust and reliable results. Experimental results confirm the effectiveness of the proposed system for various real-world applications, including live streaming sessions and interactive video conversations. The system achieved high accuracy, precision, recall, and F1-score, proving its applicability for real-time deployment. This paper contributes to the corpus of hate speech detection and provides valuable insights for developing similar systems for other low-resource languages. Physical sciences/Energy science and technology Physical sciences/Engineering Amharic Real-Time Live Streams Video Chats Deep Learning Multimodal Analysis 1. Introduction The development of social media and live streaming platforms has revolutionized communication, where users can post and communicate in real time. However, this innovation also bred the spread of hate speech, which poses serious challenges to online communities. Hate speech can be a cause of violence, discrimination, and social conflict, hence their detection and removal are essential (Davidson et al., 2017 ; Schmidt & Wiegand, 2017 ). Amharic is among the prevalent languages spoken in Ethiopia and has seen a surge in online content creation and consumption. Despite this, few studies have been specifically focused on hate speech detection in Amharic, particularly in live streaming and video chatting. The majority of current studies have focused on text-based hate speech detection, creating a knowledge gap regarding the expression of hate speech in real-time audiovisual communication (Waseem et al., 2017 ; Salminen et al., 2018 ). The lacuna in current research is addressed in this paper through the suggestion of an Amharic real-time hate speech detection system using deep learning techniques. Multimodal information is leveraged with the system utilizing text, audio, and video analysis for hate speech identification. In being able to extract the strength of different modalities of information, the system has an encompassing approach to Amharic hate speech in real-time communication channels. Identifying hate speech is vital for securing and hosting an inclusive online environment. Effective systems of detecting hate speech will refrain from propagating hateful material, protect at-risk communities, and foster appropriate social behaviors. Traditionally, hate speech has been manually moderated using an approach that is intensive on resources but unable to use the information in a timely fashion (Warner & Hirschberg, 2012 ). The above concerns, therefore, give rise to a greater development of automated systems that are needed to resolve such. Real-time hate speech detection poses various challenges. First, hate speech can be highly context-specific, and programs need to be capable of detecting nuances in speech, tone, and intent (Fortuna & Nunes, 2018 ). Second, real-time detection involves the processing of considerable amounts of data within an instant, an exercise that is computationally intensive. The combination of greater than a single modality of data, i.e., text, audio, and video, makes the problem more complex but offers a more holistic solution to hate speech detection (Zhou et al., 2016 ). This paper contributes to the literature by proposing a new approach for Amharic live streaming and video chat hate speech identification in real time. The system employs deep learning techniques to handle multimodal data, offering a robust and efficient solution to the problem. Experimental results demonstrate the efficacy of the system in various real-world scenarios, suggesting its practicability for deployment in real life. 2. Related Work Previous research on hate speech detection has been largely text-based. Researchers have explored various machine learning and deep learning approaches to identify hate speech in social media tweets, comments, and online articles. For instance, Davidson et al. ( 2017 ) used a logistic regression model to identify hate speech on Twitter with remarkable accuracy. Similarly, Waseem and Hovy ( 2016 ) developed a dataset of hate speech and abusive language on Twitter that has been widely utilized for hate speech model training and testing. Nevertheless, hate speech identification in live streaming and video chat is underresearched, especially for low-resources languages like Amharic. Gao et al. ( 2020 ) presented a proof-of-concept for identification of objectionable content in Twitch video live stream chats and promised to utilize the methods of transfer learning for enhancing identification accuracy. Pookpanich and Siriborvornratanakul ( 2024 ) applied transformer language models to identification of hate speech in YouTube live stream chat with promising F1 and recall values. New multimodal analysis has also shown promising in increasing the accuracy of hate speech detection. By integrating text, audio, and video data, researchers have been able to develop systems that can better capture the context and nuances of hate speech. Zhou et al. ( 2016 ) demonstrated the potential of multimodal systems by developing a model that fuses textual and visual features to outperform text-only models in hate speech detection in memes. This study demonstrated the imperative of considering the interplay between text and image, especially in analyzing content like memes, where images contribute to the meaning or intent. Afterwards, Mao et al. ( 2024 ) came up with a self-attention mechanism that combined multimodal features, pushing the frontier further by successfully integrating text and image data. These studies demonstrate the necessity of multimodal research to improve the accuracy and robustness of hate speech detection systems in the face of complex and subtle online content. Amharic hate speech identification is a comparatively untapped field, and Watanabe et al. ( 2018 ) were among the first to venture there by creating a text-based dataset. The research emphasized the key issues of low-resource languages like Amharic, which include scarce annotated data, numerous dialects, and the abundance of linguistic structure in the language. All of these issues multiply the difficulty of training robust models for hate speech identification. The lack of attention to Amharic and other similar underrepresented languages underlines the need for more research efforts and resource development to close the gap in global hate speech moderation and ensure inclusivity and fairness in moderating toxic online content. Vetagiri et al. ( 2024 ) suggested a novel synthetic hate speech detection dataset with multimodal capabilities to counter the challenge brought about by the increasing hateful content developed by AI. This Amharic text-image dataset is a significant contribution in this field by offering a method to examine the complex interplay between visual and text information in hate speech. With the addition of features like pixel-level temperature maps and adversarial examples, the dataset not only enhances the explainability of detection models but also enhances adversarial robustness. The resource is now poised to take a lead in driving hate speech moderation research, particularly for underrepresented languages like Amharic, towards developing a safer and more inclusive digital environment. Davidson et al. ( 2017 ) used a crowd-sourced hate speech lexicon to label tweets into hate speech classes, offensive language, and neutral content. Their method demonstrated that racist and homophobic tweets tended to fall under hate speech, while sexist tweets fell under offensive language. This revealed how challenging it is to build classifiers capable of identifying fine-grained classes of abusive language. Schmidt and Wiegand ( 2017 ) authored a survey of feature extraction techniques, such as lexical and syntactic features, for hate speech detection and also outlined the limitations of these techniques, such as the inability to detect context and reliance on predetermined keywords, which can lead to false positives. Waseem et al. ( 2017 ) proposed a typology of abusive language detection subtasks, categorizing phenomena such as hate speech, cyberbullying, and trolling. Their model emphasized the intersecting nature of these categories and the need for tailored annotation guidelines to improve model accuracy. Salminen et al. ( 2018 ) moved the field ahead by developing a classifier capable of analyzing hate speech across multiple social media platforms. Their work demonstrated the value of including multimodal data and advanced techniques like BERT, which improved detection accuracy by modeling contextual nuances and cross-platform variation. Altogether, the studies reflect the evolution of the methodologies for detecting hate speech from basic keyword-based approaches to powerful multimodal and cross-platform ones, making it possible to further develop more potent and complete solutions. This work extends these improvements to create an Amharic real-time hate speech detection system. Utilizing deep learning methodologies and multimodal data, the system is expected to offer an end-to-end solution to hate speech in live streams and video chats. Merging text, audio, and video analysis capabilities allows the system to detect explicit and implicit forms of hate speech, enhancing accuracy and robustness. 3. Methodology The proposed Amharic hate speech detection system has three main modules: text, audio, and video, each employing state-of-the-art deep learning techniques to analyze and process the corresponding data modalities. The system integrates the three modules in order to effectively detect hate speech for real-time applications such as video calling and live streaming. For text processing, a 128-unit Bidirectional Long Short-Term Memory (BiLSTM) network is employed, using pre-trained and fine-tuned word embeddings from a large Amharic text corpus. The audio module combines speech-to-text translation using the Google Speech-to-Text API with acoustic features like Mel-Frequency Cepstral Coefficients (MFCC) and spectral contrast. These features are analyzed by a three-layer Convolutional Neural Network (CNN), with progressively larger filters to identify hate speech indicators. The video component is modeled using ResNet-50, trained on the FER + dataset for emotion detection and fine-tuned for the task, detecting facial expressions, gestures, and other contextual visual features. To achieve end-to-end real-time processing, the system includes deepened learning architectures, fault-tolerant data pipelines for ingestion, and parallel computing techniques, thus enabling low-latency inference and scalability. Individual module outputs are fused by a Multimodal Transformer model with 12 attention heads and hidden layer dimension of. The fusion captures sophisticated cross-modal interactions, applying self-attention mechanisms to obtain the final classification of hate speech existence. The fusion ensures end-to-end comprehension of multimodal data and enhances the stability of the system.The crucial hyperparameters include the Adam optimizer with 0.0001 learning rate, 1e-5 weight decay, 0.3 dropout, and early stopping at 30 epochs on validation loss. The batch size is 64 for text and audio, and half of this amount, i.e., 32, for video because of memory constraints. The loss function is composed of cross-entropy loss for individual modules and joint loss for multimodal fusion. Through integration of the most advanced techniques in text, audio, and video processing, this system provides a real-time, scalable solution for hate speech detection and prevention and can be applied to live streaming and video conferencing applications. 3.1. Text Analysis Text data from comments and live chat messages are processed using natural language processing (NLP) techniques. A deep learning model, e.g., a Bidirectional Long Short-Term Memory (BiLSTM) network, is trained on a labeled corpus of Amharic text to classify hate speech. 3.2. Audio Analysis Audio data from live streams are undergone speech-to-text conversion and acoustic feature extraction. The converted text is treated in the same NLP way as the text analysis module. Acoustic features, such as Mel-Frequency Cepstral Coefficients (MFCC), are also extracted and passed through a Convolutional Neural Network (CNN) to detect hate speech. 3.3. Video Analysis Video content from live streams undergoes processing via computer vision techniques. Facial emotion, hand movements, and other visual cues are processed by a deep learning model, i.e., CNN, in order to identify hate speech. The video analysis module is run alongside the text and audio analysis modules to obtain the overall understanding of the content. speech in real-time from Amharic live streams and video conversations. 4. Experimental Results The dataset was annotated with careful annotation by domain experts to ensure proper labeling of hate speech, such as explicit, implicit, and context-dependent ones. The annotations also covered different modalities, such as text, audio, and visual hints. 4.1. Dataset Description The dataset of the system formulated consists of an exhaustive collection of Amharic live stream records and video chat meetings. Specifically, it comprises 1,000 hours of live stream records and 500 hours of video chat meetings with rich diversity of content. The richness of content diversity guarantees that the dataset encompasses a large number of histories and situations where hate speech is probable, ranging from light conversations and socializing to angry debates and disputes. To ensure quality and reliability in the dataset, domain experts very carefully labeled the data, annotating hate speech and non-hate speech instances. The annotation was performed using explicit hate speech, implicit hate speech, and context-dependent samples, thus turning the dataset into a rich material for system testing and training. The dataset consisted of different modalities such as text, audio, and visual signals, therefore being suitable for multimodal analysis. For training, validation, and testing, the dataset was split into three subsets in a ratio of 80:10:10. The training set that had 80% of the data was used to train the deep learning models. The 10% of the data was used as the validation set to fine-tune the models and prevent overfitting. The other 10% of the data formed the test set, which was used to test the performance of the system and its capability to generalize. Video chat sessions and recordings of live streams in the data were gathered from various platforms for diversified representation of user-generated content. These involved recorded webinars, private video chat sessions, and public live streams on social media platforms, providing a complete evaluation of the performance of the system under realistic scenarios. Overall, the dataset's size, diversity, and meticulous annotation render it a valuable tool for training and testing the proposed real-time hate speech detection system for Amharic video conferencing and live streaming. 4.2. Data Availability The datasets generated and/or analyzed during the current study are not publicly available due to privacy and ethical considerations, but are available from the corresponding author upon reasonable request. 4.3. Ethical Considerations All the processes conducted within this study were reviewed and approved by the Debre Markos University colleague peer review committee. Informed consent was obtained from all the study participants and their guardians prior to their involvement in the study. This is to ascertain that participants are fully aware of the aim and method of the study, and have consented to take part under well-specified ethical standards. Furthermore, the study observes strict confidentiality protocols to ensure privacy and personal information of all the subjects are protected. All methods were performed in accordance with the relevant guidelines and regulations to ensure ethical conduct and the integrity of the research. This includes adherence to institutional protocols, national and international regulations, and the specific guidelines provided by Debre Markos University. 4.4. Evaluation Metrics The system's performance was measured through a variety of critical metrics: Accuracy, or the proportion of true positive hate speech and non-hate speech instances; Precision, or the proportion of true positive hate speech instances to the total number of instances flagged as hate speech; Recall, or the proportion of true positive hate speech instances to the total actual instances of hate speech; and F1-score, or the harmonic mean of precision and recall, representing an overall metric of the performance of the system. 4.5. Experimental Setup The experiments were conducted on a high-performance computing system with multiple GPUs. The text analysis component utilized a BiLSTM network, the audio analysis component utilized a CNN to extract acoustic features, and the video analysis component utilized a CNN for facial expression and object recognition. The multimodal fusion was carried out using a Multimodal Transformer model. 4.6. Results and Discussion The experimental results demonstrate the effectiveness of the suggested system in real-time hate speech detection with accuracy of 92.5%, precision of 90.8%, recall of 91.2%, and an F1-score of 91.0% on the test set. The results indicate the robust performance of the system and its ability to accurately detect hate speech in various modalities, validating the effectiveness of the multimodal approach in improving detection accuracy compared to text-only models. 4.7. Baseline Model Comparison As a way to show the improvement of the system with the use of the multimodal approach, the proposed system was compared against baseline text-only models. Baseline system implemented text-only analysis and achieved accuracy of 83.4%, precision of 82.0%, recall of 81.5%, and an F1-score of 81.7%. The experiments indicate an enhancement of performance by using the multimodal approach. The addition of text, audio, and video analysis enabled the system to detect contextual information and fine-grained cues not seen by text-based models, and this led to improved accuracy and reliability. 4.8. Real-Time Performance Real-time performance was also tested for the system, considering the average processing time per live stream and video chat session. The results were as follows: Text Analysis was 50 ms, Audio Analysis was 100 ms, and Video Analysis was 120 ms, and the Total Average Processing Time was 270 ms. These results demonstrate the system's capability to process and analyze information in real-time with minimal latency. The use of optimized deep learning architectures and parallel computing techniques played a crucial role in the achievement of these low processing times, ensuring efficient detection and mitigation of hate speech on live streams and video chats. 4.9. Error Analysis Error analysis was conducted to ascertain the common causes of misclassification. The primary sources of errors were conditions under which context was significant to classify correctly, yet the system lacked sufficient context information (context-dependent hate speech). Also, ambiguous sentences were bad news, because the same sentences could be used with different meanings based on the context. Excessive background noise in audio data was another primary source of error that affected the accuracy of acoustic feature extraction. With the knowledge of these sources of error, we can see where future enhancements should be made and how we can enhance the overall performance of the system. Some future enhancements were identified from the error analysis in order to enhance the performance of the proposed system. Improvement in contextual understanding by using more contextual data, such as history of the conversation, can be done to classify context-specific hate speech better. This would enable the system to better learn the nuances and subtleties of conversation, leading to more accurate detections. In addition, improving noise robustness is necessary to address the issue of high background noise in audio data, which corrupts the accuracy of acoustic feature extraction. With the application of advanced noise reduction techniques and stable feature extraction mechanisms, the system would significantly be enhanced to effectively process and analyze sound information in noisy environments. With this complemented by continued testing and model refinement, the system would further enhance its potential for real-time hate speech detection. 5. Conclusion This work presents an Amharic live streaming and video chatting-based real-time hate speech detection system using deep learning techniques. Leverage on multimodal data, the system presented here is an end-to-end solution for hate speech in real-time communication platforms. Experimental outcomes indicate that the system can effectively and feasibly be implemented in real-world use. The system's high-performance metrics-accuracy, precision, recall, and F1-score-reflect its robust capacity to accurately identify hate speech in various modalities, performing significantly better than baseline text-only models. By integrating text, audio, and video analysis, the system is able to extract contextual information and more subtle cues lost by single-modality models and therefore offer a more reliable detection process. This integrated approach, not only does it enhance accuracy in hate speech detection, but also contributes towards the creation of safer and more inclusive cyberspace. This success emphasizes future research and development of multimodal hate speech detection, particularly in under-resourced languages like Amharic. Further studies can explore other modalities of data and additional developments in noise insensitivity and contextual awareness to further enhance the performance of the system. 6. Future Work Upcoming research can explore the integration of other data modalities, such as contextual data and sentiment analysis, to further enhance the accuracy of hate speech identification. Sentiment analysis would allow insight into the emotional color of the content, which would allow for differentiation between neutral, positive, and negative terms and is critical in accurate hate speech identification. The combination of historical conversation data and context-aware models would enhance the system's ability to understand nuances and subtleties of conversations, leading to more precise detections. Another possible avenue for future work is the extension of the system to other under-resourced languages. Constructing hate speech models for the languages that currently lack sufficient resources and research funding would significantly expand the scope and reach of the system, which in turn would result in safer online communities globally. This could involve constructing multilingual datasets and employing transfer learning mechanisms to adapt models trained on well-resourced languages to under-resourced languages. In addition, collaboration with linguistic experts and the local population may help in developing accurate and culture-specific datasets and models. Last but not least, future studies would be investigating real-time deployment plans involving edge computing and cloud-based systems in order to ensure scalability and accessibility. Adding adaptive learning mechanisms, through which the system continuously learns from new information as well as user feedback, would also enhance its resilience and accuracy in the long run. Through such focus, future research can enable more precise and comprehensive hate speech detection models to be developed, allowing for online discourse to be healthier and more respectful in multicultural and multilingual settings. Declarations Author Contribution Baye Atnafu solely contributed to the conceptualization, methodology, and implementation of the research titled "Real-Time Amharic Hate Speech Detection in Live Streams and Video Chats." He was responsible for the development and fine-tuning of the deep learning models, data collection, and annotation processes. Baye Atnafu also conducted the experiments, analyzed the results, and drafted the manuscript. Additionally, he coordinated the entire research project and ensured adherence to the highest ethical and scientific standards. All aspects of this work were carried out independently by Baye Atnafu, who takes full responsibility for the integrity and accuracy of the research findings. Data Availability The datasets generated and/or analyzed during the current study are not publicly available due to privacy and ethical considerations, but are available from the corresponding author upon reasonable request. References Davidson, T., Warmsley, D., Macy, M. & Weber, I. Automated Hate Speech Detection and the Problem of Offensive Language. ICWSM. (2017). Waseem, Z. & Hovy, D. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter (NAACL-HLT, 2016). Warner, W. & Hirschberg, J. Detecting Hate Speech on the World Wide Web. Proceedings of the Second Workshop on Language in Social Media . (2012). Fortuna, P. & Nunes, S. A Survey on Automatic Detection of Hate Speech in Text. ACM Computing Surveys (CSUR) . (2018). Zhou, X., Reid, E., Qin, J., Chen, H. & Lai, G. US Domestic Extremist Groups on the Web: Link and Content Analysis (IEEE Intelligent Systems, 2016). Waseem, Z., Davidson, T., Warmsley, D. & Weber, I. Understanding Abuse: A Typology of Abusive Language Detection Subtasks. Proceedings of the First Workshop on Abusive Language Online . (2017). Gao, Z., Yada, S., Wakamiya, S. & Aramaki, E. Offensive Language Detection on Video Live Streaming Chat. Proceedings of the 28th International Conference on Computational Linguistics. (2020). Pookpanich, P. & Siriborvornratanakul, T. Offensive language and hate speech detection using deep learning in football news live streaming chat on YouTube in Thailand (Social Network Analysis and Mining, 2024). Mao, J., Shi, H. & Li, X. Research on multimodal hate speech detection based on self-attention mechanism feature fusion (The Journal of Supercomputing, 2024). Watanabe, H., Bouazizi, M. & Ohtsuki, T. Hate Speech on Twitter: A Pragmatic Approach to Collect Hateful and Offensive Expressions and Perform Hate Speech Detection (IEEE Access, 2018). Vetagiri, A., Halder, E., Das Majumder, A., Pakray, P. & Das, A. Multilate: A Synthetic Dataset for Multimodal Hate Speech Detection. SSRN. (2024). Davidson, T., Warmsley, D., Macy, M. & Weber, I. Automated Hate Speech Detection and the Problem of Offensive Language. ICWSM . (2017). Schmidt, A. & Wiegand, M. A Survey on Hate Speech Detection using Natural Language Processing. Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media . (2017). Salminen, J. et al. Developing an Online Hate Classifier for Multiple Social Media Platforms. Human-Centric Computing and Information Sciences . (2018). Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6028931","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":441493525,"identity":"145f212f-d365-48c3-a4a2-a274ff142a6a","order_by":0,"name":"Baye Atnafu Ferede","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA+klEQVRIiWNgGAWjYBACAziL+QDjAx4wK4EoLUCKLYHZAKEFjzZkLWwSRGkxZz9j+ICh5k8efxuPWcXbnMMM/Ow5Bgwff+DWYtmTY2zAcMygWOIYj9nNudsOM0j2vDFgnIHPYQdyzCQY2AwSG+73mN3mBWoxuJFjwMyDT8v5N+Y/GP4ZJM4H2lIM0mIP0vIHn5YbOWYMjG0GiRuAWpjBtkgAteDzvuWMZ8USjH3GiRuPsRVLzt2WziNx5lnBwZ403FrM+ZM3fmD4Jpc47xjzxg9vt1nL8bcnb3zwwwa3FgYGDqDLkbjgqDmATwMDA/sD/PKjYBSMglEwCgCJ5U+KeM1o1wAAAABJRU5ErkJggg==","orcid":"","institution":"Debre Markos University","correspondingAuthor":true,"prefix":"","firstName":"Baye","middleName":"Atnafu","lastName":"Ferede","suffix":""}],"badges":[],"createdAt":"2025-02-14 08:53:32","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6028931/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6028931/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":96363425,"identity":"b4a609d5-ddb0-4bee-9576-c1264712e092","added_by":"auto","created_at":"2025-11-20 10:06:47","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":387118,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6028931/v1/7e4020ff-6de4-4342-9f76-00e9d211af65.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Real-Time Amharic Hate Speech Detection in Live Streams and Video Chats","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eThe development of social media and live streaming platforms has revolutionized communication, where users can post and communicate in real time. However, this innovation also bred the spread of hate speech, which poses serious challenges to online communities. Hate speech can be a cause of violence, discrimination, and social conflict, hence their detection and removal are essential (Davidson et al., \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Schmidt \u0026amp; Wiegand, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2017\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eAmharic is among the prevalent languages spoken in Ethiopia and has seen a surge in online content creation and consumption. Despite this, few studies have been specifically focused on hate speech detection in Amharic, particularly in live streaming and video chatting. The majority of current studies have focused on text-based hate speech detection, creating a knowledge gap regarding the expression of hate speech in real-time audiovisual communication (Waseem et al., \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Salminen et al., \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2018\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe lacuna in current research is addressed in this paper through the suggestion of an Amharic real-time hate speech detection system using deep learning techniques. Multimodal information is leveraged with the system utilizing text, audio, and video analysis for hate speech identification. In being able to extract the strength of different modalities of information, the system has an encompassing approach to Amharic hate speech in real-time communication channels.\u003c/p\u003e \u003cp\u003eIdentifying hate speech is vital for securing and hosting an inclusive online environment. Effective systems of detecting hate speech will refrain from propagating hateful material, protect at-risk communities, and foster appropriate social behaviors. Traditionally, hate speech has been manually moderated using an approach that is intensive on resources but unable to use the information in a timely fashion (Warner \u0026amp; Hirschberg, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2012\u003c/span\u003e). The above concerns, therefore, give rise to a greater development of automated systems that are needed to resolve such.\u003c/p\u003e \u003cp\u003eReal-time hate speech detection poses various challenges. First, hate speech can be highly context-specific, and programs need to be capable of detecting nuances in speech, tone, and intent (Fortuna \u0026amp; Nunes, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). Second, real-time detection involves the processing of considerable amounts of data within an instant, an exercise that is computationally intensive. The combination of greater than a single modality of data, i.e., text, audio, and video, makes the problem more complex but offers a more holistic solution to hate speech detection (Zhou et al., \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2016\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThis paper contributes to the literature by proposing a new approach for Amharic live streaming and video chat hate speech identification in real time. The system employs deep learning techniques to handle multimodal data, offering a robust and efficient solution to the problem. Experimental results demonstrate the efficacy of the system in various real-world scenarios, suggesting its practicability for deployment in real life.\u003c/p\u003e"},{"header":"2. Related Work","content":"\u003cp\u003ePrevious research on hate speech detection has been largely text-based. Researchers have explored various machine learning and deep learning approaches to identify hate speech in social media tweets, comments, and online articles. For instance, Davidson et al. (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) used a logistic regression model to identify hate speech on Twitter with remarkable accuracy. Similarly, Waseem and Hovy (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) developed a dataset of hate speech and abusive language on Twitter that has been widely utilized for hate speech model training and testing.\u003c/p\u003e \u003cp\u003eNevertheless, hate speech identification in live streaming and video chat is underresearched, especially for low-resources languages like Amharic. Gao et al. (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) presented a proof-of-concept for identification of objectionable content in Twitch video live stream chats and promised to utilize the methods of transfer learning for enhancing identification accuracy. Pookpanich and Siriborvornratanakul (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) applied transformer language models to identification of hate speech in YouTube live stream chat with promising F1 and recall values.\u003c/p\u003e \u003cp\u003eNew multimodal analysis has also shown promising in increasing the accuracy of hate speech detection. By integrating text, audio, and video data, researchers have been able to develop systems that can better capture the context and nuances of hate speech.\u003c/p\u003e \u003cp\u003eZhou et al. (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2016\u003c/span\u003e) demonstrated the potential of multimodal systems by developing a model that fuses textual and visual features to outperform text-only models in hate speech detection in memes. This study demonstrated the imperative of considering the interplay between text and image, especially in analyzing content like memes, where images contribute to the meaning or intent. Afterwards, Mao et al. (\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) came up with a self-attention mechanism that combined multimodal features, pushing the frontier further by successfully integrating text and image data. These studies demonstrate the necessity of multimodal research to improve the accuracy and robustness of hate speech detection systems in the face of complex and subtle online content.\u003c/p\u003e \u003cp\u003eAmharic hate speech identification is a comparatively untapped field, and Watanabe et al. (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) were among the first to venture there by creating a text-based dataset. The research emphasized the key issues of low-resource languages like Amharic, which include scarce annotated data, numerous dialects, and the abundance of linguistic structure in the language. All of these issues multiply the difficulty of training robust models for hate speech identification. The lack of attention to Amharic and other similar underrepresented languages underlines the need for more research efforts and resource development to close the gap in global hate speech moderation and ensure inclusivity and fairness in moderating toxic online content.\u003c/p\u003e \u003cp\u003eVetagiri et al. (\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) suggested a novel synthetic hate speech detection dataset with multimodal capabilities to counter the challenge brought about by the increasing hateful content developed by AI. This Amharic text-image dataset is a significant contribution in this field by offering a method to examine the complex interplay between visual and text information in hate speech. With the addition of features like pixel-level temperature maps and adversarial examples, the dataset not only enhances the explainability of detection models but also enhances adversarial robustness. The resource is now poised to take a lead in driving hate speech moderation research, particularly for underrepresented languages like Amharic, towards developing a safer and more inclusive digital environment.\u003c/p\u003e \u003cp\u003eDavidson et al. (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) used a crowd-sourced hate speech lexicon to label tweets into hate speech classes, offensive language, and neutral content. Their method demonstrated that racist and homophobic tweets tended to fall under hate speech, while sexist tweets fell under offensive language. This revealed how challenging it is to build classifiers capable of identifying fine-grained classes of abusive language. Schmidt and Wiegand (\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) authored a survey of feature extraction techniques, such as lexical and syntactic features, for hate speech detection and also outlined the limitations of these techniques, such as the inability to detect context and reliance on predetermined keywords, which can lead to false positives.\u003c/p\u003e \u003cp\u003eWaseem et al. (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) proposed a typology of abusive language detection subtasks, categorizing phenomena such as hate speech, cyberbullying, and trolling. Their model emphasized the intersecting nature of these categories and the need for tailored annotation guidelines to improve model accuracy. Salminen et al. (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) moved the field ahead by developing a classifier capable of analyzing hate speech across multiple social media platforms. Their work demonstrated the value of including multimodal data and advanced techniques like BERT, which improved detection accuracy by modeling contextual nuances and cross-platform variation. Altogether, the studies reflect the evolution of the methodologies for detecting hate speech from basic keyword-based approaches to powerful multimodal and cross-platform ones, making it possible to further develop more potent and complete solutions.\u003c/p\u003e \u003cp\u003eThis work extends these improvements to create an Amharic real-time hate speech detection system. Utilizing deep learning methodologies and multimodal data, the system is expected to offer an end-to-end solution to hate speech in live streams and video chats. Merging text, audio, and video analysis capabilities allows the system to detect explicit and implicit forms of hate speech, enhancing accuracy and robustness.\u003c/p\u003e"},{"header":"3. Methodology","content":"\u003cp\u003eThe proposed Amharic hate speech detection system has three main modules: text, audio, and video, each employing state-of-the-art deep learning techniques to analyze and process the corresponding data modalities. The system integrates the three modules in order to effectively detect hate speech for real-time applications such as video calling and live streaming. For text processing, a 128-unit Bidirectional Long Short-Term Memory (BiLSTM) network is employed, using pre-trained and fine-tuned word embeddings from a large Amharic text corpus. The audio module combines speech-to-text translation using the Google Speech-to-Text API with acoustic features like Mel-Frequency Cepstral Coefficients (MFCC) and spectral contrast. These features are analyzed by a three-layer Convolutional Neural Network (CNN), with progressively larger filters to identify hate speech indicators. The video component is modeled using ResNet-50, trained on the FER\u0026thinsp;+\u0026thinsp;dataset for emotion detection and fine-tuned for the task, detecting facial expressions, gestures, and other contextual visual features. To achieve end-to-end real-time processing, the system includes deepened learning architectures, fault-tolerant data pipelines for ingestion, and parallel computing techniques, thus enabling low-latency inference and scalability. Individual module outputs are fused by a Multimodal Transformer model with 12 attention heads and hidden layer dimension of. The fusion captures sophisticated cross-modal interactions, applying self-attention mechanisms to obtain the final classification of hate speech existence. The fusion ensures end-to-end comprehension of multimodal data and enhances the stability of the system.The crucial hyperparameters include the Adam optimizer with 0.0001 learning rate, 1e-5 weight decay, 0.3 dropout, and early stopping at 30 epochs on validation loss. The batch size is 64 for text and audio, and half of this amount, i.e., 32, for video because of memory constraints. The loss function is composed of cross-entropy loss for individual modules and joint loss for multimodal fusion. Through integration of the most advanced techniques in text, audio, and video processing, this system provides a real-time, scalable solution for hate speech detection and prevention and can be applied to live streaming and video conferencing applications.\u003c/p\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e3.1. Text Analysis\u003c/h2\u003e \u003cp\u003eText data from comments and live chat messages are processed using natural language processing (NLP) techniques. A deep learning model, e.g., a Bidirectional Long Short-Term Memory (BiLSTM) network, is trained on a labeled corpus of Amharic text to classify hate speech.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e3.2. Audio Analysis\u003c/h2\u003e \u003cp\u003eAudio data from live streams are undergone speech-to-text conversion and acoustic feature extraction. The converted text is treated in the same NLP way as the text analysis module. Acoustic features, such as Mel-Frequency Cepstral Coefficients (MFCC), are also extracted and passed through a Convolutional Neural Network (CNN) to detect hate speech.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e3.3. Video Analysis\u003c/h2\u003e \u003cp\u003eVideo content from live streams undergoes processing via computer vision techniques. Facial emotion, hand movements, and other visual cues are processed by a deep learning model, i.e., CNN, in order to identify hate speech. The video analysis module is run alongside the text and audio analysis modules to obtain the overall understanding of the content. speech in real-time from Amharic live streams and video conversations.\u003c/p\u003e \u003c/div\u003e"},{"header":"4. Experimental Results","content":"\u003cp\u003eThe dataset was annotated with careful annotation by domain experts to ensure proper labeling of hate speech, such as explicit, implicit, and context-dependent ones. The annotations also covered different modalities, such as text, audio, and visual hints.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e4.1. Dataset Description\u003c/h2\u003e \u003cp\u003eThe dataset of the system formulated consists of an exhaustive collection of Amharic live stream records and video chat meetings. Specifically, it comprises 1,000 hours of live stream records and 500 hours of video chat meetings with rich diversity of content. The richness of content diversity guarantees that the dataset encompasses a large number of histories and situations where hate speech is probable, ranging from light conversations and socializing to angry debates and disputes.\u003c/p\u003e \u003cp\u003eTo ensure quality and reliability in the dataset, domain experts very carefully labeled the data, annotating hate speech and non-hate speech instances. The annotation was performed using explicit hate speech, implicit hate speech, and context-dependent samples, thus turning the dataset into a rich material for system testing and training. The dataset consisted of different modalities such as text, audio, and visual signals, therefore being suitable for multimodal analysis.\u003c/p\u003e \u003cp\u003eFor training, validation, and testing, the dataset was split into three subsets in a ratio of 80:10:10. The training set that had 80% of the data was used to train the deep learning models. The 10% of the data was used as the validation set to fine-tune the models and prevent overfitting. The other 10% of the data formed the test set, which was used to test the performance of the system and its capability to generalize.\u003c/p\u003e \u003cp\u003eVideo chat sessions and recordings of live streams in the data were gathered from various platforms for diversified representation of user-generated content. These involved recorded webinars, private video chat sessions, and public live streams on social media platforms, providing a complete evaluation of the performance of the system under realistic scenarios.\u003c/p\u003e \u003cp\u003eOverall, the dataset's size, diversity, and meticulous annotation render it a valuable tool for training and testing the proposed real-time hate speech detection system for Amharic video conferencing and live streaming.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e4.2. Data Availability\u003c/h2\u003e \u003cp\u003eThe datasets generated and/or analyzed during the current study are not publicly available due to privacy and ethical considerations, but are available from the corresponding author upon reasonable request.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e4.3. Ethical Considerations\u003c/h2\u003e \u003cp\u003eAll the processes conducted within this study were reviewed and approved by the Debre Markos University colleague peer review committee. Informed consent was obtained from all the study participants and their guardians prior to their involvement in the study. This is to ascertain that participants are fully aware of the aim and method of the study, and have consented to take part under well-specified ethical standards. Furthermore, the study observes strict confidentiality protocols to ensure privacy and personal information of all the subjects are protected. All methods were performed in accordance with the relevant guidelines and regulations to ensure ethical conduct and the integrity of the research. This includes adherence to institutional protocols, national and international regulations, and the specific guidelines provided by Debre Markos University.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e4.4. Evaluation Metrics\u003c/h2\u003e \u003cp\u003eThe system's performance was measured through a variety of critical metrics: Accuracy, or the proportion of true positive hate speech and non-hate speech instances; Precision, or the proportion of true positive hate speech instances to the total number of instances flagged as hate speech; Recall, or the proportion of true positive hate speech instances to the total actual instances of hate speech; and F1-score, or the harmonic mean of precision and recall, representing an overall metric of the performance of the system.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e4.5. Experimental Setup\u003c/h2\u003e \u003cp\u003eThe experiments were conducted on a high-performance computing system with multiple GPUs. The text analysis component utilized a BiLSTM network, the audio analysis component utilized a CNN to extract acoustic features, and the video analysis component utilized a CNN for facial expression and object recognition. The multimodal fusion was carried out using a Multimodal Transformer model.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e4.6. Results and Discussion\u003c/h2\u003e \u003cp\u003eThe experimental results demonstrate the effectiveness of the suggested system in real-time hate speech detection with accuracy of 92.5%, precision of 90.8%, recall of 91.2%, and an F1-score of 91.0% on the test set. The results indicate the robust performance of the system and its ability to accurately detect hate speech in various modalities, validating the effectiveness of the multimodal approach in improving detection accuracy compared to text-only models.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e4.7. Baseline Model Comparison\u003c/h2\u003e \u003cp\u003eAs a way to show the improvement of the system with the use of the multimodal approach, the proposed system was compared against baseline text-only models. Baseline system implemented text-only analysis and achieved accuracy of 83.4%, precision of 82.0%, recall of 81.5%, and an F1-score of 81.7%. The experiments indicate an enhancement of performance by using the multimodal approach. The addition of text, audio, and video analysis enabled the system to detect contextual information and fine-grained cues not seen by text-based models, and this led to improved accuracy and reliability.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e4.8. Real-Time Performance\u003c/h2\u003e \u003cp\u003eReal-time performance was also tested for the system, considering the average processing time per live stream and video chat session. The results were as follows: Text Analysis was 50 ms, Audio Analysis was 100 ms, and Video Analysis was 120 ms, and the Total Average Processing Time was 270 ms. These results demonstrate the system's capability to process and analyze information in real-time with minimal latency. The use of optimized deep learning architectures and parallel computing techniques played a crucial role in the achievement of these low processing times, ensuring efficient detection and mitigation of hate speech on live streams and video chats.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e4.9. Error Analysis\u003c/h2\u003e \u003cp\u003eError analysis was conducted to ascertain the common causes of misclassification. The primary sources of errors were conditions under which context was significant to classify correctly, yet the system lacked sufficient context information (context-dependent hate speech). Also, ambiguous sentences were bad news, because the same sentences could be used with different meanings based on the context. Excessive background noise in audio data was another primary source of error that affected the accuracy of acoustic feature extraction. With the knowledge of these sources of error, we can see where future enhancements should be made and how we can enhance the overall performance of the system. Some future enhancements were identified from the error analysis in order to enhance the performance of the proposed system. Improvement in contextual understanding by using more contextual data, such as history of the conversation, can be done to classify context-specific hate speech better. This would enable the system to better learn the nuances and subtleties of conversation, leading to more accurate detections. In addition, improving noise robustness is necessary to address the issue of high background noise in audio data, which corrupts the accuracy of acoustic feature extraction. With the application of advanced noise reduction techniques and stable feature extraction mechanisms, the system would significantly be enhanced to effectively process and analyze sound information in noisy environments. With this complemented by continued testing and model refinement, the system would further enhance its potential for real-time hate speech detection.\u003c/p\u003e \u003c/div\u003e"},{"header":"5. Conclusion","content":"\u003cp\u003eThis work presents an Amharic live streaming and video chatting-based real-time hate speech detection system using deep learning techniques. Leverage on multimodal data, the system presented here is an end-to-end solution for hate speech in real-time communication platforms. Experimental outcomes indicate that the system can effectively and feasibly be implemented in real-world use. The system's high-performance metrics-accuracy, precision, recall, and F1-score-reflect its robust capacity to accurately identify hate speech in various modalities, performing significantly better than baseline text-only models. By integrating text, audio, and video analysis, the system is able to extract contextual information and more subtle cues lost by single-modality models and therefore offer a more reliable detection process. This integrated approach, not only does it enhance accuracy in hate speech detection, but also contributes towards the creation of safer and more inclusive cyberspace. This success emphasizes future research and development of multimodal hate speech detection, particularly in under-resourced languages like Amharic. Further studies can explore other modalities of data and additional developments in noise insensitivity and contextual awareness to further enhance the performance of the system.\u003c/p\u003e"},{"header":"6. Future Work","content":"\u003cp\u003eUpcoming research can explore the integration of other data modalities, such as contextual data and sentiment analysis, to further enhance the accuracy of hate speech identification. Sentiment analysis would allow insight into the emotional color of the content, which would allow for differentiation between neutral, positive, and negative terms and is critical in accurate hate speech identification. The combination of historical conversation data and context-aware models would enhance the system's ability to understand nuances and subtleties of conversations, leading to more precise detections. Another possible avenue for future work is the extension of the system to other under-resourced languages. Constructing hate speech models for the languages that currently lack sufficient resources and research funding would significantly expand the scope and reach of the system, which in turn would result in safer online communities globally. This could involve constructing multilingual datasets and employing transfer learning mechanisms to adapt models trained on well-resourced languages to under-resourced languages. In addition, collaboration with linguistic experts and the local population may help in developing accurate and culture-specific datasets and models. Last but not least, future studies would be investigating real-time deployment plans involving edge computing and cloud-based systems in order to ensure scalability and accessibility. Adding adaptive learning mechanisms, through which the system continuously learns from new information as well as user feedback, would also enhance its resilience and accuracy in the long run. Through such focus, future research can enable more precise and comprehensive hate speech detection models to be developed, allowing for online discourse to be healthier and more respectful in multicultural and multilingual settings.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eBaye Atnafu solely contributed to the conceptualization, methodology, and implementation of the research titled \"Real-Time Amharic Hate Speech Detection in Live Streams and Video Chats.\" He was responsible for the development and fine-tuning of the deep learning models, data collection, and annotation processes. Baye Atnafu also conducted the experiments, analyzed the results, and drafted the manuscript. Additionally, he coordinated the entire research project and ensured adherence to the highest ethical and scientific standards. All aspects of this work were carried out independently by Baye Atnafu, who takes full responsibility for the integrity and accuracy of the research findings.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe datasets generated and/or analyzed during the current study are not publicly available due to privacy and ethical considerations, but are available from the corresponding author upon reasonable request.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eDavidson, T., Warmsley, D., Macy, M. \u0026amp; Weber, I. Automated Hate Speech Detection and the Problem of Offensive Language. ICWSM. (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWaseem, Z. \u0026amp; Hovy, D. \u003cem\u003eHateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter\u003c/em\u003e (NAACL-HLT, 2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWarner, W. \u0026amp; Hirschberg, J. Detecting Hate Speech on the World Wide Web. \u003cem\u003eProceedings of the Second Workshop on Language in Social Media\u003c/em\u003e. (2012).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFortuna, P. \u0026amp; Nunes, S. A Survey on Automatic Detection of Hate Speech in Text. \u003cem\u003eACM Computing Surveys (CSUR)\u003c/em\u003e. (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou, X., Reid, E., Qin, J., Chen, H. \u0026amp; Lai, G. \u003cem\u003eUS Domestic Extremist Groups on the Web: Link and Content Analysis\u003c/em\u003e (IEEE Intelligent Systems, 2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWaseem, Z., Davidson, T., Warmsley, D. \u0026amp; Weber, I. Understanding Abuse: A Typology of Abusive Language Detection Subtasks. \u003cem\u003eProceedings of the First Workshop on Abusive Language Online\u003c/em\u003e. (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGao, Z., Yada, S., Wakamiya, S. \u0026amp; Aramaki, E. Offensive Language Detection on Video Live Streaming Chat. Proceedings of the 28th International Conference on Computational Linguistics. (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePookpanich, P. \u0026amp; Siriborvornratanakul, T. \u003cem\u003eOffensive language and hate speech detection using deep learning in football news live streaming chat on YouTube in Thailand\u003c/em\u003e (Social Network Analysis and Mining, 2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMao, J., Shi, H. \u0026amp; Li, X. \u003cem\u003eResearch on multimodal hate speech detection based on self-attention mechanism feature fusion\u003c/em\u003e (The Journal of Supercomputing, 2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWatanabe, H., Bouazizi, M. \u0026amp; Ohtsuki, T. \u003cem\u003eHate Speech on Twitter: A Pragmatic Approach to Collect Hateful and Offensive Expressions and Perform Hate Speech Detection\u003c/em\u003e (IEEE Access, 2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVetagiri, A., Halder, E., Das Majumder, A., Pakray, P. \u0026amp; Das, A. Multilate: A Synthetic Dataset for Multimodal Hate Speech Detection. SSRN. (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDavidson, T., Warmsley, D., Macy, M. \u0026amp; Weber, I. Automated Hate Speech Detection and the Problem of Offensive Language. \u003cem\u003eICWSM\u003c/em\u003e. (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchmidt, A. \u0026amp; Wiegand, M. A Survey on Hate Speech Detection using Natural Language Processing. \u003cem\u003eProceedings of the Fifth International Workshop on Natural Language Processing for Social Media\u003c/em\u003e. (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSalminen, J. et al. Developing an Online Hate Classifier for Multiple Social Media Platforms. \u003cem\u003eHuman-Centric Computing and Information Sciences\u003c/em\u003e. (2018).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Amharic, Real-Time, Live Streams, Video Chats, Deep Learning, Multimodal Analysis","lastPublishedDoi":"10.21203/rs.3.rs-6028931/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6028931/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThis work proposes a novel approach to real-time hate speech detection in Amharic live streams and video chats using state-of-the-art deep learning techniques. The extensive use of social media and live communication tools has amplified hateful and offensive content, and thus it is essential to develop efficient mitigation strategies. The system uses multimodal data, including text, audio, and video, to efficiently detect and mitigate hate speech. By integrating multiple sources of data, the system provides a general overview of the content, identifying both implicit and explicit hate speech. The text analysis module employs a Bidirectional Long Short-Term Memory (BiLSTM) network to process chat messages and comments, while the audio analysis module employs Convolutional Neural Networks (CNN) to extract and process acoustic features. The computer vision module of video analysis detects visual cues of hate speech. Integration of the modalities enables the system to output robust and reliable results. Experimental results confirm the effectiveness of the proposed system for various real-world applications, including live streaming sessions and interactive video conversations. The system achieved high accuracy, precision, recall, and F1-score, proving its applicability for real-time deployment. This paper contributes to the corpus of hate speech detection and provides valuable insights for developing similar systems for other low-resource languages.\u003c/p\u003e","manuscriptTitle":"Real-Time Amharic Hate Speech Detection in Live Streams and Video Chats","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-04-14 12:15:51","doi":"10.21203/rs.3.rs-6028931/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"8624ac0a-4312-487d-b2ba-dfba92f3ad5c","owner":[],"postedDate":"April 14th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":47009546,"name":"Physical sciences/Energy science and technology"},{"id":47009547,"name":"Physical sciences/Engineering"}],"tags":[],"updatedAt":"2025-11-18T02:53:45+00:00","versionOfRecord":[],"versionCreatedAt":"2025-04-14 12:15:51","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6028931","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6028931","identity":"rs-6028931","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00