Hybrid Deep Learning Architecture for Efficient Human Activity Recognition: A CNN-Attention-BiLSTM Framework | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Hybrid Deep Learning Architecture for Efficient Human Activity Recognition: A CNN-Attention-BiLSTM Framework Purba Mukhopadhyay, Sudipta Saha, Koushik Majumder, Saikat Basu This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6536118/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Human Activity Recognition (HAR) has emerged as a critical research area in the domains of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) due to its extensive applications across various domains. The development of robust HAR models capable of accurately identifying human activities is growing in demand. This study aims to advance the field by introducing a novel hybrid model that integrates Convolutional Neural Networks (CNN), Attention mechanisms, and Bidirectional Long Short-Term Memory (BiLSTM) networks. This “CNN-Attention-BiLSTM” model is meticulously designed to capture both spatial and temporal features, thereby enhancing feature extraction and attentiveness. We have evaluated the proposed model using the widely recognized UCI-HAR dataset. The results demonstrate that our model achieves an impressive activity classification accuracy of 93%. To ensure the reliability and validity of our findings, we employed rigorous validation techniques, including cross-validation and detailed classification reports. The model successfully met these validation criteria, confirming its effectiveness and innovation. Human Activity Recognition Deep Learning Spatio-Temporal Convolutional Block Attention Module Accuracy Validation Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 1 Introduction In the present scenario, Human Activity Recognition has become one of the most widely-known research areas (Ariza-Colpas et al. 2022 ). Recently, due to its numerous applications in fields including medical care, disease prediction, robotics, sports, video surveillance, and others, human activity recognition (HAR) has attracted a lot of attention. The ability to automatically identify and understand human actions from sensory data is generally a critical task in the domain of artificial intelligence. However, it also possesses the potential to revolutionize several industries as well. A dense understanding of human behavior along with an improvement in life quality is fostered by HAR-enabled individualized health monitoring, behavioral analysis, and real-time activity monitoring capabilities. In accordance with a report published by UN (United Nations), it is assumed that by 2050, there will be 2 billion elderly people world-wide. However, elderly individuals require extra care and attention since the majority of these people possess multiple diseases. An essential component of smart healthcare is the real-time monitoring of individuals’ physical activities, especially their daily living activities (DLAs) (Gu et al. 2021 ), which can significantly improve eldercare and medical rehabilitation. Numerous serious diseases are significantly impacted by daily activities. Therefore, monitoring of day-to-day physical activities provides a crucial health indicator as well. Generally, it is common practice to track, evaluate, and comprehend different postures across a wide range of systems and applications by classifying and identifying human physical activities (Chen et al. 2021 ). In HAR, several human activities including running, walking, sleeping, sitting, standing, and so on are recognized. Experimental data and required resources can be acquired from several sensors (wearable, wireless etc.), accelerometers, or through images, or video frames as well. There exist several sensor-based frameworks of HAR like smartphone sensor-enabled, audio/video data-related and body-worn sensor-based as well (Ramanujam et al. 2021 ). However, among these, body-worn sensors may not be comfortable to the users, as it is required to place on several locations of body. Apart from that, collecting inputs from audio or video possess different kinds of privacy related issues. Moreover, both the signals from body-worn devices and audio/video need complex techniques for pre-processing to remove unwanted noises from input data as well. In most of the cases, long-ranged audio signals become noisy for white noise or background noise as well. Hence, an audio input at a certain moment fails to provide valuable insights. Moreover, differentiating between two audio signals gets difficult too (Fu et al. 2020 ). Therefore, it can be stated that for distinctly identifying some basic human activities audio inputs are not always sufficient and suitable on their own as well. Collecting video data, especially in populated locations may be problematic, due to existence of various physical obstacles, or due to low brightness as well (Pareek and Thakkar, 2021 ). However, for inferring the characteristics of transfer modes and human activities, sensor-based data can also be acquired from smartphones also. In general, the physical systems on HAR based on smartphone sensors are prompted by their discretion, ubiquity, and inexpensive appliance procedures, usefulness, and noninvasive properties as well (Straczkiewicz et al. 2021 ; Ronao and Cho, 2016 ). Utilizing smartphones, continuous input can be gathered at the time of performing any kind of physical actions. Apart from that, due to various built-in mobile sensors, keeping track of health-related data has become more accurate and elegant nowadays. These different built-in sensors of smartphones are used for collection important insights of HAR models. It is noticed that, among mobile sensors, gyroscopes and accelerometers are the most widely used sensors (Bulbul et al. 2018 ; Voicu et al. 2019 ). Multivariate time-series characteristics can be found in datasets derived from smartphone sensors. The basic feature of time-series data is local dependency. Furthermore, human activity signals are translation-invariant and hierarchical, and also possess dynamic information with regard to underlying systems as well (Li et al. 2023 ). Therefore, the requirement of modeling these high-dimensional datasets accurately is increasing. Physical activities include several unique characteristics. Thus, HAR involves various methodological issues, including imbalanced datasets, interclass similarity, intra-class variability, empty class problem, and others as well (Yao et al. 2018 ). In the current days, HAR systems based on smartphone sensor data most preferably utilize the traditional machine learning (ML) algorithms as well as deep learning (DL) methods for recognizing human activities in efficient manner (Cvetković et al. 2018 ). For extracting relevant underlying features responsible for differentiating distinct activity patterns, in conventional ML methods, feature engineering has become one of the most ruling phases. The desired and enhanced model performance of HAR systems hugely relies upon the efficient feature engineering of raw input signals. After extracting useful features, those are then fed to the classifiers to identify human activities. However, for obtaining enhanced classification result, extraction of relevant features accurately is highly required. Without a relevant feature engineering model, traditional classifiers cannot perform well and thus fails to identify human activities accurately and competently (Atalaa et al. 2021 ). In this manner, complex techniques for data pre-processing are needed for getting sensory data in a proper form, and this extraction procedure of hand-crafted features from sensory data require high expertise domain knowledge in this sense. Finally, the extracted handcrafted features are sent to conventional classification system for identifying human activities. However, it is noticed from several research that these handcrafted features not always work for all the models and perform poorly in recognition models as well (Tasmin et al. 2020 ). Apart from that, it requires distinct handcrafted feature vector for different domains in research for handling classification problems properly. In this manner, currently, most of the researchers prefer and utilize DL algorithms to overcome such problems. However, there are huge application field of HAR, most extensively applied in medical domains, and for the purpose of taking care and tracking records of elderly people for helping them in better and secure lifestyle (Yadav et al. 2021 ). Moreover, for controlling and monitoring crime records and rates, HAR can also be applied. Apart from that, the everyday activities recognition can build an environment for smart home technology. Driving behaviors can be detected and thus helps in promoting safe transportation. Implementing HAR, military operations can be identified as well. Moreover, the other domains where HAR is applied are entertainment, autonomous driving, surveillance and security, human-robot interaction as well (Dang et al. 2020 ). The basic objective of HAR is recognizing different kinds of human actions and activities in controlled and uncontrolled manners. 1.1 Motivation In several research works, it is noticed that nowadays most of the Human Activity Recognition models utilize Deep Learning algorithms rather using traditional ML algorithms, as ML algorithms require handcrafted features (Cao et al. 2021 ). Apart from that, DL models possess capability of automatic feature extraction and learning. This makes an extra advantage for using DL models in HAR systems. DL algorithms are basically capable of extracting important features in an efficient manner without any manual intervention along possessing the ability of recognizing human actions simultaneously (Côté-Allard et al. 2020 ). These DL methodologies are proved as outstanding in performing desired prediction in various domains such as intelligent gaming system, ideal recognition of image and speech, and so on as well (Sultana et al. 2020 ). In the literature of HAR, DL methods have gained an outstanding contribution. Nowadays, in most of the research works regarding HAR domain, different types of DL algorithms are being applied and investigated. Recognition of distinct complex activities of human require some certain steps for identifying all the responsible features accurately so that the HAR model can get its desired outcome successfully. Among these procedures, one of the most considerable phases is extraction of relevant features. Human activities are basically consists of two types of features: spatial and temporal. Identification of both of these features is equally important for recognizing a specific activity (Lee et al. 2020 ). In this manner, extraction of these spatio-temporal characteristics of smartphone-based sensory input is highly required. For this purpose, proper procedures of feature extraction along with following required data pre-processing techniques for converting raw and noisy input signal into acceptable as well as clear data is required. There exist numerous DL methodologies applied in the HAR architectures for gaining required spatial and temporal characteristics of data to ideally specify human actions. Among these DL mechanisms, CNN model is one of the most widely used DL model that generally helps in extracting spatial as well as local trends of data ideally. In HAR domain, utilization of CNN architecture aids with several advantageous features that enrich the classification cost and accuracy of activity classifiers. CNNs can learn spatial characteristics of data points very efficiently and can interpret spatial hierarchies from input data such as images or sensors (accelerometers or gyroscopes) automatically without intervention of manual pre-processing. In HAR models, input data fed to model is generally image data or sensory data as well. Hence, CNNs can be taken as best fit for defining the model. Apart from that, CNNs can gain insights of relevant features that are generally invariant to minor translations in input data. For HAR models, these minor features also play important role in capturing slight body movements, so that every little moves are captured and understood by the model as well (Islam et al. 2022 ). For extracting higher-level features, CNN use various layers that perform operations like pooling and convolution. This characteristic of CNN model helps in capturing both low-level spatial (corners) and high-level abstract (movement sequences, postures) patterns that are equally important to model learning. Moreover, CNNs are also capable of handling multimodal data. Using several data streams (gyroscope, accelerometer) or by integrating sensory data with image-based data, CNNS can generally handle multimodal data and thus possess versatility in processing several input data types (Sinha et al. 2020 ). However, building HAR models incorporating CNN is advantageous in several aspects, especially for the ability of CNN in learning discriminative spatial patterns automatically, along with handling translations, processing multimodal inputs, and others without the requirement of manual feature extraction as well. However, besides extraction of local characteristics of sensory data, it is also equally required to accurately retrieve temporal dependency of data. Generally, RNNs are better and widely used deep networks in HAR domain. Among RNNs, LSTMs play a great role in capturing temporal relationship of data. LSTMs possess various advantages in capturing temporal characteristics. However, temporal dependencies of human body movements also possess several bi-directional patterns. Sometimes, LSTMs may lack in addressing complex activities that involves bi-directional activities. To mitigate this problem, Bi-LSTM can be utilized as they can capture temporal trends of data from both of the instances, past and future (Jang et al. 2020 ). Capturing bi-directional trends of data is necessary for identifying the activities involving forward and backward movements as well. Hence, besides LSTM model, application of Bi-LSTM model can also be noticed in this sense. Apart from this, it is also required to keep an eye in extracting these features wisely so that any unwanted features cannot be entertained in model training. For this, to pay extra “attention” in feature selection, attention mechanisms are being used in most of the cases recently (Li et al. 2020 ). Generally, attention mechanisms are kind of DL methods that pay extra attention in selecting features by paying more attention in wanted features, while paying less attention in unwanted ones as well. Hence, motivated by these constraints (Bi-LSTM, CNN, and Attention), here, in this article, an integrated model is proposed for identification of human activities. The corresponding literature of this study also indicates that the proposed combination is novel. However, the step-by-step architecture in this proposed model involves the following three different phases: CNN-Attention-Bi-LSTM module Understanding CBAM operation A GAP layer followed by dropout and Softmax layers 1.2 Contribution The main contributions provided in this article are as follows: The paper presents and discusses several aspects regarding the domain of human activity recognition. An extensive literature review consisting DL-based frameworks of HAR is performed for easy understanding of readers regarding the topic along with identifying potential literature gaps. An efficient hybrid DL-based model is proposed consisting three DL algorithms such as Attention, CNN, and Bi-LSTM to ideally extract essential spatio-temporal characteristics of smartphone-based sensor data. The effectiveness and proper justification of the proposed system is presented through required experiments, validation techniques, and performance metric as well. Finally, we compare the obtained result with other existing literatures. 2 Related Work In this section, a comprehensive literature review of related articles has been performed. Various works and research done in this particular field of HAR have been evaluated. Generally, to find out the effective research gap and the possible future directions that can make this domain of HAR more productive and more fruitful, this review work is presented. The research field of HAR domain generally consist of both ML and DL approaches. Previously, in several research activities, researchers utilized the application of classic ML algorithms in HAR domain to find out recognition accuracy of models (Barna et al. 2019 ). In these ML-based systems, researchers used numerous feature selection or/and extraction procedures prior to feeding the collected data to classifiers for identifying several human behaviors. However, it is noticed that, ML models depend upon handcrafted extraction of features, and this procedure of feature retrieval requires expertise in domain knowledge and manual intervention as well, which results in increased time complexity (Wang et al. 2021 ). In this sense, to overcome such disadvantages of ML-based HAR systems, researchers have focused in exploring and applying DL-based mechanisms, as DL-based architectures possess the benefits of automated extraction of features, without human interference as well. Hence, in this part, a survey on DL-based human action identification systems is presented. DL for HAR Thakur et al. ( 2022 ) proposed smartphone-based hybrid DL architecture for recognizing human activities. The hybrid architecture “ConvAE-LSTM” consists of deep learning models: CNN, auto-encoders (AE), and LSTM. CNN models perform well by extracting useful features automatically and capturing spatial features, for reducing dimensionality AEs are used, and LSTMs are popular for capturing temporal sequences as well. Thus, this hybrid unified model forms a complimentary architecture by covering all the advantageous aspects like spatiotemporal characteristics and dimensionality reduction. Four distinct standard public datasets are used for the proposed experimental purpose. Two of them are smartphone-based (WISDM, UCI), and the rest of the two are based on body-worn sensors (OPPORTUNITY, PAMAP2) as well. Using the metrics such as recall, F1 score, precision, and accuracy along with a cross-validation technique named LOSO; the acquired outcomes are cross-checked and validated. However, the model can be enhanced more by addressing bi-directional dynamic activities more precisely. Wang et al. ( 2020 ) presented a deep architecture, capable of learning local features and modeling time dependencies between features automatically, without manual intervention. In this regard, the author built a hybrid model, combination of convolutional neural network (CNN) model and long short-term memory (LSTM) recurrent deep network. CNN model is utilized here to extract relevant features from collected sensor-based experimental data. LSTM architecture is applied for capturing long-term reciprocity among two activities for further improvement purpose of the identification rate of HAR. Hence, combining CNN-LSTM, a model based on wearable sensors is proposed to detect several human activities and associated transitions accurately. Acceleration and gyroscope sensor-based smartphone data are collected for experimental purposes. The experiment was performed utilizing the “HAPT” dataset. However, the further development of this model may be performed in terms of transitions by tuning its parameters, and components to achieve more satisfactory result. Xia et al. ( 2020 ) proposed a deep network combining convolutional layers with LSTM model. LSTM is basically a variant of recurrent neural network (RNN) and more capable of processing temporal features or sequences as well. CNN deep network is utilized for capturing local spatial dependencies of data. This hybrid model “LSTM- CNN” automatically extracts activity features and classify these with fewer model parameters as well. Here, three broadly used mobile sensor-based public datasets; WISDM, UCI, and OPPORTUNITY are used for experimental and analytical purpose. However, the model lacks in addressing bidirectional activities along with selection of proper features as well. Hence, further development of the model can be done and the results can be improved accordingly. Mim et al. ( 2023 ) proposed a DL model “GRU-INC” for recognizing human activities. The model is an “Inception-Attention” based method combining the Gated Recurrent Unit (GRU) model. The combination is effective for actively capturing spatial and temporal information of time-series data. Here, combination of GRU and Attention is utilized for extracting temporal features. On the other hand, Inception along with Convolutional Block Attention Module (CBAM) is exploited for extracting spatial representations. Using available public datasets such as OPPORTUNITY, WISDM, PAMAP2, UCI-HAR, and Daphnet, several human activities have been examined. However, the model lacks in addressing long-range dependencies and bi-directional movements adequately. Apart from that, only one metric is used here for performance analysis. In this sense, further improvement of this model can be performed and checked by hyper tuning the model parameters. Mutegeki and Han ( 2020 ) presented a novel deep architecture-based activity recognition model, “Convolutional neural network-long short-term memory network” (CNN- LSTM) architecture as well. This model is basically a hybrid model, combination of two different deep architecture CNN architecture and LSTM architecture. For experimental purpose, two datasets are used and the proposed method is applied over these two datasets, iSPL (3-activity) and UCI HAR (6-activity) for evaluating the applicability and performance of the proposed method. The performance of the model is evaluated using several performance metrics like accuracy, cross-entropy. However, the model consumes more time and less effective in selecting essential features. Hence, the model can be improved further eventually. Xu et al. ( 2019 ) proposed a deep neural architecture, InnoHAR model combining two deep architectures; Gated Recurrent Unit (GRU) and inception net- work as well. The model accepts input data in the form of waves of multi-channel sensing devices end-to-end. Gated Recurrent Unit (GRU) is employed for effective modeling of time series data and features as well. Among RNNs, GRU is quite popular for its simple architecture and temporal ability. GRU model possess the ability of sensing temporal relationships between data points. Apart from that, in this experiment, for retrieving spatial features from sensor-based waveform data, GoogLeNet’s Inception part is used for implementing inception on three datasets. The experiment was performed over three datasets, OPPORTUNITY, PAMAP2, and SMARTPHONE and performance evaluation was done using F-measure that also covers both recall and precision as well. Considering the overall performance of the proposed structure, the experimental outcomes deliver that the suggested InnoHAR based on Inception-like model produce better output than both CNN (9% improvement for OPPORTUNITY dataset and 3% for PAMAP2 dataset) and DeepConvLSTM (5% improvement for OPPORTUNITY dataset and 3% for PAMAP2 dataset) as well. Khan et al. ( 2022 ) proposed a hybrid DL model combining two commonly used DL network, CNN architecture and LSTM architecture for achieving better recognition performance for indoor environments. CNNs are basically used for extracting features spatially, where; LSTMs are mainly focused on extracting learning temporary information dependencies as well. Keeping this in consideration, author presented a hybrid model combining these two with desire to obtain an improved performance. For analytical evaluation, a self-made dataset is used that collected instances via e Kinect V2 sensor capable of extracting 25 distinct joints of human body (involves 12 distinct human activity classes) from 20 members. However, the proposed model obtained accuracy of 90.89% in comparison with other existing deep models. Nafea et al. ( 2021 ) introduced a new method that involves convolutional deep model (CNN) with differing kernel dimensions and bi-directional long-short-term memory for capturing features at several resolutions. The main motive of this research work lies effectively in the appropriate selection in effective extraction of temporal as well as spatial patterns from sensory data and also optimal representation of video using classic CNN algorithm and BiLSTM as well. Two datasets (WISDM and UCI) are utilized in this analytical study where data collection procedures involve sensors, accelerometers, and gyroscopes. However, time consumption of this model can be reduced in future and also features can be selected more effectively by applying suitable mechanisms as well. Abdel-Basset et al. ( 2020 ) proposed a dual-channel supervised model “ST-deepHAR” consisting of LSTM network, followed by attention mechanism for fusing temporal nature of inertial sensory data along with a convolutional ResNet for extracting the spatial dependencies of sensory data as well. Apart from that, in the proposed model, an adaptive operation for channel-squeezing is introduced in order to fine-tune the convolutional feature extraction ability of the neural network exploiting the multi-channel dependency. After the retrieval of spatio-temporal data, those data were concatenated for making final classification decision by feeding through multilayer perceptron and a softmax layer as well. For experimental purpose, two publicly available HAR datasets (WISDM, UCI HAR) are utilized, and performance of the proposed architecture is evaluated. However, the model lacks in addressing the data imbalance problem and time consumption is also a concerning factor. In future, more works can be performed considering these factors. Challa et al. ( 2023 ) developed a hybrid DL model that can effectively recognize several human movements captured utilizing IMU sensors. The hybrid model basically consists of CNN model and Bi-LSTM units for extracting temporal sequences along with spatial characteristics simultaneously from the raw sensory data. Apart from that, a meta-heuristic optimization method, “Rao-3” is adopted for identifying ideal values of hyper-parameters for the suggested hybrid architecture for the purpose of enhancing model performance. Three widely used HAR datasets are used in this article for evaluation of classification performance. The used datasets are UCI HAR, MHEALTH, and PAMAP2 as well. However, the implemented framework is a complex architecture and use of further hyper parameter optimization increases computational time. Moreover, the optimization technique may not be able to produce satisfactory outcome for all the cases in HAR domain. Hence, the model lacks in robustness and interpretability as well. Kumar et al. ( 2023 ) proposed a deep learning-based framework for the efficient detection of anomalous activities of human. The suggested framework is implemented combining three components of DL, CNN, Bi-LSTM, and Attention for identifying unique spatio-temporal trends of data. However, the analytical task has been performed in this article using three distinct datasets, UCF50, UCF11, and subUCF crime as well. However, the flexibility and robustness can be improved further and more works on the datasets may increase the model accuracy in future. Singh et al. ( 2019 ) introduced a two-stream DL model having less complexity utilizing raw RGB sequences along with their “dynamic motion images (DMIs)” for recognizing complex human behaviors. The frames of RGB have been trained incorporating a pre-trained network of Inception-v3 module and having CNN-LSTM attached with end-to-end training. Moreover, for dynamic image streaming, some last layers of utilized pre-trained network are fine-tuned. Utilizing the proposed two-stream model, the features are extracted and then are max fused to get increased classification accuracy as well. For the evaluation purpose, authors used dyadic SBU Interaction as well as MIVIA Action dataset, single-person activity dataset. However, the model is a complex architecture with higher computing time. By modifying its components and hyper parameters, the dimensionality reduction and hence computational time may be reduced further. Nguyen et al. ( 2023 ) proposed a 1D-CNN – Bi-LSTM model followed by attention mechanism, CBiAM for specifically recognizing states of cyclists utilizing smartphones. The motto is enhancing the safety measures along with promoting secured cycling experience to avoid accidental or emergency risks as well. A new created dataset “cycling safe (CySa)” was utilized for the experimental purpose that contains data on various actions of the cyclists during cycling, where smartphones were placed in their pocket position for collecting the data. The suggested CBiAM system was trained using the CySa dataset incorporating varying window sizes, learning rates, and batch sizes as well. However, the mechanism uses fixed sample length that is not suitable for addressing complex activities, hence, information loss may happen. Moreover, the model lacks in addressing contextual information and transitional states adequately. Hence, in future, more work should be done in this context to enhance the model performance by reducing such limitations. Ige and Noor ( 2023 ) presented a new parallel deep architecture, DLT, generally based on the idea of pipeline concatenation. In the proposed pipeline system, single pipelines are consist of two sub-pipelines, first one, consisting 1D-CNN that learns the local features, and the second one is Bi-LSTM, LSTMs that learns the temporal dependencies as well by merging feature maps along with integrating the channel attention. The experiment was held on two HAR datasets, that are available publicly, that is WISDM, and PAMAP2 as well. However, the model is a complex architecture comprising of several pipelines and sub-pipelines, each with multiple layers that also increases the risk of overfitting and time consumption. Thus, this model lacks in the contexts of generalization, computational demands, and others and can be improved by further modifications as well. Alo et al. ( 2020 ) proposed a deep stacked model for recognizing human activities involving auto-encoder algorithm. The aim of this paper is proposing a deep model based on auto-encoder along with orientation of invariant features, for identifying complex human activities. Basically, in this article, a deep stacked architecture that involves auto-encoder for extracting crucial human behaviors for improvement of model accuracy, and reducing over-fitting is proposed. The data was taken from smartphone accelerometer. In this model, the advantageous aspects of auto-encoder, sparse auto-encoder, softmax classifier and others are utilized for obtaining better model performance. For analyzing the model performance, author used several types of performance metrics such as recall, accuracy, specificity, confusion matrix as well. It is observed that the proposed model gained an accuracy of 97.13% compared to the traditional ML algorithms and deep belief network as well. However, the model performance can be further checked over DL algorithms as well. Ni et al. (2020) presented a novel deep learning based framework for recognizing dynamic human activities, static human behaviors, and transitional activities as well by utilizing SDAE (stacked denoising auto-encoders). The experimental setup is designed for acquiring three types (twelve daily activities) of day-to-day activities utilizing wearable sensors. These records were collected from 10 adults in smart lab of Ulster University for analytical purpose. In this article, SDAE, a deep model that extracts various features in an automatic manner is used for experimental purpose. The performance analysis of the deployed model was measured using performance metrics such as precision, accuracy, recall, and F1 score as well. However, as the experiment is held in a controlled environment, hence, the model may lack proper generalization and robustness. Addressing such shortcomings could enhance the performance of the model in future. In the Table 1 , the above reviewed literatures based on deep learning method are summarized. From the table, it can be easily stated that WISDM and UCI datasets are the most widely used popular standard and publicly available smartphone-based datasets. Moreover, it can also be observed that most of the literatures have utilized the benefits of attention mechanism for selecting useful features wisely. Almost maximum models have extracted both the spatial as well as temporal dependencies of sensory inputs in order to recognize different human behaviors in an efficient manner with enhanced model performance. In this regard, for extracting spatial-temporal characteristics, most of the reviewed papers have utilized CNN for the spatial part extraction and LSTM (sometimes, GRU too) as extraction means for temporal dependencies. However, for getting better model performance, it is simultaneously required to pay emphasize on every aspects of data points. In this manner, an efficient HAR model should capture both spatial and temporal characteristics of input with utmost attention. Hence, it is required to form such a hybrid model that can cover all these aspects equally and the model can comprehend every representation of data points. Though CNNs are widely used in HAR models, but one major consideration of CNN model is its tendency of overfitting. CNN models are generally prone to overfitting issue. To mitigate this, GAP layer can be used. GAP layers are generally capable of mitigating the risk of overfitting, and also help in performance enhancement of the model. Apart from that, instead of fully-connected layers, GAP layers can be applied. This helps in lesser time consumption and gaining increased model accuracy as well. In order to temporal extraction, LSTMs are most preferred deep learning models that can efficiently handle temporal features. In order to retrieve temporal features more prominently, paying extra attention is required for ensuring focus on most relevant features. However, sometimes, it can be noticed that, Bi-LSTM and Bi-GRU can be more effective and productive than LSTM and GRU models. Bi-GRU and Bi-LSTMs possess the ability of remembering the data flow for both direction and instances, past and future. GRU and LSTMs are responsible for capturing only one-directional dataflow, rather than capturing two-directional dataflow. In this sense, use of Bi-LSTM and Bi-GRU may be more beneficial as they can capture and remember relevant patterns from both instances as well. However, in this article, to address the potential literature gaps, a hybrid model involving CNN module, Bi-LSTM network, and Attention mechanism is suggested by taking all these factors in considerations. Apart from that, among attention module, a special kind of attention called “CBAM” is applied here in the model. Moreover, LSTM is also used to notice the accuracy of the model for an effective comparison purpose. It is desired that the proposed architecture will be capable of obtaining desired output with enhanced classification accuracy along with proper justifications of the model performance as well. 3 Preliminaries In this research work, a combined hybrid deep model is presented that generally combines three distinct deep learning networks. The proposed system is formed combining the advantageous aspects of attention mechanism, BLSTM model, and CNN model as well. The main concept behind this idea is retrieval of local and temporal features of the input data efficiently. Hence, for interpreting the working mechanism of the model, it is highly required to understand all the associated components and concerned parameters separately. Therefore, in the below section, the conceptual elaboration of the required components is presented briefly. In Fig. 1 , the mechanism of proposed hybrid DL framework is displayed. Table 1 HAR systems based on Deep Learning Reference Dataset Sensor Classifier Thakur et al. ( 2022 ) WISDM UCI OPPORTUNITY PAMAP2 smartphone smartphone Body-worn Body-worn CNN+ Auto-encoder+ LSTM Wang et al. ( 2020 ) HAPT (“Human Activities and Postural Transitions”) Dataset Body-worn CNN + LSTM Xia et al. ( 2020 ) WISDM UCI HAR OPPORTUNITY Smartphone smartphone Body-worn LSTM + CNN Mim et al. ( 2023 ) OPPORTUNITY WISDM PAMAP2 UCI HAR Daphnet Body-worn Smartphone Body-worn Smartphone Body-worn GRU + Attention + Inception Mutegeki and Han ( 2020 ) iSPL UCI HAR Body-worn smartphone CNN + LSTM Xu et al. ( 2019 ) OPPORTUNITY PAMAP2 SMARTPHONE Body-worn Body-worn Body-worn GRU + Inception Khan et al. ( 2022 ) Self-collected Body-worn CNN + LSTM Nafea et al. ( 2021 WISDM UCI HAR Smartphone smartphone CNN + Bi-LSTM Abdel-Basset et al. ( 2020 ) WISDM UCI HAR Smartphone Smartphone LSTM + Attention + ResNet Challa et al. ( 2023 ) UCI HAR MHEALTH PAMAP2 Smartphone Body-worn Body-worn CNN + Bi-LSTM Kumar et al. ( 2023 ) UCF50 UCF11 subUCF crime Video Data Video Data Video Data CNN + Bi-LSTM + Attention Singh et al. ( 2019 ) SBU Interaction MIVIA Action Video Data Video Data Inception-V3 + CNN + LSTM Nguyen et al. ( 2023 ) CySa (Self-made) OPPORTUNITY UCI HAR WISDM MOTIONSENSE PAMAP2 Body-worn Body-worn Smartphone Smartphone Smartphone Body-worn CNN + Bi-LSTM + Attention Ige and Noor ( 2023 ) WISDM PAMAP2 Smartphone Body-worn CNN + Bi-LSTM + LSTM Alo et al. ( 2020 ) Self-collected Smartphone Auto-encoder Ni et al. (2020) Self-collected Body-worn Stacked Auto-encoder 3.1 CNN Model 3.2 Bi-LSTM CNN (Convolutional Neural Net) is a popular DL model that is used extensively in numerous domains such as speech recognition, image classification and others. In the domain of human activity recognition, CNNs have become one of the most suitable and efficient deep learning models, nowadays. Currently, most researchers prefer CNN model for building the HAR models. Due to the ability of learning spatial (locally-connected) features, CNNs have gained great success in several domains (Zhang et al. 2021 ). CNNs are generally consists of three layers: convolutional layer, pooling layer, and dense layer. The main concept of a CNN model is convolution layers that generally perform the mechanism of feature extraction. The input data is processed by the Convolutional layer (applying filters) for extracting relevant features, pooling layer reduces computation by down sampling the image, and the fully-connected layer generates the final output. The network uses gradient descent and back-propagation to discover the most efficient filter. Both the average pooling and max pooling layers are commonly utilized to perform averaging operations and local maximization on input features, respectively. Tens or even hundreds of layers can be found in a convolutional neural network, every single one is trained to recognize a unique characteristic of an image. Every training image is subjected to various resolutions of filters, and the result of every convolved image serves as the input for the subsequent layer. The filters can begin with relatively basic criteria, such as edges and brightness, and gain competence to include features that specifically identify the object as well (Alzubaidi et al. 2021 ). Figure 2 depicts the basic architecture of CNN architecture. In the sphere of human activity recognition, Recurrent Neural Networks (RNNs) play an important role. Among RNNs, LSTM networks are generally utilized in a large basis. For capturing temporal dependencies of data, researchers apply LSTM networks in HAR models. For effective removal and selection of features and for getting enhanced accurate result, besides extracting spatial characteristics of data, extraction of temporal dependencies are similarly important. Human actions generally consist of time-series sensory data. Hence, temporal trends in time-series data play crucial role for modelling human movements. LSTMs are responsible for retrieving temporal characteristics from sensory data for its long-term dependencies as well as temporal characteristics. Not only for capturing human actions, but also capturing small or long transitions are equally important in HAR models. Though LSTMs are good in capturing temporal features, but it possesses some major drawbacks too. To overcome such shortcomings of LSTM models, there comes the necessity of Bi-LSTM models (Li et al. 2021 ). For recognizing complex human movements such as swimming, cycling, walking, it is crucial to identify the actions that generally depend on preceding and succeeding movements. LSTMs are capable of capturing only one-directional data instances, while Bi-LSTMs process input signals in both forward and backward directions allowing the model in capturing contextual information from both the past and future time steps as well (Naheliya et al. 2023 ). This nature provides a better comprehensive insight regarding the temporal dependencies of input, helping in acquiring enhanced classification result. Hence, considering these advantages of Bidirectional LSTM models, nowadays, in most of the analytical tasks involving complex human activities, Bi-LSTMs are considered as more suitable and preferable ones rather than applying LSTM networks as well. Figure 3 depicts the working of Bi-LSTM. The mathematical expression of working of Bi-LSTM is presented by Eq. 1 (Anwar et al. 2023 ): $$\:{H}_{i}^{l+1}=g\left({V}_{f\sigma\:}\left({U}_{f\:}\left[{S}_{f}^{l},\:\:{O}_{i}^{l+1}\right]\right)+\:{V}_{b\sigma\:}\left({U}_{b}\left[{S}_{b}^{l},\:{O}_{i}^{l+1}\right]\right)+b\right),\:i\in\:[1,\:N]$$ 1 Where, \(\:{S}_{f}^{l}\) Information from past time steps of hidden states \(\:{S}_{b}^{l}\) Information from future time steps of hidden states \(\:{U}_{f\:}:\) Input states embedded in two directions \(\:{U}_{b}:\) Hidden states embedded in two directions \(\:g\) Activation function \(\:b\) Bias \(\:\sigma\:\) : Sigmoid function 3.3 Attention Mechanism In the sphere of human activity detection, the features play the most important role. Recognition accuracy and the model efficiency effectively rely upon proper selection of essential features. It is noticed that the feature identification and selection is one of the most crucial parts in recognizing human behaviours efficiently. For detecting human movements, both the temporal and the spatial selection of features are essentially required. It is highly important to evaluate and recognize the features that are crucial for the model implementation as well. Here comes the need of Attention mechanism to pay extra “attention” as well as emphasize in most wanted and relevant features as well (Niu et al. 2021 ). The attention mechanism is especially beneficial where not every piece of input is equally meaningful or informative. In the HAR domain, currently, most of the researchers prefer attention method to concentrate on particular time steps or movements that are more reminiscent of particular activities. In human activity recognition, attention can help the model focus on the most relevant parts of the input data. It potentially highlights important time steps or features that are crucial for distinguishing between different activities. Apart from that, this can enhance the model ability to recognize activities possessing varying durations or complexities (Chorowski et al. 2015 ). Hence, leveraging the advantageous aspects of attention mechanism, in this paper, a “Convolutional Block Attention Module (CBAM)” block is utilized. This is basically a special kind of attention module. The purpose of using CBAM is inferring two distinct but sequential attention maps in both spatial and channel dimensions and after that multiplying these obtained resulting attention maps by the input map in order to refine features further. Hence, CBAM consists of two sub-modules known as “Channel Attention module (CAM)” and “Spatial Attention Module (SAM)” as well. These two modules, that is, spatial as well as channel attention masks complementary work with each other. From Fig. 4 , the working mechanism of CBAM can be comprehended. The overall process of the attention mechanism can be depicted as: $$\:{\text{F}}^{{\prime\:}}={\text{M}}_{c}\left(\text{F}\right)\otimes\:\text{F}$$ $$\:{\text{F}}^{{\prime\:\prime\:}}={\text{M}}_{s}\left({\text{F}}^{{\prime\:}}\right)\otimes\:{\text{F}}^{{\prime\:}}$$ Where, input feature map \(\:\:\text{F}\in\:{\mathbb{R}}^{C\times\:H\times\:W}\) , channel attention feature map \(\:{\text{M}}_{c}\in\:{\mathbb{R}}^{C\times\:1\times\:1,\:}\) and spatial attention feature map \(\:{\text{M}}_{s}\in\:{\mathbb{R}}^{1\times\:H\times\:W}\) as well. 3.4 GAP, Dropout, and, Softmax Layers In this paper, “Global Average Pooling (GAP)” layer is used. The main motto of using GAP layer is reducing the dimensionality of the proposed scheme. GAP layers reduce the model parameters and help the model in fast convergence as well. GAP layers basically perform an extreme kind of dimensionality reduction that reduces the dimensions effectively and mitigates the computational cost of the model. Furthermore, the GAP layer has no parameters that need to be optimized. As a result, it succeeds in minimizing the entire model parameters. Moreover, since GAP layers aggregate the spatial information, GAP layers are more resilient to the spatial alterations of input as well. Dropout, basically a regularization technique, is used to prevent the overfitting issue ensuring all the units are independent to each other. Here, in this study, the Dropout layer is suggested in order to mitigate the overfitting risk factor as well. Lastly, the Softmax layer is used here for obtaining the final classification result of the proposed scheme of human activity recognition. 4 Proposed Model Here, a hybrid deep architecture is proposed for recognizing several human actions effectively. The proposed DL-based framework is a layered architecture that involves Bi-LSTM, CNN, and Attention layers together in a combination for building the classification framework. The proposed Bi-LSTM-CNN model, with attached attention mechanism is suggested in order to obtain a better predictive outcome in HAR domain. The working mechanism of the proposed layered framework is elaborated step-by-step in the following section. First of all, data collection is required for deploying the proposed algorithm and analysing the results as well. However, from in-built smartphone-based sensors multivariate time-series details are gathered for recognizing several human activities. Utilizing various sensors like gyroscopes and tri-axial accelerometers, fine-grained sensory data can be acquired. Here, for evaluation purpose, data are collected through smartphone sensors. The experiment is deployed on UCI-HAR dataset. After data collection, then the data is passed through the proposed hybrid scheme for recognition of human activities. The mechanism of the model is divided into two parts generally; one is for extracting spatial features of data whereas another one will be utilized for retrieving temporal dependencies of data. The data is first passed through the CNN layer for spatial extraction of features. Next, attention layer is attached with CNN to make sure the desirable features are emphasized. However, after that, the output from CNN-attention is then sent to the Bi-LSTM layers for temporal extraction. After retrieval of the spatio-temporal features, then the final obtained features are sent to the corresponding layers and finally sent to the softmax layer for getting the final outcome of activity recognition model. The model is then validated and justified using certain methods such as cross-validation technique, comparison among proposed model and other existing literatures, and finally evaluating the performance metrics as well. In Fig. 5 , the overview of the proposed model involving the required components and layers is displayed. However, the proposed model can be elaborated step by step along with explaining contribution of each involved layers. For easy understanding of every involved component, here, a detailed description is presented. Below we can see the algorithm for CBAM-enhanced model. From the algorithm, an idea of step-wise execution of model components can be portrayed as well. However, the working mechanism and model defining can be divided in the following phases: 4.1 CBAM Module CBAM module is basically a special kind of attention mechanism. It incorporates spatial attention and channel mechanism and helps in increasing the model efficiency effectively. Here, CBAM is utilized as a main part of the proposed architecture desiring improved model performance. Channel attention and spatial attention both these methods from CBAM equally contribute in enhancing the performance of model. While Channel Attention targets on emphasizing or suppressing various channels or features, Spatial Attention focuses on the spatial aspects i.e. time steps and features as well. Together, these two attention mechanisms allow the model to adaptively focus on both feature-level and spatial-level information, potentially enhancing the model’s performance and interpretability. However, in the proposed mechanism, channel attention is applied first, and after that spatial attention is implemented. By applying channel attention first, the model can focus on the most relevant features before considering the spatial aspect of the input sequence. This order allows the model to select the most informative features, which can then be used more effectively by the subsequent spatial attention mechanism. Apart from that, implementing channel attention before spatial attention mechanism can potentially reduce the number of features that need to be processed by the spatial attention mechanism. This can lead to computational efficiency and potentially improved performance, especially when dealing with high-dimensional input data. Hence, it can be said that this order (channel attention before spatial attention) can be effective in obtaining desired model performance. 4.1.1. Channel Attention In the first phase of our proposed model, the Channel Attention layer is defined. This Channel Attention is utilized to implement the channel attention portion from the CBAM architecture. This module basically learns to suppress or emphasizes certain channels as well as features based on the performance in input tensor. However, the Channel Attention method first calculates the average and max-pooled representations of input tensor along the sequence dimension. These representations generally capture the most prominent and relevant features, respectively, as well. The average tensors and max-pooled tensors are then concatenated and passed through a shared dense layer followed by a sigmoid activation for producing channel attention weights. After that, the input tensor is multiplied element-wise with the channel attention weights, effectively emphasizing or suppressing specific channels (features) based on their importance. The resulting tensor with channel attention applied is returned as the output. The Channel Attention module aims to learn to focus on the most informative features within the input tensor, potentially improving the model’s performance on sequence modeling tasks. 4.1.2. Spatial Attention Next, the Spatial Attention part from CBAM method is implemented. This is another most important component of CBAM module. The Spatial Attention aims to learn a weighted combination of spatial features (across the time steps and relevant features) by applying a 1D convolution with a sigmoid activation. This mechanism works by utilizing a 1D convolution along the time dimension of the input tensor. The convolution kernel learns to capture patterns and temporal dependencies in the data, and the sigmoid activation ensure that the learned weights are between 0 and 1, acting as attention weights. However, by multiplying the input tensor with the learned weights, the model can selectively suppress as well as emphasize various spatial locations in the input tensor. This allows the model to focus on the most relevant spatial information, potentially improving its performance and interpretability for sequence-based tasks. 4.2 CNN Layer CNN is another most important part of our proposed architecture. It is known that CNN is a popular DL model for interpreting image data or sensory data as well. However, here, the “Convolutional CNN is another most important part of our proposed architecture. It is known that CNN is a popular DL model for interpreting image data or sensory data as well. However, here, the “Convolutional Neural Network” is implemented utilizing the TimeDistributed layers along with Conv1D and Pooling layers as well. First, the input layer of CNN accepts a 4D shaped tensor (batch_size, n_steps, n_length, n_features), where n_steps, n_length, and n_features represents the number of time steps, the length of the sequence at each time step, and the number of features, respectively. Algorithm 1: CBAM-enhanced CNN-BLSTM Algorithm Input: trainX (training input data), trainy (training labels), testX (test input data), testy (test labels) Output: Trained CBAM-enhanced model and its performance metrics 1 Initialization: Load the dataset and define parameters : 1.1 Call the load_dataset() function to obtain trainX, trainy, testX, and testy 1.2 Extract the number of timesteps (n_timesteps), features (n_features), and output classes (n_outputs) from the data 1.3 Define the number of steps (n_steps) and length (n_length) for reshaping the input data 2 Reshape the input data for TimeDistributed layers : 2.1 Reshape trainX to (trainX.shape[0], n_steps, n_length, n_features) 2.2 Reshape testX to (testX.shape[0], n_steps, n_length, n_features) 3 Build the CBAM-enhanced model : 3.1 Create an Input layer with the shape (n_steps, n_length, n_features) 3.2 TimeDistributed Conv1D layers with ReLU activation and MaxPool1D layers is applied 3.3 Apply a TimeDistributed Dropout layer 3.4 Apply the ChannelAttention module 3.5 Apply TimeDistributed Flatten layer 3.6 Apply a Bidirectional LSTM layer with return_sequences = True 3.7 Apply a Dropout layer 3.8 Apply GlobalAveragePooling1D and BatchNormalization layers 3.9 Apply the SpatialAttention module 3.10 Apply a Dense layer with ReLU activation 3.11 Apply the output Dense layer with softmax activation 3.12 Create and compile the CBAM-enhanced model with categorical_crossentropy loss and Adam optimizer 4 Print the model summary 5 Train the model : 5.1 Define the number of epochs, batch size, and verbose setting 5.2 Call the fit() method on the CBAM-enhanced model, passing trainX, trainy, epochs, validation_data=(testX, testy), batch_size, and verbose The first TimeDistributed layer applies a 1D convolution operation along the time dimension of the input tensor. It uses 128 filters with a kernel size of 4 and a ReLU activation function. The TimeDistributed layer ensures that the convolution operation is applied independently to each time step of the input tensor. After the convolution operation, a TimeDistributed MaxPool1D layer is applied to perform max pooling along the time dimension with a pool size of 2. This down-sampling operation helps to reduce the spatial dimensions and introduces translation invariance as well. Next, another TimeDistributed Conv1D layer is applied with 128 filters and a kernel size of 4, followed by a ReLU activation function. This layer extracts higher-level features from the output of the previous layer. After that, a TimeDistributed Dropout layer is applied with a dropout rate of 0.5 to regularize the model and prevent overfitting. Hence, the CNN model, combined with the attention mechanisms and recurrent layers, forms a powerful architecture capable of capturing both local patterns and long-range dependencies in sequential data, while selectively focusing on the most relevant features and spatial locations. 4.3 Bi-LSTM Layer Bi-LSTM layers are another important part of the suggested module. Bi-LSTM is a popular RNN model that can capture the time instances from both forward and backward passes as well. This feature of Bi-LSTM enables enhanced model performance capturing all the sequences effectively. The Bidirectional LSTM part of the model is implemented using the Bidirectional layer in combination with the LSTM layer. Before applying the BI-LSTM layer, the output from the convolutional and Attention layers is flattened using the Flatten layer. This is necessary because the LSTM layer expects a 3D input tensor with shape (batch_size, time_steps, and features) as well. However, the flattened output is then passed through the Bidirectional LSTM layer. This Bidirectional layer wraps LSTM layer and applies it in two directions: forward and backward. This allows the model to capture both past and future dependencies in the sequential data. The forward LSTM processes the input sequence from the first time step to the last, while the backward LSTM processes the input sequence in reverse order, from the last time step to the first. The outputs of the forward and backward LSTMs are then concatenated at each time step, creating a single output sequence that incorporates information from both directions. Finally, a Dropout layer with a rate of 0.5 is applied to regularize the model and prevent overfitting. Thus, using a BLSTM layer can help in obtaining improved performance on tasks that require understanding the entire sequence context. 5 Performance Evaluation In this experimental task of HAR, we present the analytical results of our suggested method (CNN-Attention-BLSTM) on the basis of smartphone-based dataset. In this paper work, the main focus is on smartphone-based sensors. The main motto is evaluating the effect of our proposed architecture on smartphone sensor-based data for the purpose of recognition of human activities as well. Hence, in this manner, to establish the efficiency of the above mentioned model, we present the experimental outcomes obtained by utilizing UCI HAR dataset in terms of F1 score and accuracy. 5.1. Dataset Description For evaluating the performance of human activity detection model, in this study, a smartphone-based publicly available datasets: UCI HAR is utilized. The basic elaboration of this dataset is presented as follows: UCI-HAR (Reyes-Ortiz et al. 2016 ): This standard database comes from the “University of California Irvine (UCI) Machine Learning” repository, which is openly accessible to the public. The dataset is basically a balanced dataset. This dataset was gathered from thirty individuals, ranging in age from 19 to 48, who engaged in six distinct activities of everyday living, including “sitting”, “standing”, “walking”, “lying”, “walking upstairs”, and “walking downstairs” (Fig. 6 ). A smartphone “Samsung Galaxy S II” integrated with gyroscope and accelerometer, positioned on the waist was used for gathering the data. Additionally, this dataset was gathered under appropriate supervision in a laboratory setting. The researchers measured the 3-axial angular velocity and tri-axial linear acceleration at a constant sampling rate of 50 Hz. statistically, the dataset is consists of 10,299 numbers of instances and further details of train set and test set are displayed in Table 2 and Table 3 . Table 2 Activities involved in UCI HAR train set Activities Samples Percentage Walking Sitting Standing Laying Walking Upstairs Walking Downstairs 1226 1286 1374 1407 1073 986 16.7% 17.5% 18.7% 19.1% 14.6% 13.4% Table 3 Activities involved in UCI HAR test set Activities Samples Percentage Walking Sitting Standing Laying Walking Upstairs Walking Downstairs 496 491 532 537 471 420 16.8% 16.7% 18.1% 18.2% 16.0% 14.3% From the Fig. 7 a) and Fig. 7 b), it can be easily noticed that the dataset, UCI HAR is a balanced dataset having almost same percentage of data (for both train and test) in different categories. Here, class 1 represents walking, class 2 represents walking upstairs, class 3 represents walking downstairs, class 4 represents sitting, class 5 represents standing, and class 6 represents lying as well. 5.2. Experimental Setup The proposed network design was built using keras, a high-level API for neural networks. Keras is capable of running on the top of CNTK, TensorFlow, or Theano and it is written using Python language. The Keras API allows the model to move from the starting phase to the end of result with the least delay. TensorFlow is used in this experiment for backend purpose. However, the model train and validation is performed on a PC having an Intel(R) Core(TM) i3-7100U CPU with 2.40GHz, 4.00 GB RAM with 16 GB memory. The PC is furnished with a windows operating system with bit size of 64 bits as well. For the experiment, the dataset is first divided into two sets: train and test containing 7352 samples and 2947 samples in train data and test data respectively. The model is then trained in a full supervised manner. In training phase, forward calculation is performed on the train data for obtaining the model output. However, from Softmax layer, the gradient value was back propagated to the convolution layer. For each layer, the biases and weights were initialized with randomly selected values. Adam optimizer is utilized here for back propagation of errors in the layer sequences for updating the model hyper-parameters. Adam is basically an algorithm of stochastic optimization based on first-order gradient, and is hugely selected as optimizer in such models. ReLU activation function is used in this experiment for the layers of the model with varying kernel sizes and filters as well. However, the detailed description of used hyper-parameters for deployment of the proposed network is displayed in the Table 4 as well. 5.3. Experimental Results In this phase, the experimental outputs obtained by deploying the model is presented and elaborated briefly. For the analysis, as here, UCI HAR dataset is used, hence, the model performance is evaluated based on the UCI HAR dataset. Here, the output of the suggested model is discussed with respect to several entities such as performance metric, classification report, and others. The result is obtained utilizing accuracy, recall, precision, computational complexity, testing time as well. However, a comparison analysis is also provided here (Table 4 ) with respect to other existing and commonly used deep learning models such as CNN, LSTM, BLSTM, CNN-LSTM, and CNN-BLSTM as well. This comparison is performed in same experimental environment in order to check and verify the efficiency of the model. Table 5 demonstrates testing time, training time and the accuracy of aforementioned DL models. From the table, it can be noticed that, our proposed mechanism consists comparatively lesser computational time than the other DL approaches as well. In the case of our proposed scheme, it can be seen that the training time is 273 ms and the testing time is 65 ms, which is quite good for the model. Apart from that, in terms of accuracy too, compared to other DL-based architectures, it is noticed that our proposed “CNN-Attention-BLSTM” model provides an accuracy score of 93%, which is better than other existing DL models as well. In Table 6 , the details of classification results of our suggested scheme are presented. In the proposed architecture, we have three popular DL models combined. After deploying the hybrid model, the classification result is obtained in terms of performance metrics such as precision, recall, fi-score, and accuracy as well. Our dataset has six classes containing six different human activities. As a result, we got the classification result for these six classes as well. From this classification report, we can get a clear idea of our model performance for each and every distinct class as well. However, for the dynamic activities, such as, walking, walking upstairs, and walking downstairs the obtained f1-score is 99%, 93%, and 95%, respectively. From this, it can be concluded that the proposed hybrid model can distinguish same kinds of activity patterns very effectively, which cannot be generally obtained by using these models only individually as well. Similarly, for the static activities also, that include, sitting, standing, and laying the obtained value of f1-score is 84%, 87%, and 97%, respectively, which are also quite good model performance than other existing literatures as well. However, from this result, it can be easily depicted that our suggested mechanism not only effectively differentiates among the dynamic and static activities, but also can identify similar types of patterns efficiently at the same time. Apart from f1-score, from the obtained classification score, it is also noticed that for other performances metrics too, such as recall and precision, our model performs exceptionally good for both the static as well as dynamic human activities. The confusion matrix for the different involved activities in test set data is depicted via Fig. 8 . Using multi-class classifier, the confusion matrix for six activity classes is obtained. However, finally, it can be concluded that the proposed model have successfully classified distinct human activities into its desired classes in an efficient manner with an accuracy of 93%. Table 4 Description of hyper-parameters Phase Hyper-parameters Assigned values Model Architecture 4 128 ReLU 2 4 128 ReLU 0.5 100 ReLU 0.5 100 ReLU Convolution_1 Max-pooling Convolution_1 Bi-LSTM Output Kernel size Filters Activation Pool size Kernel size Filters Activation Dropout neurons Activation Dropout neurons Activation Training Optimizer Epochs Batch size Loss Metrics Adam 30 150 Categorical cross entropy Accuracy Table 5 Performance comparison with other models using UCI HAR dataset Models Methods CNN LSTM Bi-LSTM CNN-LSTM CNN-Attention-BLSTM Training time Testing time Test accuracy 271 ms 96 ms 91.25 675 ms 153 ms 90.19 591 ms 197 ms 91.28 371 ms 150 ms 91 273 ms 65 ms 93% Table 6 Classification report of proposed model on UCI HAR dataset Class Precision Recall F1-Score Support Walking Upstairs Downstairs Sitting Standing Laying Accuracy macro avg Weighted avg 1.00 0.91 0.91 0.86 0.87 1.00 0.93 0.93 0.98 0.94 1.00 0.82 0.88 0.95 0.93 0.93 0.99 0.93 0.95 0.84 0.87 0.97 0.93 0.93 0.93 496 471 420 491 532 537 2947 2947 2947 5.4. Performance Evaluation After completion of the experimental execution of our proposed model, it is required to check whether the model is producing satisfactory classification output or not. For this purpose, evaluation of the model performance is really a crucial part of the overall procedure to verify the capability of the model in classifying human activities efficiently. In this manner, we perform some statistical analysis like cross-validation, ablation study for the proper justification and analysis of our model performance. 5.4.1. Cross Validation In this task, cross-validation is performed to cross verify the performance of the model against the dataset and its constraints as well. Cross-validation technique is basically a popular machine learning statistical method used for evaluating the performance of predictive models against independent datasets. In this technique, data is partitioned into train and test sets, where training data is used for training the model and testing set is utilized for assessing the model accuracy as well. This process runs multiple times taking different portion of datasets and average performance accuracy is calculated. Cross-validation actively helps in preventing overfitting issue and hence contributes in obtaining more reliable model performance as well. Among several methods of cross-validation, in this study, we have applied stratified k-fold method for validating our model performance. This is basically a variation of k-fold method that contains the property of maintaining same proportion of data and observations for each of the target classes of the whole dataset. The stratified K-fold helps in reducing the class imbalance by partitioning the data positions into same proportions as well. However, here, we have taken the value of k as 5 and the obtained outcome is presented in the below table [Table 7 ]. Table 7 Cross-validation Result No. of Folds Average Train Accuracy Average Validation Accuracy Average Precision Average Recall Average F1-Score Test Accuracy Test Loss 5 98.85% 98.35% 0.9840 0.9835 0.9835 92.91% 0.4581 the model and testing set is utilized for assessing the model accuracy as well. This process runs multiple times taking different portion of datasets and average performance accuracy is calculated. Cross-validation actively helps in preventing overfitting issue and hence contributes in obtaining more reliable model performance as well. Among several methods of cross-validation, in this study, we have applied stratified k-fold method for validating our model performance. This is basically a variation of k-fold method that contains the property of maintaining same proportion of data and observations for each of the target classes of the whole dataset. The stratified K-fold helps in reducing the class imbalance by partitioning the data positions into same proportions as well. However, here, we have taken the value of k as 5 and the obtained outcome is presented in the below table [Table 7 ]. 5.4.2. Ablation Study An ablation study has also been performed in this work for demonstrating the specific contribution of each of the components involved in our hybrid architecture. The purpose of ablation test is basically investigating the performance of AI models by removing its associated components eventually for getting an idea regarding the contribution of the involved components to the accuracy of the overall system as well. In this experimental framework, ablation experiment is performed in three ways. First of all, we observe the output of each component of our proposed architecture individually, and after that, we have noticed the performance of the combined hybrid model for understanding how the involved factors may affect the overall model performance [Table 8 ]. In Table 8 , it can be noticed that, first, the individual contribution of CNN and Bidirectional LSTM is checked respectively. Next, the combined contribution of CNN and BLSTM with added Attention mechanism is checked and it is clearly observed that the hybrid combination produced better prediction output. After that, we also perform ablation in the context of CBAM mechanism [Table 9 ]. We have observed the model performance by three ways. These are: i) removal of channel attention and presence of spatial attention, ii) removal of spatial attention and presence of channel attention, and iii) presence of both channel and spatial attention. From Table 9 , it can be easily noticed that channel or spatial mechanism are not individually sufficient for obtaining the desired model performance. Hence, the importance of each module involved in CBAM for obtaining better model accuracy is noticed clearly. Apart from that, we have also noticed the change in classification output by modifying the hyper parameters like batch size, epochs, and others for understanding the importance of selecting proper model parameters as well [Table 10 and 11 ]. In this regard, it is observed that the proposed architecture produced higher accuracy with our selected batch size (150) and epochs (30) as well. Table 8 Ablation test result I Model Attention Accuracy F1-score CNN BI-LSTM CNN-BILSTM - - ✔ 91.25% 91.28% 93% 0.91 0.91 0.93 Table 9 Ablation test result II Model Channel Attention Spatial Attention Accuracy CNN-Attention-BiLSTM CNN-Attention-BiLSTM CNN-Attention-BiLSTM - ✔ ✔ ✔ - ✔ 39.7552% 16.8178% 93% Table 10 Ablation test result III Model Epochs Batch Size Accuracy CNN-Attention-BiLSTM CNN-Attention-BiLSTM CNN-Attention-BiLSTM CNN-Attention-BiLSTM CNN-Attention-BiLSTM 30 25 50 75 100 150 91.8222% 92.1615% 90.6006% 92.0937% 93.00% Table 11 Ablation test result IV Model Epochs Batch Size Accuracy CNN-Attention-BiLSTM CNN-Attention-BiLSTM CNN-Attention-BiLSTM CNN-Attention-BiLSTM CNN-Attention-BiLSTM 10 20 30 40 50 150 89.3790% 91.5847% 93.00% 91.9240% 91.6525% 6 Discussion In this task, we have proposed a hybrid deep learning based model for classifying human activities into different classes with desired classification accuracy. Nowadays, understanding and classifying human activities based on different insights have become one of the most popular research areas. Day by day, recognizing crucial activities have become challenging tasks. In this manner, it is required to catch every meaningful insight of data points so that the classification model can produce better outcomes. For this, a hybrid DL model involving three independent DL mechanisms such as CNN, Attention, and Bidirectional LSTM is proposed here in order to predict and classify human activities in suitable classes. For the experiment, a popular publicly available dataset “UCI-HAR” is used. CNN model is well-known for capturing spatial as well as local insight, where LSTMs are suitable for capturing temporal dependencies of data and attention mechanism is useful for picking up the most meaningful and suitable features removing unwanted or less required features as well. Moreover, rather than using only one directional LSTMs, Bi-LSTMs are suitable for capturing data flow from both forward and backward directions so that crucial dynamic activities can be addressed in better way. Hence, considering all these factors, here, a hybrid model “CNN-Attention-Bi-LSTM” is suggested with desire of obtaining more fruitful classification result. However, it is noticed that our proposed model performed well on the above mentioned dataset on both train set and test set with a higher percentage of accuracy. For validating the model performance, we have also utilized some of the statistical measures such as cross-validation technique, ablation study, confusion metric, accuracy, recall, precision and others as well. After performing all these statistical measures, it is noticed that our suggested mechanism produced desired output with higher classification justifying its efficiency in classifying human activities. The proposed architecture can be useful in several domains involving the task of classifying complex human actions. However, in future, our proposed scheme can be modified with several varying constraints like adding or removing different suitable layers, modifying the hyper parameters, adding other components, and so on. Apart from that, various other statistical methods can be also utilized for justifying the model supremacy in more sophisticated manner. Hence, it is desired to work more on the proposed architecture so that more suitable classification outputs can be obtained in future. 7 Conclusion An Attention based hybrid deep learning architecture is proposed here with the desire of classifying different human activities in its suitable categories. For leveraging the characteristics of three independent DL models, in this paper, a hybrid model is proposed. UCI-HAR, a famous openly available dataset is used for experimental purpose. The hybrid model involves the components like Bi-LSTM, Attention, and CNN architectures to form an integrated structure that can ideally recognize the complex human activities by learning meaningful insight of data instances. CNN and Bi-LSTM both these models are famous for capturing spatio-temporal instances from data. CNNs are ideal for capturing local data trends whereas LSTMs are effective for capturing temporal features of data. However, it is considered to use BiLSTM in this experiment rather than applying only LSTM, as Bi-LSTMs possess the ability of capturing data flow from forward and backward directions. In this manner, utilization of BiLSTM may produce better result for dynamic human activities like walking, running, and others that generally involves backward and forward time steps as well. Moreover, besides extraction of relevant features, it is simultaneously necessary to select only meaningful and most wanted feature sets eliminating unwanted or less required features. For this purpose, attention mechanism is applied to select efficient features attentively and also for dimensionality reduction. This results in the model taking lesser computational time as well. Hence, by considering all these factors, with desire of obtaining better accuracy, the “CNN-Attention-BiLSTM” model is suggested in this work. However, after the completion of the experiment, our proposed model produced effective classification accuracy with a percentage of 93%. It shows that the proposed architecture performed well on the dataset and learned important features efficiently. For proper justification and validation of the suggested mechanism, we also utilize some statistical methods like classification report, confusion metric, ablation experiment, and cross-validation as well. After applying these techniques, it is observed that the model produced a higher output by justifying its effectiveness in classifying the several human activity sets involved in the UCI-HAR dataset. We also compare our model performance with other existing algorithms, with respect to several parameters to prove the supremacy of our model. However, in future, it is expected to enhance the proposed model by modifying its components and parameters to obtain more accurate and fruitful recognition accuracy for classifying more complex human activities. Declarations Funding Declaration There is no funding for this manuscript. References Abdel-Basset, M., Hawash, H., Chakrabortty, R.K., Ryan, M., Elhoseny, M. and Song, H., 2020. ST-DeepHAR: Deep learning model for human activity recognition in IoHT applications. IEEE Internet of Things Journal, 8(6), pp.4969-4979. Alo, U.R., Nweke, H.F., Teh, Y.W. and Murtaza, G., 2020. Smartphone motion sensor-based complex human activity identification using deep stacked autoencoder algorithm for enhanced smart healthcare system. Sensors, 20(21), p.6300. Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M.A., Al-Amidie, M. and Farhan, L., 2021. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. Journal of big Data, 8, pp.1-74. Anwar, S., Milanova, M., Adbulla, S. and Muhammed, S.A., 2023. Graph convolutional networks for pain detection via telehealth. In Artificial Intelligence in Healthcare and COVID-19 (pp. 93-104). Academic Press. Ariza-Colpas, P.P., Vicario, E., Oviedo-Carrascal, A.I., Butt Aziz, S., Piñeres-Melo, M.A., Quintero-Linero, A. and Patara, F., 2022. human activity recognition data analysis: Histo- ry, evolutions, and new trends. Sensors, 22(9), p.3401. Atalaa, B.A., Ziedan, I., Alenany, A. and Helmi, A., 2021. Feature Engineering for Human Activity Recognition. Int. J. Adv. Comput. Sci. Appl, 12, pp.160-167. Barna, A., Masum, A.K.M., Hossain, M.E., Bahadur, E.H. and Alam, M.S., 2019, February. A study on human activity recognition using gyroscope, accelerometer, temperature and humidity data. In 2019 international conference on electrical, computer and communication engineering (ecce) (pp. 1-6). IEEE. Bhatt, D., Patel, C., Talsania, H., Patel, J., Vaghela, R., Pandya, S., Modi, K. and Ghayvat, H., 2021. CNN variants for computer vision: History, architecture, application, challenges and future scope. Electronics, 10(20), p.2470. Bulbul, E., Cetin, A. and Dogru, I.A., 2018, October. Human activity recognition using smartphones. In 2018 2nd international symposium on multidisciplinary studies and innovative technologies (ismsit) (pp. 1-6). IEEE. Cao, J., Pang, Y., Xie, J., Khan, F.S. and Shao, L., 2021. From handcrafted to deep features for pedestrian detection: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9), pp.4913-4934. Challa, S.K., Kumar, A., Semwal, V.B. and Dua, N., 2023. An optimized deep learning model for human activity recognition using inertial measurement units. Expert Systems, 40(10), p.e13457. Chen, K., Zhang, D., Yao, L., Guo, B., Yu, Z. and Liu, Y., 2021. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Computing Surveys (CSUR), 54(4), pp.1-40. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K. and Bengio, Y., 2015. Attention-based models for speech recognition. Advances Côté-Allard, U., Campbell, E., Phinyomark, A., Laviolette, F., Gosselin, B. and Scheme, E., 2020. Interpreting deep learning features for myoelectric control: A comparison with handcrafted features. Frontiers in bioengineering and biotechnology, 8, p.158. Cvetković, B., Szeklicki, R., Janko, V., Lutomski, P. and Luštrek, M., 2018. Real-time activity monitoring with a wristband and a smartphone. Information Fusion, 43, pp.77-93. Dang, L.M., Min, K., Wang, H., Piran, M.J., Lee, C.H. and Moon, H., 2020. Sensor-based and vision-based human activity recognition: A comprehensive survey. Pattern Recognition, 108, p.107561. Fu, B., Damer, N., Kirchbuchner, F. and Kuijper, A., 2020. Sensing technology for human activity recognition: A comprehensive survey. Ieee Access, 8, pp.83791-83820. Gu, F., Chung, M.H., Chignell, M., Valaee, S., Zhou, B. and Liu, X., 2021. A survey on deep learning for human activity recognition. ACM Computing Surveys (CSUR), 54(8), pp.1-34. Ige, A.O. and Noor, M.H.M., 2023. A deep local-temporal architecture with attention for lightweight human activity recognition. Applied Soft Computing, 149, p.110954. in neural information processing systems, 28. Islam, M.M., Nooruddin, S., Karray, F. and Muhammad, G., 2022. Human activity recognition using tools of convolutional neural networks: A state of the art review, data sets, challenges, and future prospects. Computers in Biology and Medicine, 149, p.106060. Jang, B., Kim, M., Harerimana, G., Kang, S.U. and Kim, J.W., 2020. Bi-LSTM model to increase accuracy in text classification: Combining Word2vec CNN and attention mechanism. Applied Sciences, 10(17), p.5841. Khan, I.U., Afzal, S. and Lee, J.W., 2022. Human activity recognition via hybrid deep learning-based model. Sensors, 22(1), p.323. Kumar, M., Patel, A.K., Biswas, M. and Shitharth, S., 2023. Attention-based bidirectional-long short-term memory for abnormal human activity detection. Scientific Reports, 13(1), p.14442. Lee, P., Uh, Y. and Byun, H., 2020, April. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 07, pp. 11320-11327). Li, J., Wang, Y. and McAuley, J., 2020, January. Time interval aware self-attention for sequential recommendation. In Proceedings of the 13th international conference on web search and data mining (pp. 322-330). Li, L., Yang, Y., Yuan, Z. and Chen, Z., 2021. A spatial-temporal approach for traffic status analysis and prediction based on Bi-LSTM structure. Modern Physics Letters B, 35(31), p.2150481. Li, Y., Yang, G., Su, Z., Li, S. and Wang, Y., 2023. Human activity recognition based on multienvironment sensor data. Information Fusion, 91, pp.47-63. Mim, T.R., Amatullah, M., Afreen, S., Yousuf, M.A., Uddin, S., Alyami, S.A., Hasan, K.F. and Moni, M.A., 2023. GRU-INC: An inception-attention based approach using GRU for human activity recognition. Expert Systems with Applications, 216, p.119419. Mutegeki, R. and Han, D.S., 2020, February. A CNN-LSTM approach to human activity recognition. In 2020 international conference on artificial intelligence in information and communication (ICAIIC) (pp. 362-366). IEEE. Nafea, O., Abdul, W., Muhammad, G. and Alsulaiman, M., 2021. Sensor-based human ac- tivity recognition with spatio-temporal deep learning. Sensors, 21(6), p.2141. Naheliya, B., Redhu, P. and Kumar, K., 2023. MFOA-Bi-LSTM: An optimized bidirectional long short-term memory model for short-term traffic flow prediction. Physica A: Statistical Mechanics and its Applications, p.129448. Nguyen, V.S., Kim, H. and Suh, D., 2023. Attention Mechanism-Based Bidirectional Long Short-Term Memory for Cycling Activity Recognition Using Smartphones. IEEE Access, 11, pp.136206-136218. Ni, Q., Fan, Z., Zhang, L., Nugent, C.D., Cleland, I., Zhang, Y. and Zhou, N., 2020. Leveraging wearable sensors for human daily activity recognition with stacked denoising autoencoders. Sensors, 20(18), p.5114. Niu, Z., Zhong, G. and Yu, H., 2021. A review on the attention mechanism of deep learning. Neurocomputing, 452, pp.48-62. Pareek, P. and Thakkar, A., 2021. A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artificial Intelligence Review, 54, pp.2259-2322. Ramanujam, E., Perumal, T. and Padmavathi, S., 2021. Human activity recognition with smartphone and wearable sensors using deep learning techniques: A review. IEEE Sensors Journal, 21(12), pp.13029-13040. Reyes-Ortiz, J.L., Oneto, L., Samà, A., Parra, X. and Anguita, D., 2016. Transition-aware human activity recognition using smartphones. Neurocomputing, 171, pp.754-767. Ronao, C.A. and Cho, S.B., 2016. Human activity recognition with smartphone sensors using deep learning neural networks. Expert systems with applications, 59, pp.235-244. Singh, T., Rustagi, S., Garg, A. and Vishwakarma, D.K., 2019, September. Deep Learning Framework for Single and Dyadic Human Activity Recognition. In 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM) (pp. 237-241). IEEE. Sinha, H., Awasthi, V. and Ajmera, P.K., 2020. Audio classification using braided convolutional neural networks. IET Signal Processing, 14(7), pp.448-454. Straczkiewicz, M., James, P. and Onnela, J.P., 2021. A systematic review of smartphone-based human activity recognition methods for health research. NPJ Digital Medicine, 4(1), p.148. Sultana, J., Usha Rani, M. and Farquad, M.A.H., 2020. An extensive survey on some deep-learning applications. In Emerging Research in Data Engineering Systems and Computer Communications: Proceedings of CCODE 2019 (pp. 511-519). Singapore: Springer Singapore. Tasmin, M., Ishtiak, T., Ruman, S.U., Suhan, A.U.R.C., Islam, N.S., Jahan, S., Ahmed, S., Zulminan, M.S., Saleheen, A.R. and Rahman, R.M., 2020, August. Comparative study of classifiers on human activity recognition by different feature engineering techniques. In 2020 IEEE 10th International Conference on Intelligent Systems (IS) (pp. 93-101). IEEE. Thakur, D., Biswas, S., Ho, E.S. and Chattopadhyay, S., 2022. Convae-lstm: Convolution- al autoencoder long short-term memory network for smartphone-based human activity recognition. IEEE Access, 10, pp.4137-4156. Voicu, R.A., Dobre, C., Bajenaru, L. and Ciobanu, R.I., 2019. Human physical activity recognition using smartphone sensors. Sensors, 19(3), p.458. Wang, H., Zhao, J., Li, J., Tian, L., Tu, P., Cao, T., An, Y., Wang, K. and Li, S., 2020. Wearable sensor-based human activity recognition using hybrid deep learning techniques. Security and communication Networks, 2020, pp.1-12. Wang, P., Fan, E. and Wang, P., 2021. Comparative analysis of image classification algorithms based on traditional machine learning and deep learning. Pattern Recognition Letters, 141, pp.61-67. Xia, K., Huang, J. and Wang, H., 2020. LSTM-CNN architecture for human activity recognition. IEEE Access, 8, pp.56855-56866. Xu, C., Chai, D., He, J., Zhang, X. and Duan, S., 2019. InnoHAR: A deep neural network for complex human activity recognition. Ieee Access, 7, pp.9893-9902. Yadav, S.K., Tiwari, K., Pandey, H.M. and Akbar, S.A., 2021. A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions. Knowledge-Based Systems, 223, p.106970. Yao, L., Sheng, Q.Z., Benatallah, B., Dustdar, S., Wang, X., Shemshadi, A. and Kanhere, S.S., 2018. WITS: an IoT-endowed computational framework for activity recognition in personalized smart homes. Computing, 100, pp.369-385. Zhang, X., Wang, L. and Su, Y., 2021. Visual place recognition: A survey from deep learning perspective. Pattern Recognition, 113,p.107760. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6536118","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":452827171,"identity":"acdbe525-8bec-4470-b61b-993a27a00cbc","order_by":0,"name":"Purba Mukhopadhyay","email":"","orcid":"","institution":"Maulana Abul Kalam Azad University of Technology West Bengal","correspondingAuthor":false,"prefix":"","firstName":"Purba","middleName":"","lastName":"Mukhopadhyay","suffix":""},{"id":452827172,"identity":"3c8d2f80-9304-4bd0-b475-0d4db9fc1956","order_by":1,"name":"Sudipta Saha","email":"","orcid":"","institution":"Maulana Abul Kalam Azad University of Technology West Bengal","correspondingAuthor":false,"prefix":"","firstName":"Sudipta","middleName":"","lastName":"Saha","suffix":""},{"id":452827173,"identity":"41c644ea-91e3-4316-a299-6ddcfceb1fa6","order_by":2,"name":"Koushik Majumder","email":"","orcid":"","institution":"Maulana Abul Kalam Azad University of Technology West Bengal","correspondingAuthor":false,"prefix":"","firstName":"Koushik","middleName":"","lastName":"Majumder","suffix":""},{"id":452827174,"identity":"b1ba3720-2909-477e-b0ff-b1bf3cbbbdab","order_by":3,"name":"Saikat Basu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABEklEQVRIiWNgGAWjYFACxmYQmQBiPYDQDGwMCBKXlgSwUmYDsBY2gloYmGFa2CRQtOAC8u2Hmw1+/mDI4+8/Y1Z1oyZN3uB+j9kDhho7Bj7pBqxaDM4kNif2JDAUS9zIMbudcyzHcMMxHnMDhmPJDGwyB7BrYUhsPsCTwJDYcIMHqIWtgnHDMd5tEgxsB4DuTMDusP6HzQf/ALXMP3/GrDjnX4U9RMs/3FoYbiQ2J4Ns2XAgx4w5ty0nEayFsQ23FoMbD5uNZdIkig1vpBVL5/alJc88lv/dILEvmQe3w9IfS76xscmTO3944+ecb8m2fYePpT348M1OTn4GDodBgAQaH6iYB5/6UTAKRsEoGAX4AQBnulz5SD8FZAAAAABJRU5ErkJggg==","orcid":"","institution":"Maulana Abul Kalam Azad University of Technology West Bengal","correspondingAuthor":true,"prefix":"","firstName":"Saikat","middleName":"","lastName":"Basu","suffix":""}],"badges":[],"createdAt":"2025-04-26 16:23:11","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6536118/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6536118/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":82402614,"identity":"63de7253-128f-485f-8e01-05b8e3d00e2a","added_by":"auto","created_at":"2025-05-10 01:06:37","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":171722,"visible":true,"origin":"","legend":"\u003cp\u003eStructure of the proposed model\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-6536118/v1/68a9bb71aee1106337d66163.png"},{"id":82402909,"identity":"aae60de2-cb7d-4dbc-8dbf-14fdf8e7d02e","added_by":"auto","created_at":"2025-05-10 01:14:37","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":156796,"visible":true,"origin":"","legend":"\u003cp\u003eCNN model\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-6536118/v1/7d4e32f916bd977da2b8dfba.png"},{"id":82402616,"identity":"1b72f6b0-eb03-4b14-a690-15b5e7c5af8f","added_by":"auto","created_at":"2025-05-10 01:06:37","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":43864,"visible":true,"origin":"","legend":"\u003cp\u003eBi-LSTM Model (Nguyen et al. 2023)\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-6536118/v1/3b7bcbcdcb6e3d41e7adf971.png"},{"id":82402621,"identity":"5126cdbd-75da-4228-8937-248df7bc3e1c","added_by":"auto","created_at":"2025-05-10 01:06:37","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":151305,"visible":true,"origin":"","legend":"\u003cp\u003eCBAM Architecture\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-6536118/v1/019acc12e00cd10c3a148725.png"},{"id":82402910,"identity":"8118ec9e-0bfe-4804-af32-aaeecabfee2f","added_by":"auto","created_at":"2025-05-10 01:14:37","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":42296,"visible":true,"origin":"","legend":"\u003cp\u003eProposed Model Flow Diagram\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-6536118/v1/6d2612cab358ae9534ce6102.png"},{"id":82402980,"identity":"94d0d6cb-1c71-44f4-ab2a-80f3cb94c0a7","added_by":"auto","created_at":"2025-05-10 01:22:37","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":464160,"visible":true,"origin":"","legend":"\u003cp\u003eInvolved activities in UCI HAR Dataset\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-6536118/v1/0de104dfa87f2dd225f04b6c.png"},{"id":82402619,"identity":"2ca8b96f-4742-4e41-9665-9a2a61d66685","added_by":"auto","created_at":"2025-05-10 01:06:37","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":83861,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ea) \u003c/strong\u003e% of different activities in train set \u003cstrong\u003eb) \u003c/strong\u003e% of different activities in test set\u003c/p\u003e","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-6536118/v1/673109b9e069184d0c72c21f.png"},{"id":82402911,"identity":"8ec27070-f4de-4c3b-a7e6-6b9bcbe574d7","added_by":"auto","created_at":"2025-05-10 01:14:37","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":117746,"visible":true,"origin":"","legend":"\u003cp\u003eConfusion matrix of the proposed model\u003c/p\u003e","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-6536118/v1/a9470a2d62fabce0573eb133.png"},{"id":85469700,"identity":"f33c5a98-cce7-4f58-b6c5-6a87b422b409","added_by":"auto","created_at":"2025-06-26 09:02:17","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2537739,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6536118/v1/4ac14591-a03f-464a-bf5f-8646279abc07.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Hybrid Deep Learning Architecture for Efficient Human Activity Recognition: A CNN-Attention-BiLSTM Framework","fulltext":[{"header":"1 Introduction","content":"\u003cp\u003eIn the present scenario, Human Activity Recognition has become one of the most widely-known research areas (Ariza-Colpas et al. \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Recently, due to its numerous applications in fields including medical care, disease prediction, robotics, sports, video surveillance, and others, human activity recognition (HAR) has attracted a lot of attention. The ability to automatically identify and understand human actions from sensory data is generally a critical task in the domain of artificial intelligence. However, it also possesses the potential to revolutionize several industries as well. A dense understanding of human behavior along with an improvement in life quality is fostered by HAR-enabled individualized health monitoring, behavioral analysis, and real-time activity monitoring capabilities. In accordance with a report published by UN (United Nations), it is assumed that by 2050, there will be 2\u0026nbsp;billion elderly people world-wide. However, elderly individuals require extra care and attention since the majority of these people possess multiple diseases. An essential component of smart healthcare is the real-time monitoring of individuals\u0026rsquo; physical activities, especially their daily living activities (DLAs) (Gu et al. \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2021\u003c/span\u003e), which can significantly improve eldercare and medical rehabilitation. Numerous serious diseases are significantly impacted by daily activities. Therefore, monitoring of day-to-day physical activities provides a crucial health indicator as well. Generally, it is common practice to track, evaluate, and comprehend different postures across a wide range of\u003c/p\u003e \u003cp\u003esystems and applications by classifying and identifying human physical activities (Chen et al. \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2021\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIn HAR, several human activities including running, walking, sleeping, sitting, standing, and so on are recognized. Experimental data and required resources can be acquired from several sensors (wearable, wireless etc.), accelerometers, or through images, or video frames as well. There exist several sensor-based frameworks of HAR like smartphone sensor-enabled, audio/video data-related and body-worn sensor-based as well (Ramanujam et al. \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). However, among these, body-worn sensors may not be comfortable to the users, as it is required to place on several locations of body. Apart from that, collecting inputs from audio or video possess different kinds of privacy related issues. Moreover, both the signals from body-worn devices and audio/video need complex techniques for pre-processing to remove unwanted noises from input data as well. In most of the cases, long-ranged audio signals become noisy for white noise or background noise as well. Hence, an audio input at a certain moment fails to provide valuable insights. Moreover, differentiating between two audio signals gets difficult too (Fu et al. \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). Therefore, it can be stated that for distinctly identifying some basic human activities audio inputs are not always sufficient and suitable on their own as well. Collecting video data, especially in populated locations may be problematic, due to existence of various physical obstacles, or due to low brightness as well (Pareek and Thakkar, \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). However, for inferring the characteristics of transfer modes and human activities, sensor-based data can also be acquired from smartphones also. In general, the physical systems on HAR based on smartphone sensors are prompted by their discretion, ubiquity, and inexpensive appliance procedures, usefulness, and noninvasive properties as well (Straczkiewicz et al. \u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Ronao and Cho, \u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e2016\u003c/span\u003e). Utilizing smartphones, continuous input can be gathered at the time of performing any kind of physical actions. Apart from that, due to various built-in mobile sensors, keeping track of health-related data has become more accurate and elegant nowadays. These different built-in sensors of smartphones are used for collection important insights of HAR models. It is noticed that, among mobile sensors, gyroscopes and accelerometers are the most widely used sensors (Bulbul et al. \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Voicu et al. \u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). Multivariate time-series characteristics can be found in datasets derived from smartphone sensors. The basic feature of time-series data is local dependency. Furthermore, human activity signals are translation-invariant and hierarchical, and also possess dynamic information with regard to underlying systems as well (Li et al. \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Therefore, the requirement of modeling these high-dimensional datasets accurately is increasing. Physical activities include several unique characteristics. Thus, HAR involves various methodological issues, including imbalanced datasets, interclass similarity, intra-class variability, empty class problem, and others as well (Yao et al. \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2018\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIn the current days, HAR systems based on smartphone sensor data most preferably utilize the traditional machine learning (ML) algorithms as well as deep learning (DL) methods for recognizing human activities in efficient manner (Cvetković et al. \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). For extracting relevant underlying features responsible for differentiating distinct activity patterns, in conventional ML methods, feature engineering has become one of the most ruling phases. The desired and enhanced model performance of HAR systems hugely relies upon the efficient feature engineering of raw input signals. After extracting useful features, those are then fed to the classifiers to identify human activities. However, for obtaining enhanced classification result, extraction of relevant features accurately is highly required. Without a relevant feature engineering model, traditional classifiers cannot perform well and thus fails to identify human activities accurately and competently (Atalaa et al. \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). In this manner, complex techniques for data pre-processing are needed for getting sensory data in a proper form, and this extraction procedure of hand-crafted features from sensory data require high expertise domain knowledge in this sense. Finally, the extracted handcrafted features are sent to conventional classification system for identifying human activities. However, it is noticed from several research that these handcrafted features not always work for all the models and perform poorly in recognition models as well (Tasmin et al. \u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). Apart from that, it requires distinct handcrafted feature vector for different domains in research for handling classification problems properly. In this manner, currently, most of the researchers prefer and utilize DL algorithms to overcome such problems. However, there are huge application field of HAR, most extensively applied in medical domains, and for the purpose of taking care and tracking records of elderly people for helping them in better and secure lifestyle (Yadav et al. \u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). Moreover, for controlling and monitoring crime records and rates, HAR can also be applied. Apart from that, the everyday activities recognition can build an environment for smart home technology. Driving behaviors can be detected and thus helps in promoting safe transportation. Implementing HAR, military operations can be identified as well. Moreover, the other domains where HAR is applied are entertainment, autonomous driving, surveillance and security, human-robot interaction as well (Dang et al. \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). The basic objective of HAR is recognizing different kinds of human actions and activities in controlled and uncontrolled manners.\u003c/p\u003e \u003cdiv id=\"Sec2\" class=\"Section2\"\u003e \u003ch2\u003e1.1 Motivation\u003c/h2\u003e \u003cp\u003eIn several research works, it is noticed that nowadays most of the Human Activity Recognition models utilize Deep Learning algorithms rather using traditional ML algorithms, as ML algorithms require handcrafted features (Cao et al. \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). Apart from that, DL models possess capability of automatic feature extraction and learning. This makes an extra advantage for using DL models in HAR systems. DL algorithms are basically capable of extracting important features in an efficient manner without any manual intervention along possessing the ability of recognizing human actions simultaneously (C\u0026ocirc;t\u0026eacute;-Allard et al. \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). These DL methodologies are proved as outstanding in performing desired prediction in various domains such as intelligent gaming system, ideal recognition of image and speech, and so on as well (Sultana et al. \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). In the literature of HAR, DL methods have gained an outstanding contribution. Nowadays, in most of the research works regarding HAR domain, different types of DL algorithms are being applied and investigated. Recognition of distinct complex activities of human require some certain steps for identifying all the responsible features accurately so that the HAR model can get its desired outcome successfully. Among these procedures, one of the most considerable phases is extraction of relevant features. Human activities are basically consists of two types of features: spatial and temporal. Identification of both of these features is equally important for recognizing a specific activity (Lee et al. \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). In this manner, extraction of these spatio-temporal characteristics of smartphone-based sensory input is highly required. For this purpose, proper procedures of feature extraction along with following required data pre-processing techniques for converting raw and noisy input signal into acceptable as well as clear data is required. There exist numerous DL methodologies applied in the HAR architectures for gaining required spatial and temporal characteristics of data to ideally specify human actions. Among these DL mechanisms, CNN model is one of the most widely used DL model that generally helps in extracting spatial as well as local trends of data ideally. In HAR domain, utilization of CNN architecture aids with several advantageous features that enrich the classification cost and accuracy of activity classifiers. CNNs can learn spatial characteristics of data points very efficiently and can interpret spatial hierarchies from input data such as images or sensors (accelerometers or gyroscopes) automatically without intervention of manual pre-processing. In HAR models, input data fed to model is generally image data or sensory data as well. Hence, CNNs can be taken as best fit for defining the model. Apart from that, CNNs can gain insights of relevant features that are generally invariant to minor translations in input data. For HAR models, these minor features also play important role in capturing slight body movements, so that every little moves are captured and understood by the model as well (Islam et al. \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). For extracting higher-level features, CNN use various layers that perform operations like pooling and convolution. This characteristic of CNN model helps in capturing both low-level spatial (corners) and high-level abstract (movement sequences, postures) patterns that are equally important to model learning. Moreover, CNNs are also capable of handling multimodal data. Using several data streams (gyroscope, accelerometer) or by integrating sensory data with image-based data, CNNS can generally handle multimodal data and thus possess versatility in processing several input data types (Sinha et al. \u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). However, building HAR models incorporating CNN is advantageous in several aspects, especially for the ability of CNN in learning discriminative spatial patterns automatically, along with handling translations, processing multimodal inputs, and others without the requirement of manual feature extraction as well. However, besides extraction of local characteristics of sensory data, it is also equally required to accurately retrieve temporal dependency of data. Generally, RNNs are better and widely used deep networks in HAR domain. Among RNNs, LSTMs play a great role in capturing temporal relationship of data. LSTMs possess various advantages in capturing temporal characteristics. However, temporal dependencies of human body movements also possess several bi-directional patterns. Sometimes, LSTMs may lack in addressing complex activities that involves bi-directional activities. To mitigate this problem, Bi-LSTM can be utilized as they can capture temporal trends of data from both of the instances, past and future (Jang et al. \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). Capturing bi-directional trends of data is necessary for identifying the activities involving forward and backward movements as well. Hence, besides LSTM model, application of Bi-LSTM model can also be noticed in this sense. Apart from this, it is also required to keep an eye in extracting these features wisely so that any unwanted features cannot be entertained in model training. For this, to pay extra \u0026ldquo;attention\u0026rdquo; in feature selection, attention mechanisms are being used in most of the cases recently (Li et al. \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). Generally, attention mechanisms are kind of DL methods that pay extra attention in selecting features by paying more attention in wanted features, while paying less attention in unwanted ones as well.\u003c/p\u003e \u003cp\u003eHence, motivated by these constraints (Bi-LSTM, CNN, and Attention), here, in this article, an integrated model is proposed for identification of human activities. The corresponding literature of this study also indicates that the proposed combination is novel. However, the step-by-step architecture in this proposed model involves the following three different phases:\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eCNN-Attention-Bi-LSTM module\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eUnderstanding CBAM operation\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eA GAP layer followed by dropout and Softmax layers\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e1.2 Contribution\u003c/h2\u003e \u003cp\u003eThe main contributions provided in this article are as follows:\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eThe paper presents and discusses several aspects regarding the domain of human activity recognition.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eAn extensive literature review consisting DL-based frameworks of HAR is performed for easy understanding of readers regarding the topic along with identifying potential literature gaps.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eAn efficient hybrid DL-based model is proposed consisting three DL algorithms such as Attention, CNN, and Bi-LSTM to ideally extract essential spatio-temporal characteristics of smartphone-based sensor data.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eThe effectiveness and proper justification of the proposed system is presented through required experiments, validation techniques, and performance metric as well. Finally, we compare the obtained result with other existing literatures.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"2 Related Work","content":"\u003cp\u003eIn this section, a comprehensive literature review of related articles has been performed. Various works and research done in this particular field of HAR have been evaluated. Generally, to find out the effective research gap and the possible future directions that can make this domain of HAR more productive and more fruitful, this review work is presented. The research field of HAR domain generally consist of both ML and DL approaches. Previously, in several research activities, researchers utilized the application of classic ML algorithms in HAR domain to find out recognition accuracy of models (Barna et al. \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). In these ML-based systems, researchers used numerous feature selection or/and extraction procedures prior to feeding the collected data to classifiers for identifying several human behaviors. However, it is noticed that, ML models depend upon handcrafted extraction of features, and this procedure of feature retrieval requires expertise in domain knowledge and manual intervention as well, which results in increased time complexity (Wang et al. \u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). In this sense, to overcome such disadvantages of ML-based HAR systems, researchers have focused in exploring and applying DL-based mechanisms, as DL-based architectures possess the benefits of automated extraction of features, without human interference as well. Hence, in this part, a survey on DL-based human action identification systems is presented.\u003c/p\u003e \u003cp\u003e \u003cb\u003eDL for HAR\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThakur et al. (\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) proposed smartphone-based hybrid DL architecture for recognizing human activities. The hybrid architecture \u0026ldquo;ConvAE-LSTM\u0026rdquo; consists of deep learning models: CNN, auto-encoders (AE), and LSTM. CNN models perform well by extracting useful features automatically and capturing spatial features, for reducing dimensionality AEs are used, and LSTMs are popular for capturing temporal sequences as well. Thus, this hybrid unified model forms a complimentary architecture by covering all the advantageous aspects like spatiotemporal characteristics and dimensionality reduction. Four distinct standard public datasets are used for the proposed experimental purpose. Two of them are smartphone-based (WISDM, UCI), and the rest of the two are based on body-worn sensors (OPPORTUNITY, PAMAP2) as well. Using the metrics such as recall, F1 score, precision, and accuracy along with a cross-validation technique named LOSO; the acquired outcomes are cross-checked and validated. However, the model can be enhanced more by addressing bi-directional dynamic activities more precisely.\u003c/p\u003e \u003cp\u003eWang et al. (\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) presented a deep architecture, capable of learning local features and modeling time dependencies between features automatically, without manual intervention. In this regard, the author built a hybrid model, combination of convolutional neural network (CNN) model and long short-term memory (LSTM) recurrent deep network. CNN model is utilized here to extract relevant features from collected sensor-based experimental data. LSTM architecture is applied for capturing long-term reciprocity among two activities for further improvement purpose of the identification rate of HAR. Hence, combining CNN-LSTM, a model based on wearable sensors is proposed to detect several human activities and associated transitions accurately. Acceleration and gyroscope sensor-based smartphone data are collected for experimental purposes. The experiment was performed utilizing the \u0026ldquo;HAPT\u0026rdquo; dataset. However, the further development of this model may be performed in terms of transitions by tuning its parameters, and components to achieve more satisfactory result. Xia et al. (\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) proposed a deep network combining convolutional layers with LSTM model. LSTM is basically a variant of recurrent neural network (RNN) and more capable of processing temporal features or sequences as well. CNN deep network is utilized for capturing local spatial dependencies of data. This hybrid model \u0026ldquo;LSTM- CNN\u0026rdquo; automatically extracts activity features and classify these with fewer model parameters as well. Here, three broadly used mobile sensor-based public datasets; WISDM, UCI, and OPPORTUNITY are used for experimental and analytical purpose. However, the model lacks in addressing bidirectional activities along with selection of proper features as well. Hence, further development of the model can be done and the results can be improved accordingly.\u003c/p\u003e \u003cp\u003eMim et al. (\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) proposed a DL model \u0026ldquo;GRU-INC\u0026rdquo; for recognizing human activities. The model is an \u0026ldquo;Inception-Attention\u0026rdquo; based method combining the Gated Recurrent Unit (GRU) model. The combination is effective for actively capturing spatial and temporal information of time-series data. Here, combination of GRU and Attention is utilized for extracting temporal features. On the other hand, Inception along with Convolutional Block Attention Module (CBAM) is exploited for extracting spatial representations. Using available public datasets such as OPPORTUNITY, WISDM, PAMAP2, UCI-HAR, and Daphnet, several human activities have been examined. However, the model lacks in addressing long-range dependencies and bi-directional movements adequately. Apart from that, only one metric is used here for performance analysis. In this sense, further improvement of this model can be performed and checked by hyper tuning the model parameters.\u003c/p\u003e \u003cp\u003eMutegeki and Han (\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) presented a novel deep architecture-based activity recognition model, \u0026ldquo;Convolutional neural network-long short-term memory network\u0026rdquo; (CNN- LSTM) architecture as well. This model is basically a hybrid model, combination of two different deep architecture CNN architecture and LSTM architecture. For experimental purpose, two datasets are used and the proposed method is applied over these two datasets, iSPL (3-activity) and UCI HAR (6-activity) for evaluating the applicability and performance of the proposed method. The performance of the model is evaluated using several performance metrics like accuracy, cross-entropy. However, the model consumes more time and less effective in selecting essential features. Hence, the model can be improved further eventually.\u003c/p\u003e \u003cp\u003eXu et al. (\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) proposed a deep neural architecture, InnoHAR model combining two deep architectures; Gated Recurrent Unit (GRU) and inception net- work as well. The model accepts input data in the form of waves of multi-channel sensing devices end-to-end. Gated Recurrent Unit (GRU) is employed for effective modeling of time series data and features as well. Among RNNs, GRU is quite popular for its simple architecture and temporal ability. GRU model possess the ability of sensing temporal relationships between data points. Apart from that, in this experiment, for retrieving spatial features from sensor-based waveform data, GoogLeNet\u0026rsquo;s Inception part is used for implementing inception on three datasets. The experiment was performed over three datasets, OPPORTUNITY, PAMAP2, and SMARTPHONE and performance evaluation was done using F-measure that also covers both recall and precision as well. Considering the overall performance of the proposed structure, the experimental outcomes deliver that the suggested InnoHAR based on Inception-like model produce better output than both CNN (9% improvement for OPPORTUNITY dataset and 3% for PAMAP2 dataset) and DeepConvLSTM (5% improvement for OPPORTUNITY dataset and 3% for PAMAP2 dataset) as well. Khan et al. (\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) proposed a hybrid DL model combining two commonly used DL network, CNN architecture and LSTM architecture for achieving better recognition performance for indoor environments. CNNs are basically used for extracting features spatially, where; LSTMs are mainly focused on extracting learning temporary information dependencies as well. Keeping this in consideration, author presented a hybrid model combining these two with desire to obtain an improved performance. For analytical evaluation, a self-made dataset is used that collected instances via e Kinect V2 sensor capable of extracting 25 distinct joints of human body (involves 12 distinct human activity classes) from 20 members. However, the proposed model obtained accuracy of 90.89% in comparison with other existing deep models.\u003c/p\u003e \u003cp\u003eNafea et al. (\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2021\u003c/span\u003e) introduced a new method that involves convolutional deep model (CNN) with differing kernel dimensions and bi-directional long-short-term memory for capturing features at several resolutions. The main motive of this research work lies effectively in the appropriate selection in effective extraction of temporal as well as spatial patterns from sensory data and also optimal representation of video using classic CNN algorithm and BiLSTM as well. Two datasets (WISDM and UCI) are utilized in this analytical study where data collection procedures involve sensors, accelerometers, and gyroscopes. However, time consumption of this model can be reduced in future and also features can be selected more effectively by applying suitable mechanisms as well.\u003c/p\u003e \u003cp\u003eAbdel-Basset et al. (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) proposed a dual-channel supervised model \u0026ldquo;ST-deepHAR\u0026rdquo; consisting of LSTM network, followed by attention mechanism for fusing temporal nature of inertial sensory data along with a convolutional ResNet for extracting the spatial dependencies of sensory data as well. Apart from that, in the proposed model, an adaptive operation for channel-squeezing is introduced in order to fine-tune the convolutional feature extraction ability of the neural network exploiting the multi-channel dependency. After the retrieval of spatio-temporal data, those data were concatenated for making final classification decision by feeding through multilayer perceptron and a softmax layer as well. For experimental purpose, two publicly available HAR datasets (WISDM, UCI HAR) are utilized, and performance of the proposed architecture is evaluated. However, the model lacks in addressing the data imbalance problem and time consumption is also a concerning factor. In future, more works can be performed considering these factors.\u003c/p\u003e \u003cp\u003eChalla et al. (\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) developed a hybrid DL model that can effectively recognize several human movements captured utilizing IMU sensors. The hybrid model basically consists of CNN model and Bi-LSTM units for extracting temporal sequences along with spatial characteristics simultaneously from the raw sensory data. Apart from that, a meta-heuristic optimization method, \u0026ldquo;Rao-3\u0026rdquo; is adopted for identifying ideal values of hyper-parameters for the suggested hybrid architecture for the purpose of enhancing model performance. Three widely used HAR datasets are used in this article for evaluation of classification performance. The used datasets are UCI HAR, MHEALTH, and PAMAP2 as well. However, the implemented framework is a complex architecture and use of further hyper parameter optimization increases computational time. Moreover, the optimization technique may not be able to produce satisfactory outcome for all the cases in HAR domain. Hence, the model lacks in robustness and interpretability as well.\u003c/p\u003e \u003cp\u003eKumar et al. (\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) proposed a deep learning-based framework for the efficient detection of anomalous activities of human. The suggested framework is implemented combining three components of DL, CNN, Bi-LSTM, and Attention for identifying unique spatio-temporal trends of data. However, the analytical task has been performed in this article using three distinct datasets, UCF50, UCF11, and subUCF crime as well. However, the flexibility and robustness can be improved further and more works on the datasets may increase the model accuracy in future.\u003c/p\u003e \u003cp\u003eSingh et al. (\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) introduced a two-stream DL model having less complexity utilizing raw RGB sequences along with their \u0026ldquo;dynamic motion images (DMIs)\u0026rdquo; for recognizing complex human behaviors. The frames of RGB have been trained incorporating a pre-trained network of Inception-v3 module and having CNN-LSTM attached with end-to-end training. Moreover, for dynamic image streaming, some last layers of utilized pre-trained network are fine-tuned. Utilizing the proposed two-stream model, the features are extracted and then are max fused to get increased classification accuracy as well. For the evaluation purpose, authors used dyadic SBU Interaction as well as MIVIA Action dataset, single-person activity dataset. However, the model is a complex architecture with higher computing time. By modifying its components and hyper parameters, the dimensionality reduction and hence computational time may be reduced further.\u003c/p\u003e \u003cp\u003eNguyen et al. (\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) proposed a 1D-CNN \u0026ndash; Bi-LSTM model followed by attention mechanism, CBiAM for specifically recognizing states of cyclists utilizing smartphones. The motto is enhancing the safety measures along with promoting secured cycling experience to avoid accidental or emergency risks as well. A new created dataset \u0026ldquo;cycling safe (CySa)\u0026rdquo; was utilized for the experimental purpose that contains data on various actions of the cyclists during cycling, where smartphones were placed in their pocket position for collecting the data. The suggested CBiAM system was trained using the CySa dataset incorporating varying window sizes, learning rates, and batch sizes as well. However, the mechanism uses fixed sample length that is not suitable for addressing complex activities, hence, information loss may happen. Moreover, the model lacks in addressing contextual information and transitional states adequately. Hence, in future, more work should be done in this context to enhance the model performance by reducing such limitations.\u003c/p\u003e \u003cp\u003eIge and Noor (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) presented a new parallel deep architecture, DLT, generally based on the idea of pipeline concatenation. In the proposed pipeline system, single pipelines are consist of two sub-pipelines, first one, consisting 1D-CNN that learns the local features, and the second one is Bi-LSTM, LSTMs that learns the temporal dependencies as well by merging feature maps along with integrating the channel attention. The experiment was held on two HAR datasets, that are available publicly, that is WISDM, and PAMAP2 as well. However, the model is a complex architecture comprising of several pipelines and sub-pipelines, each with multiple layers that also increases the risk of overfitting and time consumption. Thus, this model lacks in the contexts of generalization, computational demands, and others and can be improved by further modifications as well.\u003c/p\u003e \u003cp\u003eAlo et al. (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) proposed a deep stacked model for recognizing human activities involving auto-encoder algorithm. The aim of this paper is proposing a deep model based on auto-encoder along with orientation of invariant features, for identifying complex human activities. Basically, in this article, a deep stacked architecture that involves auto-encoder for extracting crucial human behaviors for improvement of model accuracy, and reducing over-fitting is proposed. The data was taken from smartphone accelerometer. In this model, the advantageous aspects of auto-encoder, sparse auto-encoder, softmax classifier and others are utilized for obtaining better model performance. For analyzing the model performance, author used several types of performance metrics such as recall, accuracy, specificity, confusion matrix as well. It is observed that the proposed model gained an accuracy of 97.13% compared to the traditional ML algorithms and deep belief network as well. However, the model performance can be further checked over DL algorithms as well.\u003c/p\u003e \u003cp\u003eNi et al. (2020) presented a novel deep learning based framework for recognizing dynamic human activities, static human behaviors, and transitional activities as well by utilizing SDAE (stacked denoising auto-encoders). The experimental setup is designed for acquiring three types (twelve daily activities) of day-to-day activities utilizing wearable sensors. These records were collected from 10 adults in smart lab of Ulster University for analytical purpose. In this article, SDAE, a deep model that extracts various features in an automatic manner is used for experimental purpose. The performance analysis of the deployed model was measured using performance metrics such as precision, accuracy, recall, and F1 score as well. However, as the experiment is held in a controlled environment, hence, the model may lack proper generalization and robustness. Addressing such shortcomings could enhance the performance of the model in future.\u003c/p\u003e \u003cp\u003eIn the Table \u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, the above reviewed literatures based on deep learning method are summarized. From the table, it can be easily stated that WISDM and UCI datasets are the most widely used popular standard and publicly available smartphone-based datasets. Moreover, it can also be observed that most of the literatures have utilized the benefits of attention mechanism for selecting useful features wisely. Almost maximum models have extracted both the spatial as well as temporal dependencies of sensory inputs in order to recognize different human behaviors in an efficient manner with enhanced model performance. In this regard, for extracting spatial-temporal characteristics, most of the reviewed papers have utilized CNN for the spatial part extraction and LSTM (sometimes, GRU too) as extraction means for temporal dependencies. However, for getting better model performance, it is simultaneously required to pay emphasize on every aspects of data points. In this manner, an efficient HAR model should capture both spatial and temporal characteristics of input with utmost attention. Hence, it is required to form such a hybrid model that can cover all these aspects equally and the model can comprehend every representation of data points. Though CNNs are widely used in HAR models, but one major consideration of CNN model is its tendency of overfitting. CNN models are generally prone to overfitting issue. To mitigate this, GAP layer can be used. GAP layers are generally capable of mitigating the risk of overfitting, and also help in performance enhancement of the model. Apart from that, instead of fully-connected layers, GAP layers can be applied. This helps in lesser time consumption and gaining increased model accuracy as well. In order to temporal extraction, LSTMs are most preferred deep learning models that can efficiently handle temporal features. In order to retrieve temporal features more prominently, paying extra attention is required for ensuring focus on most relevant features. However, sometimes, it can be noticed that, Bi-LSTM and Bi-GRU can be more effective and productive than LSTM and GRU models. Bi-GRU and Bi-LSTMs possess the ability of remembering the data flow for both direction and instances, past and future. GRU and LSTMs are responsible for capturing only one-directional dataflow, rather than capturing two-directional dataflow. In this sense, use of Bi-LSTM and Bi-GRU may be more beneficial as they can capture and remember relevant patterns from both instances as well. However, in this article, to address the potential literature gaps, a hybrid model involving CNN module, Bi-LSTM network, and Attention mechanism is suggested by taking all these factors in considerations. Apart from that, among attention module, a special kind of attention called \u0026ldquo;CBAM\u0026rdquo; is applied here in the model. Moreover, LSTM is also used to notice the accuracy of the model for an effective comparison purpose. It is desired that the proposed architecture will be capable of obtaining desired output with enhanced classification accuracy along with proper justifications of the model performance as well.\u003c/p\u003e"},{"header":"3 Preliminaries","content":"\u003cp\u003eIn this research work, a combined hybrid deep model is presented that generally combines three distinct deep learning networks. The proposed system is formed combining the advantageous aspects of attention mechanism, BLSTM model, and CNN model as well. The main concept behind this idea is retrieval of local and temporal features of the input data efficiently. Hence, for interpreting the working mechanism of the model, it is highly required to understand all the associated components and concerned parameters separately. Therefore, in the below section, the conceptual elaboration of the required components is presented briefly. In Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, the mechanism of proposed hybrid DL framework is displayed.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eHAR systems based on Deep Learning\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eReference\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSensor\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eClassifier\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eThakur et al. (\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2022\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWISDM\u003c/p\u003e \u003cp\u003eUCI\u003c/p\u003e \u003cp\u003eOPPORTUNITY\u003c/p\u003e \u003cp\u003ePAMAP2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003esmartphone\u003c/p\u003e \u003cp\u003esmartphone\u003c/p\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCNN+\u003c/p\u003e \u003cp\u003eAuto-encoder+\u003c/p\u003e \u003cp\u003eLSTM\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWang et al. (\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2020\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHAPT (\u0026ldquo;Human Activities and\u003c/p\u003e \u003cp\u003ePostural Transitions\u0026rdquo;) Dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCNN\u0026thinsp;+\u0026thinsp;LSTM\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXia et al. (\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e2020\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWISDM\u003c/p\u003e \u003cp\u003eUCI HAR\u003c/p\u003e \u003cp\u003eOPPORTUNITY\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSmartphone\u003c/p\u003e \u003cp\u003esmartphone\u003c/p\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLSTM\u0026thinsp;+\u0026thinsp;CNN\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMim et al. (\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e2023\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eOPPORTUNITY\u003c/p\u003e \u003cp\u003eWISDM\u003c/p\u003e \u003cp\u003ePAMAP2\u003c/p\u003e \u003cp\u003eUCI HAR\u003c/p\u003e \u003cp\u003eDaphnet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003cp\u003eSmartphone\u003c/p\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003cp\u003eSmartphone\u003c/p\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eGRU\u0026thinsp;+\u0026thinsp;Attention\u0026thinsp;+\u0026thinsp;Inception\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMutegeki and Han (\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2020\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eiSPL\u003c/p\u003e \u003cp\u003eUCI HAR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003cp\u003esmartphone\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCNN\u0026thinsp;+\u0026thinsp;LSTM\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXu et al. (\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e2019\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eOPPORTUNITY\u003c/p\u003e \u003cp\u003ePAMAP2\u003c/p\u003e \u003cp\u003eSMARTPHONE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eGRU\u0026thinsp;+\u0026thinsp;Inception\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKhan et al. (\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2022\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSelf-collected\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCNN\u0026thinsp;+\u0026thinsp;LSTM\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNafea et al. (\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2021\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWISDM\u003c/p\u003e \u003cp\u003eUCI HAR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSmartphone\u003c/p\u003e \u003cp\u003esmartphone\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCNN\u0026thinsp;+\u0026thinsp;Bi-LSTM\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAbdel-Basset et al. (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2020\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWISDM\u003c/p\u003e \u003cp\u003eUCI HAR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSmartphone\u003c/p\u003e \u003cp\u003eSmartphone\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLSTM\u0026thinsp;+\u0026thinsp;Attention\u0026thinsp;+\u0026thinsp;ResNet\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eChalla et al. (\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2023\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eUCI HAR\u003c/p\u003e \u003cp\u003eMHEALTH\u003c/p\u003e \u003cp\u003ePAMAP2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSmartphone\u003c/p\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCNN\u0026thinsp;+\u0026thinsp;Bi-LSTM\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKumar et al. (\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2023\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eUCF50\u003c/p\u003e \u003cp\u003eUCF11\u003c/p\u003e \u003cp\u003esubUCF crime\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eVideo Data\u003c/p\u003e \u003cp\u003eVideo Data\u003c/p\u003e \u003cp\u003eVideo Data\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCNN\u0026thinsp;+\u0026thinsp;Bi-LSTM\u0026thinsp;+\u0026thinsp;Attention\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSingh et al. (\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e2019\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSBU Interaction\u003c/p\u003e \u003cp\u003eMIVIA Action\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eVideo Data\u003c/p\u003e \u003cp\u003eVideo Data\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eInception-V3\u0026thinsp;+\u0026thinsp;CNN\u0026thinsp;+\u0026thinsp;LSTM\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNguyen et al. (\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e2023\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCySa (Self-made)\u003c/p\u003e \u003cp\u003eOPPORTUNITY\u003c/p\u003e \u003cp\u003eUCI HAR\u003c/p\u003e \u003cp\u003eWISDM\u003c/p\u003e \u003cp\u003eMOTIONSENSE\u003c/p\u003e \u003cp\u003ePAMAP2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003cp\u003eSmartphone\u003c/p\u003e \u003cp\u003eSmartphone\u003c/p\u003e \u003cp\u003eSmartphone\u003c/p\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCNN\u0026thinsp;+\u0026thinsp;Bi-LSTM\u0026thinsp;+\u0026thinsp;Attention\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eIge and Noor (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2023\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWISDM\u003c/p\u003e \u003cp\u003ePAMAP2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSmartphone\u003c/p\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCNN\u0026thinsp;+\u0026thinsp;Bi-LSTM\u0026thinsp;+\u0026thinsp;LSTM\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlo et al. (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2020\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSelf-collected\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSmartphone\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eAuto-encoder\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNi et al. (2020)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSelf-collected\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBody-worn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eStacked Auto-encoder\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e3.1 CNN Model 3.2 Bi-LSTM\u003c/h2\u003e \u003cp\u003eCNN (Convolutional Neural Net) is a popular DL model that is used extensively in numerous domains such as speech recognition, image classification and others. In the domain of human activity recognition, CNNs have become one of the most suitable and efficient deep learning models, nowadays. Currently, most researchers prefer CNN model for building the HAR models. Due to the ability of learning spatial (locally-connected) features, CNNs have gained great success in several domains (Zhang et al. \u003cspan citationid=\"CR57\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). CNNs are generally consists of three layers: convolutional layer, pooling layer, and dense layer. The main concept of a CNN model is convolution layers that generally perform the mechanism of feature extraction. The input data is processed by the Convolutional layer (applying filters) for extracting relevant features, pooling layer reduces computation by down sampling the image, and the fully-connected layer generates the final output. The network uses gradient descent and back-propagation to discover the most efficient filter. Both the average pooling and max pooling layers are commonly utilized to perform averaging operations and local maximization on input features, respectively. Tens or even hundreds of layers can be found in a convolutional neural network, every single one is trained to recognize a unique characteristic of an image. Every training image is subjected to various resolutions of filters, and the result of every convolved image serves as the input for the subsequent layer. The filters can begin with relatively basic criteria, such as edges and brightness, and gain competence to include features that specifically identify the object as well (Alzubaidi et al. \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e depicts the basic architecture of CNN architecture.\u003c/p\u003e \u003cp\u003eIn the sphere of human activity recognition, Recurrent Neural Networks (RNNs) play an important role. Among RNNs, LSTM networks are generally utilized in a large basis. For capturing temporal dependencies of data, researchers apply LSTM networks in HAR models. For effective removal and selection of features and for getting enhanced accurate result, besides extracting spatial characteristics of data, extraction of temporal dependencies are similarly important. Human actions generally consist of time-series sensory data. Hence, temporal trends in time-series data play crucial role for modelling human movements. LSTMs are responsible for retrieving temporal characteristics from sensory data for its long-term dependencies as well as temporal characteristics. Not only for capturing human actions, but also capturing small or long transitions are equally important in HAR models. Though LSTMs are good in capturing temporal features, but it possesses some major drawbacks too. To overcome such shortcomings of LSTM models, there comes the necessity of Bi-LSTM models (Li et al. \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). For recognizing complex human movements such as swimming, cycling, walking, it is crucial to identify the actions that generally depend on preceding and succeeding movements. LSTMs are capable of capturing only one-directional data instances, while Bi-LSTMs process input signals in both forward and backward directions allowing the model in capturing contextual information from both the past and future time steps as well (Naheliya et al. \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). This nature provides a better comprehensive insight regarding the temporal dependencies of input, helping in acquiring\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eenhanced classification result. Hence, considering these advantages of Bidirectional LSTM models, nowadays, in most of the analytical tasks involving complex human activities, Bi-LSTMs are considered as more suitable and preferable ones rather than applying LSTM networks as well. Figure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e depicts the working of Bi-LSTM.\u003c/p\u003e \u003cp\u003eThe mathematical expression of working of Bi-LSTM is presented by Eq.\u0026nbsp;\u003cspan refid=\"Equ1\" class=\"InternalRef\"\u003e1\u003c/span\u003e (Anwar et al. \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2023\u003c/span\u003e):\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$$\\:{H}_{i}^{l+1}=g\\left({V}_{f\\sigma\\:}\\left({U}_{f\\:}\\left[{S}_{f}^{l},\\:\\:{O}_{i}^{l+1}\\right]\\right)+\\:{V}_{b\\sigma\\:}\\left({U}_{b}\\left[{S}_{b}^{l},\\:{O}_{i}^{l+1}\\right]\\right)+b\\right),\\:i\\in\\:[1,\\:N]$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eWhere,\u003c/p\u003e \u003cp\u003e \u003cstrong\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{S}_{f}^{l}\\)\u003c/span\u003e\u003c/span\u003e\u003c/strong\u003e \u003cp\u003eInformation from past time steps of hidden states\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{S}_{b}^{l}\\)\u003c/span\u003e\u003c/span\u003e\u003c/strong\u003e \u003cp\u003eInformation from future time steps of hidden states\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cspan class=\"InlineEquation\"\u003e \u003cspan class=\"mathinline\"\u003e\\(\\:{U}_{f\\:}:\\)\u003c/span\u003e \u003c/span\u003eInput states embedded in two directions \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{U}_{b}:\\)\u003c/span\u003e\u003c/span\u003eHidden states embedded in two directions\u003c/p\u003e \u003cp\u003e \u003cstrong\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:g\\)\u003c/span\u003e\u003c/span\u003e\u003c/strong\u003e \u003cp\u003eActivation function\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:b\\)\u003c/span\u003e\u003c/span\u003e\u003c/strong\u003e \u003cp\u003eBias\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cspan class=\"InlineEquation\"\u003e \u003cspan class=\"mathinline\"\u003e\\(\\:\\sigma\\:\\)\u003c/span\u003e \u003c/span\u003e : Sigmoid function\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Attention Mechanism\u003c/h2\u003e \u003cp\u003eIn the sphere of human activity detection, the features play the most important role. Recognition accuracy and the model efficiency effectively rely upon proper selection of essential features. It is noticed that the feature identification and selection is one of the most crucial parts in recognizing human behaviours efficiently. For detecting human movements, both the temporal and the spatial selection of features are essentially required. It is highly important to evaluate and recognize the features that are crucial for the model implementation as well. Here comes the need of Attention mechanism to pay extra \u0026ldquo;attention\u0026rdquo; as well as emphasize in most wanted and relevant features as well (Niu et al. \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). The attention mechanism is especially beneficial where not every piece of input is equally meaningful or informative. In the HAR domain, currently, most of the researchers prefer attention method to concentrate on particular time steps or movements that are more reminiscent of particular activities. In human activity recognition, attention can help the model focus on the most relevant parts of the input data. It potentially highlights important time steps or features that are crucial for distinguishing between different activities. Apart from that, this can enhance the model ability to recognize activities possessing varying durations or complexities (Chorowski et al. \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2015\u003c/span\u003e). Hence, leveraging the advantageous aspects of attention mechanism, in this paper, a \u0026ldquo;Convolutional Block Attention Module (CBAM)\u0026rdquo; block is utilized. This is basically a special kind of attention module. The purpose of using CBAM is inferring two distinct but sequential attention maps in both spatial and channel dimensions and after that multiplying these obtained resulting attention maps by the input map in order to refine features further. Hence, CBAM consists of two sub-modules known as \u0026ldquo;Channel Attention module (CAM)\u0026rdquo; and \u0026ldquo;Spatial Attention Module (SAM)\u0026rdquo; as well. These two modules, that is, spatial as well as channel attention masks complementary work with each other. From Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, the working mechanism of CBAM can be comprehended. The overall process of the attention mechanism can be depicted as:\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:{\\text{F}}^{{\\prime\\:}}={\\text{M}}_{c}\\left(\\text{F}\\right)\\otimes\\:\\text{F}$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e\n$$\\:{\\text{F}}^{{\\prime\\:\\prime\\:}}={\\text{M}}_{s}\\left({\\text{F}}^{{\\prime\\:}}\\right)\\otimes\\:{\\text{F}}^{{\\prime\\:}}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eWhere, input feature map\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\:\\text{F}\\in\\:{\\mathbb{R}}^{C\\times\\:H\\times\\:W}\\)\u003c/span\u003e\u003c/span\u003e, channel attention feature map \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{M}}_{c}\\in\\:{\\mathbb{R}}^{C\\times\\:1\\times\\:1,\\:}\\)\u003c/span\u003e\u003c/span\u003eand spatial attention feature map \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{M}}_{s}\\in\\:{\\mathbb{R}}^{1\\times\\:H\\times\\:W}\\)\u003c/span\u003e\u003c/span\u003e as well.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e3.4 GAP, Dropout, and, Softmax Layers\u003c/h2\u003e \u003cp\u003eIn this paper, \u0026ldquo;Global Average Pooling (GAP)\u0026rdquo; layer is used. The main motto of using GAP layer is reducing the dimensionality of the proposed scheme. GAP layers reduce the model parameters and help the model in fast convergence as well. GAP layers basically perform an extreme kind of dimensionality reduction that reduces the dimensions effectively and mitigates the computational cost of the model. Furthermore, the GAP layer has no parameters that need to be optimized. As a result, it succeeds in minimizing the entire model parameters. Moreover, since GAP layers aggregate the spatial information, GAP layers are more resilient to the spatial alterations of input as well. Dropout, basically a regularization technique, is used to prevent the overfitting issue ensuring all the units are independent to each other. Here, in this study, the Dropout layer is suggested in order to mitigate the overfitting risk factor as well. Lastly, the Softmax layer is used here for obtaining the final classification result of the proposed scheme of human activity recognition.\u003c/p\u003e \u003c/div\u003e"},{"header":"4 Proposed Model","content":"\u003cp\u003eHere, a hybrid deep architecture is proposed for recognizing several human actions effectively. The proposed DL-based framework is a layered architecture that involves Bi-LSTM, CNN, and Attention layers together in a combination for building the classification framework. The proposed Bi-LSTM-CNN model, with attached attention mechanism is suggested in order to obtain a better predictive outcome in HAR domain. The working mechanism of the proposed layered framework is elaborated step-by-step in the following section. First of all, data collection is required for deploying the proposed algorithm and analysing the results as well. However, from in-built smartphone-based sensors multivariate time-series details are gathered for recognizing several human activities. Utilizing various sensors like gyroscopes and tri-axial accelerometers, fine-grained sensory data can be acquired. Here, for evaluation purpose, data are collected through smartphone sensors. The experiment is deployed on UCI-HAR dataset. After data collection, then the data is passed through the proposed hybrid scheme for recognition of human activities. The mechanism of the model is divided into two parts generally; one is for extracting spatial features of data whereas another one will be utilized for retrieving temporal dependencies of data. The data is first passed through the CNN layer for spatial extraction of features. Next, attention layer is attached with CNN to make sure the desirable features are emphasized. However, after that, the output from CNN-attention is then sent to the Bi-LSTM layers for temporal extraction. After retrieval of the spatio-temporal features, then the final obtained features are sent to the corresponding layers and finally sent to the softmax layer for getting the final outcome of activity recognition model. The model is then validated and justified using certain methods such as cross-validation technique, comparison among proposed model and other existing literatures, and finally evaluating the performance metrics as well. In Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, the overview of the proposed model involving the required components and layers is displayed.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eHowever, the proposed model can be elaborated step by step along with explaining contribution of each involved layers. For easy understanding of every involved component, here, a detailed description is presented. Below we can see the algorithm for CBAM-enhanced model. From the algorithm, an idea of step-wise execution of model components can be portrayed as well. However, the working mechanism and model defining can be divided in the following phases:\u003c/p\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e4.1 CBAM Module\u003c/h2\u003e \u003cp\u003eCBAM module is basically a special kind of attention mechanism. It incorporates spatial attention and channel mechanism and helps in increasing the model efficiency effectively. Here, CBAM is utilized as a main part of the proposed architecture desiring improved model performance. Channel attention and spatial attention both these methods from CBAM equally contribute in enhancing the performance of model. While Channel Attention targets on emphasizing or suppressing various channels or features, Spatial Attention focuses on the spatial aspects i.e. time steps and features as well. Together, these two attention mechanisms allow the model to adaptively focus on both feature-level and spatial-level information, potentially enhancing the model\u0026rsquo;s performance and interpretability. However, in the proposed mechanism, channel attention is applied first, and after that spatial attention is implemented. By applying channel attention first, the model can focus on the most relevant\u003c/p\u003e \u003cp\u003efeatures before considering the spatial aspect of the input sequence. This order allows the model to select the most informative features, which can then be used more effectively by the subsequent spatial attention mechanism. Apart from that, implementing channel attention before spatial attention mechanism can potentially reduce the number of features that need to be processed by the spatial attention mechanism. This can lead to computational efficiency and potentially improved performance, especially when dealing with high-dimensional input data. Hence, it can be said that this order (channel attention before spatial attention) can be effective in obtaining desired model performance.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section3\"\u003e \u003ch2\u003e4.1.1. Channel Attention\u003c/h2\u003e \u003cp\u003eIn the first phase of our proposed model, the Channel Attention layer is defined. This Channel Attention is utilized to implement the channel attention portion from the CBAM architecture. This module basically learns to suppress or emphasizes certain channels as well as features based on the performance in input tensor. However, the Channel Attention method first calculates the average and max-pooled representations of input tensor along the sequence dimension. These representations generally capture the most prominent and relevant features, respectively, as well. The average tensors and max-pooled tensors are then concatenated and passed through a shared dense layer followed by a sigmoid activation for producing channel attention weights. After that, the input tensor is multiplied element-wise with the channel attention weights, effectively emphasizing or suppressing specific channels (features) based on their importance. The resulting tensor with channel attention applied is returned as the output. The Channel Attention module aims to learn to focus on the most informative features within the input tensor, potentially improving the model\u0026rsquo;s performance on sequence modeling tasks.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section3\"\u003e \u003ch2\u003e4.1.2. Spatial Attention\u003c/h2\u003e \u003cp\u003eNext, the Spatial Attention part from CBAM method is implemented. This is another most important component of CBAM module. The Spatial Attention aims to learn a weighted combination of spatial features (across the time steps and relevant features) by applying a 1D convolution with a sigmoid activation. This mechanism works by utilizing a 1D convolution along the time dimension of the input tensor. The convolution kernel learns to capture patterns and temporal dependencies in the data, and the sigmoid activation ensure that the learned weights are between 0 and 1, acting as attention weights. However, by multiplying the input tensor with the learned weights, the model can selectively suppress as well as emphasize various spatial locations in the input tensor. This allows the model to focus on the most relevant spatial information, potentially improving its performance and interpretability for sequence-based tasks.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e4.2 CNN Layer\u003c/h2\u003e \u003cp\u003eCNN is another most important part of our proposed architecture. It is known that CNN is a popular DL model for interpreting image data or sensory data as well. However, here, the \u0026ldquo;Convolutional CNN is another most important part of our proposed architecture. It is known that CNN is a popular DL model for interpreting image data or sensory data as well. However, here, the \u0026ldquo;Convolutional Neural Network\u0026rdquo; is implemented utilizing the TimeDistributed layers along with Conv1D and Pooling layers as well. First, the input layer of CNN accepts a 4D shaped tensor (batch_size, n_steps, n_length, n_features), where n_steps, n_length, and n_features represents the number of time steps, the length of the sequence at each time step, and the number of features, respectively.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Taba\" border=\"1\"\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c3\" namest=\"c1\"\u003e \u003cp\u003eAlgorithm 1: CBAM-enhanced CNN-BLSTM Algorithm\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c3\" namest=\"c2\"\u003e \u003cp\u003e\u003cem\u003eInput: trainX (training input data), trainy (training labels), testX (test input data), testy (test labels)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c3\" namest=\"c2\"\u003e \u003cp\u003e\u003cem\u003eOutput: Trained CBAM-enhanced model and its performance metrics\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c3\" namest=\"c2\"\u003e \u003cp\u003e\u003cem\u003eInitialization: Load the dataset and define parameters\u003c/em\u003e:\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e1.1\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCall the load_dataset() function to obtain trainX, trainy, testX, and testy\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e1.2\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eExtract the number of timesteps (n_timesteps), features (n_features), and output classes (n_outputs) from the data\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e1.3\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDefine the number of steps (n_steps) and length (n_length) for reshaping the input data\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c3\" namest=\"c2\"\u003e \u003cp\u003e\u003cem\u003eReshape the input data for TimeDistributed layers\u003c/em\u003e:\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e2.1\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eReshape trainX to (trainX.shape[0], n_steps, n_length, n_features)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e2.2\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eReshape testX to (testX.shape[0], n_steps, n_length, n_features)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c3\" namest=\"c2\"\u003e \u003cp\u003e\u003cem\u003eBuild the CBAM-enhanced model\u003c/em\u003e:\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e3.1\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCreate an Input layer with the shape (n_steps, n_length, n_features)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e3.2\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eTimeDistributed Conv1D layers with ReLU activation and MaxPool1D layers is applied\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e3.3\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eApply a TimeDistributed Dropout layer\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e3.4\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eApply the ChannelAttention module\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e3.5\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eApply TimeDistributed Flatten layer\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e3.6\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eApply a Bidirectional LSTM layer with return_sequences\u0026thinsp;=\u0026thinsp;True\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e3.7\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eApply a Dropout layer\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e3.8\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eApply GlobalAveragePooling1D and BatchNormalization layers\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e3.9\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eApply the SpatialAttention module\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e3.10\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eApply a Dense layer with ReLU activation\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e3.11\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eApply the output Dense layer with softmax activation\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e3.12\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCreate and compile the CBAM-enhanced model with categorical_crossentropy loss and Adam optimizer\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c3\" namest=\"c2\"\u003e \u003cp\u003e\u003cem\u003ePrint the model summary\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c3\" namest=\"c2\"\u003e \u003cp\u003e\u003cem\u003eTrain the model\u003c/em\u003e:\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e5.1\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDefine the number of epochs, batch size, and verbose setting\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e5.2\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCall the fit() method on the CBAM-enhanced model, passing trainX, trainy, epochs, validation_data=(testX, testy), batch_size, and verbose\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe first TimeDistributed layer applies a 1D convolution operation along the time dimension of the input tensor. It uses 128 filters with a kernel size of 4 and a ReLU activation function. The TimeDistributed layer ensures that the convolution operation is applied independently to each time step of the input tensor. After the convolution operation, a TimeDistributed MaxPool1D layer is applied to perform max pooling along the time dimension with a pool size of 2. This down-sampling operation helps to reduce the spatial dimensions and introduces translation invariance as well. Next, another TimeDistributed Conv1D layer is applied with 128 filters and a kernel size of 4, followed by a ReLU activation function. This layer extracts higher-level features from the output of the previous layer. After that, a TimeDistributed Dropout layer is applied with a dropout rate of 0.5 to regularize the model and prevent overfitting. Hence, the CNN model, combined with the attention mechanisms and recurrent layers, forms a powerful architecture capable of capturing both local patterns and long-range dependencies in sequential data, while selectively focusing on the most relevant features and spatial locations.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Bi-LSTM Layer\u003c/h2\u003e \u003cp\u003eBi-LSTM layers are another important part of the suggested module. Bi-LSTM is a popular RNN model that can capture the time instances from both forward and backward passes as well. This feature of Bi-LSTM enables enhanced model performance capturing all the sequences effectively. The Bidirectional LSTM part of the model is implemented using the Bidirectional layer in combination with the LSTM layer. Before applying the BI-LSTM layer, the output from the convolutional and Attention layers is flattened using the Flatten layer. This is necessary because the LSTM layer expects a 3D input tensor with shape (batch_size, time_steps, and features) as well. However, the flattened output is then passed through the Bidirectional LSTM layer. This Bidirectional layer wraps LSTM layer and applies it in two directions: forward and backward. This allows the model to capture both past and future dependencies in the sequential data. The forward LSTM processes the input sequence from the first time step to the last, while the backward LSTM processes the input sequence in reverse order, from the last time step to the first. The outputs of the forward and backward LSTMs are then concatenated at each time step, creating a single output sequence that incorporates information from both directions. Finally, a Dropout layer with a rate of 0.5 is applied to regularize the model and prevent overfitting. Thus, using a BLSTM layer can help in obtaining improved performance on tasks that require understanding the entire sequence context.\u003c/p\u003e \u003c/div\u003e"},{"header":"5 Performance Evaluation","content":"\u003cp\u003eIn this experimental task of HAR, we present the analytical results of our suggested method (CNN-Attention-BLSTM) on the basis of smartphone-based dataset. In this paper work, the main focus is on smartphone-based sensors. The main motto is evaluating the effect of our proposed architecture on smartphone sensor-based data for the purpose of recognition of human activities as well. Hence, in this manner, to establish the efficiency of the above mentioned model, we present the experimental outcomes obtained by utilizing UCI HAR dataset in terms of F1 score and accuracy.\u003c/p\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e5.1. Dataset Description\u003c/h2\u003e \u003cp\u003eFor evaluating the performance of human activity detection model, in this study, a smartphone-based publicly available datasets: UCI HAR is utilized. The basic elaboration of this dataset is presented as follows:\u003c/p\u003e \u003cp\u003e \u003cb\u003eUCI-HAR\u003c/b\u003e (Reyes-Ortiz et al. \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e2016\u003c/span\u003e): This standard database comes from the \u0026ldquo;University of California Irvine (UCI) Machine Learning\u0026rdquo; repository, which is openly accessible to the public. The dataset is basically a balanced dataset. This dataset was gathered from thirty individuals, ranging in age from 19 to 48, who engaged in six distinct activities of everyday living, including \u0026ldquo;sitting\u0026rdquo;, \u0026ldquo;standing\u0026rdquo;, \u0026ldquo;walking\u0026rdquo;, \u0026ldquo;lying\u0026rdquo;, \u0026ldquo;walking upstairs\u0026rdquo;, and \u0026ldquo;walking downstairs\u0026rdquo; (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e). A smartphone \u0026ldquo;Samsung Galaxy S II\u0026rdquo; integrated with gyroscope and accelerometer, positioned on the waist was used for gathering the data. Additionally, this dataset was gathered under appropriate supervision in a laboratory setting. The researchers measured the 3-axial angular velocity and tri-axial linear acceleration at a constant sampling rate of 50 Hz. statistically, the dataset is consists of 10,299 numbers of instances and further details of train set and test set are displayed in Table \u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e and Table \u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eActivities involved in UCI HAR train set\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eActivities\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSamples\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePercentage\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWalking\u003c/p\u003e \u003cp\u003eSitting\u003c/p\u003e \u003cp\u003eStanding\u003c/p\u003e \u003cp\u003eLaying\u003c/p\u003e \u003cp\u003eWalking Upstairs\u003c/p\u003e \u003cp\u003eWalking Downstairs\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1226\u003c/p\u003e \u003cp\u003e1286\u003c/p\u003e \u003cp\u003e1374\u003c/p\u003e \u003cp\u003e1407\u003c/p\u003e \u003cp\u003e1073\u003c/p\u003e \u003cp\u003e986\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e16.7%\u003c/p\u003e \u003cp\u003e17.5%\u003c/p\u003e \u003cp\u003e18.7%\u003c/p\u003e \u003cp\u003e19.1%\u003c/p\u003e \u003cp\u003e14.6%\u003c/p\u003e \u003cp\u003e13.4%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eActivities involved in UCI HAR test set\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eActivities\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSamples\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePercentage\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWalking\u003c/p\u003e \u003cp\u003eSitting\u003c/p\u003e \u003cp\u003eStanding\u003c/p\u003e \u003cp\u003eLaying\u003c/p\u003e \u003cp\u003eWalking Upstairs\u003c/p\u003e \u003cp\u003eWalking Downstairs\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e496\u003c/p\u003e \u003cp\u003e491\u003c/p\u003e \u003cp\u003e532\u003c/p\u003e \u003cp\u003e537\u003c/p\u003e \u003cp\u003e471\u003c/p\u003e \u003cp\u003e420\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e16.8%\u003c/p\u003e \u003cp\u003e16.7%\u003c/p\u003e \u003cp\u003e18.1%\u003c/p\u003e \u003cp\u003e18.2%\u003c/p\u003e \u003cp\u003e16.0%\u003c/p\u003e \u003cp\u003e14.3%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFrom the Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003ea) and Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eb), it can be easily noticed that the dataset, UCI HAR is a balanced dataset having almost same percentage of data (for both train and test) in different categories. Here, class 1 represents walking, class 2 represents walking upstairs, class 3 represents walking downstairs, class 4 represents sitting, class 5 represents standing, and class 6 represents lying as well.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003e5.2. Experimental Setup\u003c/h2\u003e \u003cp\u003eThe proposed network design was built using keras, a high-level API for neural networks. Keras is capable of running on the top of CNTK, TensorFlow, or Theano and it is written using Python language. The Keras API allows the model to move from the starting phase to the end of result with the least delay. TensorFlow is used in this experiment for backend purpose. However, the model train and validation is performed on a PC having an Intel(R) Core(TM) i3-7100U CPU with 2.40GHz, 4.00 GB RAM with 16 GB memory. The PC is furnished with a windows operating system with bit size of 64 bits as well.\u003c/p\u003e \u003cp\u003eFor the experiment, the dataset is first divided into two sets: train and test containing 7352 samples and 2947 samples in train data and test data respectively. The model is then trained in a full supervised manner. In training phase, forward calculation is performed on the train data for obtaining the model output. However, from Softmax layer, the gradient\u003c/p\u003e \u003cp\u003evalue was back propagated to the convolution layer. For each layer, the biases and weights were initialized with randomly selected values. Adam optimizer is utilized here for back propagation of errors in the layer sequences for updating the model hyper-parameters. Adam is basically an algorithm of stochastic optimization based on first-order gradient, and is hugely selected as optimizer in such models. ReLU activation function is used in this experiment for the layers of the model with varying kernel sizes and filters as well. However, the detailed description of used hyper-parameters for deployment of the proposed network is displayed in the Table \u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e as well.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003e5.3. Experimental Results\u003c/h2\u003e \u003cp\u003eIn this phase, the experimental outputs obtained by deploying the model is presented and elaborated briefly. For the analysis, as here, UCI HAR dataset is used, hence, the model\u003c/p\u003e \u003cp\u003eperformance is evaluated based on the UCI HAR dataset. Here, the output of the suggested model is discussed with respect to several entities such as performance metric, classification report, and others. The result is obtained utilizing accuracy, recall, precision, computational complexity, testing time as well. However, a comparison analysis is also provided here (Table \u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e) with respect to other existing and commonly used deep learning models such as CNN, LSTM, BLSTM, CNN-LSTM, and CNN-BLSTM as well. This comparison is performed in same experimental environment in order to check and verify the efficiency of the model. Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e demonstrates testing time, training time and the accuracy of aforementioned DL models. From the table, it can be noticed that, our proposed mechanism consists comparatively lesser computational time than the other DL approaches as well. In the case of our proposed scheme, it can be seen that the training time is 273 ms and the testing time is 65 ms, which is quite good for the model. Apart from that, in terms of accuracy too, compared to other DL-based architectures, it is noticed that our proposed \u0026ldquo;CNN-Attention-BLSTM\u0026rdquo; model provides an accuracy score of 93%, which is better than other existing DL models as well.\u003c/p\u003e \u003cp\u003eIn Table \u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e, the details of classification results of our suggested scheme are presented. In the proposed architecture, we have three popular DL models combined. After deploying the hybrid model, the classification result is obtained in terms of performance metrics such as precision, recall, fi-score, and accuracy as well. Our dataset has six classes containing six different human activities. As a result, we got the classification result for these six classes as well. From this classification report, we can get a clear idea of our model performance for each and every distinct class as well. However, for the dynamic activities, such as, walking, walking upstairs, and walking downstairs the obtained f1-score is 99%, 93%, and 95%, respectively. From this, it can be concluded that the proposed hybrid model can distinguish same kinds of activity patterns very effectively, which cannot be generally obtained by using these models only individually as well. Similarly, for the static activities also, that include, sitting, standing, and laying the obtained value of f1-score is 84%, 87%, and 97%, respectively, which are also quite good model performance than other existing literatures as well. However, from this result, it can be easily depicted that our suggested mechanism not only effectively differentiates among the dynamic and static activities, but also can identify similar types of patterns efficiently at the same time. Apart from f1-score, from the obtained classification score, it is also noticed that for other performances metrics too, such as recall and precision, our model performs exceptionally good for both the static as well as dynamic human activities. The confusion matrix for the different involved activities in test set data is depicted via Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e. Using multi-class classifier, the confusion matrix for six activity classes is obtained. However, finally, it can be concluded that the proposed model have successfully classified distinct human activities into its desired classes in an efficient manner with an accuracy of 93%.\u003c/p\u003e \u003cp\u003e\u003cstrong\u003eTable\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003e4\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003eDescription of hyper-parameters\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"344\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e\u003cstrong\u003ePhase\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 171px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eHyper-parameters\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 88px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAssigned values\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"2\" valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003eModel Architecture\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 171px;\"\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd rowspan=\"2\" valign=\"top\" style=\"width: 88px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003cp\u003e128\u003c/p\u003e\n \u003cp\u003eReLU\u003c/p\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003cp\u003e128\u003c/p\u003e\n \u003cp\u003eReLU\u003c/p\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003cp\u003e100\u003c/p\u003e\n \u003cp\u003eReLU\u003c/p\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003cp\u003e100\u003c/p\u003e\n \u003cp\u003eReLU\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 97px;\"\u003e\n \u003cp\u003eConvolution_1\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eMax-pooling\u003c/p\u003e\n \u003cp\u003eConvolution_1\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eBi-LSTM\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eOutput\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 74px;\"\u003e\n \u003cp\u003eKernel size\u003c/p\u003e\n \u003cp\u003eFilters\u003c/p\u003e\n \u003cp\u003eActivation\u003c/p\u003e\n \u003cp\u003ePool size\u003c/p\u003e\n \u003cp\u003eKernel size\u003c/p\u003e\n \u003cp\u003eFilters\u003c/p\u003e\n \u003cp\u003eActivation\u003c/p\u003e\n \u003cp\u003eDropout\u003c/p\u003e\n \u003cp\u003eneurons\u003c/p\u003e\n \u003cp\u003eActivation\u003c/p\u003e\n \u003cp\u003eDropout\u003c/p\u003e\n \u003cp\u003eneurons\u003c/p\u003e\n \u003cp\u003eActivation\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eTraining\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 97px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eOptimizer\u003c/p\u003e\n \u003cp\u003eEpochs\u003c/p\u003e\n \u003cp\u003eBatch size\u003c/p\u003e\n \u003cp\u003eLoss\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eMetrics\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 74px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 88px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eAdam\u003c/p\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003cp\u003e150\u003c/p\u003e\n \u003cp\u003eCategorical cross entropy\u003c/p\u003e\n \u003cp\u003eAccuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003ctable id=\"Tab5\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003ePerformance comparison with other models using UCI HAR dataset\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\" rowspan=\"2\"\u003e\n \u003cp\u003eModels\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colspan=\"5\"\u003e\n \u003cp\u003eMethods\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCNN\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eLSTM\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eBi-LSTM\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCNN-LSTM\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eCNN-Attention-BLSTM\u003c/strong\u003e\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTraining time\u003c/p\u003e\n \u003cp\u003eTesting time\u003c/p\u003e\n \u003cp\u003eTest accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e271 ms\u003c/p\u003e\n \u003cp\u003e96 ms\u003c/p\u003e\n \u003cp\u003e91.25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e675 ms\u003c/p\u003e\n \u003cp\u003e153 ms\u003c/p\u003e\n \u003cp\u003e90.19\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e591 ms\u003c/p\u003e\n \u003cp\u003e197 ms\u003c/p\u003e\n \u003cp\u003e91.28\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e371 ms\u003c/p\u003e\n \u003cp\u003e150 ms\u003c/p\u003e\n \u003cp\u003e91\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e273 ms\u003c/p\u003e\n \u003cp\u003e65 ms\u003c/p\u003e\n \u003cp\u003e93%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003cstrong\u003eTable\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003e6\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003eClassification report of proposed model on UCI HAR dataset\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 78px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eClass\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003ePrecision\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 62px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eRecall\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 59px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eF1-Score\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 67px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eSupport\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 78px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eWalking\u003c/p\u003e\n \u003cp\u003eUpstairs\u003c/p\u003e\n \u003cp\u003eDownstairs\u003c/p\u003e\n \u003cp\u003eSitting\u003c/p\u003e\n \u003cp\u003eStanding\u003c/p\u003e\n \u003cp\u003eLaying\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003eAccuracy\u003c/p\u003e\n \u003cp\u003emacro avg\u003c/p\u003e\n \u003cp\u003eWeighted avg\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003cp\u003e0.91\u003c/p\u003e\n \u003cp\u003e0.91\u003c/p\u003e\n \u003cp\u003e0.86\u003c/p\u003e\n \u003cp\u003e0.87\u003c/p\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 62px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e0.98\u003c/p\u003e\n \u003cp\u003e0.94\u003c/p\u003e\n \u003cp\u003e1.00\u003c/p\u003e\n \u003cp\u003e0.82\u003c/p\u003e\n \u003cp\u003e0.88\u003c/p\u003e\n \u003cp\u003e0.95\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 59px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e0.99\u003c/p\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003cp\u003e0.95\u003c/p\u003e\n \u003cp\u003e0.84\u003c/p\u003e\n \u003cp\u003e0.87\u003c/p\u003e\n \u003cp\u003e0.97\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 67px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e496\u003c/p\u003e\n \u003cp\u003e471\u003c/p\u003e\n \u003cp\u003e420\u003c/p\u003e\n \u003cp\u003e491\u003c/p\u003e\n \u003cp\u003e532\u003c/p\u003e\n \u003cp\u003e537\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cp\u003e2947\u003c/p\u003e\n \u003cp\u003e2947\u003c/p\u003e\n \u003cp\u003e2947\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e5.4. Performance Evaluation\u003cp\u003eAfter completion of the experimental execution of our proposed model, it is required to check whether the model is producing satisfactory classification output or not. For this purpose, evaluation of the model performance is really a crucial part of the overall procedure to verify the capability of the model in classifying human activities efficiently. In this manner, we perform some statistical analysis like cross-validation, ablation study for the proper justification and analysis of our model performance.\u003c/p\u003e\n\u003cdiv id=\"Sec20\" class=\"Section3\"\u003e\n \u003ch2\u003e5.4.1. Cross Validation\u003c/h2\u003e\n \u003cp\u003eIn this task, cross-validation is performed to cross verify the performance of the model against the dataset and its constraints as well. Cross-validation technique is basically a popular machine learning statistical method used for evaluating the performance of predictive models against\u003c/p\u003e\n \u003cp\u003eindependent datasets. In this technique, data is partitioned into train and test sets, where training data is used for training the model and testing set is utilized for assessing the model accuracy as well. This process runs multiple times taking different portion of datasets and average performance accuracy is calculated. Cross-validation actively helps in preventing overfitting issue and hence contributes in obtaining more reliable model performance as well.\u003c/p\u003e\n \u003cp\u003eAmong several methods of cross-validation, in this study, we have applied stratified k-fold method for validating our model performance. This is basically a variation of k-fold method that contains the property of maintaining same proportion of data and observations for each of the target classes of the whole dataset. The stratified K-fold helps in reducing the class imbalance by partitioning the data positions into same proportions as well. However, here, we have taken the value of k as 5 and the obtained outcome is presented in the below table [Table \u003cspan class=\"InternalRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e\u0026nbsp;\u003ctable id=\"Tab7\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 7\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eCross-validation Result\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eNo. of Folds\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAverage Train Accuracy\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAverage Validation Accuracy\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAverage Precision\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAverage Recall\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAverage F1-Score\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eTest Accuracy\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eTest Loss\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.85%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e98.35%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.9840\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.9835\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.9835\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e92.91%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.4581\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003cp\u003ethe model and testing set is utilized for assessing the model accuracy as well. This process runs multiple times taking different portion of datasets and average performance accuracy is calculated. Cross-validation actively helps in preventing overfitting issue and hence contributes in obtaining more reliable model performance as well.\u003c/p\u003e\n \u003cp\u003eAmong several methods of cross-validation, in this study, we have applied stratified k-fold method for validating our model performance. This is basically a variation of k-fold method that contains the property of maintaining same proportion of data and observations for each of the target classes of the whole dataset. The stratified K-fold helps in reducing the class imbalance by partitioning the data positions into same proportions as well. However, here, we have taken the value of k as 5 and the obtained outcome is presented in the below table [Table \u003cspan class=\"InternalRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec21\" class=\"Section3\"\u003e\n \u003ch2\u003e5.4.2. Ablation Study\u003c/h2\u003e\n \u003cp\u003eAn ablation study has also been performed in this work for demonstrating the specific contribution of each of the components involved in our hybrid architecture. The purpose of ablation test is basically investigating the performance of AI models by removing its associated components eventually for getting an idea regarding the contribution of the involved components to the accuracy of the overall system as well. In this experimental framework, ablation experiment is performed in three ways.\u003c/p\u003e\n \u003cp\u003eFirst of all, we observe the output of each component of our proposed architecture individually, and after that, we have noticed the performance of the combined hybrid model for understanding how the involved factors may affect the overall model performance [Table \u003cspan class=\"InternalRef\"\u003e8\u003c/span\u003e]. In Table \u003cspan class=\"InternalRef\"\u003e8\u003c/span\u003e, it can be noticed that, first, the individual contribution of CNN and Bidirectional LSTM is checked respectively. Next, the combined contribution of CNN and BLSTM with added Attention mechanism is checked and it is clearly observed that the hybrid combination produced better prediction output.\u003c/p\u003e\n \u003cp\u003eAfter that, we also perform ablation in the context of CBAM mechanism [Table \u003cspan class=\"InternalRef\"\u003e9\u003c/span\u003e]. We have observed the model performance by three ways. These are: i) removal of channel attention and presence of spatial attention, ii) removal of spatial attention and presence of channel attention, and iii) presence of both channel and spatial attention. From Table \u003cspan class=\"InternalRef\"\u003e9\u003c/span\u003e, it can be easily noticed that channel or spatial mechanism are not individually sufficient for obtaining the desired model performance. Hence, the importance of each module involved in CBAM for obtaining better model accuracy is noticed clearly. Apart from that, we have also noticed the change in classification output by modifying the hyper parameters like batch size, epochs, and others for understanding the importance of selecting proper model parameters as well [Table \u003cspan class=\"InternalRef\"\u003e10\u003c/span\u003e and \u003cspan class=\"InternalRef\"\u003e11\u003c/span\u003e]. In this regard, it is observed that the proposed architecture produced higher accuracy with our selected batch size (150) and epochs (30) as well.\u0026nbsp;\u003c/p\u003e\u0026nbsp;\u003ctable id=\"Tab8\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 8\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eAblation test result I\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eModel\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAttention\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAccuracy\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eF1-score\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCNN\u003c/p\u003e\n \u003cp\u003eBI-LSTM\u003c/p\u003e\n \u003cp\u003eCNN-BILSTM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003cp\u003e✔\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e91.25%\u003c/p\u003e\n \u003cp\u003e91.28%\u003c/p\u003e\n \u003cp\u003e93%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.91\u003c/p\u003e\n \u003cp\u003e0.91\u003c/p\u003e\n \u003cp\u003e0.93\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\u0026nbsp;\u003ctable id=\"Tab9\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 9\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eAblation test result II\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eModel\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eChannel Attention\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eSpatial Attention\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAccuracy\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCNN-Attention-BiLSTM\u003c/p\u003e\n \u003cp\u003eCNN-Attention-BiLSTM\u003c/p\u003e\n \u003cp\u003eCNN-Attention-BiLSTM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003cp\u003e✔\u003c/p\u003e\n \u003cp\u003e✔\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e✔\u003c/p\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003cp\u003e✔\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e39.7552%\u003c/p\u003e\n \u003cp\u003e16.8178%\u003c/p\u003e\n \u003cp\u003e93%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\u0026nbsp;\u003ctable id=\"Tab10\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 10\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eAblation test result III\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eModel\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eEpochs\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eBatch Size\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAccuracy\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCNN-Attention-BiLSTM\u003c/p\u003e\n \u003cp\u003eCNN-Attention-BiLSTM\u003c/p\u003e\n \u003cp\u003eCNN-Attention-BiLSTM\u003c/p\u003e\n \u003cp\u003eCNN-Attention-BiLSTM\u003c/p\u003e\n \u003cp\u003eCNN-Attention-BiLSTM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e25\u003c/p\u003e\n \u003cp\u003e50\u003c/p\u003e\n \u003cp\u003e75\u003c/p\u003e\n \u003cp\u003e100\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e150\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e91.8222%\u003c/p\u003e\n \u003cp\u003e92.1615%\u003c/p\u003e\n \u003cp\u003e90.6006%\u003c/p\u003e\n \u003cp\u003e92.0937%\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e93.00%\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\u0026nbsp;\u003ctable id=\"Tab11\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 11\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eAblation test result IV\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eModel\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eEpochs\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eBatch Size\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAccuracy\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCNN-Attention-BiLSTM\u003c/p\u003e\n \u003cp\u003eCNN-Attention-BiLSTM\u003c/p\u003e\n \u003cp\u003eCNN-Attention-BiLSTM\u003c/p\u003e\n \u003cp\u003eCNN-Attention-BiLSTM\u003c/p\u003e\n \u003cp\u003eCNN-Attention-BiLSTM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003cp\u003e20\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e30\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e40\u003c/p\u003e\n \u003cp\u003e50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e150\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e89.3790%\u003c/p\u003e\n \u003cp\u003e91.5847%\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003e93.00%\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e91.9240%\u003c/p\u003e\n \u003cp\u003e91.6525%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003c/p\u003e\n\u003c/div\u003e"},{"header":"6 Discussion","content":"\u003cp\u003eIn this task, we have proposed a hybrid deep learning based model for classifying human activities into different classes with desired classification accuracy. Nowadays, understanding and classifying human activities based on different insights have become one of the most popular research areas. Day by day, recognizing crucial activities have become challenging tasks.\u003c/p\u003e \u003cp\u003eIn this manner, it is required to catch every meaningful insight of data points so that the classification model can produce better outcomes. For this, a hybrid DL model involving three independent DL mechanisms such as CNN, Attention, and Bidirectional LSTM is proposed here in order to predict and classify human activities in suitable classes. For the experiment, a popular publicly available dataset \u0026ldquo;UCI-HAR\u0026rdquo; is used. CNN model is well-known for capturing spatial as well as local insight, where LSTMs are suitable for capturing temporal dependencies of data and attention mechanism is useful for picking up the most meaningful and suitable features removing unwanted or less required features as well. Moreover, rather than using only one directional LSTMs, Bi-LSTMs are suitable for capturing data flow from both forward and backward directions so that crucial dynamic activities can be addressed in better way. Hence, considering all these factors, here, a hybrid model \u0026ldquo;CNN-Attention-Bi-LSTM\u0026rdquo; is suggested with desire of obtaining more fruitful classification result. However, it is noticed that our proposed model performed well on the above mentioned dataset on both train set and test set with a higher percentage of accuracy. For validating the model performance, we have also utilized some of the statistical measures such as cross-validation technique, ablation study, confusion metric, accuracy, recall, precision and others as well. After performing all these statistical measures, it is noticed that our suggested mechanism produced desired output with higher classification justifying its efficiency in classifying human activities. The proposed architecture can be useful in several domains involving the task of classifying complex human actions. However, in future, our proposed scheme can be modified with several varying constraints like adding or removing different suitable layers, modifying the hyper parameters, adding other components, and so on. Apart from that, various other statistical methods can be also utilized for justifying the model supremacy in more sophisticated manner. Hence, it is desired to work more on the proposed architecture so that more suitable classification outputs can be obtained in future.\u003c/p\u003e"},{"header":"7 Conclusion","content":"\u003cp\u003eAn Attention based hybrid deep learning architecture is proposed here with the desire of classifying different human activities in its suitable categories. For leveraging the characteristics of three independent DL models, in this paper, a hybrid model is proposed. UCI-HAR, a famous openly available dataset is used for experimental purpose. The hybrid model involves the components like Bi-LSTM, Attention, and CNN architectures to form an integrated structure that can ideally recognize the complex human activities by learning meaningful insight of data instances. CNN and Bi-LSTM both these models are famous for capturing spatio-temporal instances from data. CNNs are ideal for capturing local data trends whereas LSTMs are effective for capturing temporal features of data. However, it is considered to use BiLSTM in this experiment rather than applying only LSTM, as Bi-LSTMs possess the ability of capturing data flow from forward and backward directions. In this manner, utilization of BiLSTM may produce better result for dynamic human activities like walking, running, and others that generally involves backward and forward time steps as well. Moreover, besides extraction of relevant features, it is simultaneously necessary to select only meaningful and most wanted feature sets eliminating unwanted or less required features. For this purpose, attention mechanism is applied to select efficient features attentively and also for dimensionality reduction. This results in the model taking lesser computational time as well. Hence, by considering all these factors, with desire of obtaining better accuracy, the \u0026ldquo;CNN-Attention-BiLSTM\u0026rdquo; model is suggested in this work. However, after the completion of the experiment, our proposed model produced effective classification accuracy with a percentage of 93%. It shows that the proposed architecture performed well on the dataset and learned important features efficiently. For proper justification and validation of the suggested mechanism, we also utilize some statistical methods like classification report, confusion metric, ablation experiment, and cross-validation as well. After applying these techniques, it is observed that the model produced a higher output by justifying its effectiveness in classifying the several human activity sets involved in the UCI-HAR dataset. We also compare our model performance with other existing algorithms, with respect to several parameters to prove the supremacy of our model. However, in future, it is expected to enhance the proposed model by modifying its components and parameters to obtain more accurate and fruitful recognition accuracy for classifying more complex human activities.\u003c/p\u003e "},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eFunding Declaration\u003c/strong\u003e\u003c/p\u003e\n\n\u003cp\u003eThere is no funding for this manuscript.\u003c/p\u003e\n"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAbdel-Basset, M., Hawash, H., Chakrabortty, R.K., Ryan, M., Elhoseny, M. and Song, H., 2020. ST-DeepHAR: Deep learning model for human activity recognition in IoHT applications. IEEE Internet of Things Journal, 8(6), pp.4969-4979.\u003c/li\u003e\n\u003cli\u003eAlo, U.R., Nweke, H.F., Teh, Y.W. and Murtaza, G., 2020. Smartphone motion sensor-based complex human activity identification using deep stacked autoencoder algorithm for enhanced smart healthcare system. Sensors, 20(21), p.6300.\u003c/li\u003e\n\u003cli\u003eAlzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamar\u0026iacute;a, J., Fadhel, M.A., Al-Amidie, M. and Farhan, L., 2021. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. Journal of big Data, 8, pp.1-74.\u003c/li\u003e\n\u003cli\u003eAnwar, S., Milanova, M., Adbulla, S. and Muhammed, S.A., 2023. Graph convolutional networks for pain detection via telehealth. In Artificial Intelligence in Healthcare and COVID-19 (pp. 93-104). Academic Press.\u003c/li\u003e\n\u003cli\u003eAriza-Colpas, P.P., Vicario, E., Oviedo-Carrascal, A.I., Butt Aziz, S., Pi\u0026ntilde;eres-Melo, M.A., Quintero-Linero, A. and Patara, F., 2022. human activity recognition data analysis: Histo- ry, evolutions, and new trends. Sensors, 22(9), p.3401.\u003c/li\u003e\n\u003cli\u003eAtalaa, B.A., Ziedan, I., Alenany, A. and Helmi, A., 2021. Feature Engineering for Human Activity Recognition. Int. J. Adv. Comput. Sci. Appl, 12, pp.160-167.\u003c/li\u003e\n\u003cli\u003eBarna, A., Masum, A.K.M., Hossain, M.E., Bahadur, E.H. and Alam, M.S., 2019, February. A study on human activity recognition using gyroscope, accelerometer, temperature and humidity data. In 2019 international conference on electrical, computer and communication engineering (ecce) (pp. 1-6). IEEE.\u003c/li\u003e\n\u003cli\u003eBhatt, D., Patel, C., Talsania, H., Patel, J., Vaghela, R., Pandya, S., Modi, K. and Ghayvat, H., 2021. CNN variants for computer vision: History, architecture, application, challenges and future scope. Electronics, 10(20), p.2470.\u003c/li\u003e\n\u003cli\u003eBulbul, E., Cetin, A. and Dogru, I.A., 2018, October. Human activity recognition using smartphones. In 2018 2nd international symposium on multidisciplinary studies and innovative technologies (ismsit) (pp. 1-6). IEEE.\u003c/li\u003e\n\u003cli\u003eCao, J., Pang, Y., Xie, J., Khan, F.S. and Shao, L., 2021. From handcrafted to deep features for pedestrian detection: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9), pp.4913-4934.\u003c/li\u003e\n\u003cli\u003eChalla, S.K., Kumar, A., Semwal, V.B. and Dua, N., 2023. An optimized deep learning model for human activity recognition using inertial measurement units. Expert Systems, 40(10), p.e13457.\u003c/li\u003e\n\u003cli\u003eChen, K., Zhang, D., Yao, L., Guo, B., Yu, Z. and Liu, Y., 2021. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Computing Surveys (CSUR), 54(4), pp.1-40.\u003c/li\u003e\n\u003cli\u003eChorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K. and Bengio, Y., 2015. Attention-based models for speech recognition. Advances \u003c/li\u003e\n\u003cli\u003eC\u0026ocirc;t\u0026eacute;-Allard, U., Campbell, E., Phinyomark, A., Laviolette, F., Gosselin, B. and Scheme, E., 2020. Interpreting deep learning features for myoelectric control: A comparison with handcrafted features. Frontiers in bioengineering and biotechnology, 8, p.158.\u003c/li\u003e\n\u003cli\u003eCvetković, B., Szeklicki, R., Janko, V., Lutomski, P. and Lu\u0026scaron;trek, M., 2018. Real-time activity monitoring with a wristband and a smartphone. Information Fusion, 43, pp.77-93.\u003c/li\u003e\n\u003cli\u003eDang, L.M., Min, K., Wang, H., Piran, M.J., Lee, C.H. and Moon, H., 2020. Sensor-based and vision-based human activity recognition: A comprehensive survey. Pattern Recognition, 108, p.107561.\u003c/li\u003e\n\u003cli\u003eFu, B., Damer, N., Kirchbuchner, F. and Kuijper, A., 2020. Sensing technology for human activity recognition: A comprehensive survey. Ieee Access, 8, pp.83791-83820.\u003c/li\u003e\n\u003cli\u003eGu, F., Chung, M.H., Chignell, M., Valaee, S., Zhou, B. and Liu, X., 2021. A survey on deep learning for human activity recognition. ACM Computing Surveys (CSUR), 54(8), pp.1-34.\u003c/li\u003e\n\u003cli\u003eIge, A.O. and Noor, M.H.M., 2023. A deep local-temporal architecture with attention for lightweight human activity recognition. Applied Soft Computing, 149, p.110954. in neural information processing systems, 28.\u003c/li\u003e\n\u003cli\u003eIslam, M.M., Nooruddin, S., Karray, F. and Muhammad, G., 2022. Human activity recognition using tools of convolutional neural networks: A state of the art review, data sets, challenges, and future prospects. Computers in Biology and Medicine, 149, p.106060.\u003c/li\u003e\n\u003cli\u003eJang, B., Kim, M., Harerimana, G., Kang, S.U. and Kim, J.W., 2020. Bi-LSTM model to increase accuracy in text classification: Combining Word2vec CNN and attention mechanism. Applied Sciences, 10(17), p.5841.\u003c/li\u003e\n\u003cli\u003eKhan, I.U., Afzal, S. and Lee, J.W., 2022. Human activity recognition via hybrid deep learning-based model. Sensors, 22(1), p.323.\u003c/li\u003e\n\u003cli\u003eKumar, M., Patel, A.K., Biswas, M. and Shitharth, S., 2023. Attention-based bidirectional-long short-term memory for abnormal human activity detection. Scientific Reports, 13(1), p.14442.\u003c/li\u003e\n\u003cli\u003eLee, P., Uh, Y. and Byun, H., 2020, April. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 07, pp. 11320-11327).\u003c/li\u003e\n\u003cli\u003eLi, J., Wang, Y. and McAuley, J., 2020, January. Time interval aware self-attention for sequential recommendation. In Proceedings of the 13th international conference on web search and data mining (pp. 322-330).\u003c/li\u003e\n\u003cli\u003eLi, L., Yang, Y., Yuan, Z. and Chen, Z., 2021. A spatial-temporal approach for traffic status analysis and prediction based on Bi-LSTM structure. Modern Physics Letters B, 35(31), p.2150481.\u003c/li\u003e\n\u003cli\u003eLi, Y., Yang, G., Su, Z., Li, S. and Wang, Y., 2023. Human activity recognition based on multienvironment sensor data. Information Fusion, 91, pp.47-63.\u003c/li\u003e\n\u003cli\u003eMim, T.R., Amatullah, M., Afreen, S., Yousuf, M.A., Uddin, S., Alyami, S.A., Hasan, K.F. and Moni, M.A., 2023. GRU-INC: An inception-attention based approach using GRU for human activity recognition. Expert Systems with Applications, 216, p.119419.\u003c/li\u003e\n\u003cli\u003eMutegeki, R. and Han, D.S., 2020, February. A CNN-LSTM approach to human activity recognition. In 2020 international conference on artificial intelligence in information and communication (ICAIIC) (pp. 362-366). IEEE.\u003c/li\u003e\n\u003cli\u003eNafea, O., Abdul, W., Muhammad, G. and Alsulaiman, M., 2021. Sensor-based human ac- tivity recognition with spatio-temporal deep learning. Sensors, 21(6), p.2141.\u003c/li\u003e\n\u003cli\u003eNaheliya, B., Redhu, P. and Kumar, K., 2023. MFOA-Bi-LSTM: An optimized bidirectional long short-term memory model for short-term traffic flow prediction. Physica A: Statistical Mechanics and its Applications, p.129448.\u003c/li\u003e\n\u003cli\u003eNguyen, V.S., Kim, H. and Suh, D., 2023. Attention Mechanism-Based Bidirectional Long Short-Term Memory for Cycling Activity Recognition Using Smartphones. IEEE Access, 11, pp.136206-136218.\u003c/li\u003e\n\u003cli\u003eNi, Q., Fan, Z., Zhang, L., Nugent, C.D., Cleland, I., Zhang, Y. and Zhou, N., 2020. Leveraging wearable sensors for human daily activity recognition with stacked denoising autoencoders. Sensors, 20(18), p.5114.\u003c/li\u003e\n\u003cli\u003eNiu, Z., Zhong, G. and Yu, H., 2021. A review on the attention mechanism of deep learning. Neurocomputing, 452, pp.48-62.\u003c/li\u003e\n\u003cli\u003ePareek, P. and Thakkar, A., 2021. A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artificial Intelligence Review, 54, pp.2259-2322.\u003c/li\u003e\n\u003cli\u003eRamanujam, E., Perumal, T. and Padmavathi, S., 2021. Human activity recognition with smartphone and wearable sensors using deep learning techniques: A review. IEEE Sensors Journal, 21(12), pp.13029-13040.\u003c/li\u003e\n\u003cli\u003eReyes-Ortiz, J.L., Oneto, L., Sam\u0026agrave;, A., Parra, X. and Anguita, D., 2016. Transition-aware human activity recognition using smartphones. Neurocomputing, 171, pp.754-767.\u003c/li\u003e\n\u003cli\u003eRonao, C.A. and Cho, S.B., 2016. Human activity recognition with smartphone sensors using deep learning neural networks. Expert systems with applications, 59, pp.235-244.\u003c/li\u003e\n\u003cli\u003eSingh, T., Rustagi, S., Garg, A. and Vishwakarma, D.K., 2019, September. Deep Learning Framework for Single and Dyadic Human Activity Recognition. In 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM) (pp. 237-241). IEEE.\u003c/li\u003e\n\u003cli\u003eSinha, H., Awasthi, V. and Ajmera, P.K., 2020. Audio classification using braided convolutional neural networks. IET Signal Processing, 14(7), pp.448-454.\u003c/li\u003e\n\u003cli\u003eStraczkiewicz, M., James, P. and Onnela, J.P., 2021. A systematic review of smartphone-based human activity recognition methods for health research. NPJ Digital Medicine, 4(1), p.148.\u003c/li\u003e\n\u003cli\u003eSultana, J., Usha Rani, M. and Farquad, M.A.H., 2020. An extensive survey on some deep-learning applications. In Emerging Research in Data Engineering Systems and Computer Communications: Proceedings of CCODE 2019 (pp. 511-519). Singapore: Springer Singapore.\u003c/li\u003e\n\u003cli\u003eTasmin, M., Ishtiak, T., Ruman, S.U., Suhan, A.U.R.C., Islam, N.S., Jahan, S., Ahmed, S., Zulminan, M.S., Saleheen, A.R. and Rahman, R.M., 2020, August. Comparative study of classifiers on human activity recognition by different feature engineering techniques. In 2020 IEEE 10th International Conference on Intelligent Systems (IS) (pp. 93-101). IEEE. \u003c/li\u003e\n\u003cli\u003eThakur, D., Biswas, S., Ho, E.S. and Chattopadhyay, S., 2022. Convae-lstm: Convolution- al autoencoder long short-term memory network for smartphone-based human activity recognition. IEEE Access, 10, pp.4137-4156.\u003c/li\u003e\n\u003cli\u003eVoicu, R.A., Dobre, C., Bajenaru, L. and Ciobanu, R.I., 2019. Human physical activity recognition using smartphone sensors. Sensors, 19(3), p.458.\u003c/li\u003e\n\u003cli\u003eWang, H., Zhao, J., Li, J., Tian, L., Tu, P., Cao, T., An, Y., Wang, K. and Li, S., 2020. Wearable sensor-based human activity recognition using hybrid deep learning techniques. Security and communication Networks, 2020, pp.1-12.\u003c/li\u003e\n\u003cli\u003eWang, P., Fan, E. and Wang, P., 2021. Comparative analysis of image classification algorithms based on traditional machine learning and deep learning. Pattern Recognition Letters, 141, pp.61-67.\u003c/li\u003e\n\u003cli\u003eXia, K., Huang, J. and Wang, H., 2020. LSTM-CNN architecture for human activity recognition. IEEE Access, 8, pp.56855-56866.\u003c/li\u003e\n\u003cli\u003eXu, C., Chai, D., He, J., Zhang, X. and Duan, S., 2019. InnoHAR: A deep neural network for complex human activity recognition. Ieee Access, 7, pp.9893-9902.\u003c/li\u003e\n\u003cli\u003eYadav, S.K., Tiwari, K., Pandey, H.M. and Akbar, S.A., 2021. A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions. Knowledge-Based Systems, 223, p.106970.\u003c/li\u003e\n\u003cli\u003eYao, L., Sheng, Q.Z., Benatallah, B., Dustdar, S., Wang, X., Shemshadi, A. and Kanhere, S.S., 2018. WITS: an IoT-endowed computational framework for activity recognition in personalized smart homes. Computing, 100, pp.369-385.\u003c/li\u003e\n\u003cli\u003eZhang, X., Wang, L. and Su, Y., 2021. Visual place recognition: A survey from deep learning perspective. Pattern Recognition, 113,p.107760.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Human Activity Recognition, Deep Learning, Spatio-Temporal, Convolutional Block Attention Module, Accuracy, Validation","lastPublishedDoi":"10.21203/rs.3.rs-6536118/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6536118/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eHuman Activity Recognition (HAR) has emerged as a critical research area in the domains of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) due to its extensive applications across various domains. The development of robust HAR models capable of accurately identifying human activities is growing in demand. This study aims to advance the field by introducing a novel hybrid model that integrates Convolutional Neural Networks (CNN), Attention mechanisms, and Bidirectional Long Short-Term Memory (BiLSTM) networks. This \u0026ldquo;CNN-Attention-BiLSTM\u0026rdquo; model is meticulously designed to capture both spatial and temporal features, thereby enhancing feature extraction and attentiveness. We have evaluated the proposed model using the widely recognized UCI-HAR dataset. The results demonstrate that our model achieves an impressive activity classification accuracy of 93%. To ensure the reliability and validity of our findings, we employed rigorous validation techniques, including cross-validation and detailed classification reports. The model successfully met these validation criteria, confirming its effectiveness and innovation.\u003c/p\u003e","manuscriptTitle":"Hybrid Deep Learning Architecture for Efficient Human Activity Recognition: A CNN-Attention-BiLSTM Framework","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-05-10 01:06:32","doi":"10.21203/rs.3.rs-6536118/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"202e937d-8c1f-4097-9be3-f4633e0e5a7e","owner":[],"postedDate":"May 10th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-06-26T08:54:06+00:00","versionOfRecord":[],"versionCreatedAt":"2025-05-10 01:06:32","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6536118","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6536118","identity":"rs-6536118","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.