Enhancing Network Performance Monitoring through Scalable Multi-Dimensional Metric Analysis and Pattern-Based Anomaly Detection

doi:10.21203/rs.3.rs-4914517/v1

Enhancing Network Performance Monitoring through Scalable Multi-Dimensional Metric Analysis and Pattern-Based Anomaly Detection

2024 · doi:10.21203/rs.3.rs-4914517/v1

preprint OA: closed

Full text JSON View at publisher

Full text 127,892 characters · extracted from preprint-html · click to expand

Enhancing Network Performance Monitoring through Scalable Multi-Dimensional Metric Analysis and Pattern-Based Anomaly Detection | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Enhancing Network Performance Monitoring through Scalable Multi-Dimensional Metric Analysis and Pattern-Based Anomaly Detection Salman Ahmad, Bryan Scotney, David Glass, Shuai Zhang This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4914517/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract This paper presents a novel approach to network performance monitoring and improvement through profile pattern extraction-based anomaly detection in multi-dimensional network throughput metrics. As modern networks grow in complexity, traditional monitoring methods often struggle to detect subtle yet significant anomalies that can impact performance. Our research addresses this challenge by developing an integrated framework that combines advanced data analysis techniques with machine learning algorithms to identify and interpret complex patterns in network behaviour. The proposed methodology leverages autocorrelation function (ACF) based clustering to group similar time series, and employs feature extraction methods to create profile patterns from multi-dimensional network data. These patterns serve as a baseline for normal network behaviour, against which anomalies are detected using the Isolation Forest algorithm. These patterns serve as a baseline for normal network behaviour, against which anomalies are detected using a combination of statistical methods and machine learning approaches. Our experimental results, based on real-world data from a telecommunications network, demonstrate that the profile pattern-based approach significantly enhances anomaly detection capabilities. The best-performing model, which combines raw data and Z-scores derived from profile patterns, achieved an anomaly detection rate of 1.805% with the highest confidence (average anomaly score of -0.123). This model outperformed both raw data analysis and Z-score-only approaches in terms of selectivity and computational efficiency, completing analysis in 8.582 seconds. This research contributes to the field of network performance monitoring by offering a more sophisticated and accurate approach to anomaly detection, potentially leading to enhanced network reliability, reduced downtime, and improved user experience. The paper concludes by discussing the implications of these findings for network administrators and outlining future research directions in this rapidly evolving field. Network Performance Anomaly Detection Multi-Dimensional Data Analysis Machine Learning Pattern Extraction Network Throughput Metrics Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 1 Introduction In the era of digital transformation, network performance has become a critical factor in the success of businesses and organizations worldwide. Modern computer networks require accurate anomaly detection in real-time streaming data for various applications, including preventative maintenance, fraud prevention, problem detection, and monitoring. The increasing complexity and heterogeneity of these networks, along with complex patterns and real-time constraints, make detecting and diagnosing anomalies in network throughput metrics a challenging task. Anomaly detection in network systems has long been a focus of research and practical implementation. It serves as an early warning system, alerting network administrators to potential issues that could degrade performance or indicate security threats. Anomaly detection in multivariate time series metrics in network data is a challenging problem due to the complexity and large-scale nature of network data. Large, complex networks can face challenges in detecting anomalies due to the curse of dimensionality(Chen et al., 2020 ) This term describes the abundance of data originating from numerous sources, leading to diverse manifestations across various network devices. Such a situation can hinder the effective identification of anomalies. Telecommunication ISPs (Internet Service Providers) companies often generate multiple correlated time series, representing different aspects of network performance. These metrics provide a rich dataset that, when properly analysed, can reveal complex patterns and anomalies that may go unnoticed in simpler monitoring schemes(Z. Li et al., 2021 ). The challenge lies in effectively processing and interpreting this high-dimensional data to extract meaningful insights and identify potential issues before they escalate into major problems. Anomaly detection methods must be capable of handling multivariate data, considering the correlations between dimensions to provide a more comprehensive and accurate assessment of network anomalies. Traditional anomaly detection methods often focus on datasets with a single item. However, real-world scenarios often involve multi-component datasets, where each component has its own pattern and behaviour. Developing a separate model for each component can lead to performance issues and increased complexity. This research paper introduces a novel approach to anomaly detection in multi-dimensional network throughput metrics through profile pattern extraction. In the first step multiple time series representing different metrics are grouped using clustering algorithm based on their similar pattern. In the second step profile patterns are created by grouping data based on temporal features, then aggregating key metrics within these groups to identify trends and deviations. The z-score features computed after the profile pattern are modelled for anomaly detection with Isolation Forest algorithm. This approach offers a scalable and efficient solution for anomaly detection in large-scale network environments, ultimately contributing to improved network performance and increased reliability. Thus, rather than developing separate model for each time series, this approach clusters similar behavioural time series into the same cluster and develop a model for each cluster. The objectives of this study are threefold: To develop an integrated single model-based approach for extracting profile patterns from multi-dimensional network throughput metrics. To design and implement an anomaly detection mechanism based on these extracted profiles. A telecommunication network monitoring case study with throughput data is used to test the proposed method. The potential impact of this research extends beyond mere academic interest. By providing network administrators with more accurate and timely information about network anomalies, this approach can contribute to proactive network management, reduced downtime, and improved user experience. Furthermore, the insights gained from profile pattern analysis can inform network optimization strategies, leading to more efficient resource allocation and enhanced network design. 2 Related Work Telecommunication networks contain diverse data that represents various metrics of different types and scales (Bordeau-Aubert et al., 2023 ) When working with unsupervised techniques, it is common to handle multiple time series by either clustering them(Diez-Olivan et al., 2017 ) or calculating the distances between them(Benkabou et al., 2018 ). High dimensionality in data complicates the training process for Machine Learning Algorithms (MLAs), leading to overfitting of the model and decreased predictive performance(Anowar et al., 2022 ). This is due to the increased complexity of the dataset, which hinders the algorithm's ability to generalize to new data. Clustering time series data is a valuable technique for identifying underlying patterns and structures in various applications. State-of-the-art anomaly detection algorithms have been proposed (Garg et al., 2021 ) for heterogeneous time series data include unsupervised and semi-supervised deep-learning-based methods. These methods can scale to high dimensions and model complex patterns in various domains, making them suitable for detecting anomalies in multivariate time series data. Another approach is to use robust time series decomposition, as proposed in (T. Li et al., 2020 )to decouple the trend and seasonal components of the data and detect outliers in the residual. However, the field of time series anomaly detection is constantly advancing, and several methods are available, making it a challenge to determine the most appropriate method for a specific domain(Sørbø & Ruocco, 2023 ). Canizo et al., ( 2019 ) proposed a multi-head CNN-RNN architecture for anomaly detection in multi-sensor systems to provide a novel deep learning-based approach that addresses the challenge of detecting anomalies in heterogeneous data. The architecture uses independent CNNs to extract features from each sensor data on a fully independent basis, allowing for a more tailored approach to each type of sensor. The RNN component processes the sensor data in a window-based method, which allows it to focus on different phases of the sensor data. The effectiveness of this approach was demonstrated through an industrial case study involving a service elevator, where anomalies were effectively detected based on multiple sensor data. This study shows that the proposed approach outperforms other state-of-the-art methods for anomaly detection in multi-time series data and provides an effective solution for anomaly detection in multi-sensor systems without the need for data pre-processing. For large internet companies, it is crucial to monitor a significant number of Key Performance Indicators (KPIs) and identify anomalies to ensure service quality and reliability. However, detecting anomalies on millions of KPIs poses significant challenges, including the extensive overhead of model selection, parameter tuning, model training, and labelling. In (Z. Li et al., 2018 ) KPI clustering approach is proposed that can provide a solution. By clustering millions of KPIs into a smaller number of clusters, we can select and train models on a per-cluster basis. In this chapter we, proposed a method for anomaly detection in multi-component datasets where the patterns of components are dependent on common factors. 3 Methodology 3.1 Overview of Proposed Scalable Anomaly Detection Framework In this section, we presented our large-scale time series anomaly detection approach for data in two main steps. First the multicomponent time series metrics are clustered according to their underlying shapes. By identifying similar KPIs, such as the number of queries per server in a well-balanced server cluster, we can group them into a few clusters. This approach allows us that we can employ one anomaly detection model per cluster, effectively reducing the mentioned overhead significantly. In the second step a simple single model-based approach is used to detect anomalies in data. The approach incorporates temporal information into the analysis, which is not commonly done in traditional time series analysis methods. This enables us to group the data based on specific time intervals, providing valuable insights into the relationships between various KPIs and their corresponding temporal contexts. The approach detects anomalies in large-scale time series data by leveraging clustering and KPI-based pattern table extraction as depicted in Fig. 1 . The proposed approach can be extended for many application domains which share common characteristics. The details of the main steps involved in our approach are outlined in the next subsections. 3.2 Clustering of Multi-Dimensional Time Series Data Time series data is a sequence of observations recorded at regular time intervals with its dynamic fluctuations, presents a unique challenge for analysis. Typically, there are a vast number of Key Performance Indicators (KPIs) in a large-scale internet-based service company. It is impossible for operators to analyse each KPI individually. By using clustering, they can analyse KPIs per cluster and create an anomaly detection model for each cluster(Z. Li et al., 2018 ). This significantly reduces modelling costs and improves efficiency. Clustering such data involves grouping similar time series based on certain characteristics. One efficient way to extract features from time series data for clustering is by using the Autocorrelation Function (ACF). Clustering is a type of unsupervised learning that involves splitting a collection of unlabelled data objects into clusters or groups that are similar to each other. A number of anomaly detection methods have been proposed, e.g. (Xu et al., 2018 ) and (Laptev et al., 2015 ) However, these approaches often assume that an individual model is needed for each KPI. This poses a significant challenge when it comes to large-scale anomaly detection, involving thousands to millions of KPIs. The overhead involved developing multiple models, parameter tuning, model training, and anomaly labelling becomes a major obstacle. Fortunately, there are many KPIs that share common characteristics and associations. The autocorrelation function (ACF) measures the linear dependence between data points in a time series at different time lags. It provides a correlation between the series and its lagged values. In this context, the ACF is used to transform the time series data into a form that can be effectively clustered(Yakubu & Saputra, 2022 ). To compute ACF estimates, we determined the degree of autocorrelation between data points separated by varying time lags. The result is a feature vector representing the temporal structure of the time series. To reduce complexity of the data analysing clusters instead of individual series can offer a concise overview of the data's overall structure. We clustered time series data based on similarities in their autocorrelation patterns using Kmeans Clustering. The KMeans algorithm is a popular method for partitioning a dataset into a specified number of clusters. It works by assigning each data point to the cluster whose centroid is nearest. The centroids are then updated based on the new assignments, and the process is repeated until the assignments no longer change(Ferencz et al., 2022 ). However, the number of clusters needs to be specified in advance, and has to be estimated using other methods(Xia et al., 2015 ). In our approach we can obtain an idea for the number of clusters by analysing the patterns in the autocorrelation plot. We first calculated the ACF for each time series, then uses those ACF values as features for clustering. By using the ACF to transform the data and the KMeans algorithm to cluster it, our approach can reveal patterns in the data that may not be apparent from the raw time series alone. 3.3 Context-Aware Based Pattern Construction for Multi-Dimensional Metrics In the previous section we explored how to cluster multiple behavioural time series into groups. In this section we detail the creation of pattern table based on temporal contextual features. Data is segmented based on contextual factors such as time of day and day of the week. Each grouping represents a unique combination of time-related attributes (e.g., Monday at 10 AM, Sunday at 3 PM, etc.). This segmentation is crucial as it acknowledges that data behaviour can vary significantly depending on these contextual elements, which is often overlooked in simpler models. KPIs exhibit regular patterns based on time, making deviations potentially indicative of issues or extraordinary events. For each group, statistical measures are calculated for each KPI. The typical measures include the mean (average) and standard deviation. The mean provides an average value of the KPI for that specific time segment, offering insight into what is typical or expected during that period. While the standard deviation measures the amount of variation or dispersion of the KPI values from the mean. The results of these aggregations are compiled into a pattern table. This table serves as a reference model, capturing the normal behaviour patterns of KPIs across different temporal contexts. Thus, each entry in the pattern table corresponds to a specific temporal grouping and includes the aggregated statistics (mean and standard deviation) for each KPI. Each row in this table corresponds to a unique combination of temporal attributes and includes the aggregated statistical figures for each KPI. 3.4 Zscore based Features and Anomaly Detection In the final step after constructing the pattern table, the pattern table is merged with the original data, and z-scores are calculated for each KPI. This means that the deviations of each KPI value from its own pattern are estimated by considering the factors that determine the pattern. Each z-score is determined by evaluating how many standard deviations a KPI value deviates from its mean, which is calculated for each temporal segment. The Isolation Forest algorithm is an anomaly detection method that identifies anomalies based on the concept of isolation without employing any distance or density measure. This approach is fundamentally different from most existing methods(Madhukar Rao & Ramesh, 2021 ). Feeding z-score features to the Isolation Forest algorithm can enhance its anomaly detection capabilities. Therefore, the z-scores are utilized as input features for the Isolation Forest model. The model operates under the principle of isolating anomalies instead of constructing a profile of normal instances. This method is particularly adept in our context due to its efficiency in handling the high-dimensional feature space. It leverages the inherent property of anomalies being few and distinct, thus efficiently segregating them from normal observations. When these z-score features are fed into the Isolation Forest algorithm, the algorithm can more effectively isolate anomalies. This is because the z-score transformation helps to highlight those data points that are significantly different from the mean, which are likely to be the anomalies that the Isolation Forest algorithm is designed to detect. 4 Experimental Setup This section details the experiment's setting, preprocessing and baseline approaches used. For experiments in this chapter, we utilised the same type of data as we used in the previous chapter. Also, in the exploratory data analysis section we found that the data needs to be pre-processed before using for analysis. The following section details the necessary pre-processing steps performed on the data. 4.1 Data Collection from Real Network Infrastructure (Core Peering Router) Real-world dataset from British Telecommunication (BT) network components was collected. The data is captured from 85 BT internet peering router interfaces at a single location. Each interface represents four time-series data: input data rate, output data rate, input packet rate, and output packet rate. The time-series data for each interface of the router consists of input and output throughput rates for almost 12 days of the period between 09 May and 21 May 2019 and each observation has been timestamped with approximately 30 seconds gap. The anomaly detection techniques can be applied at different locations of the telecommunication networks such as the customer access layer e.g., Broadband time series data and core network layer e.g., core peering router. The underlying issues include primarily potential network faults, connection, spikes in input traffic, and other network configuration issues that should raise the alarm for a timely recovery. These metrics are extracted from an operational system, it contains time series data of key metrics such as OPRATE, representing the output packet rate over time. The system metrics generated are of diverse nature and requires domain expert to label anomalies. To understand its basic behaviour, the plot shown in Fig. 2 represents output packet rate on y-axis and datetime on x-axis. In this chapter for experiments single interface data is used for analysis which consists of 4 metrics represent the input and output throughput. As the core network peering router generate more regular data at high speed, such time series data poses a challenge to predict where time series deviates significantly from normal behaviour. Both the following Fig. 3 and Fig. 4 shows the typical pattern of network time series metrics. It is illustrated from the figures that the time series data for each interface starts on 09 May and ends on 21 May 2019. Every observation has been recorded with a time stamp, spaced approximately 30 seconds apart. The daily patterns of each component exhibit similarities, with the main difference being their magnitude. 4.2 Data Component Selection and Filtering Organizing the data by component type allows for a clear understanding of the different elements within the dataset. We can select random number of components for analysis. However, for this experiment, we selected 20 components out of 85 components due to better interpret the results. By grouping the data frames and generating a bar chart displaying data points per day. The bar chart below illustrates the number of data points per day, it is evident that, the on the first day there are 83 observations while on last day there are 1890 observations. To avoid bias in the anomaly detection analysis, we exclude these two days. For each component we selected 10 days of data and dropped the observations on date 2019-05-09 and 2019-05-20. Upon examining a substantial number of metric streams, it becomes apparent that despite the variety of metrics, many shares common characteristics due to their inherent associations and similarities. The heterogeneous time series metrics are clustered into groups based on their patterns, using autocorrelation-based clustering method described in the subsection 6.4.2. 4.3 Resampling The raw data consisted of measurements from different components recorded at irregular intervals. The inconsistency between each measurement is not balanced and tightly centred around 30sec, and the difference fluctuates evenly between 29, 30 and 31 sec. The time interval between consecutive measurements isn't fixed, but it's approximately 30 seconds. The initial step in our resampling process involved grouping the data by the 'Component'. For each component, we rounded the index to the nearest minute to align the data points to a standardized time grid. This step is crucial to avoid any misalignment issues that could arise due to data points being recorded at varying seconds within each minute. Subsequently, we resampled the data to a one-minute frequency using the mean as the aggregation function. This step involved averaging the data points that fell within each one-minute interval. Resampling by averaging is a common technique to reduce the noise and variability in the data, providing a smoother representation of the underlying trends. Details of the datasets used in this chapter after preprocessing are presented in the following Table 1 . Table 1 Data Component details after preprocessing (Resampling and Filtering) Number of Components Start Date End Date Total No of Days Total Datapoints Frequency Data Points per Hour Data Points per Day Total number of time series 20 2019-05-10 2019-05-19 10 14,400 1 minute 60 1440 80 4.4 Train/Test Split In the experiments, the dataset is grouped after creating groups using clustering. Each of these groups is then individually subjected to the train-test split. For example, in the first cluster of the time series metric we have the data available for the duration of 10 days which can be seen in Fig. 6 . The starting one week of data of each group's data is used for training purposes, and the last three days is used for testing. This approach ensures that each group's unique characteristics are represented in both the training and testing phases, allowing for a more robust evaluation of the model across different segments of the data. The total number of points after preprocessing we have 10 days data in which we have used one week(7-days) for training and 3 days as our testing data. Profile Pattern Creation After grouping similar metric streams into several clusters, we constructed a pattern table for each cluster to capture patterns across time segments. It groups each clustered data by weekday, hour, and 30-minute intervals, and aggregates mean and standard deviation for each KPI within these segments. The aggregation is based on a combination of weekday, hour, and 30-minute intervals. This design reflects the assumption that the time series metrics often exhibit temporal patterns based on day of the week and time of day. Isolation Forest Training and Anomaly Detection An iForest model was trained on the engineered features extracted from the training set. We employed multiple random forests for robust anomaly scoring. The Contamination parameter plays a crucial role in the detection process. It represents the assumed level of contamination or the proportion of outliers in the dataset. When fitting the model, this parameter is used to determine the threshold on the scores of the samples. Manual adjustment of the Contamination parameter is necessary to achieve an optimal fit for the specific dataset and the desired outcome. The parameters used for Isolation Forest model and feature set are presented in the following Table 2 . Table 2 Parameters and Feature Sets Used in the Isolation Forest Algorithm for Anomaly Detection Category Parameter /Feature set Description Value Parameters 1 max_samples Number of samples to train each estimator 1000 2 random_state Seed for random number generator 0 3 contamination Proportion of outliers in the dataset 0.025 4 max_features Number of features for each estimator 1.0 5 n_estimators Number of base estimators in the ensemble 1000 6 bootstrap Whether samples are drawn with replacement False 7 verbose Verbosity of the process 0 8 n_jobs Number of cores for parallel processing -1 Feature Sets Set 0 Model 1 [TS_0] Set 1 Model 2 [TS_0, TS_0_zscore] Set 2 Model 3 [TS_0_zscore] Isolation Forest detects anomalies by considering their distinctiveness or deviation from normal data points. This is determined by the number of neighbouring instances that surround an anomaly, which is fewer compared to normal data points. The decision_function method computes the anomaly score for every data point in the dataset. The scores can be utilized to identify and rank any outliers or anomalies within the dataset. Higher scores indicate a greater probability of being an outlier. The anomaly score indicates the level of abnormality or uniqueness of a data point. It is determined by the average path length needed to isolate the data point within the Isolation Forest structure. A lower average path length corresponds to a higher anomaly score, indicating that the data point is more likely to be an outlier. We constructed three models using different feature sets. The first model uses raw data only of the original time series values. The second model uses the standardized feature values derived from Pattern table-based Z-scores. Third model Employing both raw data and z-scores to exploit both inherent patterns and standardized deviations. These three models will provide different perspectives and insights into the anomaly detection process. 5 Results and Discussion The heterogenous data from multiple components containing throughput metrics are integrated in one data frame to analyse it through clustering. The multiple time series data might be originating from different processes and exhibits distinct autocorrelation structure. We verified this by analysing their autocorrelation structure. We computed ACF for each time series with a lag of 20. By measuring the degree of correlation between an observation and its lagged versions, the ACF reveals the temporal dependencies within the time series. Thus, time series with similar temporal structures will have similar ACFs. Figure 7 displays the autocorrelation functions (ACFs) of 80-time series. The horizontal axis represents the lag, while the vertical axis shows the correlation value for each lag. It is evident from the ACF plot that we can determine optimal number of clusters for our dataset by analysing the underlying structures in their ACFs. By using these ACF feature vectors as the basis for K Means clustering, we can group time series not by their raw values, but by the structure and dependencies in the data. This enables us to identify clusters of time series with similar temporal dynamics. The experiment validated the hypothesis that ACF is a robust feature for Clustering time series capturing the essential dynamics of each time series. For our context, each observation is a time series represented by its ACF feature vector. It can be seen from Fig. 7 that we can chose the number of clusters visually as 4 clusters as there are 4 distinct structures. We employed the k-means algorithm with four clusters to segregate them computationally. The algorithm partitions the dataset into 4 clusters, where each observation belongs to the cluster with the nearest mean. K-means algorithm assigns a label to each time series indicating the cluster it belongs to. These labels are then utilized to group and display the time series data based on their corresponding clusters. For example, in Fig. 8 provide a visual representation of the inherent patterns within the first cluster and there are 51 ACFs of the respective time series. Thus, from Fig. 8 to Fig. 11 , it is evident that each cluster predominantly consisted of time series from the same generating processes, indicating the utility of ACF in time series clustering. By plotting the ACF for each cluster, we can visually confirm the internal coherence of each cluster – time series within the same cluster exhibit similar ACF patterns, indicating similar temporal dynamics. The plots illustrate the correlation structure of the time series grouped in their corresponding cluster, highlighting their similar temporal dynamics. This supports in validating the effectiveness of the clustering approach and provides insight into the fundamental characteristics of each cluster, which can be valuable when examining the data for conducting further analysis. Clustering these feature vectors using k-means enables us to group time series based on the similarity of their temporal structures rather than their raw values, making this a powerful method for understanding the underlying dynamics of the data. The pattern table successfully grouped KPI data into time segments based on weekday, hour, and 30-minute intervals. For each KPI, means and standard deviations were calculated within each time segment, providing a summary of typical behaviour during those periods. For example, in Fig. 12 shows the normal pattern for a single time series in group 0 we have created. The analysis revealed diverse temporal patterns across KPIs. For example, the KPIs in group 3 (TS_58, TS_19, TS_43, TS_78) exhibited a slightly different trends across weekdays and hours, with lower mean values during certain periods. These findings suggest the influence of day-to-day and hourly routines on these KPIs. Other KPIs (TS_18, TS_50) in 2nd group(g1) showed minimal variation across time segments, indicating their relative independence from weekday and hourly fluctuations. This may be due to factors like external events or specific user activities beyond the scope of the time-based grouping. Z-scores were calculated for each KPI, allowing for comparison of individual data points to the temporal pattern within their corresponding time segment. Outliers identified through significant deviations from the expected mean and standard deviation within their time segment could indicate unusual activity or potential anomalies for further investigation. We used time based Zscore features and raw data to assist Isolation Forest Algorithm. As mentioned in section 6.5.5, three models were constructed using different features to examine the effectiveness of iForest in capturing anomalies within each feature set. The distribution of anomaly scores obtained from each model were also analysed illustrated in Fig. 13 . we observed the shape and range of the scores associated with each feature set. By comparing the score distributions between models, we can evaluate the relative performance and contribution of different features in detecting anomalies within the dataset. If there are noticeable differences in the score distributions among the models, it indicates that the different features capture specific aspects of the data's anomalies. Each model assigns different levels of anomaly scores to the data points based on the specific feature set it uses. These distribution differences provide insights into the effectiveness of each feature set in identifying anomalies. A model with a wider or more skewed score distribution may indicate a greater ability to distinguish between normal and anomalous data points. On the other hand, a model with scores concentrated within a specific range may suggest a more conservative or less sensitive approach to anomaly detection. The following Fig. 14 depicts the test time series data with detected anomalies in red. Figure 14 a represents the results of the first model, which involved fitting IForest model only on the raw data. It is apparent that the outcome is relatively simplistic, as the model primarily detects anomalies at the significantly higher extremes. Figure 14 b displays the results of a model fitted on two features (raw value and Pattern table based zscore). In the perspective of data, this model produces outcomes that are more in line with a somewhat subjective interpretation of interesting anomalies. Lastly Fig. 14 c illustrates the model fitted exclusively on the TS_0_zscore derived after pattern table. Table 3 Comparison of Anomaly Detection Models Using Different Feature Sets for Network Throughput Metrics Model Anomaly Detection Rate (%) Average Anomaly Score Score Distribution Skewness Computational Time (s) Model 1 (TS_0) 2.522 -0.091 2.282 9.046 Model 2 (TS_0, TS_0_z_score) 1.805 -0.123 1.824 8.582 Model 3 (TS_0_z_score) 2.198 -0.122 3.247 9.182 Table 3 shows the percentage of data points classified as anomalies varies across the models. Model 1 (Raw Data) likely shows a higher anomaly detection rate compared to Models 2 and 3. This suggests that using raw data alone may be overly sensitive, potentially flagging normal network fluctuations as anomalies. Models 2 and 3, which incorporate Z-scores, are likely more selective in identifying anomalies, which aligns with the paper's goal of detecting subtle yet significant anomalies in network performance. It is evident in the table that Models 2 and 3, which incorporate Z-scores, have lower (more negative) average anomaly scores. This indicates that these models are more confident in their anomaly classifications. The similarity between Models 2 and 3 suggests that the Z-score feature is driving this improved confidence, aligning with your paper's emphasis on profile pattern extraction for more accurate anomaly detection. Model 3, using only Z-scores, shows the highest positive skewness. This suggests it's most effective at distinguishing between normal and anomalous network behavior, providing a clearer separation between regular operations and potential issues. The lower skewness of Model 2 might indicate a more balanced approach, potentially reducing extreme classifications. 6 Conclusion In this paper, we proposed a novel approach to anomaly detection in multidimensional network throughput metrics through profile pattern extraction. The similar behavioural patterns are clustered through autocorrelation-based clustering method. Using a real-world dataset, we demonstrated an efficient method for anomaly detection using clustering and constructing pattern-based features. The proposed method showed the effectiveness of combining profile pattern with anomaly detection algorithms. For anomaly detection, we employed Isolation Forest model as an outlier detection model. Isolation Forest demonstrated its capability to detect anomalies in diverse time-series data, even with varying feature sets. The combination of IForest with pattern table based Zscore exhibits superior performance compared to raw features. The results obtained depicts that the approach offered a simpler solution for detecting anomalies for multi component metrics using simpler models. The approach presented provides a single model-based approach capable of handling multiple metrics, where components might have different patterns. these results strongly support the effectiveness of the profile pattern extraction approach for anomaly detection in multi-dimensional network throughput metrics. The models incorporating Z-scores (2 and 3) show improvements in anomaly detection confidence and characterization, with minimal computational overhead. In addition, the approach is scalable and adaptable to other use cases through the construction of more relevant features. However, it is important to acknowledge the limitations of our study. There is an assumption that each cluster dependent on only inferred factors (e.g. hour, weekday, weekend). Also, for detecting anomalies in multivariate data additional features such as correlation between the features could be considered along with multiple raw data. The current analysis includes creating pattern table based on combination of each weekday, hour and 30-minute interval. More investigations could be done with different granularity of time segments while constructing pattern table and calculating Zscore. The overall approach presented in this paper can be extended to detect anomalies in multivariate data. Isolation Forest Model is particularly effective in detecting anomalies, especially for high-dimensional data. Abbreviations KPI Key Performance indicator RNN Recurrent Neural Network CNN Convolutional neural network ACF Autocorrelation Function STL Seasonal-Trend decomposition ADF augmented Dickey–Fuller FFT Fast Fourier Transform BTIIC British Telecom Ireland Innovation Centre AE Autoencoder Declarations Ethics approval and consent to participate Not applicable Funding The authors have received no financial support regarding carrying out of this research study Conflict of Interest/Competing interest The authors have no conflicts of interest to declare. All co-authors have seen and agreed with the contents of the manuscript. We certify that the submission is original work and is not under review at any other publication. Ethics approval and consent to participate Not applicable. Consent for publication Not Applicable. Availability of data and materials Due to British Telecom (BT) policy, the real data collected from the core network cannot be provided due to confidentiality. Source code can be provided once paper is accepted. Authors' contributions S.A. conceived the idea of the proposed research study and designed a conceptual framework. In addition, S.A. carried out the research from highlighting the research gap by exploring the existing literature to develop a scalable framework. S.A., B.S. and D.G. conducted formal analysis and wrote the original manuscript. D.Y. captured the real data from the telecommunication network routers. D.G. investigated the experimental results. B.S., D.G. and S.Z. supervised the overall research and reviewed the final manuscript. Acknowledgement This research is supported by the BTIIC (British Telecom Ireland Innovation Centre) project, funded by British Telecom and Invest Northern Ireland. In addition, I am very thankful to Prof Bryan Scotney for his immense support and encouragement to carry out this research. References Anowar, F., Sadaoui, S., & Dalal, H. (2022). Clustering Quality of a High-dimensional Service Monitoring Time-series Dataset. International Conference on Agents and Artificial Intelligence . Benkabou, S. E., Benabdeslem, K., & Canitia, B. (2018). Unsupervised outlier detection for time series by entropy and dynamic time warping. Knowledge and Information Systems, 54 (2), 463–486. https://doi.org/10.1007/S10115-017-1067-8/FIGURES/8 Bordeau-Aubert, K., Whatley, J., Nadeau, S., Glatard, T., & Jaumard, B. (2023). Classification of Anomalies in Telecommunication Network KPI Time Series . https://arxiv.org/abs/2308.16279v1 Canizo, M., Triguero, I., Conde, A., & Onieva, E. (2019). Multi-head CNN–RNN for multi-time series anomaly detection: An industrial case study. Neurocomputing, 363 , 246–260. https://doi.org/https://doi.org/10.1016/j.neucom.2019.07.034 Chen, F., Garrett, J., Zacks, D. N., & Yashin, V. (2020). METHOD AND SYSTEM FOR ANOMALY DETECTION IN LARGE-SCALE NETWORKS . Diez-Olivan, A., Pagan, J. A., Sanz, R., & Sierra, B. (2017). Data-driven prognostics using a combination of constrained K-means clustering, fuzzy modeling and LOF-based score. Neurocomputing, 241 , 97–107. https://doi.org/10.1016/J.NEUCOM.2017.02.024 Ferencz, K., Domokos, J., & Kovács, L. (2022). Analysis of time series data for anomaly detection. 2022 IEEE 22nd International Symposium on Computational Intelligence and Informatics and 8th IEEE International Conference on Recent Achievements in Mechatronics, Automation, Computer Science and Robotics (CINTI-MACRo) , 95–100. https://api.semanticscholar.org/CorpusID:256589477 Garg, A., Zhang, W., Samaran, J., Savitha, R., & Foo, C.-S. (2021). An evaluation of anomaly detection and diagnosis in multivariate time series. IEEE Transactions on Neural Networks and Learning Systems, 33 (6), 2508–2517. Laptev, N. P., Amizadeh, S., & Flint, I. (2015). Generic and Scalable Framework for Automated Time-series Anomaly Detection. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . Li, T., Geng, Y., & Jiang, H. (2020). Anomaly Detection on Seasonal Metrics via Robust Time Series Decomposition. In arXiv.org . https://arxiv.org/abs/2008.09245 Li, Z., Zhao, Y., Han, J., Su, Y., Jiao, R., Wen, X., & Pei, D. (2021). Multivariate Time Series Anomaly Detection and Interpretation using Hierarchical Inter-Metric and Temporal Embedding. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery \& Data Mining . Li, Z., Zhao, Y., Liu, R., & Pei, D. (2018). Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection. 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS) , 1–10. https://doi.org/10.1109/IWQoS.2018.8624168 Madhukar Rao, G., & Ramesh, D. (2021). A Hybrid and Improved Isolation Forest Algorithm for Anomaly Detection BT - Proceedings of International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications (V. K. Gunjan & J. M. Zurada (eds.); pp. 589–598). Springer Singapore. Sørbø, S., & Ruocco, M. (2023). Navigating the Metric Maze: A Taxonomy of Evaluation Metrics for Anomaly Detection in Time Series . https://arxiv.org/abs/2303.01272 Xia, H., Chen, B., Fan, J., Li, Z., & Gao, D. (2015). Mining Time Series Data with Two Dimensional Fuzzy Pattern Rules . https://api.semanticscholar.org/CorpusID:118498288 Xu, H., Chen, W., Zhao, N., Li, Z. Z., Bu, J., Li, Z. Z., Liu, Y., Zhao, Y., Pei, D., Feng, Y., Chen, J., Wang, Z., & Qiao, H. (2018). Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications. The Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018 , 187–196. https://doi.org/10.1145/3178876.3185996 Yakubu, U. A., & Saputra, M. P. A. (2022). Time Series Model Analysis Using Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) for E-wallet Transactions during a Pandemic. International Journal of Global Operations Research. https://api.semanticscholar.org/CorpusID:251462691 Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4914517","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":352858167,"identity":"773f330a-cb35-47e0-ae20-c41795bfdef3","order_by":0,"name":"Salman Ahmad","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7ElEQVRIie2RsQrCMBCGTwpxOe0aqegrnKNQ9FUiQieLPoG4OfgI+hCK4JwScNI9g4MuTh0cK4IadVAQg24O+UiW4z7uvwTA4fhHuLloDssPcvJZlh/aX5RyEaXpEz8oYYWLLxV/PGodUugjK6WUHLOwC3m19XBlGbJZz0sTUMiCDikUUX2AEXmoLcF0PA0KZ/lQQCgC6ICHh89GVcfzE96DrSjJxIXAT+0K6XgRoOlhHEmikAT8NsUSrGaU+n0XjHoKozYxvqdkYlm/YoJp82LN6lDNdlnYIN9v77bp0rL+GwzsH+lwOByOL7gCsI1L6/IwtVgAAAAASUVORK5CYII=","orcid":"","institution":"University of Ulster","correspondingAuthor":true,"prefix":"","firstName":"Salman","middleName":"","lastName":"Ahmad","suffix":""},{"id":352858168,"identity":"eec73bd1-c04a-491b-b36d-f5fa47f23c86","order_by":1,"name":"Bryan Scotney","email":"","orcid":"","institution":"University of Ulster","correspondingAuthor":false,"prefix":"","firstName":"Bryan","middleName":"","lastName":"Scotney","suffix":""},{"id":352858170,"identity":"343d1b25-40e7-498f-a4f7-f289afc4d722","order_by":2,"name":"David Glass","email":"","orcid":"","institution":"University of Ulster","correspondingAuthor":false,"prefix":"","firstName":"David","middleName":"","lastName":"Glass","suffix":""},{"id":352858172,"identity":"ad6e3cfd-6289-40ad-88a0-e8608ed231ff","order_by":3,"name":"Shuai Zhang","email":"","orcid":"","institution":"University of Ulster","correspondingAuthor":false,"prefix":"","firstName":"Shuai","middleName":"","lastName":"Zhang","suffix":""}],"badges":[],"createdAt":"2024-08-14 14:52:28","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4914517/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4914517/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":64635184,"identity":"09119e82-3f05-467b-8a04-02b9b43cde50","added_by":"auto","created_at":"2024-09-16 22:45:28","extension":"jpeg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":43282,"visible":true,"origin":"","legend":"\u003cp\u003eHigh level Flowchart of our approach for Detecting Anomalies\u003c/p\u003e","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/0aa7accfa805f9c673618653.jpeg"},{"id":64635182,"identity":"84c7180b-fbed-45be-8f11-2a6da111a861","added_by":"auto","created_at":"2024-09-16 22:45:28","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":84797,"visible":true,"origin":"","legend":"\u003cp\u003eVisual illustration of the weekday and weekend patterns\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/83bdf4ac3da7734a67a1de06.png"},{"id":64635948,"identity":"6b3124f1-a449-41b9-ae56-e2a36ea0bfe4","added_by":"auto","created_at":"2024-09-16 23:09:28","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":83516,"visible":true,"origin":"","legend":"\u003cp\u003eTime Series Plot of Component 1 Network Time Series Metrics\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/92af46c4c8eb640e7f3d0a20.png"},{"id":64635190,"identity":"e3212666-785d-4408-ac84-087ff7473b6e","added_by":"auto","created_at":"2024-09-16 22:45:28","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":93887,"visible":true,"origin":"","legend":"\u003cp\u003eTime Series Plot of Component 2 Network Time Series Metrics\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/4565024d6729cbf054ca6068.png"},{"id":64635284,"identity":"76e6fd6c-0f37-4181-9de3-102c87516016","added_by":"auto","created_at":"2024-09-16 22:53:28","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":21513,"visible":true,"origin":"","legend":"\u003cp\u003eNumber of Datapoints Each Day\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/db748324244cf22334a745c2.png"},{"id":64635780,"identity":"5c3451fc-39e0-4aff-b4a7-f3b8001a7a47","added_by":"auto","created_at":"2024-09-16 23:01:28","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":63276,"visible":true,"origin":"","legend":"\u003cp\u003eTypical pattern of one of the metrics from Cluster 0\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/fa92f76cb99dc248b41368a8.png"},{"id":64635782,"identity":"5c0e1bd1-4e17-495c-845e-f7a49617c3d2","added_by":"auto","created_at":"2024-09-16 23:01:28","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":77088,"visible":true,"origin":"","legend":"\u003cp\u003eAuto Correlation plot of 80 metrics along for each lag\u003c/p\u003e","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/3f22cd7bb6d57bbeab6b53d0.png"},{"id":64635185,"identity":"5f656f1d-0dc2-4b6b-8f8a-45e6e97093ff","added_by":"auto","created_at":"2024-09-16 22:45:28","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":81619,"visible":true,"origin":"","legend":"\u003cp\u003eAuto-Correlation Function (ACF) plots of time series in Cluster 0\u003c/p\u003e","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/6c902515f174b87bf28b9400.png"},{"id":64635189,"identity":"e8e62f06-2b8c-4b78-aa4d-14142bc9b20a","added_by":"auto","created_at":"2024-09-16 22:45:28","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":29903,"visible":true,"origin":"","legend":"\u003cp\u003eAuto-Correlation Function (ACF) plots of time series in Cluster 1\u003c/p\u003e","description":"","filename":"floatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/3bf1426578ce850447d68708.png"},{"id":64635195,"identity":"2a07f691-36ef-478c-bf83-db6004cd599a","added_by":"auto","created_at":"2024-09-16 22:45:28","extension":"png","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":43398,"visible":true,"origin":"","legend":"\u003cp\u003eAuto-Correlation Function (ACF) plots of time series in Cluster 2\u003c/p\u003e","description":"","filename":"floatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/b73d7ceeb6e02ef1944e8060.png"},{"id":64635194,"identity":"270d263f-19b1-4cad-bd36-3e19cfd3ef43","added_by":"auto","created_at":"2024-09-16 22:45:28","extension":"png","order_by":11,"title":"Figure 11","display":"","copyAsset":false,"role":"figure","size":80417,"visible":true,"origin":"","legend":"\u003cp\u003eAuto-Correlation Function (ACF) plots of time series in Cluster 3\u003c/p\u003e","description":"","filename":"floatimage11.png","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/038a959503922ebd9e57c699.png"},{"id":64635949,"identity":"bbb89104-44a1-4555-861f-3522bd5cefcf","added_by":"auto","created_at":"2024-09-16 23:09:28","extension":"png","order_by":12,"title":"Figure 12","display":"","copyAsset":false,"role":"figure","size":52768,"visible":true,"origin":"","legend":"\u003cp\u003eProfile Pattern Extraction based on temporal features.\u003c/p\u003e","description":"","filename":"floatimage12.png","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/5113fea7c6f905d9f67eb59d.png"},{"id":64635289,"identity":"8fd855c7-a65b-4e19-85ed-40e3f7e3b309","added_by":"auto","created_at":"2024-09-16 22:53:28","extension":"png","order_by":13,"title":"Figure 13","display":"","copyAsset":false,"role":"figure","size":39001,"visible":true,"origin":"","legend":"\u003cp\u003eDistributions of Anomaly scores for each model\u003c/p\u003e","description":"","filename":"floatimage13.png","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/46cab1c014927fe48eaa16ba.png"},{"id":64635287,"identity":"b9e4d435-8015-4da0-80e0-d5488b82e245","added_by":"auto","created_at":"2024-09-16 22:53:28","extension":"jpeg","order_by":14,"title":"Figure 14","display":"","copyAsset":false,"role":"figure","size":542203,"visible":true,"origin":"","legend":"\u003cp\u003eTest data with detected anomalies in red. (a) Detected anomalies with Isolation Forest Model Result for Feature Set 0 (b) Anomalies detected with Feature Set 1 (c) Anomalies detected with Feature Set 2\u003c/p\u003e","description":"","filename":"floatimage14.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/2e651f0439082c245bd334d3.jpeg"},{"id":82363202,"identity":"5d4ef283-7d7b-471d-9965-7a0a68d8ab5a","added_by":"auto","created_at":"2025-05-09 12:16:47","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2284628,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4914517/v1/f9abaacc-0b40-4ec0-8e08-691e6f6ba24d.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Enhancing Network Performance Monitoring through Scalable Multi-Dimensional Metric Analysis and Pattern-Based Anomaly Detection","fulltext":[{"header":"1 Introduction","content":"\u003cp\u003eIn the era of digital transformation, network performance has become a critical factor in the success of businesses and organizations worldwide. Modern computer networks require accurate anomaly detection in real-time streaming data for various applications, including preventative maintenance, fraud prevention, problem detection, and monitoring. The increasing complexity and heterogeneity of these networks, along with complex patterns and real-time constraints, make detecting and diagnosing anomalies in network throughput metrics a challenging task.\u003c/p\u003e \u003cp\u003eAnomaly detection in network systems has long been a focus of research and practical implementation. It serves as an early warning system, alerting network administrators to potential issues that could degrade performance or indicate security threats. Anomaly detection in multivariate time series metrics in network data is a challenging problem due to the complexity and large-scale nature of network data. Large, complex networks can face challenges in detecting anomalies due to the curse of dimensionality(Chen et al., \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) This term describes the abundance of data originating from numerous sources, leading to diverse manifestations across various network devices. Such a situation can hinder the effective identification of anomalies. Telecommunication ISPs (Internet Service Providers) companies often generate multiple correlated time series, representing different aspects of network performance. These metrics provide a rich dataset that, when properly analysed, can reveal complex patterns and anomalies that may go unnoticed in simpler monitoring schemes(Z. Li et al., \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). The challenge lies in effectively processing and interpreting this high-dimensional data to extract meaningful insights and identify potential issues before they escalate into major problems. Anomaly detection methods must be capable of handling multivariate data, considering the correlations between dimensions to provide a more comprehensive and accurate assessment of network anomalies. Traditional anomaly detection methods often focus on datasets with a single item. However, real-world scenarios often involve multi-component datasets, where each component has its own pattern and behaviour. Developing a separate model for each component can lead to performance issues and increased complexity.\u003c/p\u003e \u003cp\u003eThis research paper introduces a novel approach to anomaly detection in multi-dimensional network throughput metrics through profile pattern extraction. In the first step multiple time series representing different metrics are grouped using clustering algorithm based on their similar pattern. In the second step profile patterns are created by grouping data based on temporal features, then aggregating key metrics within these groups to identify trends and deviations. The z-score features computed after the profile pattern are modelled for anomaly detection with Isolation Forest algorithm. This approach offers a scalable and efficient solution for anomaly detection in large-scale network environments, ultimately contributing to improved network performance and increased reliability. Thus, rather than developing separate model for each time series, this approach clusters similar behavioural time series into the same cluster and develop a model for each cluster.\u003c/p\u003e \u003cp\u003eThe objectives of this study are threefold:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eTo develop an integrated single model-based approach for extracting profile patterns from multi-dimensional network throughput metrics.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eTo design and implement an anomaly detection mechanism based on these extracted profiles.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eA telecommunication network monitoring case study with throughput data is used to test the proposed method.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003eThe potential impact of this research extends beyond mere academic interest. By providing network administrators with more accurate and timely information about network anomalies, this approach can contribute to proactive network management, reduced downtime, and improved user experience. Furthermore, the insights gained from profile pattern analysis can inform network optimization strategies, leading to more efficient resource allocation and enhanced network design.\u003c/p\u003e"},{"header":"2 Related Work","content":"\u003cp\u003eTelecommunication networks contain diverse data that represents various metrics of different types and scales (Bordeau-Aubert et al., \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) When working with unsupervised techniques, it is common to handle multiple time series by either clustering them(Diez-Olivan et al., \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) or calculating the distances between them(Benkabou et al., \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). High dimensionality in data complicates the training process for Machine Learning Algorithms (MLAs), leading to overfitting of the model and decreased predictive performance(Anowar et al., \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). This is due to the increased complexity of the dataset, which hinders the algorithm's ability to generalize to new data. Clustering time series data is a valuable technique for identifying underlying patterns and structures in various applications. State-of-the-art anomaly detection algorithms have been proposed (Garg et al., \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2021\u003c/span\u003e) for heterogeneous time series data include unsupervised and semi-supervised deep-learning-based methods. These methods can scale to high dimensions and model complex patterns in various domains, making them suitable for detecting anomalies in multivariate time series data. Another approach is to use robust time series decomposition, as proposed in (T. Li et al., \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2020\u003c/span\u003e)to decouple the trend and seasonal components of the data and detect outliers in the residual. However, the field of time series anomaly detection is constantly advancing, and several methods are available, making it a challenge to determine the most appropriate method for a specific domain(S\u0026oslash;rb\u0026oslash; \u0026amp; Ruocco, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eCanizo et al., (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) proposed a multi-head CNN-RNN architecture for anomaly detection in multi-sensor systems to provide a novel deep learning-based approach that addresses the challenge of detecting anomalies in heterogeneous data. The architecture uses independent CNNs to extract features from each sensor data on a fully independent basis, allowing for a more tailored approach to each type of sensor. The RNN component processes the sensor data in a window-based method, which allows it to focus on different phases of the sensor data. The effectiveness of this approach was demonstrated through an industrial case study involving a service elevator, where anomalies were effectively detected based on multiple sensor data. This study shows that the proposed approach outperforms other state-of-the-art methods for anomaly detection in multi-time series data and provides an effective solution for anomaly detection in multi-sensor systems without the need for data pre-processing.\u003c/p\u003e \u003cp\u003eFor large internet companies, it is crucial to monitor a significant number of Key Performance Indicators (KPIs) and identify anomalies to ensure service quality and reliability. However, detecting anomalies on millions of KPIs poses significant challenges, including the extensive overhead of model selection, parameter tuning, model training, and labelling. In (Z. Li et al., \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) KPI clustering approach is proposed that can provide a solution. By clustering millions of KPIs into a smaller number of clusters, we can select and train models on a per-cluster basis.\u003c/p\u003e \u003cp\u003eIn this chapter we, proposed a method for anomaly detection in multi-component datasets where the patterns of components are dependent on common factors.\u003c/p\u003e"},{"header":"3 Methodology","content":"\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Overview of Proposed Scalable Anomaly Detection Framework\u003c/h2\u003e \u003cp\u003eIn this section, we presented our large-scale time series anomaly detection approach for data in two main steps. First the multicomponent time series metrics are clustered according to their underlying shapes. By identifying similar KPIs, such as the number of queries per server in a well-balanced server cluster, we can group them into a few clusters. This approach allows us that we can employ one anomaly detection model per cluster, effectively reducing the mentioned overhead significantly. In the second step a simple single model-based approach is used to detect anomalies in data. The approach incorporates temporal information into the analysis, which is not commonly done in traditional time series analysis methods. This enables us to group the data based on specific time intervals, providing valuable insights into the relationships between various KPIs and their corresponding temporal contexts. The approach detects anomalies in large-scale time series data by leveraging clustering and KPI-based pattern table extraction as depicted in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe proposed approach can be extended for many application domains which share common characteristics. The details of the main steps involved in our approach are outlined in the next subsections.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Clustering of Multi-Dimensional Time Series Data\u003c/h2\u003e \u003cp\u003eTime series data is a sequence of observations recorded at regular time intervals with its dynamic fluctuations, presents a unique challenge for analysis. Typically, there are a vast number of Key Performance Indicators (KPIs) in a large-scale internet-based service company. It is impossible for operators to analyse each KPI individually. By using clustering, they can analyse KPIs per cluster and create an anomaly detection model for each cluster(Z. Li et al., \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). This significantly reduces modelling costs and improves efficiency.\u003c/p\u003e \u003cp\u003eClustering such data involves grouping similar time series based on certain characteristics. One efficient way to extract features from time series data for clustering is by using the Autocorrelation Function (ACF). Clustering is a type of unsupervised learning that involves splitting a collection of unlabelled data objects into clusters or groups that are similar to each other. A number of anomaly detection methods have been proposed, e.g. (Xu et al., \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2018\u003c/span\u003e) and (Laptev et al., \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2015\u003c/span\u003e) However, these approaches often assume that an individual model is needed for each KPI. This poses a significant challenge when it comes to large-scale anomaly detection, involving thousands to millions of KPIs. The overhead involved developing multiple models, parameter tuning, model training, and anomaly labelling becomes a major obstacle. Fortunately, there are many KPIs that share common characteristics and associations.\u003c/p\u003e \u003cp\u003eThe autocorrelation function (ACF) measures the linear dependence between data points in a time series at different time lags. It provides a correlation between the series and its lagged values. In this context, the ACF is used to transform the time series data into a form that can be effectively clustered(Yakubu \u0026amp; Saputra, \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). To compute ACF estimates, we determined the degree of autocorrelation between data points separated by varying time lags. The result is a feature vector representing the temporal structure of the time series.\u003c/p\u003e \u003cp\u003eTo reduce complexity of the data analysing clusters instead of individual series can offer a concise overview of the data's overall structure. We clustered time series data based on similarities in their autocorrelation patterns using Kmeans Clustering. The KMeans algorithm is a popular method for partitioning a dataset into a specified number of clusters. It works by assigning each data point to the cluster whose centroid is nearest. The centroids are then updated based on the new assignments, and the process is repeated until the assignments no longer change(Ferencz et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). However, the number of clusters needs to be specified in advance, and has to be estimated using other methods(Xia et al., \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2015\u003c/span\u003e). In our approach we can obtain an idea for the number of clusters by analysing the patterns in the autocorrelation plot. We first calculated the ACF for each time series, then uses those ACF values as features for clustering. By using the ACF to transform the data and the KMeans algorithm to cluster it, our approach can reveal patterns in the data that may not be apparent from the raw time series alone.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Context-Aware Based Pattern Construction for Multi-Dimensional Metrics\u003c/h2\u003e \u003cp\u003eIn the previous section we explored how to cluster multiple behavioural time series into groups. In this section we detail the creation of pattern table based on temporal contextual features. Data is segmented based on contextual factors such as time of day and day of the week. Each grouping represents a unique combination of time-related attributes (e.g., Monday at 10 AM, Sunday at 3 PM, etc.). This segmentation is crucial as it acknowledges that data behaviour can vary significantly depending on these contextual elements, which is often overlooked in simpler models. KPIs exhibit regular patterns based on time, making deviations potentially indicative of issues or extraordinary events.\u003c/p\u003e \u003cp\u003eFor each group, statistical measures are calculated for each KPI. The typical measures include the mean (average) and standard deviation. The mean provides an average value of the KPI for that specific time segment, offering insight into what is typical or expected during that period. While the standard deviation measures the amount of variation or dispersion of the KPI values from the mean. The results of these aggregations are compiled into a pattern table. This table serves as a reference model, capturing the normal behaviour patterns of KPIs across different temporal contexts. Thus, each entry in the pattern table corresponds to a specific temporal grouping and includes the aggregated statistics (mean and standard deviation) for each KPI. Each row in this table corresponds to a unique combination of temporal attributes and includes the aggregated statistical figures for each KPI.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e3.4 Zscore based Features and Anomaly Detection\u003c/h2\u003e \u003cp\u003eIn the final step after constructing the pattern table, the pattern table is merged with the original data, and z-scores are calculated for each KPI. This means that the deviations of each KPI value from its own pattern are estimated by considering the factors that determine the pattern. Each z-score is determined by evaluating how many standard deviations a KPI value deviates from its mean, which is calculated for each temporal segment.\u003c/p\u003e \u003cp\u003eThe Isolation Forest algorithm is an anomaly detection method that identifies anomalies based on the concept of isolation without employing any distance or density measure. This approach is fundamentally different from most existing methods(Madhukar Rao \u0026amp; Ramesh, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). Feeding z-score features to the Isolation Forest algorithm can enhance its anomaly detection capabilities. Therefore, the z-scores are utilized as input features for the Isolation Forest model. The model operates under the principle of isolating anomalies instead of constructing a profile of normal instances. This method is particularly adept in our context due to its efficiency in handling the high-dimensional feature space. It leverages the inherent property of anomalies being few and distinct, thus efficiently segregating them from normal observations.\u003c/p\u003e \u003cp\u003eWhen these z-score features are fed into the Isolation Forest algorithm, the algorithm can more effectively isolate anomalies. This is because the z-score transformation helps to highlight those data points that are significantly different from the mean, which are likely to be the anomalies that the Isolation Forest algorithm is designed to detect.\u003c/p\u003e \u003c/div\u003e"},{"header":"4 Experimental Setup","content":"\u003cp\u003eThis section details the experiment's setting, preprocessing and baseline approaches used. For experiments in this chapter, we utilised the same type of data as we used in the previous chapter. Also, in the exploratory data analysis section we found that the data needs to be pre-processed before using for analysis. The following section details the necessary pre-processing steps performed on the data.\u003c/p\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Data Collection from Real Network Infrastructure (Core Peering Router)\u003c/h2\u003e \u003cp\u003eReal-world dataset from British Telecommunication (BT) network components was collected. The data is captured from 85 BT internet peering router interfaces at a single location. Each interface represents four time-series data: input data rate, output data rate, input packet rate, and output packet rate. The time-series data for each interface of the router consists of input and output throughput rates for almost 12 days of the period between 09 May and 21 May 2019 and each observation has been timestamped with approximately 30 seconds gap. The anomaly detection techniques can be applied at different locations of the telecommunication networks such as the customer access layer e.g., Broadband time series data and core network layer e.g., core peering router. The underlying issues include primarily potential network faults, connection, spikes in input traffic, and other network configuration issues that should raise the alarm for a timely recovery.\u003c/p\u003e \u003cp\u003eThese metrics are extracted from an operational system, it contains time series data of key metrics such as OPRATE, representing the output packet rate over time. The system metrics generated are of diverse nature and requires domain expert to label anomalies. To understand its basic behaviour, the plot shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e represents output packet rate on y-axis and datetime on x-axis.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn this chapter for experiments single interface data is used for analysis which consists of 4 metrics represent the input and output throughput. As the core network peering router generate more regular data at high speed, such time series data poses a challenge to predict where time series deviates significantly from normal behaviour.\u003c/p\u003e \u003cp\u003eBoth the following Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e and Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e shows the typical pattern of network time series metrics. It is illustrated from the figures that the time series data for each interface starts on 09 May and ends on 21 May 2019. Every observation has been recorded with a time stamp, spaced approximately 30 seconds apart.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe daily patterns of each component exhibit similarities, with the main difference being their magnitude.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e4.2 Data Component Selection and Filtering\u003c/h2\u003e \u003cp\u003eOrganizing the data by component type allows for a clear understanding of the different elements within the dataset. We can select random number of components for analysis. However, for this experiment, we selected 20 components out of 85 components due to better interpret the results.\u003c/p\u003e \u003cp\u003eBy grouping the data frames and generating a bar chart displaying data points per day. The bar chart below illustrates the number of data points per day, it is evident that, the on the first day there are 83 observations while on last day there are 1890 observations. To avoid bias in the anomaly detection analysis, we exclude these two days.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFor each component we selected 10 days of data and dropped the observations on date 2019-05-09 and 2019-05-20.\u003c/p\u003e \u003cp\u003eUpon examining a substantial number of metric streams, it becomes apparent that despite the variety of metrics, many shares common characteristics due to their inherent associations and similarities. The heterogeneous time series metrics are clustered into groups based on their patterns, using autocorrelation-based clustering method described in the subsection 6.4.2.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Resampling\u003c/h2\u003e \u003cp\u003eThe raw data consisted of measurements from different components recorded at irregular intervals. The inconsistency between each measurement is not balanced and tightly centred around 30sec, and the difference fluctuates evenly between 29, 30 and 31 sec. The time interval between consecutive measurements isn't fixed, but it's approximately 30 seconds.\u003c/p\u003e \u003cp\u003eThe initial step in our resampling process involved grouping the data by the 'Component'. For each component, we rounded the index to the nearest minute to align the data points to a standardized time grid. This step is crucial to avoid any misalignment issues that could arise due to data points being recorded at varying seconds within each minute. Subsequently, we resampled the data to a one-minute frequency using the mean as the aggregation function. This step involved averaging the data points that fell within each one-minute interval. Resampling by averaging is a common technique to reduce the noise and variability in the data, providing a smoother representation of the underlying trends. Details of the datasets used in this chapter after preprocessing are presented in the following Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eData Component details after preprocessing (Resampling and Filtering)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"9\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNumber of Components\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eStart Date\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eEnd Date\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eTotal No of Days\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eTotal Datapoints\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eFrequency\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eData Points per Hour\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eData Points per Day\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003eTotal number of time series\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2019-05-10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e2019-05-19\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e14,400\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1 minute\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e1440\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e80\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e4.4 Train/Test Split\u003c/h2\u003e \u003cp\u003eIn the experiments, the dataset is grouped after creating groups using clustering. Each of these groups is then individually subjected to the train-test split. For example, in the first cluster of the time series metric we have the data available for the duration of 10 days which can be seen in Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe starting one week of data of each group's data is used for training purposes, and the last three days is used for testing. This approach ensures that each group's unique characteristics are represented in both the training and testing phases, allowing for a more robust evaluation of the model across different segments of the data. The total number of points after preprocessing we have 10 days data in which we have used one week(7-days) for training and 3 days as our testing data.\u003c/p\u003e \u003cp\u003e \u003cb\u003eProfile Pattern Creation\u003c/b\u003e \u003c/p\u003e \u003cp\u003eAfter grouping similar metric streams into several clusters, we constructed a pattern table for each cluster to capture patterns across time segments. It groups each clustered data by weekday, hour, and 30-minute intervals, and aggregates mean and standard deviation for each KPI within these segments. The aggregation is based on a combination of weekday, hour, and 30-minute intervals. This design reflects the assumption that the time series metrics often exhibit temporal patterns based on day of the week and time of day.\u003c/p\u003e \u003cp\u003e \u003cb\u003eIsolation Forest Training and Anomaly Detection\u003c/b\u003e \u003c/p\u003e \u003cp\u003eAn iForest model was trained on the engineered features extracted from the training set. We employed multiple random forests for robust anomaly scoring. The Contamination parameter plays a crucial role in the detection process. It represents the assumed level of contamination or the proportion of outliers in the dataset. When fitting the model, this parameter is used to determine the threshold on the scores of the samples. Manual adjustment of the Contamination parameter is necessary to achieve an optimal fit for the specific dataset and the desired outcome. The parameters used for Isolation Forest model and feature set are presented in the following Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eParameters and Feature Sets Used in the Isolation Forest Algorithm for Anomaly Detection\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCategory\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eParameter /Feature set\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDescription\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eValue\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eParameters\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e1\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003emax_samples\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNumber of samples to train each estimator\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e2\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003erandom_state\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSeed for random number generator\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e3\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003econtamination\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProportion of outliers in the dataset\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.025\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e4\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003emax_features\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNumber of features for each estimator\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e5\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003en_estimators\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNumber of base estimators in the ensemble\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e6\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003ebootstrap\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eWhether samples are drawn with replacement\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eFalse\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e7\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003everbose\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eVerbosity of the process\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003e8\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003en_jobs\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNumber of cores for parallel processing\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e-1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eFeature Sets\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eSet 0 Model 1\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e[TS_0]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eSet 1 Model 2\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e[TS_0, TS_0_zscore]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eSet 2 Model 3\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e[TS_0_zscore]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eIsolation Forest detects anomalies by considering their distinctiveness or deviation from normal data points. This is determined by the number of neighbouring instances that surround an anomaly, which is fewer compared to normal data points. The decision_function method computes the anomaly score for every data point in the dataset. The scores can be utilized to identify and rank any outliers or anomalies within the dataset. Higher scores indicate a greater probability of being an outlier. The anomaly score indicates the level of abnormality or uniqueness of a data point. It is determined by the average path length needed to isolate the data point within the Isolation Forest structure. A lower average path length corresponds to a higher anomaly score, indicating that the data point is more likely to be an outlier.\u003c/p\u003e \u003cp\u003eWe constructed three models using different feature sets. The first model uses raw data only of the original time series values. The second model uses the standardized feature values derived from Pattern table-based Z-scores. Third model Employing both raw data and z-scores to exploit both inherent patterns and standardized deviations.\u003c/p\u003e \u003cp\u003eThese three models will provide different perspectives and insights into the anomaly detection process.\u003c/p\u003e \u003c/div\u003e"},{"header":"5 Results and Discussion","content":"\u003cp\u003eThe heterogenous data from multiple components containing throughput metrics are integrated in one data frame to analyse it through clustering. The multiple time series data might be originating from different processes and exhibits distinct autocorrelation structure. We verified this by analysing their autocorrelation structure. We computed ACF for each time series with a lag of 20. By measuring the degree of correlation between an observation and its lagged versions, the ACF reveals the temporal dependencies within the time series. Thus, time series with similar temporal structures will have similar ACFs. Figure\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e displays the autocorrelation functions (ACFs) of 80-time series. The horizontal axis represents the lag, while the vertical axis shows the correlation value for each lag. It is evident from the ACF plot that we can determine optimal number of clusters for our dataset by analysing the underlying structures in their ACFs. By using these ACF feature vectors as the basis for K Means clustering, we can group time series not by their raw values, but by the structure and dependencies in the data. This enables us to identify clusters of time series with similar temporal dynamics. The experiment validated the hypothesis that ACF is a robust feature for Clustering time series capturing the essential dynamics of each time series. For our context, each observation is a time series represented by its ACF feature vector.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIt can be seen from Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e that we can chose the number of clusters visually as 4 clusters as there are 4 distinct structures. We employed the k-means algorithm with four clusters to segregate them computationally. The algorithm partitions the dataset into 4 clusters, where each observation belongs to the cluster with the nearest mean. K-means algorithm assigns a label to each time series indicating the cluster it belongs to. These labels are then utilized to group and display the time series data based on their corresponding clusters. For example, in Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e provide a visual representation of the inherent patterns within the first cluster and there are 51 ACFs of the respective time series.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThus, from Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e to Fig.\u0026nbsp;\u003cspan refid=\"Fig11\" class=\"InternalRef\"\u003e11\u003c/span\u003e, it is evident that each cluster predominantly consisted of time series from the same generating processes, indicating the utility of ACF in time series clustering. By plotting the ACF for each cluster, we can visually confirm the internal coherence of each cluster \u0026ndash; time series within the same cluster exhibit similar ACF patterns, indicating similar temporal dynamics. The plots illustrate the correlation structure of the time series grouped in their corresponding cluster, highlighting their similar temporal dynamics.\u003c/p\u003e \u003cp\u003eThis supports in validating the effectiveness of the clustering approach and provides insight into the fundamental characteristics of each cluster, which can be valuable when examining the data for conducting further analysis. Clustering these feature vectors using k-means enables us to group time series based on the similarity of their temporal structures rather than their raw values, making this a powerful method for understanding the underlying dynamics of the data.\u003c/p\u003e \u003cp\u003eThe pattern table successfully grouped KPI data into time segments based on weekday, hour, and 30-minute intervals. For each KPI, means and standard deviations were calculated within each time segment, providing a summary of typical behaviour during those periods. For example, in Fig.\u0026nbsp;\u003cspan refid=\"Fig12\" class=\"InternalRef\"\u003e12\u003c/span\u003e shows the normal pattern for a single time series in group 0 we have created. The analysis revealed diverse temporal patterns across KPIs. For example, the KPIs in group 3 (TS_58, TS_19, TS_43, TS_78) exhibited a slightly different trends across weekdays and hours, with lower mean values during certain periods. These findings suggest the influence of day-to-day and hourly routines on these KPIs.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eOther KPIs (TS_18, TS_50) in 2nd group(g1) showed minimal variation across time segments, indicating their relative independence from weekday and hourly fluctuations. This may be due to factors like external events or specific user activities beyond the scope of the time-based grouping.\u003c/p\u003e \u003cp\u003eZ-scores were calculated for each KPI, allowing for comparison of individual data points to the temporal pattern within their corresponding time segment. Outliers identified through significant deviations from the expected mean and standard deviation within their time segment could indicate unusual activity or potential anomalies for further investigation. We used time based Zscore features and raw data to assist Isolation Forest Algorithm.\u003c/p\u003e \u003cp\u003eAs mentioned in section 6.5.5, three models were constructed using different features to examine the effectiveness of iForest in capturing anomalies within each feature set. The distribution of anomaly scores obtained from each model were also analysed illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig13\" class=\"InternalRef\"\u003e13\u003c/span\u003e. we observed the shape and range of the scores associated with each feature set.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eBy comparing the score distributions between models, we can evaluate the relative performance and contribution of different features in detecting anomalies within the dataset. If there are noticeable differences in the score distributions among the models, it indicates that the different features capture specific aspects of the data's anomalies. Each model assigns different levels of anomaly scores to the data points based on the specific feature set it uses. These distribution differences provide insights into the effectiveness of each feature set in identifying anomalies. A model with a wider or more skewed score distribution may indicate a greater ability to distinguish between normal and anomalous data points. On the other hand, a model with scores concentrated within a specific range may suggest a more conservative or less sensitive approach to anomaly detection.\u003c/p\u003e \u003cp\u003eThe following Fig.\u0026nbsp;\u003cspan refid=\"Fig14\" class=\"InternalRef\"\u003e14\u003c/span\u003e depicts the test time series data with detected anomalies in red. Figure\u0026nbsp;\u003cspan refid=\"Fig14\" class=\"InternalRef\"\u003e14\u003c/span\u003ea represents the results of the first model, which involved fitting IForest model only on the raw data. It is apparent that the outcome is relatively simplistic, as the model primarily detects anomalies at the significantly higher extremes.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig14\" class=\"InternalRef\"\u003e14\u003c/span\u003eb displays the results of a model fitted on two features (raw value and Pattern table based zscore). In the perspective of data, this model produces outcomes that are more in line with a somewhat subjective interpretation of interesting anomalies. Lastly Fig.\u0026nbsp;\u003cspan refid=\"Fig14\" class=\"InternalRef\"\u003e14\u003c/span\u003ec illustrates the model fitted exclusively on the TS_0_zscore derived after pattern table.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparison of Anomaly Detection Models Using Different Feature Sets for Network Throughput Metrics\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAnomaly Detection Rate (%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAverage Anomaly Score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eScore Distribution Skewness\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eComputational Time (s)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eModel 1 (TS_0)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e2.522\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e-0.091\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.282\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e9.046\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eModel 2 (TS_0, TS_0_z_score)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1.805\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e-0.123\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.824\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e8.582\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eModel 3 (TS_0_z_score)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e2.198\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e-0.122\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3.247\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e9.182\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows the percentage of data points classified as anomalies varies across the models. Model 1 (Raw Data) likely shows a higher anomaly detection rate compared to Models 2 and 3. This suggests that using raw data alone may be overly sensitive, potentially flagging normal network fluctuations as anomalies. Models 2 and 3, which incorporate Z-scores, are likely more selective in identifying anomalies, which aligns with the paper's goal of detecting subtle yet significant anomalies in network performance.\u003c/p\u003e \u003cp\u003eIt is evident in the table that Models 2 and 3, which incorporate Z-scores, have lower (more negative) average anomaly scores. This indicates that these models are more confident in their anomaly classifications. The similarity between Models 2 and 3 suggests that the Z-score feature is driving this improved confidence, aligning with your paper's emphasis on profile pattern extraction for more accurate anomaly detection.\u003c/p\u003e \u003cp\u003eModel 3, using only Z-scores, shows the highest positive skewness. This suggests it's most effective at distinguishing between normal and anomalous network behavior, providing a clearer separation between regular operations and potential issues. The lower skewness of Model 2 might indicate a more balanced approach, potentially reducing extreme classifications.\u003c/p\u003e"},{"header":"6 Conclusion","content":"\u003cp\u003eIn this paper, we proposed a novel approach to anomaly detection in multidimensional network throughput metrics through profile pattern extraction. The similar behavioural patterns are clustered through autocorrelation-based clustering method. Using a real-world dataset, we demonstrated an efficient method for anomaly detection using clustering and constructing pattern-based features. The proposed method showed the effectiveness of combining profile pattern with anomaly detection algorithms. For anomaly detection, we employed Isolation Forest model as an outlier detection model. Isolation Forest demonstrated its capability to detect anomalies in diverse time-series data, even with varying feature sets. The combination of IForest with pattern table based Zscore exhibits superior performance compared to raw features. The results obtained depicts that the approach offered a simpler solution for detecting anomalies for multi component metrics using simpler models. The approach presented provides a single model-based approach capable of handling multiple metrics, where components might have different patterns. these results strongly support the effectiveness of the profile pattern extraction approach for anomaly detection in multi-dimensional network throughput metrics. The models incorporating Z-scores (2 and 3) show improvements in anomaly detection confidence and characterization, with minimal computational overhead. In addition, the approach is scalable and adaptable to other use cases through the construction of more relevant features. However, it is important to acknowledge the limitations of our study. There is an assumption that each cluster dependent on only inferred factors (e.g. hour, weekday, weekend). Also, for detecting anomalies in multivariate data additional features such as correlation between the features could be considered along with multiple raw data.\u003c/p\u003e \u003cp\u003eThe current analysis includes creating pattern table based on combination of each weekday, hour and 30-minute interval. More investigations could be done with different granularity of time segments while constructing pattern table and calculating Zscore. The overall approach presented in this paper can be extended to detect anomalies in multivariate data. Isolation Forest Model is particularly effective in detecting anomalies, especially for high-dimensional data.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd width=\"18.008474576271187%\" valign=\"top\"\u003e\n \u003cp\u003eKPI\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"81.99152542372882%\" valign=\"top\"\u003e\n \u003cp\u003eKey Performance indicator\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"18.008474576271187%\" valign=\"top\"\u003e\n \u003cp\u003eRNN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"81.99152542372882%\" valign=\"top\"\u003e\n \u003cp\u003eRecurrent Neural Network\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"18.008474576271187%\" valign=\"top\"\u003e\n \u003cp\u003eCNN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"81.99152542372882%\" valign=\"top\"\u003e\n \u003cp\u003eConvolutional neural network\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"18.008474576271187%\" valign=\"top\"\u003e\n \u003cp\u003eACF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"81.99152542372882%\" valign=\"top\"\u003e\n \u003cp\u003eAutocorrelation Function\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"18.008474576271187%\" valign=\"top\"\u003e\n \u003cp\u003eSTL\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"81.99152542372882%\" valign=\"top\"\u003e\n \u003cp\u003eSeasonal-Trend decomposition\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"18.008474576271187%\" valign=\"top\"\u003e\n \u003cp\u003eADF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"81.99152542372882%\" valign=\"top\"\u003e\n \u003cp\u003eaugmented Dickey\u0026ndash;Fuller\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"18.008474576271187%\" valign=\"top\"\u003e\n \u003cp\u003eFFT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"81.99152542372882%\" valign=\"top\"\u003e\n \u003cp\u003eFast Fourier Transform\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"18.008474576271187%\" valign=\"top\"\u003e\n \u003cp\u003eBTIIC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"81.99152542372882%\" valign=\"top\"\u003e\n \u003cp\u003eBritish Telecom Ireland Innovation Centre\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"18.008474576271187%\" valign=\"top\"\u003e\n \u003cp\u003eAE\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"81.99152542372882%\" valign=\"top\"\u003e\n \u003cp\u003eAutoencoder\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors have received no financial support regarding carrying out of this research study\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflict of Interest/Competing interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors have no conflicts of interest to declare. All co-authors have seen and agreed with the contents of the manuscript. We certify that the submission is original work and is not under review at any other publication.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot Applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eDue to British Telecom (BT) policy, the real data collected from the core network cannot be provided due to confidentiality. Source code can be provided once paper is accepted.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eAuthors\u0026apos; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eS.A. conceived the idea of the proposed research study and designed a conceptual framework. In addition, S.A. carried out the research from highlighting the research gap by exploring the existing literature to develop a scalable framework. S.A., B.S. and D.G. conducted formal analysis and wrote the original manuscript. D.Y. captured the real data from the telecommunication network routers. D.G. investigated the experimental results. B.S., D.G. and S.Z. supervised the overall research and reviewed the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis research is supported by the BTIIC (British Telecom Ireland Innovation Centre) project, funded by British Telecom and Invest Northern Ireland. In addition, I am very thankful to Prof Bryan Scotney for his immense support and encouragement to carry out this research. \u0026nbsp;\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAnowar, F., Sadaoui, S., \u0026amp; Dalal, H. (2022). Clustering Quality of a High-dimensional Service Monitoring Time-series Dataset. \u003cem\u003eInternational Conference on Agents and Artificial Intelligence\u003c/em\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBenkabou, S. E., Benabdeslem, K., \u0026amp; Canitia, B. (2018). Unsupervised outlier detection for time series by entropy and dynamic time warping. Knowledge and Information Systems, \u003cem\u003e54\u003c/em\u003e(2), 463\u0026ndash;486. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/S10115-017-1067-8/FIGURES/8\u003c/span\u003e\u003cspan address=\"10.1007/S10115-017-1067-8/FIGURES/8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBordeau-Aubert, K., Whatley, J., Nadeau, S., Glatard, T., \u0026amp; Jaumard, B. (2023). \u003cem\u003eClassification of Anomalies in Telecommunication Network KPI Time Series\u003c/em\u003e. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2308.16279v1\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2308.16279v1\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCanizo, M., Triguero, I., Conde, A., \u0026amp; Onieva, E. (2019). Multi-head CNN\u0026ndash;RNN for multi-time series anomaly detection: An industrial case study. Neurocomputing, \u003cem\u003e363\u003c/em\u003e, 246\u0026ndash;260. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/https://doi.org/10.1016/j.neucom.2019.07.034\u003c/span\u003e\u003cspan address=\"10.1016/j.neucom.2019.07.034\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen, F., Garrett, J., Zacks, D. N., \u0026amp; Yashin, V. (2020). \u003cem\u003eMETHOD AND SYSTEM FOR ANOMALY DETECTION IN LARGE-SCALE NETWORKS\u003c/em\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDiez-Olivan, A., Pagan, J. A., Sanz, R., \u0026amp; Sierra, B. (2017). Data-driven prognostics using a combination of constrained K-means clustering, fuzzy modeling and LOF-based score. Neurocomputing, \u003cem\u003e241\u003c/em\u003e, 97\u0026ndash;107. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/J.NEUCOM.2017.02.024\u003c/span\u003e\u003cspan address=\"10.1016/J.NEUCOM.2017.02.024\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFerencz, K., Domokos, J., \u0026amp; Kov\u0026aacute;cs, L. (2022). Analysis of time series data for anomaly detection. \u003cem\u003e2022 IEEE 22nd International Symposium on Computational Intelligence and Informatics and 8th IEEE International Conference on Recent Achievements in Mechatronics, Automation, Computer Science and Robotics (CINTI-MACRo)\u003c/em\u003e, 95\u0026ndash;100. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://api.semanticscholar.org/CorpusID:256589477\u003c/span\u003e\u003cspan address=\"https://api.semanticscholar.org/CorpusID:256589477\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGarg, A., Zhang, W., Samaran, J., Savitha, R., \u0026amp; Foo, C.-S. (2021). An evaluation of anomaly detection and diagnosis in multivariate time series. IEEE Transactions on Neural Networks and Learning Systems, \u003cem\u003e33\u003c/em\u003e(6), 2508\u0026ndash;2517.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLaptev, N. P., Amizadeh, S., \u0026amp; Flint, I. (2015). Generic and Scalable Framework for Automated Time-series Anomaly Detection. \u003cem\u003eProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\u003c/em\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, T., Geng, Y., \u0026amp; Jiang, H. (2020). Anomaly Detection on Seasonal Metrics via Robust Time Series Decomposition. In \u003cem\u003earXiv.org\u003c/em\u003e. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2008.09245\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2008.09245\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, Z., Zhao, Y., Han, J., Su, Y., Jiao, R., Wen, X., \u0026amp; Pei, D. (2021). Multivariate Time Series Anomaly Detection and Interpretation using Hierarchical Inter-Metric and Temporal Embedding. \u003cem\u003eProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery \\\u0026amp; Data Mining\u003c/em\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, Z., Zhao, Y., Liu, R., \u0026amp; Pei, D. (2018). Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection. \u003cem\u003e2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS)\u003c/em\u003e, 1\u0026ndash;10. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/IWQoS.2018.8624168\u003c/span\u003e\u003cspan address=\"10.1109/IWQoS.2018.8624168\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMadhukar Rao, G., \u0026amp; Ramesh, D. (2021). \u003cem\u003eA Hybrid and Improved Isolation Forest Algorithm for Anomaly Detection BT - Proceedings of International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications\u003c/em\u003e (V. K. Gunjan \u0026amp; J. M. Zurada (eds.); pp. 589\u0026ndash;598). Springer Singapore.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eS\u0026oslash;rb\u0026oslash;, S., \u0026amp; Ruocco, M. (2023). \u003cem\u003eNavigating the Metric Maze: A Taxonomy of Evaluation Metrics for Anomaly Detection in Time Series\u003c/em\u003e. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2303.01272\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2303.01272\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXia, H., Chen, B., Fan, J., Li, Z., \u0026amp; Gao, D. (2015). \u003cem\u003eMining Time Series Data with Two Dimensional Fuzzy Pattern Rules\u003c/em\u003e. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://api.semanticscholar.org/CorpusID:118498288\u003c/span\u003e\u003cspan address=\"https://api.semanticscholar.org/CorpusID:118498288\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu, H., Chen, W., Zhao, N., Li, Z. Z., Bu, J., Li, Z. Z., Liu, Y., Zhao, Y., Pei, D., Feng, Y., Chen, J., Wang, Z., \u0026amp; Qiao, H. (2018). Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications. \u003cem\u003eThe Web Conference 2018 - Proceedings of the World Wide Web Conference, WWW 2018\u003c/em\u003e, 187\u0026ndash;196. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1145/3178876.3185996\u003c/span\u003e\u003cspan address=\"10.1145/3178876.3185996\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYakubu, U. A., \u0026amp; Saputra, M. P. A. (2022). Time Series Model Analysis Using Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) for E-wallet Transactions during a Pandemic. International Journal of Global Operations Research. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://api.semanticscholar.org/CorpusID:251462691\u003c/span\u003e\u003cspan address=\"https://api.semanticscholar.org/CorpusID:251462691\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Network Performance, Anomaly Detection, Multi-Dimensional Data Analysis, Machine Learning, Pattern Extraction, Network Throughput Metrics","lastPublishedDoi":"10.21203/rs.3.rs-4914517/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4914517/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThis paper presents a novel approach to network performance monitoring and improvement through profile pattern extraction-based anomaly detection in multi-dimensional network throughput metrics. As modern networks grow in complexity, traditional monitoring methods often struggle to detect subtle yet significant anomalies that can impact performance. Our research addresses this challenge by developing an integrated framework that combines advanced data analysis techniques with machine learning algorithms to identify and interpret complex patterns in network behaviour. The proposed methodology leverages autocorrelation function (ACF) based clustering to group similar time series, and employs feature extraction methods to create profile patterns from multi-dimensional network data. These patterns serve as a baseline for normal network behaviour, against which anomalies are detected using the Isolation Forest algorithm. These patterns serve as a baseline for normal network behaviour, against which anomalies are detected using a combination of statistical methods and machine learning approaches.\u003c/p\u003e \u003cp\u003eOur experimental results, based on real-world data from a telecommunications network, demonstrate that the profile pattern-based approach significantly enhances anomaly detection capabilities. The best-performing model, which combines raw data and Z-scores derived from profile patterns, achieved an anomaly detection rate of 1.805% with the highest confidence (average anomaly score of -0.123). This model outperformed both raw data analysis and Z-score-only approaches in terms of selectivity and computational efficiency, completing analysis in 8.582 seconds. This research contributes to the field of network performance monitoring by offering a more sophisticated and accurate approach to anomaly detection, potentially leading to enhanced network reliability, reduced downtime, and improved user experience. The paper concludes by discussing the implications of these findings for network administrators and outlining future research directions in this rapidly evolving field.\u003c/p\u003e","manuscriptTitle":"Enhancing Network Performance Monitoring through Scalable Multi-Dimensional Metric Analysis and Pattern-Based Anomaly Detection","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-09-16 22:45:23","doi":"10.21203/rs.3.rs-4914517/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"be48c708-449d-4efa-8aef-2e5a3da26559","owner":[],"postedDate":"September 16th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-05-09T12:08:40+00:00","versionOfRecord":[],"versionCreatedAt":"2024-09-16 22:45:23","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4914517","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4914517","identity":"rs-4914517","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00