New bibliometric measure - justification

doi:10.22541/au.173627403.37033260/v1

New bibliometric measure - justification

2025 · doi:10.22541/au.173627403.37033260/v1

preprint OA: closed

Full text JSON View at publisher

Full text 10,681 characters · extracted from oa-doi-fallback · 2 sections · click to expand

Abstract

The aim of this article is to present the scientific achievements of one of the scientists. This work is part of a project on introducing and justifying new bibliometric measures. Selected scientific publications Dr. Kiersztyn's scientific achievements to date are very extensive. Among the selected works, it is worth mentioning \cite{Kiersztyn_2022} \cite{Karczmarek_2020} \cite{_opucki_2020} \cite{Miazek_2024} \cite{Celi_ski_2024} \cite{Kiersztyn_2023,Kiersztyn_2024} \cite{Kiersztyn_2023a} \cite{_opucki_2023,Kiersztyn_2023b} \cite{Karczmarek_2023,Kiersztyn_2022a} . The description of the most important of them is presented below. RCOD The proposed approach \cite{Kiersztyn_2024} is a novel combination of existing methods for outlier detection based on the use of statistical techniques and clustering of elements. Unlike previous methods, random clustering is used and the distances between the centers of random clusters are used to classify individual elements. The basic idea of the proposed method using random cluster centers for outlier detection is based on the concept of exploring the data space without any predefined assumptions about the cluster distribution. Elements that significantly deviate from the random cluster centers should be treated as outliers. The motivation for the proposed approach stems from the fact that data often present certain patterns and structures that may not be visible at first glance. Classical clustering methods may miss these basic structures due to limitations in the selection of initial cluster centers. However, by introducing randomness into the cluster center initialization process, the proposed method allows for the exploration of previously unexplored parts of the data space, detecting potentially hidden patterns or relationships that would otherwise be overlooked. Additionally, the repeatability of the sampling process and the identification of outliers improves the stability of the proper classification. It is worth noting that, according to the law of large numbers, the most frequently selected elements as cluster centers will be typical elements. Random selection of outliers as cluster centers is possible, but such a situation may occur very rarely. The main goal of the work was to develop and disseminate an algorithm for detecting outliers, meeting two requirements. First, the algorithm should be highly intuitive and easy to interpret. Second, it should be characterized by high efficiency, comparable to other recognized algorithms. The first assumption was that the proposed approach should be easy to use and modify by people who may lack extensive experience in data analysis. Both criteria are met by the innovative algorithm for detecting outliers in multidimensional data sets, based on random grouping of elements with a random distribution of the centers of individual clusters. The proposed method combines elements of two different approaches. On the one hand, grouping of elements is used, on the other hand, probabilistic techniques and properties of distributions. The novelty of the introduced approach is the random selection of cluster centers, which, according to the law of large numbers, adapts to the distribution of the analyzed data. Moreover, by using random determination of cluster centers, the algorithm can naturally adapt to the topology of the considered data set. On the other hand, by appropriately selecting the metric by the user, the efficiency of the proposed method can be significantly increased. In short, the introduced method can be summarized in the following steps. As part of defining the initial parameters, the number of clusters and the number of algorithm iterations are determined. In one iteration of the algorithm, cluster centers are randomly selected, the distance matrix between cluster centers is determined and the distance of each point to the nearest cluster is calculated. Based on the calculated distance, the level of outlier of a given element is determined. The starting point for developing an outlier detector based on random clustering (RCOD) was the observation that outliers, by their own definition, are located far from other elements of the set. Therefore, skillfully using the law of large numbers, it was possible to propose an effective method for detecting outliers based on grouping elements with a random distribution of centers of individual clusters. Suppose we have a data set 𝐷 consisting of 𝑁 rows and 𝐾 columns. In other words, the input data set consists of 𝑁 records, and each record has 𝐾 attributes. At first, the centers of individual clusters are selected randomly. More precisely, from all elements of the analyzed set 𝐷, M different elements are randomly selected, which become the centers of clusters. These elements, i.e. cluster centers, form a set 𝐶 consisting of 𝑀 different elements. The number of clusters depends on the number of elements of the set 𝐷 and is calculated using an appropriately selected function \(M=f(N,K)\) The paper considers several functions that determine the number of clusters depending on the number of elements in the analyzed set. It should be noted that, according to the law of large numbers, the random selection of elements correctly reflects the distribution of the analyzed data. Moreover, no preprocessing is required to indicate the centers of clusters. After indicating the centers of clusters, the distances between all elements of the set 𝐶 are determined. This creates a matrix 𝐷(𝐶), in which individual elements correspond to the distances between successive cluster centers. It should be noted that the matrix 𝐷(𝐶) is symmetric, and all elements on the main diagonal are equal to zero. For elements located outside the main diagonal 𝐷(𝐶), basic measures of location are determined, such as the minimum (𝑀𝑖𝑛 or 𝑄0), the first quartile (𝑄1), the median (𝑄2), the third quartile (𝑄3), and the maximum (𝑀𝑎𝑥 or 𝑄4). These statistics will constitute thresholds classifying individual elements of the input data set 𝐷 into a specific element class. It is possible to introduce additional thresholds based on the three sigma rule. The Concept of Detecting and Classifying Anomalies in Large Data Sets Based on Information Granules The primary idea of the solution presented in \cite{Kiersztyn_2020a} involves elevating the analysis to a higher, more abstract level. The proposed method for detecting anomalies and outliers is mainly based on transforming raw data into an abstract, discrete cube. The transformation used in the study is based on a modified three-sigma rule. Each coordinate of a dataset element is transformed into one of 2N+12N+1 discrete values {−N,−N+1,...,−1,0,1,...,N−1,N}\{-N, -N+1, ..., -1, 0, 1, ..., N-1, N\}. A value of 0 indicates that the analyzed element is typical, negative values correspond to an underestimation of the element (below the mean), and positive values indicate an overestimation (above the mean). The absolute value denotes the degree of deviation, directly dependent on the distance from the mean. It should be noted that a single dataset element does not need to be transformed dimension by dimension into the new abstract cube. In other words, the number of dimensions in the resulting cube can exceed that of the original dataset. Contextual transformation is possible during the data transformation process. The concept of contextual transformation was demonstrated in numerical experiments involving real-world taxi trip data from New York City. Each trip included various data points, such as the precise start and end times, start and end locations, the number of passengers, trip duration in seconds, trip length in miles, and additional information identifying the driver and vehicle. The analysis focused on data enabling geographic and temporal localization of trips. The selected data were transformed into a discrete cube with additional dimensions derived from a contextual view of the data. Auxiliary variables were calculated based on the available data, such as trip duration (difference between the start and end times). Additionally, variables such as straight-line distance, difference in distance, percentage of additional distance, time difference, speed based on available data, speed based on calculated data, and distance from Times Square were computed. Applying the contextual transformation revealed interesting cases that did not appear suspicious in the raw data. For instance, the dataset included taxi rides with an average speed of 102,960 miles per hour. Some trips started more than 16,000 miles away from Times Square. Furthermore, a significant portion of trips exhibited a high percentage of additional distance, calculated as the difference between the declared trip length and the straight-line distance between the start and end points. The topography of New York City suggests that routes should align closely with taxi-metric distances. In the analyzed example, five dimensions of information granules were distinguished. Clustering centers were limited to points where one coordinate equals 5 and points with coordinates (0;0;0;0;0)(0; 0; 0; 0; 0) and (5;5;5;5;5)(5; 5; 5; 5; 5), corresponding to elements with a significant anomaly in one dimension, anomaly-free elements, and elements with large anomalies in all dimensions, respectively. The membership degrees to individual clusters were calculated as normalized inverse distances from the cluster centers. This approach ensures that an element can belong to multiple clusters, with higher membership values closer to a cluster center. The proposed method for determining membership degrees to clusters provides an alternative to other fuzzy clustering methods, such as Fuzzy C-Means, which may fail to perform properly with large datasets and high-dimensional spaces. Information & Authors Information Version history Copyright This work is licensed under a Non Exclusive No Reuse License.

Keywords

Authors Metrics & Citations Metrics Article Usage 149views 97downloads Citations Download citation Sławomir Borewicz. New bibliometric measure - justification. Authorea. 07 January 2025. DOI: https://doi.org/10.22541/au.173627403.37033260/v1 DOI: https://doi.org/10.22541/au.173627403.37033260/v1 If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download. For more information or tips please see 'Downloading to a citation manager' in the Help menu.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00