An Improved Method for Gene Function Network Term Similarity Calculation

preprint OA: closed
Full text JSON View at publisher
Full text 88,989 characters · extracted from preprint-html · click to expand
An Improved Method for Gene Function Network Term Similarity Calculation | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article An Improved Method for Gene Function Network Term Similarity Calculation Zhi Tang, Xiuli Jing, Jian Wang, Furong Lin, Yanbing Zhou, Dingwei Yang This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6537896/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background Gene Ontology (GO) is an ontology based on bioinformatics resources that utilizes its structure to represent biological knowledge and describe the functions of genes and gene products. The computation of term similarity within Gene Ontology plays a critical role in various biological research areas, such as gene function analysis, comparison, and prediction. However, existing algorithms for term similarity calculation have several limitations and fail to fully exploit available information. In recent years, some studies have incorporated gene functional networks into term similarity calculations; however, these approaches typically focus only on directly connected genes, overlooking indirect relationships within the gene network and failing to make optimal use of all available data. Results In this study, we propose a novel Gene Ontology term similarity algorithm based on a Random Walk with Restart (RWR) framework, enhanced by a Gaussian kernel function (RWRSM). This algorithm not only incorporates structural and annotation information from Gene Ontology but also captures global structural information from gene functional networks. We performed multiple experiments on yeast and Arabidopsis datasets using Enzyme Commission (EC) classification numbers. Conclusion The experimental results demonstrate that our proposed algorithm outperforms existing methods across all measures for both yeast and Arabidopsis datasets. Specifically, the Local Function Consistency (LFC) results are more stable, and our method uncovers a greater number of meaningful gene associations. Gene Ontology term similarity Random Walk with Restart gene functional network Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 1. Background Gene Ontology (GO) is one of the most successful ontologies in the field of biomedical research. It provides a standardized and precise vocabulary to describe various aspects of genes and gene products, including molecular functions, biological processes, and cellular components. GO has become widely used in various areas of biomedical research [ 1 ] , including gene function comparison and analysis, protein–protein interaction prediction, and gene set enrichment analysis, making it an indispensable resource in biomedical ontologies. Research has shown that if the functions of two gene products are similar, the GO terms annotating them are likely to be closer in the ontology [ 2 ] . The computation of semantic similarity between two entities based on ontology has long been a popular research topic in computer science [ 3 ] and has a long history of study [ 4 ] . This concept has been extensively applied in areas such as natural language processing [ 5 ] , audio signal processing [ 6 ] , and information retrieval [ 7 ] . One important application of GO is to measure the semantic similarity between terms, which allows for the calculation of gene product similarity [ 8 ] . Therefore, designing efficient and accurate algorithms for calculating the similarity between genes based on Gene Ontology and uncovering hidden associations between gene pairs has become a key research direction in knowledge engineering and bioinformatics. These efforts provide essential tools for exploring functional relationships between genes [ 9 ] . Currently, several methods have been developed to measure the similarity between GO terms, which can be broadly categorized into edge-based, path-based, information content (IC)-based, node-based, and hybrid approaches [ 10 , 11 ] . Edge-based methods calculate similarity based on the topological structure of GO, usually by determining the number of edges along the shortest path between two GO terms. However, edge-based methods rely entirely on the topology of the GO Directed Acyclic Graph (DAG) and cannot distinguish between terms at the same topological level [ 12 ] . Additionally, edges at the same depth may not reflect the same semantic distance. Node-based methods, on the other hand, use the properties of the query node and its ancestor or descendant nodes to indicate similarity. These similarities are often represented by the most informative common ancestor (MICA) and its information content (IC) [ 13 ] . Evaluation tests have shown that these results correlate with protein sequence similarity, but node-based methods only consider annotations and ignore the topological information of GO. In hybrid approaches, methods that incorporate more information from GO have been proposed. For example, the InteGO method integrates multiple existing similarity methods into a rank-based approach, called the seed method, to consider more aspects of GO [ 14 ] . The InteGO2 method selects the most suitable method using a voting mechanism and integrates them based on a metaheuristic search method [ 15 ] . Evaluation tests show that the hybrid approach outperforms the seed method. However, all of these methods rely solely on GO structural information and fail to address issues such as inaccurate representation and missing data in GO [ 16 ] . Peng et al. proposed a network-based similarity measure (NETSIM) that integrates gene associations with GO topology and annotations to address these issues [ 17 ] . Experimental results based on metabolic reaction networks showed that integrating gene associations enhances semantic similarity, but NETSIM only considers direct links in the network, using only a portion of the information from the gene co-function network. In reality, it is essential to consider not only directly connected gene pairs but also indirect gene interactions within the functional gene network. To address this limitation, a new network-based approach, NETSIM2, was proposed, which incorporates random walk methods to consider both direct and indirect interactions in the gene co-function network. Experiments on metabolic reaction networks showed that NETSIM2 outperforms all previous algorithms. However, the Local Function Consistency (LFC) values measured by NETSIM2 were not stable, with most LFC values being very small. To resolve this issue, this paper proposes the use of a Gaussian kernel function to compute edge weights and generate a normalized weighted matrix. This allows the random walk algorithm to calculate the correlation scores between gene pairs. Finally, based on the original NETSIM2 algorithm, term pair similarity is computed [ 18 ] . Compared to previous methods, our approach combines the Gaussian kernel function and random walk to calculate gene correlation scores, considering both direct and indirect interactions. This integration not only significantly improves the stability of LFC values but also uncovers more effective hidden gene pair associations. 2. Methods The RWRSM algorithm computes the semantic similarity between two genes in three steps. First, a weight matrix is constructed by fusing the Gaussian kernel function with the gene co-function network. Then, using random walk with restart (RWR) methods, a correlation score matrix between the two genes is calculated. Second, the similarity between two GO terms is computed by combining information from the co-function network and GO. Finally, a similarity measure based on the Z-score method is applied to assess the similarity between the two genes based on the selected GO term pairs. Fusion of the Gaussian Kernel Function to Calculate the Correlation Score Matrix Between Two Genes The basic idea of constructing a restart random walk graph is as follows: each training data point \(\:\text{d}\in\:\text{D}\) in the training set D is mapped as a node in the graph. The decision to connect two nodes with an edge is based on the degree of similarity between them. For computational convenience, a complete graph is typically constructed, where the similarity between two nodes serves as the weight of the edge between them. When the similarity between two nodes is very small or zero, the edge weight between the nodes is set to zero. Let the weighted graph be represented by \(\:\text{G}=(\text{V},\:\text{E})\) , where V is the set of nodes and E is the set of edges, with n nodes in total. The edge weights are calculated using the Gaussian kernel function, as shown in the following formula: $$\:\text{W}\:\left(i,\:j\right)=exp\left(-\frac{{d}_{i,j}^{2}}{{\sigma\:}^{2}}\right)$$ 1 Where \(\:{\sigma\:}\) is a parameter variable, which is set empirically, with \(\:{\sigma\:}={\sum\:}_{\text{i}\ne\:\text{i}}\left|\right|{\text{x}}_{\text{i}}-{\text{y}}_{\text{i}}\left|\right|\:/\:\left[\text{n}\:\right(\text{n}-1\left)\right]\) . \(\:{\text{d}}_{\text{i}\text{j}}\) is the Euclidean distance between two nodes in the graph, represented as: $$\:{\text{d}}_{\text{i}\text{j}}^{2}=\sum\:_{\text{t}=1}^{\text{k}}({\text{x}}_{\text{i}}^{\text{t}}-{\text{x}}_{\text{j}}^{\text{t}})$$ 2 Before applying the random walk with restart (RWR) algorithm, the probability transition matrix P of the graph G is first computed based on the weight matrix: $$\:\text{P}\:\left(i\:,\:j\right)=\left\{\begin{array}{c}W\:(i\:,\:j)/\sum\:_{t=1}^{n}W\:(i\:,\:t)\:({V}_{i},\:{V}_{j}\:ϵ\:G)\:\:\\\:0\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:({V}_{i},\:{V}_{j}\:ϵ\:G)\end{array}\right.$$ 3 The correlation score between genes is computed through the following steps. First, the probability transition matrix P is calculated, followed by the generation of a normalized weighted matrix \(\:{\text{P}}^{\text{T}}\) . Finally, the method based on random walk with restart (RWR) can be described as follows: $$\:{\text{D}}_{\text{i}}\:(\text{t}\:+\:1)=\text{a}\:·\:{\text{P}}^{\text{T}}{\text{D}}_{\text{i}}\:\left(\text{t}\right)\:+\:(1\:-\:\text{a})\:{\text{e}}_{\text{i}}$$ 4 Where \(\:{\text{e}}_{\text{i}}\) represents the initial state. The steady-state solution of Eq. ( 4 ) is: $$\:{D}_{i}=(1-a){(1-a{P}^{T})}^{-1}{e}_{i}$$ 5 Where \(\:\:{\text{D}}_{\text{i}}\) is the vector \(\:\left|\text{V}\right|\times\:1\) , and \(\:{\text{e}}_{\text{i}}\) is the initial vector \(\:\left|\text{V}\right|\times\:1\) .(1 − a) is defined as the restart probability, which takes a value between 0 and 1. Subsequently, the matrix S is obtained, which stores the correlation scores between each pair of genes in \(\:\text{N}(\text{V},\:\:\text{E})\) . Calculating the similarity between two GO terms According to the method proposed by Peng et al. in their previous work [ 19 ] , the similarity between two terms is calculated by combining information from the gene functional network and GO. Let \(\:{\text{t}}_{1}\) and \(\:{\text{t}}_{2}\) represent the two terms. \(\:\text{D}({\text{t}}_{1},\:{\text{t}}_{2})\) is defined as the gene set distance, which is used to compute the similarity between the gene sets annotated by \(\:{\text{t}}_{1}\) and \(\:{\text{t}}_{2}\) . \(\:\text{D}({\text{t}}_{1},\:{\text{t}}_{2})\) is defined as: $$\:\text{D}({\text{t}}_{1},\:{\text{t}}_{2})=\frac{{\sum\:}_{{\text{g}}_{\text{i}}\in\:{\text{G}}_{1}}{\prod\:}_{{\text{g}}_{\text{i}}\in\:{\text{G}}_{2}}{\text{d}}_{\text{i}\text{j}}+{\sum\:}_{{\text{g}}_{\text{i}}\in\:{\text{G}}_{2}}{\prod\:}_{{\text{g}}_{\text{i}}\in\:{\text{G}}_{1}}{\text{d}}_{\text{i}\text{j}}}{2\left|{\text{G}}_{1}\cup\:{\text{G}}_{2}\right|-{\sum\:}_{{\text{g}}_{\text{i}}\in\:{\text{G}}_{1}}{\prod\:}_{{\text{g}}_{\text{i}}\in\:{\text{G}}_{2}}{\text{d}}_{\text{i}\text{j}}-{\sum\:}_{{\text{g}}_{\text{i}}\in\:{\text{G}}_{2}}{\prod\:}_{{\text{g}}_{\text{i}}\in\:{\text{G}}_{1}}{\text{d}}_{\text{i}\text{j}}}$$ 6 Let \(\:{\text{G}}_{1}\) and \(\:{\text{G}}_{2}\) represent the gene sets annotated by \(\:{\text{t}}_{1}\) and \(\:{\text{t}}_{2}\) , respectively. \(\:{\text{d}}_{\text{i}\text{j}}\) is the distance score between two genes, where \(\:{\text{d}}_{\text{i}\text{j}}=1-{\text{S}}_{\text{i}\text{j}}\) , and \(\:{\text{S}}_{\text{i}\text{j}}\) is the correlation score between genes i and j calculated based on the random walk with restart (RWR) method. The gene set distance for all term pairs is normalized between 0 and 1. The similarity between two terms is then computed based on path-constrained annotations labeled as U. In traditional methods based on the most informative common ancestor (LCA), all descendants of the LCA are considered. However, the path-constrained annotation method only uses terms most relevant to the terms being compared. The set of related terms consists of three components: the gene sets annotated by terms \(\:{\text{t}}_{1}\) and \(\:{\text{t}}_{2}\) , the gene set \(\:\text{p}\) annotated by their common parent node, and the descendant terms on the path from \(\:{\text{t}}_{1}\) or \(\:{\text{t}}_{2}\) to \(\:\text{p}\) . Let \(\:{\text{t}}_{1}\) and \(\:{\text{t}}_{2}\) represent the two GO terms, and \(\:\text{p}\) be their common ancestor. The similarity between \(\:{\text{t}}_{1}\) and \(\:{\text{t}}_{2}\) is defined by the equation proposed by Peng et al. in their previous work [ 19 ] . $$\:S\left({t}_{1},\:{t}_{2}\right)=\frac{2log\left|G\right|-2logf({t}_{1},\:{t}_{2},\:p)}{2log\left|G\right|-(log\left|{G}_{1}\right|+log\left|{G}_{2}\right|)}\times\:(1-\frac{ℎ\:({t}_{1},\:{t}_{2})}{\left|G\right|}\times\:\frac{{G}_{p}}{G})$$ 7 Where \(\:{\text{G}}_{\text{p}}\) (or \(\:\text{G}\) ) is the gene set annotated by the common ancestor term \(\:\text{p}\) (or the root term) and its descendants. In the equation, \(\:\text{f}\:\left({\text{t}}_{1},\:{\text{t}}_{2},\:\text{p}\right)\) calculates the similarity based on path-constrained annotations, and is defined as follows: $$\:f\:\left({t}_{1},\:{t}_{2},\:p\right)=D\:{({t}_{1},\:{t}_{2})}^{2}\times\:\left|U\:({t}_{1},\:{t}_{2},\:p)\right|+(1-D\:{({t}_{1},\:{t}_{2})}^{2})\times\:\sqrt{\left|{G}_{1}\right|\times\:\left|{G}_{2}\right|}$$ 8 \(\:\text{h}\:({\text{t}}_{1},\:{\text{t}}_{2})\) measures the specificity of the common parent, and is defined as follows: $$\:\text{h}\:({\text{t}}_{1},\:{\text{t}}_{2})=D\:{({t}_{1},\:{t}_{2})}^{2}\times\:\left|G\right|+(1-D\:{({t}_{1},\:{t}_{2})}^{2})\times\:max(\left|{G}_{1}\right|,\:\left|{G}_{2}\right|)$$ 9 In Eq. ( 9 ), the left side measures the distance from terms \(\:{\text{t}}_{1}\) and \(\:{\text{t}}_{2}\) to \(\:\text{p}\) , while the right side calculates the distance from \(\:\text{p}\) to the root. If there are multiple lowest common ancestors, the highest score is chosen as the similarity between \(\:{\text{t}}_{1}\) and \(\:{\text{t}}_{2}\) . Measuring the similarity between two genes Once the similarity between all Gene Ontology terms for a given species has been calculated, the similarity between any two genes, each annotated with at least one GO term, can be computed. Given two genes \(\:{\text{g}}_{\text{i}}\) and \(\:{\text{g}}_{\text{j}}\) and the sets of terms \(\:{\text{T}}_{\text{i}}\) and \(\:{\text{T}}_{\text{j}}\) annotating them, the method proposed in this study uses a "leave-one-out" approach to avoid errors caused by data circularity. To calculate gene similarity, the method proposed by Wang et al. [ 20 ] is used, where the similarity between multiple term pairs is accumulated to form the gene similarity, as shown in Eq. ( 10 ). $$\:\text{G}\text{S}\:({\text{g}}_{\text{i}},\:{\text{g}}_{\text{j}})=\frac{{\sum\:}_{\text{t}\in\:{\text{T}}_{\text{i}}}\text{S}\text{i}\text{m}\:(\text{t},\:{\text{T}}_{\text{j}})+{\sum\:}_{\text{t}\in\:{\text{T}}_{\text{j}}}\text{S}\text{i}\text{m}\:(\text{t},\:{\text{T}}_{\text{i}})}{\left|{\text{T}}_{\text{i}}\right|+\left|{\text{T}}_{\text{j}}\right|}$$ 10 In the equation, for each term belonging to the set \(\:{\text{T}}_{\text{x}}\) , \(\:\text{S}\text{i}\text{m}\:(\text{t},\:{\text{T}}_{\text{y}})\) represents the maximum similarity between term \(\:\text{t}\) and the terms in the set \(\:{\text{T}}_{\text{y}}\) , as shown in Eq. ( 11 ). $$\:\text{S}\text{i}\text{m}\:(\text{t},\:{\text{T}}_{\text{y}})={\text{m}\text{a}\text{x}}_{{\text{t}}_{\text{y}}\in\:{\text{T}}_{\text{y}}}\:\text{S}\:(\text{t},\:{\text{t}}_{\text{y}})$$ 11 In the equation, \(\:\text{S}\:(\text{t},\:{\text{t}}_{\text{y}})\) represents the maximum value of \(\:\text{S}\:(\text{t},\:{\text{t}}_{\text{y}},\:\text{p})\) among all the least common ancestors of \(\:\text{t}\) and \(\:{\text{t}}_{\text{y}}\) . 3. Results and Discussion Data Preparation The RWRSM algorithm was primarily implemented using Java and the JUNG library [ 21 ] (jung.sourceforge.net). The Gene Ontology (GO) data was downloaded from the official Gene Ontology website ( www.geneontology.org/GO.downloads.shtml ), using the November 2024 version. Additionally, key data files for this study were downloaded from multiple bioinformatics databases. GO annotation files for Saccharomyces cerevisiae (yeast) and Arabidopsis were obtained from the current Gene Ontology Consortium product page. Gene functional network files for yeast and Arabidopsis were downloaded from the YeastNet [ 22 ] and AraNet [ 23 ] websites, respectively. The EC group files for yeast and Arabidopsis used in the study were downloaded from www.yeastgenome.org/ and http://ftp.plantcyc.org/Pathways . The study selected key information from the annotation files for yeast and Arabidopsis, including identifier IDs, GO term IDs, and evidence codes. Detailed information for each term was extracted from the GO files, and EC numbers along with their corresponding gene names were extracted from the EC files. Specifically, for Arabidopsis, gene names from both the annotation and EC files were additionally read. Using the go-basic nameSpace, genes annotated with all three GO domains—Molecular Function, Biological Process, and Cellular Component—were filtered to ensure the comprehensiveness and consistency of the dataset. Performance Evaluation Metrics This study tests the model's performance based on metabolic reaction networks by comparing the functional similarity between genes in non-adjacent metabolic reactions and genes in adjacent metabolic reactions. Performance scores are evaluated based on EC (Enzyme Commission) group information, where genes sharing the same EC number are assumed to have similar functions. Genes are grouped into different categories according to their EC numbers (the complete 4-digit code). The next step is to test whether genes within the same category exhibit higher similarity. Mathematically, the logged fold change (LFC) metric [ 19 ] is used for quantitative evaluation. The LFC score for EC number category \(\:{\text{e}}_{\text{i}}\) is calculated as follows: $$\:\text{L}\text{F}\text{C}\:\left({\text{e}}_{\text{i}}\right)=\frac{1}{\left|\text{E}\text{C}\right|}\times\:\sum\:_{{\text{e}}_{\text{j}}\in\:\text{E}\text{C};\text{G}\left({\text{e}}_{\text{j}}\right)\cap\:\text{G}\left({\text{e}}_{\text{i}}\right)={\varnothing}}\frac{{\sum\:}_{\text{g}\in\:\text{G}\left({\text{e}}_{\text{i}}\right)}{\text{d}\text{i}\text{f}\text{f}}_{\text{g}}\:({\text{e}}_{\text{i}},\:{\text{e}}_{\text{j}})}{\left|\:\text{G}\:\left({\text{e}}_{\text{i}}\right)\:\right|}$$ 12 Where \(\:\text{G}\:\left(\:{\text{e}}_{\text{i}}\:\right)\) is the genome, consisting of genes in \(\:\text{G}\:\left({\text{e}}_{\text{i}}\right)\:\cap\:\:\text{G}\left({\text{e}}_{\text{j}}\right)\:=\:{\varnothing}\) that are labeled with the \(\:{\text{e}}_{\text{i}}\) category; \(\:{\text{d}\text{i}\text{f}\text{f}}_{\text{g}}\:({\text{e}}_{\text{i}},\:{\text{e}}_{\text{j}})\) satisfies the definition as follows: $$\:{\text{d}\text{i}\text{f}\text{f}}_{\text{g}}\:({\text{e}}_{\text{i}},\:{\text{e}}_{\text{j}})\:=\:\text{ln}\frac{\left|\text{G}\:\left({\text{e}}_{\text{i}}\right)\right|\times\:{\sum\:}_{{\text{g}}^{{\prime\:}}\in\:\text{G}\left({\text{e}}_{\text{j}}\right)}(1-\text{G}\text{S}(\text{g},\:{\text{g}}^{{\prime\:}})+\text{c})}{\left|\text{G}\:\left({\text{e}}_{\text{j}}\right)\right|\times\:{\sum\:}_{{\text{g}}^{\ast\:}\in\:\text{G}\left({\text{e}}_{\text{i}}\right)}(1-\text{G}\text{S}(\text{g},\:{\text{g}}^{\ast\:})+\text{c})}$$ 13 Where \(\:\text{G}\left({\text{e}}_{\text{i}}\right)\) is the set of \(\:{\text{e}}_{\text{i}}\) -labeled genes excluding gene \(\:\text{g}\) ; \(\:\text{c}\) is the Laplace smoothing parameter; \(\:\text{G}\text{S}\:(\text{g},{\text{g}}^{{\prime\:}})\) 、 \(\:\text{G}\text{S}\:(\text{g},{\text{g}}^{\ast\:})\) are defined by Eq. ( 10 ). Eq. ( 13 ) measures the difference between the EC inter-group distance and the EC intra-group distance. Based on the definition of the logged fold change (LFC) score in Eq. ( 12 ), a higher LFC score indicates better performance of the model. Performance Evaluation Results The performance of the proposed improved algorithm is evaluated by comparing the relationships between genes in different EC categories and within the same category, based on GO similarity. Evaluation tests were conducted using the gene associations included in YeastNet and AraNet for Saccharomyces cerevisiae (yeast) and Arabidopsis. The LFC scores were used as a standard for comparison with the top three algorithms in the field—NETSIM, NETSIM2, and RWRSM-2019 (based on 2019 data). The RWRSM-2024 algorithm performed the best across all tests. Specifically, when comparing the LFC scores for each EC group between NETSIM, NETSIM2, RWRSM-2019, and the improved algorithm (RWRSM-2024), the results showed that RWRSM-2024 achieved the highest LFC scores in 75 out of 83 EC-encoded groups for biological process and 78 out of 83 for molecular function, accounting for 90.4% and 93.9% of all groups, respectively. In contrast, the other algorithms achieved the highest LFC score in less than 10% of the EC groups. Based on the biological process and molecular function categories, the number of EC groups with the best LFC scores for yeast data in NETSIM, NETSIM2, RWRSM-2019, and RWRSM-2024 is shown in Fig. 1. The LFC scores for RWRSM-2024 for yeast EC groups indicate that the median, 75th, and 25th percentiles of RWRSM-2024 are the highest across all evaluation metrics. Specifically, for biological process, these values are 2.32, 21.38, and 1.19, respectively, and for molecular function, they are 21.34, 21.82, and 1.83, all significantly higher than those of other algorithms, as shown in Fig. 2. The line statistics for the three algorithms are shown in Fig. 3.The RWRSM-2024 algorithm's line for biological process is stable around 2.3, while the line for molecular function stabilizes around 1.3. Compared with the NETSIM algorithm, the proposed algorithm identified more gene pairs with a similarity greater than 0.5, 0.6, and 0.7 in both biological process and molecular function, thus uncovering more effective gene associations, as shown in Fig. 4. The RWRSM-2024 algorithm outperforms other algorithms in the Arabidopsis dataset. Specifically, the median, 75th, and 25th percentile LFC scores for RWRSM-2024 in Arabidopsis are 2.3, 22.1, and 1.2 for biological process, and 2.4, 22.3, and 1.5 for molecular function, respectively. These results show a clear advantage over the other algorithms, as shown in Fig. 5. The performance is significantly higher than that of the previous NETSIM2 algorithm, indicating that incorporating the Gaussian kernel function into the random walk enhances the algorithm's performance. 4. Conclusion Gene Ontology (GO) is one of the most widely used bioinformatics resources for describing the characteristics of genes and gene products. The calculation of gene functional similarity based on GO has been extensively applied in multiple research fields. However, low-quality similarity results can arise due to incomplete GO information and limited annotations. NETSIM addresses these issues by incorporating gene associations, GO structure, and annotations. However, it only uses local association information from the gene co-function network, as NETSIM considers only direct links within the network. Subsequently, NETSIM2 was proposed, which takes into account the global structure of the network. However, experimental results have shown that the LFC scores are not sufficiently stable. In this study, we propose a new improved algorithm based on the NETSIM2 network, which incorporates the global structure of the co-function network using a random walk with restart (RWR) approach. The algorithm integrates a Gaussian kernel function to generate the weight matrix. The algorithm consists of three steps: First, the weight matrix is obtained by fusing the Gaussian kernel function with the co-function network, and a correlation score matrix between two genes is computed using the random walk with restart method. Second, the similarity between two GO terms is calculated by combining information from the co-function network and GO. Finally, the similarity between two genes is measured using a standard score-based method on the selected GO term pairs. Experimental results based on the EC classification show that the proposed algorithm performs best across all measurements for both yeast and Arabidopsis datasets. The LFC results are more stable, and the algorithm uncovers a greater number of meaningful gene associations. Declarations Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Availability of data and material The source code and experimental data of this project can be accessed at the following URL: https://github.com/Zhouyanbing171/code-and-experimental-data Competing interests The authors declare no competing financial interests. Funding This research was supported by a research grant funded by the Science and Technology Major Project of Shenzhen (Project No. KJZD20230923114500002) and the Shenzhen Philosophy and Social Science Planning Project (Project No. SZ2022B014). The research was also supported by “2023 Guangdong Province Undergraduate Teaching Quality and Teaching Reform Project - Curriculum Teaching and Research Office (Virtual Teaching and Research Office) - Teaching and Research Office on Business Digitalization and Intelligence Curriculum Group”. Authors' contributions ZT and XJ designed the study and prepared the manuscript; JW, FL, and DY performed computational analyses; YZ vetted , polished the manuscript and submitted it . All authors have read and approved the final version of the manuscript. Acknowledgements Not applicable. References Consortium G0.The Gene Ontology Project in 2008 [J].Nucleic Acids Research,2008,36(Database Issue): D440-D444. 蒋哲远,韩江洪,王钊.动态的QOS感知Web服务选择和组合优化模型 [J]. 计算机学报,2009,32(5): 1014-1025. LI Y H, BANDAR Z A, MCLEAN D. An approach for measuring semantic similarity between words using multiple information sources [J]. Ieee T Knowl Data En, 2003, 15(4): 871-882. COLLINS A M, LOFTUS E F. A spreading-activation theory of semantic processing [J]. Psychological review, 1975, 82(6): 407. HERRERO-ZAZO M, SEGURA-BEDMAR I, HASTINGS J, et al. Application of Domain Ontologies to Natural Language Processing: A Case Study for Drug-Drug Interactions [J]. International Journal of Information Retrieval Research (IJIRR),2015, 5(3): 19-38. INKPEN D, DÉSILETS A. Semantic similarity for detecting recognition errors in automatic speech transcripts; proceedings of the Human Language Technology Conference, F, 2005 [C]. ALMASRI M, TAN K, BERRUT C, et al. Integrating Semantic Term Relations into Information Retrieval Systems Based on Language Models [M]. Information Retrieval Technology. Springer. 2014: 136-147. Couto FM,Silva MJ,Coutinho PM.Measuring Semantic Similarity between Gene Ontology Terms [J].Data & Knowledge Engineering,2007,61(1): 137-152. 李志杰,廖旭红,李元香,等.基于基因关联分析的贝叶斯网络疾病样本分类算法[J].计算机应用,2024,44(11):3449-3458. Wu H, Su Z, Mao F, Olman V, Xu Y. Prediction of functional modules based on comparative genome analysis and gene ontology application. Nucleic Acids Res. 2005;33(9):2822–37. Wu X, Pang E, Lin K, Pei Z-M. Improving the measurement of semantic similarity between gene ontology terms and gene products: insights from an edge-and ic-based hybrid method. PloS ONE. 2013;8(5):66745. Pesquita C, Faria D, Falcao AO, Lord P, Couto FM. Semantic similarity in biomedical ontologies. PLoS Comput Biol. 2009;5(7):1000443. Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, Martinez-Cruz LA, Corrales FJ, Rubio A. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2005;2(4):330–8. Peng J, Wang Y, Chen J. Towards integrative gene functional similarity measurement. BMC Bioinformatics. 2014;15(2):5. Peng J, Li H, Liu Y, Juan L, Jiang Q, Wang Y, Chen J. Intego2: a web tool for measuring and visualizing gene semantic similarities using gene ontology. BMC Genomics. 2016;17(5):530. Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, et al. The arabidopsis information resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2011;40(D1):D1202–10. Peng J,Uygun S,Kim T, et al.Measuring Semantic Simi- larities by Combining Gene Ontology Annotations and Gene Co-function Networks[J].BMC Bioinformatics,2015, 16(1):I-14. Peng J,Li H,Jiang Q,et al.An Integrative Approach for Measuring Semantic Similarities Using Gene Ontology [J].BMC Systems Biology,2014,8(S5):S8. Peng J,Zhang X,Hui W,et al.Improving the Measurement of Semantic Similarity by Combining Gene Ontology and Co-functional Network:a random walk based approach [J].BMC Systems Biology,2018(12):18. WANG J Z, DU Z, PAYATTAKOOL R, et al. A new method to measure the semantic similarity of GO terms [J]. Bioinformatics, 2007, 23(10): 1274-1281. O’MADADHAIN J, FISHER D, SMYTH P, et al. Analysis and visualization of network data using JUNG [J]. Journal of Statistical Software, 2005, 10(2): 1-35. LEE I, LI Z, MARCOTTE E M. An improved, bias-reduced probabilistic functional gene network of baker's yeast, Saccharomyces cerevisiae [J]. PloS one, 2007, 2(10): e0000988. LEE I, AMBARU B, THAKKAR P, et al. Rational association of genes with traits using a genome-scale gene network for Arabidopsis [J]. Nature biotechnology, 2010, 28(2): 149-156. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6537896","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":453660548,"identity":"eccb5a93-2005-4f63-b6de-9aacf0fde37e","order_by":0,"name":"Zhi Tang","email":"","orcid":"","institution":"Jinan University (Shenzhen Campus)","correspondingAuthor":false,"prefix":"","firstName":"Zhi","middleName":"","lastName":"Tang","suffix":""},{"id":453660549,"identity":"87150f4f-ff3b-4266-b8bb-3da4396f1d93","order_by":1,"name":"Xiuli Jing","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA5ElEQVRIiWNgGAWjYBACxmYwZSEHIg3A7APEaZEwJl4LFEgkNsDZhLQwt/MefsHYJpE+v/8AQ3FhG4Mc340Exs8FeB3Gl2bBcEYid8OBAwzGM9sYjCVvJDBLz8CrhcfMgKECqIWxgcGYt40hccONBDZmHoJaDCTS5YHhANJST4wW4wdAWxIYjkG0JBgQYwsD0C+GG84wNhjznJMwnHnmYbM0Pi2G/WeMPzC22cjL9x8+ZsxTZiPPdzz54Ge8WhoY2KT/QCxsA0alBIjRgEcDA4M8MGo+QNnMD/AqHQWjYBSMghELANn7P7Y98fpDAAAAAElFTkSuQmCC","orcid":"","institution":"Jinan University (Shenzhen Campus)","correspondingAuthor":true,"prefix":"","firstName":"Xiuli","middleName":"","lastName":"Jing","suffix":""},{"id":453660550,"identity":"5db1dc06-8667-4cce-a65d-0eb3c206824e","order_by":2,"name":"Jian Wang","email":"","orcid":"","institution":"Jinan University (Shenzhen Campus)","correspondingAuthor":false,"prefix":"","firstName":"Jian","middleName":"","lastName":"Wang","suffix":""},{"id":453660551,"identity":"a8e3dfdf-5f31-461e-9b38-b6e4e0cfaf5b","order_by":3,"name":"Furong Lin","email":"","orcid":"","institution":"Jinan University (Shenzhen Campus)","correspondingAuthor":false,"prefix":"","firstName":"Furong","middleName":"","lastName":"Lin","suffix":""},{"id":453660552,"identity":"3ba3203d-e1b3-4c33-a9f9-683389203eb8","order_by":4,"name":"Yanbing Zhou","email":"","orcid":"","institution":"Jinan University (Shenzhen Campus)","correspondingAuthor":false,"prefix":"","firstName":"Yanbing","middleName":"","lastName":"Zhou","suffix":""},{"id":453660553,"identity":"387c1e75-f05f-4963-aadf-3718020de567","order_by":5,"name":"Dingwei Yang","email":"","orcid":"","institution":"Jinan University (Shenzhen Campus)","correspondingAuthor":false,"prefix":"","firstName":"Dingwei","middleName":"","lastName":"Yang","suffix":""}],"badges":[],"createdAt":"2025-04-27 04:08:07","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6537896/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6537896/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":82574804,"identity":"f597827a-7e23-4953-9dea-6c37c95e9c29","added_by":"auto","created_at":"2025-05-13 05:12:18","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":47493,"visible":true,"origin":"","legend":"\u003cp\u003eTop ECs in Saccharomyces cerevisiae (a) biological process and (b) molecular function.\u003c/p\u003e","description":"","filename":"1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6537896/v1/5ea57b1a93a7fad65de33f53.jpg"},{"id":82574807,"identity":"328b25fb-395e-451b-8d11-955f4f9ae90f","added_by":"auto","created_at":"2025-05-13 05:12:19","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":31845,"visible":true,"origin":"","legend":"\u003cp\u003eMethods comparison for (a) biological process and (b) molecular functionin in Saccharomyces cerevisiae.\u003c/p\u003e","description":"","filename":"2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6537896/v1/1f22dfa43c9a8dce80fa3723.jpg"},{"id":82574832,"identity":"c16da7af-efaa-4eda-bac7-fc4d118b495a","added_by":"auto","created_at":"2025-05-13 05:12:20","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":34749,"visible":true,"origin":"","legend":"\u003cp\u003eLFC score performance of 4 algorithms for (a) biological process and (b) molecular functionin in Saccharomyces cerevisiae.\u003c/p\u003e","description":"","filename":"3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6537896/v1/3a2c2f54394059538fd819c8.jpg"},{"id":82574811,"identity":"ecf139ea-f5b4-4368-abcf-d69ac8ee2d6a","added_by":"auto","created_at":"2025-05-13 05:12:19","extension":"jpg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":33877,"visible":true,"origin":"","legend":"\u003cp\u003eNumber of valid gene similarity values in Saccharomyces cerevisiae for (a) biological process and (b) molecular functionin.\u003c/p\u003e","description":"","filename":"4.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6537896/v1/9a21357b7aea0259012c2695.jpg"},{"id":82574824,"identity":"70a917ab-4d96-40e7-8c82-afadfe706f5f","added_by":"auto","created_at":"2025-05-13 05:12:19","extension":"jpg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":35315,"visible":true,"origin":"","legend":"\u003cp\u003eMethods comparison for (a) biological process and (b) molecular functionin in Arabidopsis thaliana.\u003c/p\u003e","description":"","filename":"5.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6537896/v1/dd289b75aef3f44ba676c72d.jpg"},{"id":83030603,"identity":"8f3615b6-c996-49da-8624-3db4a760ec04","added_by":"auto","created_at":"2025-05-19 09:02:09","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":682546,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6537896/v1/38f15119-8b9c-44cd-9c7b-a73d42e1a139.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"An Improved Method for Gene Function Network Term Similarity Calculation","fulltext":[{"header":"1. Background","content":"\u003cp\u003eGene Ontology (GO) is one of the most successful ontologies in the field of biomedical research. It provides a standardized and precise vocabulary to describe various aspects of genes and gene products, including molecular functions, biological processes, and cellular components. GO has become widely used in various areas of biomedical research\u003csup\u003e[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]\u003c/sup\u003e, including gene function comparison and analysis, protein\u0026ndash;protein interaction prediction, and gene set enrichment analysis, making it an indispensable resource in biomedical ontologies. Research has shown that if the functions of two gene products are similar, the GO terms annotating them are likely to be closer in the ontology\u003csup\u003e[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]\u003c/sup\u003e. The computation of semantic similarity between two entities based on ontology has long been a popular research topic in computer science\u003csup\u003e[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]\u003c/sup\u003e and has a long history of study\u003csup\u003e[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]\u003c/sup\u003e. This concept has been extensively applied in areas such as natural language processing\u003csup\u003e[\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]\u003c/sup\u003e, audio signal processing\u003csup\u003e[\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]\u003c/sup\u003e, and information retrieval\u003csup\u003e[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]\u003c/sup\u003e. One important application of GO is to measure the semantic similarity between terms, which allows for the calculation of gene product similarity\u003csup\u003e[\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]\u003c/sup\u003e. Therefore, designing efficient and accurate algorithms for calculating the similarity between genes based on Gene Ontology and uncovering hidden associations between gene pairs has become a key research direction in knowledge engineering and bioinformatics. These efforts provide essential tools for exploring functional relationships between genes\u003csup\u003e[\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eCurrently, several methods have been developed to measure the similarity between GO terms, which can be broadly categorized into edge-based, path-based, information content (IC)-based, node-based, and hybrid approaches\u003csup\u003e[\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]\u003c/sup\u003e. Edge-based methods calculate similarity based on the topological structure of GO, usually by determining the number of edges along the shortest path between two GO terms. However, edge-based methods rely entirely on the topology of the GO Directed Acyclic Graph (DAG) and cannot distinguish between terms at the same topological level\u003csup\u003e[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]\u003c/sup\u003e. Additionally, edges at the same depth may not reflect the same semantic distance. Node-based methods, on the other hand, use the properties of the query node and its ancestor or descendant nodes to indicate similarity. These similarities are often represented by the most informative common ancestor (MICA) and its information content (IC)\u003csup\u003e[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]\u003c/sup\u003e. Evaluation tests have shown that these results correlate with protein sequence similarity, but node-based methods only consider annotations and ignore the topological information of GO. In hybrid approaches, methods that incorporate more information from GO have been proposed. For example, the InteGO method integrates multiple existing similarity methods into a rank-based approach, called the seed method, to consider more aspects of GO\u003csup\u003e[\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]\u003c/sup\u003e. The InteGO2 method selects the most suitable method using a voting mechanism and integrates them based on a metaheuristic search method\u003csup\u003e[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]\u003c/sup\u003e. Evaluation tests show that the hybrid approach outperforms the seed method. However, all of these methods rely solely on GO structural information and fail to address issues such as inaccurate representation and missing data in GO\u003csup\u003e[\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003ePeng et al. proposed a network-based similarity measure (NETSIM) that integrates gene associations with GO topology and annotations to address these issues\u003csup\u003e[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]\u003c/sup\u003e. Experimental results based on metabolic reaction networks showed that integrating gene associations enhances semantic similarity, but NETSIM only considers direct links in the network, using only a portion of the information from the gene co-function network. In reality, it is essential to consider not only directly connected gene pairs but also indirect gene interactions within the functional gene network. To address this limitation, a new network-based approach, NETSIM2, was proposed, which incorporates random walk methods to consider both direct and indirect interactions in the gene co-function network. Experiments on metabolic reaction networks showed that NETSIM2 outperforms all previous algorithms. However, the Local Function Consistency (LFC) values measured by NETSIM2 were not stable, with most LFC values being very small. To resolve this issue, this paper proposes the use of a Gaussian kernel function to compute edge weights and generate a normalized weighted matrix. This allows the random walk algorithm to calculate the correlation scores between gene pairs. Finally, based on the original NETSIM2 algorithm, term pair similarity is computed\u003csup\u003e[\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]\u003c/sup\u003e. Compared to previous methods, our approach combines the Gaussian kernel function and random walk to calculate gene correlation scores, considering both direct and indirect interactions. This integration not only significantly improves the stability of LFC values but also uncovers more effective hidden gene pair associations.\u003c/p\u003e"},{"header":"2. Methods","content":"\u003cp\u003eThe RWRSM algorithm computes the semantic similarity between two genes in three steps. First, a weight matrix is constructed by fusing the Gaussian kernel function with the gene co-function network. Then, using random walk with restart (RWR) methods, a correlation score matrix between the two genes is calculated. Second, the similarity between two GO terms is computed by combining information from the co-function network and GO. Finally, a similarity measure based on the Z-score method is applied to assess the similarity between the two genes based on the selected GO term pairs.\u003c/p\u003e \u003cp\u003e \u003cb\u003eFusion of the Gaussian Kernel Function to Calculate the Correlation Score Matrix Between Two Genes\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe basic idea of constructing a restart random walk graph is as follows: each training data point \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{d}\\in\\:\\text{D}\\)\u003c/span\u003e\u003c/span\u003e in the training set D is mapped as a node in the graph. The decision to connect two nodes with an edge is based on the degree of similarity between them. For computational convenience, a complete graph is typically constructed, where the similarity between two nodes serves as the weight of the edge between them. When the similarity between two nodes is very small or zero, the edge weight between the nodes is set to zero. Let the weighted graph be represented by \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{G}=(\\text{V},\\:\\text{E})\\)\u003c/span\u003e\u003c/span\u003e, where V is the set of nodes and E is the set of edges, with n nodes in total. The edge weights are calculated using the Gaussian kernel function, as shown in the following formula:\u003c/p\u003e \u003cdiv id=\"Equ1\" class=\"Equation\"\u003e \u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$$\\:\\text{W}\\:\\left(i,\\:j\\right)=exp\\left(-\\frac{{d}_{i,j}^{2}}{{\\sigma\\:}^{2}}\\right)$$\u003c/div\u003e \u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eWhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\sigma\\:}\\)\u003c/span\u003e\u003c/span\u003e is a parameter variable, which is set empirically, with \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\sigma\\:}={\\sum\\:}_{\\text{i}\\ne\\:\\text{i}}\\left|\\right|{\\text{x}}_{\\text{i}}-{\\text{y}}_{\\text{i}}\\left|\\right|\\:/\\:\\left[\\text{n}\\:\\right(\\text{n}-1\\left)\\right]\\)\u003c/span\u003e\u003c/span\u003e.\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{d}}_{\\text{i}\\text{j}}\\)\u003c/span\u003e\u003c/span\u003eis the Euclidean distance between two nodes in the graph, represented as:\u003c/p\u003e \u003cdiv id=\"Equ2\" class=\"Equation\"\u003e \u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e\n$$\\:{\\text{d}}_{\\text{i}\\text{j}}^{2}=\\sum\\:_{\\text{t}=1}^{\\text{k}}({\\text{x}}_{\\text{i}}^{\\text{t}}-{\\text{x}}_{\\text{j}}^{\\text{t}})$$\u003c/div\u003e \u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eBefore applying the random walk with restart (RWR) algorithm, the probability transition matrix P of the graph G is first computed based on the weight matrix:\u003cdiv id=\"Equ3\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ3\" name=\"EquationSource\"\u003e\n$$\\:\\text{P}\\:\\left(i\\:,\\:j\\right)=\\left\\{\\begin{array}{c}W\\:(i\\:,\\:j)/\\sum\\:_{t=1}^{n}W\\:(i\\:,\\:t)\\:({V}_{i},\\:{V}_{j}\\:ϵ\\:G)\\:\\:\\\\\\:0\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:({V}_{i},\\:{V}_{j}\\:ϵ\\:G)\\end{array}\\right.$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e3\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eThe correlation score between genes is computed through the following steps. First, the probability transition matrix P is calculated, followed by the generation of a normalized weighted matrix \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{P}}^{\\text{T}}\\)\u003c/span\u003e\u003c/span\u003e. Finally, the method based on random walk with restart (RWR) can be described as follows:\u003cdiv id=\"Equ4\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ4\" name=\"EquationSource\"\u003e\n$$\\:{\\text{D}}_{\\text{i}}\\:(\\text{t}\\:+\\:1)=\\text{a}\\:\u0026middot;\\:{\\text{P}}^{\\text{T}}{\\text{D}}_{\\text{i}}\\:\\left(\\text{t}\\right)\\:+\\:(1\\:-\\:\\text{a})\\:{\\text{e}}_{\\text{i}}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e4\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eWhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{e}}_{\\text{i}}\\)\u003c/span\u003e\u003c/span\u003e represents the initial state. The steady-state solution of Eq.\u0026nbsp;(\u003cspan refid=\"Equ4\" class=\"InternalRef\"\u003e4\u003c/span\u003e) is:\u003cdiv id=\"Equ5\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ5\" name=\"EquationSource\"\u003e\n$$\\:{D}_{i}=(1-a){(1-a{P}^{T})}^{-1}{e}_{i}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e5\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eWhere\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\:{\\text{D}}_{\\text{i}}\\)\u003c/span\u003e\u003c/span\u003eis the vector \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\left|\\text{V}\\right|\\times\\:1\\)\u003c/span\u003e\u003c/span\u003e, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{e}}_{\\text{i}}\\)\u003c/span\u003e\u003c/span\u003e is the initial vector \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\left|\\text{V}\\right|\\times\\:1\\)\u003c/span\u003e\u003c/span\u003e.(1\u0026thinsp;\u0026minus;\u0026thinsp;a) is defined as the restart probability, which takes a value between 0 and 1. Subsequently, the matrix S is obtained, which stores the correlation scores between each pair of genes in \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{N}(\\text{V},\\:\\:\\text{E})\\)\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cb\u003eCalculating the similarity between two GO terms\u003c/b\u003e \u003c/p\u003e \u003cp\u003eAccording to the method proposed by Peng et al. in their previous work\u003csup\u003e[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]\u003c/sup\u003e, the similarity between two terms is calculated by combining information from the gene functional network and GO. Let \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{1}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{2}\\)\u003c/span\u003e\u003c/span\u003e represent the two terms. \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{D}({\\text{t}}_{1},\\:{\\text{t}}_{2})\\)\u003c/span\u003e\u003c/span\u003e is defined as the gene set distance, which is used to compute the similarity between the gene sets annotated by \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{1}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{2}\\)\u003c/span\u003e\u003c/span\u003e. \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{D}({\\text{t}}_{1},\\:{\\text{t}}_{2})\\)\u003c/span\u003e\u003c/span\u003e is defined as:\u003cdiv id=\"Equ6\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ6\" name=\"EquationSource\"\u003e\n$$\\:\\text{D}({\\text{t}}_{1},\\:{\\text{t}}_{2})=\\frac{{\\sum\\:}_{{\\text{g}}_{\\text{i}}\\in\\:{\\text{G}}_{1}}{\\prod\\:}_{{\\text{g}}_{\\text{i}}\\in\\:{\\text{G}}_{2}}{\\text{d}}_{\\text{i}\\text{j}}+{\\sum\\:}_{{\\text{g}}_{\\text{i}}\\in\\:{\\text{G}}_{2}}{\\prod\\:}_{{\\text{g}}_{\\text{i}}\\in\\:{\\text{G}}_{1}}{\\text{d}}_{\\text{i}\\text{j}}}{2\\left|{\\text{G}}_{1}\\cup\\:{\\text{G}}_{2}\\right|-{\\sum\\:}_{{\\text{g}}_{\\text{i}}\\in\\:{\\text{G}}_{1}}{\\prod\\:}_{{\\text{g}}_{\\text{i}}\\in\\:{\\text{G}}_{2}}{\\text{d}}_{\\text{i}\\text{j}}-{\\sum\\:}_{{\\text{g}}_{\\text{i}}\\in\\:{\\text{G}}_{2}}{\\prod\\:}_{{\\text{g}}_{\\text{i}}\\in\\:{\\text{G}}_{1}}{\\text{d}}_{\\text{i}\\text{j}}}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e6\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eLet \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{G}}_{1}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{G}}_{2}\\)\u003c/span\u003e\u003c/span\u003e represent the gene sets annotated by \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{1}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{2}\\)\u003c/span\u003e\u003c/span\u003e, respectively. \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{d}}_{\\text{i}\\text{j}}\\)\u003c/span\u003e\u003c/span\u003e is the distance score between two genes, where \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{d}}_{\\text{i}\\text{j}}=1-{\\text{S}}_{\\text{i}\\text{j}}\\)\u003c/span\u003e\u003c/span\u003e, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{S}}_{\\text{i}\\text{j}}\\)\u003c/span\u003e\u003c/span\u003e is the correlation score between genes i and j calculated based on the random walk with restart (RWR) method. The gene set distance for all term pairs is normalized between 0 and 1. The similarity between two terms is then computed based on path-constrained annotations labeled as U. In traditional methods based on the most informative common ancestor (LCA), all descendants of the LCA are considered. However, the path-constrained annotation method only uses terms most relevant to the terms being compared. The set of related terms consists of three components: the gene sets annotated by terms \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{1}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{2}\\)\u003c/span\u003e\u003c/span\u003e, the gene set \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{p}\\)\u003c/span\u003e\u003c/span\u003e annotated by their common parent node, and the descendant terms on the path from \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{1}\\)\u003c/span\u003e\u003c/span\u003e or \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{2}\\)\u003c/span\u003e\u003c/span\u003e to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{p}\\)\u003c/span\u003e\u003c/span\u003e. Let \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{1}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{2}\\)\u003c/span\u003e\u003c/span\u003e represent the two GO terms, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{p}\\)\u003c/span\u003e\u003c/span\u003e be their common ancestor. The similarity between \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{1}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{2}\\)\u003c/span\u003e\u003c/span\u003e is defined by the equation proposed by Peng et al. in their previous work\u003csup\u003e[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]\u003c/sup\u003e.\u003cdiv id=\"Equ7\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ7\" name=\"EquationSource\"\u003e\n$$\\:S\\left({t}_{1},\\:{t}_{2}\\right)=\\frac{2log\\left|G\\right|-2logf({t}_{1},\\:{t}_{2},\\:p)}{2log\\left|G\\right|-(log\\left|{G}_{1}\\right|+log\\left|{G}_{2}\\right|)}\\times\\:(1-\\frac{ℎ\\:({t}_{1},\\:{t}_{2})}{\\left|G\\right|}\\times\\:\\frac{{G}_{p}}{G})$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e7\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eWhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{G}}_{\\text{p}}\\)\u003c/span\u003e\u003c/span\u003e (or \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{G}\\)\u003c/span\u003e\u003c/span\u003e) is the gene set annotated by the common ancestor term \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{p}\\)\u003c/span\u003e\u003c/span\u003e (or the root term) and its descendants. In the equation, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{f}\\:\\left({\\text{t}}_{1},\\:{\\text{t}}_{2},\\:\\text{p}\\right)\\)\u003c/span\u003e\u003c/span\u003ecalculates the similarity based on path-constrained annotations, and is defined as follows:\u003cdiv id=\"Equ8\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ8\" name=\"EquationSource\"\u003e\n$$\\:f\\:\\left({t}_{1},\\:{t}_{2},\\:p\\right)=D\\:{({t}_{1},\\:{t}_{2})}^{2}\\times\\:\\left|U\\:({t}_{1},\\:{t}_{2},\\:p)\\right|+(1-D\\:{({t}_{1},\\:{t}_{2})}^{2})\\times\\:\\sqrt{\\left|{G}_{1}\\right|\\times\\:\\left|{G}_{2}\\right|}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e8\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003e \u003cspan class=\"InlineEquation\"\u003e \u003cspan class=\"mathinline\"\u003e\\(\\:\\text{h}\\:({\\text{t}}_{1},\\:{\\text{t}}_{2})\\)\u003c/span\u003e \u003c/span\u003e measures the specificity of the common parent, and is defined as follows:\u003cdiv id=\"Equ9\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ9\" name=\"EquationSource\"\u003e\n$$\\:\\text{h}\\:({\\text{t}}_{1},\\:{\\text{t}}_{2})=D\\:{({t}_{1},\\:{t}_{2})}^{2}\\times\\:\\left|G\\right|+(1-D\\:{({t}_{1},\\:{t}_{2})}^{2})\\times\\:max(\\left|{G}_{1}\\right|,\\:\\left|{G}_{2}\\right|)$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e9\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eIn Eq.\u0026nbsp;(\u003cspan refid=\"Equ9\" class=\"InternalRef\"\u003e9\u003c/span\u003e), the left side measures the distance from terms \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{1}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{2}\\)\u003c/span\u003e\u003c/span\u003e to \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{p}\\)\u003c/span\u003e\u003c/span\u003e, while the right side calculates the distance from \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{p}\\)\u003c/span\u003e\u003c/span\u003e to the root. If there are multiple lowest common ancestors, the highest score is chosen as the similarity between \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{1}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{2}\\)\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cb\u003eMeasuring the similarity between two genes\u003c/b\u003e \u003c/p\u003e \u003cp\u003eOnce the similarity between all Gene Ontology terms for a given species has been calculated, the similarity between any two genes, each annotated with at least one GO term, can be computed. Given two genes \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{g}}_{\\text{i}}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{g}}_{\\text{j}}\\)\u003c/span\u003e\u003c/span\u003e and the sets of terms \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{T}}_{\\text{i}}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{T}}_{\\text{j}}\\)\u003c/span\u003e\u003c/span\u003e annotating them, the method proposed in this study uses a \"leave-one-out\" approach to avoid errors caused by data circularity. To calculate gene similarity, the method proposed by Wang et al.\u003csup\u003e[\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]\u003c/sup\u003e is used, where the similarity between multiple term pairs is accumulated to form the gene similarity, as shown in Eq.\u0026nbsp;(\u003cspan refid=\"Equ10\" class=\"InternalRef\"\u003e10\u003c/span\u003e).\u003cdiv id=\"Equ10\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ10\" name=\"EquationSource\"\u003e\n$$\\:\\text{G}\\text{S}\\:({\\text{g}}_{\\text{i}},\\:{\\text{g}}_{\\text{j}})=\\frac{{\\sum\\:}_{\\text{t}\\in\\:{\\text{T}}_{\\text{i}}}\\text{S}\\text{i}\\text{m}\\:(\\text{t},\\:{\\text{T}}_{\\text{j}})+{\\sum\\:}_{\\text{t}\\in\\:{\\text{T}}_{\\text{j}}}\\text{S}\\text{i}\\text{m}\\:(\\text{t},\\:{\\text{T}}_{\\text{i}})}{\\left|{\\text{T}}_{\\text{i}}\\right|+\\left|{\\text{T}}_{\\text{j}}\\right|}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e10\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eIn the equation, for each term belonging to the set \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{T}}_{\\text{x}}\\)\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{S}\\text{i}\\text{m}\\:(\\text{t},\\:{\\text{T}}_{\\text{y}})\\)\u003c/span\u003e\u003c/span\u003e represents the maximum similarity between term \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{t}\\)\u003c/span\u003e\u003c/span\u003e and the terms in the set \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{T}}_{\\text{y}}\\)\u003c/span\u003e\u003c/span\u003e, as shown in Eq.\u0026nbsp;(\u003cspan refid=\"Equ11\" class=\"InternalRef\"\u003e11\u003c/span\u003e).\u003cdiv id=\"Equ11\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ11\" name=\"EquationSource\"\u003e\n$$\\:\\text{S}\\text{i}\\text{m}\\:(\\text{t},\\:{\\text{T}}_{\\text{y}})={\\text{m}\\text{a}\\text{x}}_{{\\text{t}}_{\\text{y}}\\in\\:{\\text{T}}_{\\text{y}}}\\:\\text{S}\\:(\\text{t},\\:{\\text{t}}_{\\text{y}})$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e11\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eIn the equation, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{S}\\:(\\text{t},\\:{\\text{t}}_{\\text{y}})\\)\u003c/span\u003e\u003c/span\u003e represents the maximum value of \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{S}\\:(\\text{t},\\:{\\text{t}}_{\\text{y}},\\:\\text{p})\\)\u003c/span\u003e\u003c/span\u003e among all the least common ancestors of \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{t}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{t}}_{\\text{y}}\\)\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e"},{"header":"3. Results and Discussion","content":"\u003cp\u003e \u003cb\u003eData Preparation\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe RWRSM algorithm was primarily implemented using Java and the JUNG library\u003csup\u003e[\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]\u003c/sup\u003e (jung.sourceforge.net). The Gene Ontology (GO) data was downloaded from the official Gene Ontology website (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ewww.geneontology.org/GO.downloads.shtml\u003c/span\u003e\u003cspan address=\"http://www.geneontology.org/GO.downloads.shtml\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), using the November 2024 version. Additionally, key data files for this study were downloaded from multiple bioinformatics databases. GO annotation files for Saccharomyces cerevisiae (yeast) and Arabidopsis were obtained from the current Gene Ontology Consortium product page. Gene functional network files for yeast and Arabidopsis were downloaded from the YeastNet\u003csup\u003e[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]\u003c/sup\u003e and AraNet\u003csup\u003e[\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]\u003c/sup\u003e websites, respectively. The EC group files for yeast and Arabidopsis used in the study were downloaded from \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ewww.yeastgenome.org/\u003c/span\u003e\u003cspan address=\"http://www.yeastgenome.org/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://ftp.plantcyc.org/Pathways\u003c/span\u003e\u003cspan address=\"http://ftp.plantcyc.org/Pathways\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003cp\u003eThe study selected key information from the annotation files for yeast and Arabidopsis, including identifier IDs, GO term IDs, and evidence codes. Detailed information for each term was extracted from the GO files, and EC numbers along with their corresponding gene names were extracted from the EC files. Specifically, for Arabidopsis, gene names from both the annotation and EC files were additionally read. Using the go-basic nameSpace, genes annotated with all three GO domains\u0026mdash;Molecular Function, Biological Process, and Cellular Component\u0026mdash;were filtered to ensure the comprehensiveness and consistency of the dataset.\u003c/p\u003e \u003cp\u003e \u003cb\u003ePerformance Evaluation Metrics\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThis study tests the model's performance based on metabolic reaction networks by comparing the functional similarity between genes in non-adjacent metabolic reactions and genes in adjacent metabolic reactions. Performance scores are evaluated based on EC (Enzyme Commission) group information, where genes sharing the same EC number are assumed to have similar functions. Genes are grouped into different categories according to their EC numbers (the complete 4-digit code). The next step is to test whether genes within the same category exhibit higher similarity. Mathematically, the logged fold change (LFC) metric\u003csup\u003e[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]\u003c/sup\u003e is used for quantitative evaluation. The LFC score for EC number category \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{e}}_{\\text{i}}\\)\u003c/span\u003e\u003c/span\u003eis calculated as follows:\u003cdiv id=\"Equ12\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ12\" name=\"EquationSource\"\u003e\n$$\\:\\text{L}\\text{F}\\text{C}\\:\\left({\\text{e}}_{\\text{i}}\\right)=\\frac{1}{\\left|\\text{E}\\text{C}\\right|}\\times\\:\\sum\\:_{{\\text{e}}_{\\text{j}}\\in\\:\\text{E}\\text{C};\\text{G}\\left({\\text{e}}_{\\text{j}}\\right)\\cap\\:\\text{G}\\left({\\text{e}}_{\\text{i}}\\right)={\\varnothing}}\\frac{{\\sum\\:}_{\\text{g}\\in\\:\\text{G}\\left({\\text{e}}_{\\text{i}}\\right)}{\\text{d}\\text{i}\\text{f}\\text{f}}_{\\text{g}}\\:({\\text{e}}_{\\text{i}},\\:{\\text{e}}_{\\text{j}})}{\\left|\\:\\text{G}\\:\\left({\\text{e}}_{\\text{i}}\\right)\\:\\right|}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e12\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eWhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{G}\\:\\left(\\:{\\text{e}}_{\\text{i}}\\:\\right)\\)\u003c/span\u003e\u003c/span\u003e is the genome, consisting of genes in \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{G}\\:\\left({\\text{e}}_{\\text{i}}\\right)\\:\\cap\\:\\:\\text{G}\\left({\\text{e}}_{\\text{j}}\\right)\\:=\\:{\\varnothing}\\)\u003c/span\u003e\u003c/span\u003ethat are labeled with the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{e}}_{\\text{i}}\\)\u003c/span\u003e\u003c/span\u003e category; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{d}\\text{i}\\text{f}\\text{f}}_{\\text{g}}\\:({\\text{e}}_{\\text{i}},\\:{\\text{e}}_{\\text{j}})\\)\u003c/span\u003e\u003c/span\u003e satisfies the definition as follows:\u003cdiv id=\"Equ13\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ13\" name=\"EquationSource\"\u003e\n$$\\:{\\text{d}\\text{i}\\text{f}\\text{f}}_{\\text{g}}\\:({\\text{e}}_{\\text{i}},\\:{\\text{e}}_{\\text{j}})\\:=\\:\\text{ln}\\frac{\\left|\\text{G}\\:\\left({\\text{e}}_{\\text{i}}\\right)\\right|\\times\\:{\\sum\\:}_{{\\text{g}}^{{\\prime\\:}}\\in\\:\\text{G}\\left({\\text{e}}_{\\text{j}}\\right)}(1-\\text{G}\\text{S}(\\text{g},\\:{\\text{g}}^{{\\prime\\:}})+\\text{c})}{\\left|\\text{G}\\:\\left({\\text{e}}_{\\text{j}}\\right)\\right|\\times\\:{\\sum\\:}_{{\\text{g}}^{\\ast\\:}\\in\\:\\text{G}\\left({\\text{e}}_{\\text{i}}\\right)}(1-\\text{G}\\text{S}(\\text{g},\\:{\\text{g}}^{\\ast\\:})+\\text{c})}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e13\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eWhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{G}\\left({\\text{e}}_{\\text{i}}\\right)\\)\u003c/span\u003e\u003c/span\u003e is the set of \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\text{e}}_{\\text{i}}\\)\u003c/span\u003e\u003c/span\u003e-labeled genes excluding gene \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{g}\\)\u003c/span\u003e\u003c/span\u003e; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{c}\\)\u003c/span\u003e\u003c/span\u003e is the Laplace smoothing parameter; \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{G}\\text{S}\\:(\\text{g},{\\text{g}}^{{\\prime\\:}})\\)\u003c/span\u003e\u003c/span\u003e、\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{G}\\text{S}\\:(\\text{g},{\\text{g}}^{\\ast\\:})\\)\u003c/span\u003e\u003c/span\u003e are defined by Eq.\u0026nbsp;(\u003cspan refid=\"Equ10\" class=\"InternalRef\"\u003e10\u003c/span\u003e). Eq.\u0026nbsp;(\u003cspan refid=\"Equ13\" class=\"InternalRef\"\u003e13\u003c/span\u003e) measures the difference between the EC inter-group distance and the EC intra-group distance. Based on the definition of the logged fold change (LFC) score in Eq.\u0026nbsp;(\u003cspan refid=\"Equ12\" class=\"InternalRef\"\u003e12\u003c/span\u003e), a higher LFC score indicates better performance of the model.\u003c/p\u003e \u003cp\u003e \u003cb\u003ePerformance Evaluation Results\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe performance of the proposed improved algorithm is evaluated by comparing the relationships between genes in different EC categories and within the same category, based on GO similarity. Evaluation tests were conducted using the gene associations included in YeastNet and AraNet for Saccharomyces cerevisiae (yeast) and Arabidopsis. The LFC scores were used as a standard for comparison with the top three algorithms in the field\u0026mdash;NETSIM, NETSIM2, and RWRSM-2019 (based on 2019 data).\u003c/p\u003e \u003cp\u003eThe RWRSM-2024 algorithm performed the best across all tests. Specifically, when comparing the LFC scores for each EC group between NETSIM, NETSIM2, RWRSM-2019, and the improved algorithm (RWRSM-2024), the results showed that RWRSM-2024 achieved the highest LFC scores in 75 out of 83 EC-encoded groups for biological process and 78 out of 83 for molecular function, accounting for 90.4% and 93.9% of all groups, respectively. In contrast, the other algorithms achieved the highest LFC score in less than 10% of the EC groups.\u003c/p\u003e \u003cp\u003eBased on the biological process and molecular function categories, the number of EC groups with the best LFC scores for yeast data in NETSIM, NETSIM2, RWRSM-2019, and RWRSM-2024 is shown in Fig.\u0026nbsp;1. The LFC scores for RWRSM-2024 for yeast EC groups indicate that the median, 75th, and 25th percentiles of RWRSM-2024 are the highest across all evaluation metrics. Specifically, for biological process, these values are 2.32, 21.38, and 1.19, respectively, and for molecular function, they are 21.34, 21.82, and 1.83, all significantly higher than those of other algorithms, as shown in Fig.\u0026nbsp;2. The line statistics for the three algorithms are shown in Fig.\u0026nbsp;3.The RWRSM-2024 algorithm's line for biological process is stable around 2.3, while the line for molecular function stabilizes around 1.3. Compared with the NETSIM algorithm, the proposed algorithm identified more gene pairs with a similarity greater than 0.5, 0.6, and 0.7 in both biological process and molecular function, thus uncovering more effective gene associations, as shown in Fig.\u0026nbsp;4.\u003c/p\u003e \u003cp\u003eThe RWRSM-2024 algorithm outperforms other algorithms in the Arabidopsis dataset. Specifically, the median, 75th, and 25th percentile LFC scores for RWRSM-2024 in Arabidopsis are 2.3, 22.1, and 1.2 for biological process, and 2.4, 22.3, and 1.5 for molecular function, respectively. These results show a clear advantage over the other algorithms, as shown in Fig.\u0026nbsp;5. The performance is significantly higher than that of the previous NETSIM2 algorithm, indicating that incorporating the Gaussian kernel function into the random walk enhances the algorithm's performance.\u003c/p\u003e "},{"header":"4. Conclusion","content":"\u003cp\u003eGene Ontology (GO) is one of the most widely used bioinformatics resources for describing the characteristics of genes and gene products. The calculation of gene functional similarity based on GO has been extensively applied in multiple research fields. However, low-quality similarity results can arise due to incomplete GO information and limited annotations. NETSIM addresses these issues by incorporating gene associations, GO structure, and annotations. However, it only uses local association information from the gene co-function network, as NETSIM considers only direct links within the network. Subsequently, NETSIM2 was proposed, which takes into account the global structure of the network. However, experimental results have shown that the LFC scores are not sufficiently stable.\u003c/p\u003e \u003cp\u003eIn this study, we propose a new improved algorithm based on the NETSIM2 network, which incorporates the global structure of the co-function network using a random walk with restart (RWR) approach. The algorithm integrates a Gaussian kernel function to generate the weight matrix. The algorithm consists of three steps: First, the weight matrix is obtained by fusing the Gaussian kernel function with the co-function network, and a correlation score matrix between two genes is computed using the random walk with restart method. Second, the similarity between two GO terms is calculated by combining information from the co-function network and GO. Finally, the similarity between two genes is measured using a standard score-based method on the selected GO term pairs.\u003c/p\u003e \u003cp\u003eExperimental results based on the EC classification show that the proposed algorithm performs best across all measurements for both yeast and Arabidopsis datasets. The LFC results are more stable, and the algorithm uncovers a greater number of meaningful gene associations.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003eEthics approval and consent to participate\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eNot applicable.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eConsent for publication\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eNot applicable.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAvailability of data and material\u003c/p\u003e\n\u003cp\u003eThe source code and experimental data of this project can be accessed at the following URL: https://github.com/Zhouyanbing171/code-and-experimental-data\u003c/p\u003e\n\u003cp\u003eCompeting interests\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing financial interests.\u003c/p\u003e\n\u003cp\u003eFunding\u003c/p\u003e\n\u003cp\u003eThis research was supported by a research grant funded by the Science and Technology Major Project of Shenzhen (Project No. KJZD20230923114500002) and the Shenzhen Philosophy and Social Science Planning Project (Project No. SZ2022B014). The research was also supported by \u0026ldquo;2023 Guangdong Province Undergraduate Teaching Quality and Teaching Reform Project - Curriculum Teaching and Research Office (Virtual Teaching and Research Office) - Teaching and Research Office on Business Digitalization and Intelligence Curriculum Group\u0026rdquo;.\u003c/p\u003e\n\u003cp\u003eAuthors\u0026apos; contributions\u003c/p\u003e\n\u003cp\u003eZT and XJ designed the study and prepared the manuscript; JW, FL, and DY performed computational analyses; YZ vetted , polished the manuscript and submitted it . All authors have read and approved the final version of the manuscript.\u003c/p\u003e\n\u003cp\u003eAcknowledgements\u003c/p\u003e\n\u003cp\u003eNot applicable. \u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eConsortium G0.The Gene Ontology Project in 2008 [J].Nucleic Acids Research,2008,36(Database Issue): D440-D444.\u003c/li\u003e\n\u003cli\u003e蒋哲远,韩江洪,王钊.动态的QOS感知Web服务选择和组合优化模型 [J]. 计算机学报,2009,32(5): 1014-1025.\u003c/li\u003e\n\u003cli\u003eLI Y H, BANDAR Z A, MCLEAN D. An approach for measuring semantic similarity between words using multiple information sources [J]. Ieee T Knowl Data En, 2003, 15(4): 871-882.\u003c/li\u003e\n\u003cli\u003eCOLLINS A M, LOFTUS E F. A spreading-activation theory of semantic processing [J]. Psychological review, 1975, 82(6): 407.\u003c/li\u003e\n\u003cli\u003eHERRERO-ZAZO M, SEGURA-BEDMAR I, HASTINGS J, et al. Application of Domain Ontologies to Natural Language Processing: A Case Study for Drug-Drug Interactions [J]. International Journal of Information Retrieval Research (IJIRR),2015, 5(3): 19-38.\u003c/li\u003e\n\u003cli\u003eINKPEN D, D\u0026Eacute;SILETS A. Semantic similarity for detecting recognition errors in automatic speech transcripts; proceedings of the Human Language Technology Conference, F, 2005 [C].\u003c/li\u003e\n\u003cli\u003eALMASRI M, TAN K, BERRUT C, et al. Integrating Semantic Term Relations into Information Retrieval Systems Based on Language Models [M]. Information Retrieval Technology. Springer. 2014: 136-147.\u003c/li\u003e\n\u003cli\u003eCouto FM,Silva MJ,Coutinho PM.Measuring Semantic Similarity between Gene Ontology Terms [J].Data \u0026amp; Knowledge Engineering,2007,61(1): 137-152.\u003c/li\u003e\n\u003cli\u003e李志杰,廖旭红,李元香,等.基于基因关联分析的贝叶斯网络疾病样本分类算法[J].计算机应用,2024,44(11):3449-3458.\u003c/li\u003e\n\u003cli\u003eWu H, Su Z, Mao F, Olman V, Xu Y. Prediction of functional modules based on comparative genome analysis and gene ontology application. Nucleic Acids Res. 2005;33(9):2822\u0026ndash;37. \u003c/li\u003e\n\u003cli\u003eWu X, Pang E, Lin K, Pei Z-M. Improving the measurement of semantic similarity between gene ontology terms and gene products: insights from an edge-and ic-based hybrid method. PloS ONE. 2013;8(5):66745.\u003c/li\u003e\n\u003cli\u003ePesquita C, Faria D, Falcao AO, Lord P, Couto FM. Semantic similarity in biomedical ontologies. PLoS Comput Biol. 2009;5(7):1000443.\u003c/li\u003e\n\u003cli\u003eSevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, Martinez-Cruz LA, Corrales FJ, Rubio A. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2005;2(4):330\u0026ndash;8.\u003c/li\u003e\n\u003cli\u003ePeng J, Wang Y, Chen J. Towards integrative gene functional similarity measurement. BMC Bioinformatics. 2014;15(2):5.\u003c/li\u003e\n\u003cli\u003ePeng J, Li H, Liu Y, Juan L, Jiang Q, Wang Y, Chen J. Intego2: a web tool for measuring and visualizing gene semantic similarities using gene ontology. BMC Genomics. 2016;17(5):530.\u003c/li\u003e\n\u003cli\u003eLamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, et al. The arabidopsis information resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2011;40(D1):D1202\u0026ndash;10.\u003c/li\u003e\n\u003cli\u003ePeng J,Uygun S,Kim T, et al.Measuring Semantic Simi- larities by Combining Gene Ontology Annotations and Gene Co-function Networks[J].BMC Bioinformatics,2015, 16(1):I-14.\u003c/li\u003e\n\u003cli\u003ePeng J,Li H,Jiang Q,et al.An Integrative Approach for Measuring Semantic Similarities Using Gene Ontology [J].BMC Systems Biology,2014,8(S5):S8.\u003c/li\u003e\n\u003cli\u003ePeng J,Zhang X,Hui W,et al.Improving the Measurement of Semantic Similarity by Combining Gene Ontology and Co-functional Network:a random walk based approach [J].BMC Systems Biology,2018(12):18.\u003c/li\u003e\n\u003cli\u003eWANG J Z, DU Z, PAYATTAKOOL R, et al. A new method to measure the semantic similarity of GO terms [J]. Bioinformatics, 2007, 23(10): 1274-1281.\u003c/li\u003e\n\u003cli\u003eO\u0026rsquo;MADADHAIN J, FISHER D, SMYTH P, et al. Analysis and visualization of network data using JUNG [J]. Journal of Statistical Software, 2005, 10(2): 1-35.\u003c/li\u003e\n\u003cli\u003eLEE I, LI Z, MARCOTTE E M. An improved, bias-reduced probabilistic functional gene network of baker\u0026apos;s yeast, Saccharomyces cerevisiae [J]. PloS one, 2007, 2(10): e0000988.\u003c/li\u003e\n\u003cli\u003eLEE I, AMBARU B, THAKKAR P, et al. Rational association of genes with traits using a genome-scale gene network for Arabidopsis [J]. Nature biotechnology, 2010, 28(2): 149-156.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Gene Ontology, term similarity, Random Walk with Restart, gene functional network","lastPublishedDoi":"10.21203/rs.3.rs-6537896/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6537896/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eGene Ontology (GO) is an ontology based on bioinformatics resources that utilizes its structure to represent biological knowledge and describe the functions of genes and gene products. The computation of term similarity within Gene Ontology plays a critical role in various biological research areas, such as gene function analysis, comparison, and prediction. However, existing algorithms for term similarity calculation have several limitations and fail to fully exploit available information. In recent years, some studies have incorporated gene functional networks into term similarity calculations; however, these approaches typically focus only on directly connected genes, overlooking indirect relationships within the gene network and failing to make optimal use of all available data.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eIn this study, we propose a novel Gene Ontology term similarity algorithm based on a Random Walk with Restart (RWR) framework, enhanced by a Gaussian kernel function (RWRSM). This algorithm not only incorporates structural and annotation information from Gene Ontology but also captures global structural information from gene functional networks. We performed multiple experiments on yeast and Arabidopsis datasets using Enzyme Commission (EC) classification numbers.\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e \u003cp\u003eThe experimental results demonstrate that our proposed algorithm outperforms existing methods across all measures for both yeast and Arabidopsis datasets. Specifically, the Local Function Consistency (LFC) results are more stable, and our method uncovers a greater number of meaningful gene associations.\u003c/p\u003e","manuscriptTitle":"An Improved Method for Gene Function Network Term Similarity Calculation","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-05-13 05:12:13","doi":"10.21203/rs.3.rs-6537896/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"c935b6e5-6518-4219-b56c-3fb2e1e6f569","owner":[],"postedDate":"May 13th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-05-19T08:53:46+00:00","versionOfRecord":[],"versionCreatedAt":"2025-05-13 05:12:13","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6537896","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6537896","identity":"rs-6537896","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00