Generation of the 12-GO-Subsets to Interpretate Human Cellular Process | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Generation of the 12-GO-Subsets to Interpretate Human Cellular Process Yirui Liu, Ruiqi Liu, Jiaming Hu, Yating Wang, Jingfang Zhang This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4581229/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract As the Gene Ontology (GO) knowledgebase becomes more and more complicated, it is difficult for researchers to follow and get a comprehensive overview of biological processes. Here, we generated a classification strategy through carefully investigating the genes any two terms shared. Using this strategy, we categorized the 66 direct child terms of the cellular process into 12 major subsets, and the interactions among them were further confirmed by studying the protein-protein interaction based networks. Subsequently, these 12 subsets were used to investigate the distribution of transcription factors, kinases and also several cancer genomes. Above all, the 12-GO-subsets provide researchers a more comprehensive overview of the cellular process, and the categorizing strategy developed herein can be utilized to characterize other large GO terms. Biological sciences/Computational biology and bioinformatics Biological sciences/Systems biology gene ontology terms subsets cellular process Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction The Gene Ontology (GO) is a robust knowledgebase that offers precisely defined ontology terms and associated annotations for a wide range of organisms [ 1 – 3 ] . The GO is organized into three main categories: molecular function (MF), cellular component (CC), and biological process (BP). Within each category, terms are linked through meaningful relationships such as ‘is_a’, ‘part_of’, ‘has_part’ or ‘regulates’. These terms and relationships collectively form a directed acyclic graph (DAG), a more flexible structure than a traditional hierarchy, as each term can have multiple parents. GO annotations associate genes or its products (such as proteins or noncoding RNAs) with GO terms. Each gene may have multiple annotations reflecting its multifunctional feature. Since its inception, the GO knowledgebase is diligently maintained and updated by the GO consortium. Now it has become a gold standard database for gene function reference, which empowers researchers in uncovering biological significance through large-scale data analysis [ 3 ] . However, as biological data continues to expand rapidly, an increasing number of gene annotations have been identified, which made the GO database more and more complicated with numerous terms and a deep-, multi-branched structure [ 4 ] . This complexity poses a challenge for enrichment analysis, as the results often contain a long list of redundant GO terms and require further effort for proper interpretation. A lot of tools have been developed to reduce the redundancy in enrichment analysis, either through visualizing enriched terms in a local context of DAG to enhance comprehension of their relationships, such as GOMiner [ 5 ] , BiNGO [ 6 ] , GOLEM [ 7 ] , GOrilla [ 8 ] , NaviGO [ 9 ] , WebGestalt [ 10 , 11 ] , and GOnet [ 12 ] , or by clustering enriched terms into more condensed and representative categories using various algorisms, such as GOstats [ 13 ] , RedundancyMiner [ 14 ] , REVIGO [ 15 ] , Enrichment Map [ 16 ] , ClueGO [ 17 ] , parent-child approach [ 18 ] , and GO-Proxy [ 19 ] . Unfortunately, very few bioinformatics studies focused on the comprehensive overview of the entire structure of the GO itself [ 3 , 4 ] , impeding researchers’ ability to gain a holistic understanding of biological systems. The GO Consortium has recently generated a manually trimmed database known as ‘Generic GO subset’, which contained only 75 BP terms, 40 MF terms and 29 CC terms [ 3 ] . K. Manjang deduced the DAG of human BP to a simplified one with only 39 nodes by combining GO terms into three categories, leaf nodes, regular nodes, and jump nodes [ 4 ] . However, previous research has not attempted to group GO terms into larger categories to offer a comprehensive overview of biological systems. Here, we acquired about 60 direct child terms of human ‘cellular process’ (CP, GO:0009987) from AmiGO and regrouped them into 12 major GO subsets. These 12 subsets encompass most of the genes involved in CP, exhibit a degree of independency, and demonstrate interconnectedness, suggesting their potential as the representative major subsets of CP. Finally, we utilized our new categories to investigate the distribution of human transcription factors (TFs), kinases, and also several types of cancer genome, and they displayed specific patterns, respectively. In general, our research initially built a novel category system based on GO terms to enhance the comprehensive understanding of cellular functions. Results Retrieving all direct child terms of CP Specifically, we concentrated on CP instead of BP for the following reasons: 1) CP is the largest child term under BP and it contains most of genes in BP (76%). 2) Some BP’s child terms also have its counterparts in CP, for example ‘developmental process’ (GO:0032502) versus ‘cellular developmental process’ (GO:0048869), which results in a lot of redundancy and complicate the following analysis. 3) Only considering terms directly connected with CP helps us to increase consistency because all the terms discussed are at the same level. To get a general view of the whole cellular system, we omitted the deep branched terms, instead only downloaded all 66 direct child terms of CP of homo sapiens from AmiGO ( https://amigo.geneontology.org/amigo/dd_browse ). By excluding the empty ones, 60 terms were left. Additionally, there exist many repeats in each term, because one gene can be annotated to a term with multiple annotations [ 2 ] . We defined the term size as the number of genes a term contained instead of its annotation number to avoid confusion. Then the 60 non-empty terms, which are all direct child of CP were reordered by their sizes (supplemental table 1 ). To our surprise, the sizes of the 60 terms were quite different, with the smallest term only containing one gene while the biggest term associated with more than half of the total genes in CP. This uneven distribution is partially due to the fact that the GO interprets cellular functions of all organisms in one structure, so some features of other organisms are presented here, e.g., ‘light absorption’ (GO:0016037). Additionally, unwell characterization also renders some terms in small size. Also, the DAG structure allows deep branched terms connected with CP directly. However, many CP’s direct child terms even without any branches also account for this issue. So, it requires further consideration of the relationships among these CP’s direct child terms in order to get a better categorizing strategy to summarize the major properties of CP. Network visualization of term relationships under CP To investigate the interactions among all those 60 terms under CP, we generated a term interaction matrix by calculating the number of overlapped genes between any two terms (supplemental table 2 ). Then we defined the relative dot link (RDL) between two terms as the ratio of overlapped gene number to each term size, respectively (see methods). By comparing the RDL with three predefined thresholds, 30%, 50%, and 70%, three levels of interactions were obtained. Over 30% but less than 50% was defined as weak connection, while moderate connection between 50–70% and strong connection over 70%. By excluding the small sized terms (less than 20 genes), as well as ‘positive regulation of cellular process’ (GO:0048522) and ‘negative regulation of cellular process’ (GO:0048523) because they are two subsets redundant with ‘regulation of cellular process’ (GO:0050794), there were 43 terms left. Additionally, we combined terms that sharing more than 90% of genes, e.g., the term ‘cell-cell fusion’ (GO:0140253) was combined with ‘syncytium formation’ (GO:0006949), and the term ‘signal transduction’ (GO:0007165) combined with ‘cell communication’ (GO:0007154). In total, 41 terms were left and visualized by a network, in which nodes indicating terms and edges representing RDL (Fig. 1 ). As shown in Fig. 1 , most of the small terms are isolated from each other but have connections with large terms, while large terms connect more extensively with each other. Generation of the 12 major GO subsets of CP Although small sized terms are meaningful, juxtaposing those with bigger ones will sacrifice the general view of the whole cellular process. So, we further investigated the term interaction matrix (supplemental table 2 ) to generate major GO subsets, which were supposed to cover most genes in CP, be relatively independent while overlapping was allowed. To achieve this goal, we generated three rules discussed as follows. First, exclusion. Exclude the terms with less than 20 genes. In addition, exclude the term ‘regulation of cellular process’ (GO:0050794) and its two subsets, because they refer regulation of any cellular process, and more importantly, these genes have been included in the corresponding cellular process. Second, coverage. To start with, we defined terms with over 3000 genes as seed terms for large terms are more comprehensive and representatives (supplemental table 2 , black box). Then all the smaller terms were compared with the seed terms. If over 70% of genes of a small term fall into one or more of the seed terms, in addition, if its function can be summarized by the seed terms, the small one will be represented by the seed terms. In other worlds, the small term was omitted. Based on this rule, most of the small terms could be ignored. However, some terms did not fall into any of the seed terms, such as ‘cell recognition’ (GO:0008037) and ‘cell killing’ (GO:0001906). Also, the function of some terms, such as ‘cell cycle’ (GO:0007049), cannot be represented by any of the seed terms. So, these terms were maintained. Third, combination. Terms of similar size and function were combined to reduce the redundance. Such as, ‘signal transduction’ (GO:0007165), ‘cell communication’ (GO:0007154), and ‘cellular response to stimulus’ (GO:0051716) were merged into a new subset containing 6917 genes. Also ‘cell division’ (GO:0051301), ‘cell cycle process’ (GO:0022402), and ‘cell cycle’ (GO:0007049) were merged into a new one with 1302 genes. However, one exception was ‘cellular localization’ (GO:0051641) and ‘cellular component organization or biogenesis’ (GO:0071840). As their function cannot be combined, both terms were remained. In conclusion, finally we obtained 12 major GO subsets to interpret CP (supplemental table 3). These 12 subsets covered most of the genes in CP (supplemental Fig. 1A). The overlapped genes between any two of these 12 subsets were calculated (supplemental table 4) and visualized as a RDL network with 12 nodes and 28 edges (Fig. 2 ). As seen in Fig. 2 , most of the interactions are weak or moderate, suggesting the 12 new generated subsets are more independent. In sum, our result indicates that CP could be represented by the 12-GO-subsets which are fairly independent but inter-connections are allowed. Investigation of relationships among the 12 major subsets based on protein-protein interactions Previously, we detected the interaction of those 12 subsets mainly based on the genes they shared. To further determine the relationships at protein-protein interaction (PPI) level, we used STRING [ 20 ] to measure the edge number of each or any two of the subsets. When set A and B are put together, suppose the new formed set is C. So, the edge number of set C can be calculated using the following equation: |C e |= |A e |+|B e |-|S e |+|N e | |A e |, |B e |, and |C e | is the edge number of set A, B, and C, respectively. |S e | is the edge number of set A \(\cap\) B. N e refers new edges generated when combining set A and B (Fig. 3 A). Next, we generated an edge matrix of the 12-GO-subsets by measuring the edge number of the 12 subnetworks and also the new edge number when combing any two subsets (supplemental table 5). Then the ratio of new edge number to the edge number of each subset respectively was calculated and defined as the relative new-edge link (RNL, see methods) (Fig. 3 B). Based on the result of Fig. 3 B and also the size of subsets, these 12 subsets can be classified into 3 groups (Figs. 2 & 3 B). For group 1, it contained 2 of the smallest subsets, ‘cell recognition’ (GO:0008037) and ‘cell killing’ (GO:0001906). When combining the subsets in group 1 with other larger subsets, the new edges were several times the amount of their own edges, which means if visualized by network, the small subset will melt into the big ones. Group2 contained five intermediate subsets with the size around 1000 genes, which included ‘cell adhesion’ (GO:0007155), ‘cell motility’ (GO:0048870), ‘cell death’ (GO:0008219), ‘cell division’ (GO:0051301)+‘cell cycle’ (GO:0007049), and ‘transmembrane transport’ (GO:0055085). When the subsets in group 2 were combined with the five largest ones (subsets in group 3, with over 3000 genes), the RNLs were moderate, between 0.4 and 1.3, indicating even combined with the five biggest ones, subsets in group 2 still maintains their own PPI network structures. For all three groups, the RNLs among groups were relatively low, suggesting independency among these subsets. From Fig. 3 A, we also learned that shared edges already exist in set A and B, and related to the overlapped gene formed network. Therefore, the higher ratio of shared edges to new edges indicates that when set A and B are combined together, not many new edges form, suggesting set A and B are relative saturated networks. So, we named |S e |/|N e | as the saturation ratio (SR, see methods). Subsequently, we calculated the SRs among different groups. Figure 3 C showed that new edges account for a lot when combining any two subsets in group 2. However, the ratios among group 3 were much higher, suggesting shared edges represent a big proportion. And the ratios were in the middle when combining subsets in group 2 with ones in group 3. Generally speaking, this result indicates that the 5 largest subsets in group 3, which including ‘cellular developmental process’ GO:0048869), ‘cellular localization’ (GO:0051641), ‘cellular response to stimulus’ (GO:0051716)+‘signal transduction’ (GO:0007165)+‘cell communication’ (GO:0007154), ‘cellular component organization or biogenesis’ (GO:0071840), and ‘cellular metabolic process’ (GO:0044237) are relatively saturated ones. And the 5 median subsets in group 2, form much more new edges when combined with each other or subsets in group 3. The distribution of TFs and kinases in the 12 major subsets To dissect the distribution of TFs and kinases in the 12 major GO subsets, we firstly download human TFs and kinase from AnimalTFDB [ 21 ] and Kinome Atlas [ 22 ] , respectively. 1659 TFs and 1024 transcriptional cofactors (TcoFs) were obtained and chi-square test showed that they were more likely to be enriched in the subsets of ‘cell motility’, ‘cell death’, ‘cell cycle’, ‘cellular developmental process’, and ‘cellular response to stimulus’ (Fig. 4 A). In addition, 538 kinases were acquired and similar analysis were conducted. Figure 4 B showed that kinases were enriched in the subsets of ‘cell adhesion’, ‘cell motility’, ‘cell death’, ‘cell cycle’, ‘cellular developmental process’, ‘cellular response to stimulus’ and ‘cellular metabolic process’. Several cancer genomes shared similar distribution Furthermore, we investigated in which subsets cancer genomes would be enriched. Several types of cancer genome were collected from The Cancer Genome Atlas (TCGA) database, including breast cancer [ 23 ] , lung adenocarcinoma [ 24 ] , colon and rectal cancer [ 25 ] and acute myeloid leukemia [ 26 ] (Table 1 ). To investigate the distribution of these cancer genomes in the 12 major GO subsets, we performed enrichment analysis and determined the statistical significance also by chi-squire test. Not surprisingly, the mutated genes of all the 4 types of cancer genome displayed similar distribution, highly enriched in all five subsets in group 2 and also two subsets in group 3 (Fig. 5 ), which might be caused by a large number of shared mutated genes among different cancers (supplemental Fig. 2). These results demonstrated the commonly accepted prospection that although targeting different type of tissues, cancer mutations may disturb similar underlying mechanisms [ 27 ] . Table 1 Information of 4 types of cancer genomes. Cancer type Number of patients Mutations Methods Ref Breast cancer 463 14130 Whole-exome sequencing [ 23 ] Lung adenocarcinoma 230 18253 Whole-exome sequencing, mRNA sequencing [ 24 ] Colon and rectal cancer 224 15995 Exome capture DNA sequencing [ 25 ] Acute myeloid leukemia 200 1996 Whole genome sequencing (50 cases), RNA sequencing, whole-exome sequencing (150 cases) [ 26 ] Discussion The GO knowledgebase become more and more complicated not only due to its deep branches in vertical, but also imbalanced structures at the same level. In this study, we examined the factors affect the latter one and visualized the interactions among CP’s direct child terms through shared gene networks (Fig. 1 ). To streamline the categorization of terms in CP, three rules-exclusion, coverage, and combination-were applied manually to condense the structure into a simplified one containing only 12 subsets. These 12 subsets encompass most of the genes in CP, are relatively independent and allow for connections between any two of them (supplemental Fig. 1A & Fig. 2 ). Additionally, PPI networks were utilized to further examine the interconnections among these 12 subsets, resulting in their classification into 3 groups. The 5 biggest subsets in group 3 collectively accounted for over 90% of the genes in CP (supplemental Fig. 1B) and exhibited saturated PPI networks with minimal new edges present when combining any two of them (Fig. 3 C). The 5 median subsets in group 2 displayed a lot of new edges when combined with each other and with the 5 largest ones in group 3 (Fig. 3 C). In conclusion, the 12 GO subsets better represent the entire cellular process. Notably, the uneven distribution of terms with varying sizes at one level is very common in the GO system. Our newly developed categorization strategy relies primarily on the number of shared genes between any two terms, and its accuracy was validated by PPI based network analysis. This strategy showed promise for application to other large GO terms. Also, this study introduced several novel metrics for network analysis, including RDL, RNL, and SR, which enhance our understanding of the network structure. Finally, the implementation of this categorization system provided valuable insights into the distribution of TFs, kinase, and genetic alterations related to cancer, which suggests its potential utility in improving our understanding of biological systems. Methods Data collection The GO and GO annotation The GO (last file loaded on 2023-11-17, AmiGO 2 version: 2.5.17) is a regularly updated online knowledgebase that serves as a standardized vocabulary for annotating genes as well as gene products [ 1 ] . In this study, the GO was browsed using AmiGO ( https://amigo.geneontology.org/amigo/dd_browse ) [ 28 ] . All human related terms and annotations under ‘cellular process’ (GO:0009987) were manually downloaded from https://www.geneontology.org . TFs, TcoFs, protein kinase, and cancer mutated genes The Animal Transcription Factor Database (AnimalTFDB) [ 21 ] is the most comprehensive database for animal TFs. In this study, 1659 human related TFs and 1024 TcoFs data were downloaded from the AnimalTFDB database ( https://guolab.wchscu.cn/AnimalTFDB4/#/ ). The Kinome Atlas ( http://cellimagelibrary.org/pages/kinome_atlas ) [ 22 ] database is a comprehensive imaging database of the kinome with compartmental annotations at the subcellular level. We downloaded 538 kinases encoded by the human genome [ 22 ] . TCGA ( https://www.cancer.gov/tcga ) is one of the most successful cancer genomic programs that generates and analyzes cancer genomes. Here, 1996 mutated genes in AML [ 26 ] , 14130 mutated genes in breast cancer [ 23 ] , 18253 mutated genes in lung adenocarcinoma [ 24 ] and 15995 mutated genes in colon and rectal cancer [ 29 ] were collected. The mutation types sorted out include “Missense_Mutation”, “Nonsense_Mutation”, “Splice_Site”, and “Frame_Shift_Del”, etc. Protein-protein interaction The Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) module in Cytoscape 3.10.1 [ 30 ] software was used to obtain the human PPI among one or two subsets of genes with default settings. Calculation of the RDL between two terms To obtain the number of overlapped genes, we developed Python scripts to process terms obtained from the GO. These scripts ran under a Python environment with the requirement of pandas and itertools [ 31 ] . We introduced the concept of the RDL as the basis for comparing interactions between two terms. Assuming two terms ‘a’ and ‘b’ with annotated gene sets as ‘A’ and ‘B’, respectively. The RDL of term ‘a’ is: $${\text{RDL}}_{\text{ab/a}}\text{=}\frac{\text{|A∩B|}}{\text{|A|}}$$ The RDL of term ‘b’ is: $${\text{RDL}}_{\text{ab/b}}\text{=}\frac{\text{|A∩B|}}{\text{|B|}}$$ Here, |A∩B| denotes the number of overlapped genes between gene set ‘A’ and ‘B’. A network was visualized using Gephi 0.10 [ 32 ] , where nodes represent terms and edges represent the RDLs. Calculation of the RNL and SR based on PPI networks Compare the number of new edges of two subsets with the number of edges in set A or B, and define the ratio as the RNL. The formulas are as follows: $${\text{R}\text{N}\text{L}}_{\text{ab/a}}\text{=}\frac{\text{|}\text{N}\text{e}\text{|}}{\text{|}\text{A}\text{e}\text{|}}$$ $${\text{R}\text{N}\text{L}}_{\text{ab/b}}\text{=}\frac{\text{|}\text{N}\text{e}\text{|}}{\text{|}\text{B}\text{e}\text{|}}$$ Further, compare the number of shared edges of two subsets with the number of new edges, and denote the ratio as SR. The formula is as follows: $${\text{SR}}_{\text{ab}}\text{=}\frac{\text{|}\text{S}\text{e}\text{|}}{\text{|}\text{N}\text{e}\text{|}}$$ The application of the 12 major GO subsets We calculated the distribution of TFs, TcoFs, protein kinase, and cancer genomes in the 12 major GO subsets. Expected data was calculated by multiplying the total number of tested genes by the percentage of each subset in the 12 major subsets (expectation). Simultaneously, we identified the number of shared genes between the test data and each subset (observation). Chi-square tests were employed to compare the differences between the observation and expectation groups. When expected values were less than 5, Yate’s continuity correction was applied. Declarations Competing interests The authors declare no competing interests. Author Contribution J.Z. conceived and designed the study. Y.L., J.H., and Y.W. acquired the data. Y.L., J.Z., and R.L. conducted data analysis and result interpretation. J.Z., Y.L., and R.L. wrote and revised the manuscript. Acknowledgement We would like to thank Beijing University of Chinese Medicine for use of its shared services to complete this research. This work was supported by the Young Scientists Fund of the National Natural Science Foundation of China (No. 31701280 to J. Zhang). Data Availability Data is provided within the manuscript or supplementary information files. References ASHBURNER M, BALL C A, BLAKE J A, et al. Gene Ontology: tool for the unification of biology [J]. Nature Genetics, 2000, 25(1): 25–9. DU PLESSIS L, SKUNCA N, DESSIMOZ C. The what, where, how and why of gene ontology–a primer for bioinformaticians [J]. Briefings in bioinformatics, 2011, 12(6): 723–35. ALEKSANDER S A, BALHOFF J, CARBON S, et al. The Gene Ontology knowledgebase in 2023 [J]. Genetics, 2023, 224(1). MANJANG K, TRIPATHI S, YLI-HARJA O, et al. Graph-based exploitation of gene ontology using GOxploreR for scrutinizing biological significance [J]. Scientific Reports, 2020, 10(1): 16672. ZEEBERG B R, FENG W, WANG G, et al. GoMiner: a resource for biological interpretation of genomic and proteomic data [J]. Genome biology, 2003, 4(4): R28. MAERE S, HEYMANS K, KUIPER M. BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks [J]. Bioinformatics (Oxford, England), 2005, 21(16): 3448–9. SEALFON R S, HIBBS M A, HUTTENHOWER C, et al. GOLEM: an interactive graph-based gene-ontology navigation and analysis tool [J]. BMC bioinformatics, 2006, 7: 443. EDEN E, NAVON R, STEINFELD I, et al. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists [J]. BMC bioinformatics, 2009, 10(1): 48. WEI Q, KHAN I K, DING Z, et al. NaviGO: interactive tool for visualization and functional similarity and coherence analysis with gene ontology [J]. BMC bioinformatics, 2017, 18(1): 177. ZHANG B, KIROV S, SNODDY J. WebGestalt: an integrated system for exploring gene sets in various biological contexts [J]. Nucleic acids research, 2005, 33(Web Server issue): W741-8. WANG J, VASAIKAR S, SHI Z, et al. WebGestalt 2017: a more comprehensive, powerful, flexible and interactive gene set enrichment analysis toolkit [J]. Nucleic acids research, 2017, 45(W1): W130-w7. POMAZNOY M, HA B, PETERS B. GOnet: a tool for interactive Gene Ontology analysis [J]. BMC bioinformatics, 2018, 19(1): 470. FALCON S, GENTLEMAN R. Using GOstats to test gene lists for GO term association [J]. Bioinformatics (Oxford, England), 2007, 23(2): 257–8. ZEEBERG B R, LIU H, KAHN A B, et al. RedundancyMiner: De-replication of redundant GO categories in microarray and proteomics analysis [J]. BMC bioinformatics, 2011, 12(1): 52. SUPEK F, BOŠNJAK M, ŠKUNCA N, et al. REVIGO summarizes and visualizes long lists of gene ontology terms [J]. PloS one, 2011, 6(7): e21800. MERICO D, ISSERLIN R, STUEKER O, et al. Enrichment map: a network-based method for gene-set enrichment visualization and interpretation [J]. PloS one, 2010, 5(11): e13984. BINDEA G, MLECNIK B, HACKL H, et al. ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks [J]. Bioinformatics (Oxford, England), 2009, 25(8): 1091–3. GROSSMANN S, BAUER S, ROBINSON P N, et al. Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis [J]. Bioinformatics (Oxford, England), 2007, 23(22): 3024–31. MARTIN D, BRUN C, REMY E, et al. GOToolBox: functional analysis of gene datasets based on Gene Ontology [J]. Genome biology, 2004, 5(12): R101. SZKLARCZYK D, KIRSCH R, KOUTROULI M, et al. The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest [J]. Nucleic acids research, 2023, 51(D1): D638-d46. SHEN W K, CHEN S Y, GAN Z Q, et al. AnimalTFDB 4.0: a comprehensive animal transcription factor database updated with variation and expression annotations [J]. Nucleic acids research, 2023, 51(D1): D39-d45. ZHANG H, CAO X, TANG M, et al. A subcellular map of the human kinome [J]. eLife, 2021, 10. KOBOLDT D C, FULTON R S, MCLELLAN M D, et al. Comprehensive molecular portraits of human breast tumours [J]. Nature, 2012, 490(7418): 61–70. COLLISSON E A, CAMPBELL J D, BROOKS A N, et al. Comprehensive molecular profiling of lung adenocarcinoma [J]. Nature, 2014, 511(7511): 543–50. Comprehensive molecular characterization of human colon and rectal cancer [J]. Nature, 2012, 487(7407): 330–7. LEY T J, MILLER C, DING L, et al. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia [J]. The New England journal of medicine, 2013, 368(22): 2059–74. KANDOTH C, MCLELLAN M D, VANDIN F, et al. Mutational landscape and significance across 12 major cancer types [J]. Nature, 2013, 502(7471): 333–9. CARBON S, IRELAND A, MUNGALL C J, et al. AmiGO: online access to ontology and annotation data [J]. Bioinformatics, 2008, 25(2): 288–9. MUZNY D M, BAINBRIDGE M N, CHANG K, et al. Comprehensive molecular characterization of human colon and rectal cancer [J]. Nature, 2012, 487(7407): 330–7. SHANNON P, MARKIEL A, OZIER O, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks [J]. Genome research, 2003, 13(11): 2498–504. MCKINNEY W. Data Structures for Statistical Computing in Python; proceedings of the SciPy, F, 2010 [C]. BASTIAN M, HEYMANN S, JACOMY M. Gephi: an open source software for exploring and manipulating networks; proceedings of the Proceedings of the international AAAI conference on web and social media, F, 2009 [C]. Additional Declarations No competing interests reported. Supplementary Files Table1.docx supplementaltables.xlsx supplementaltables.xlsx supplementalfigures.docx supplementalfigures.docx supplementaltables.xlsx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4581229","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":321303373,"identity":"8405b7f6-ba4f-4b56-9b3d-2200e0022d1d","order_by":0,"name":"Yirui Liu","email":"","orcid":"","institution":"Beijing University of Chinese Medicine","correspondingAuthor":false,"prefix":"","firstName":"Yirui","middleName":"","lastName":"Liu","suffix":""},{"id":321303374,"identity":"2f72a169-e37c-4644-89e7-ab5ac9faf748","order_by":1,"name":"Ruiqi Liu","email":"","orcid":"","institution":"Beijing University of Chinese Medicine","correspondingAuthor":false,"prefix":"","firstName":"Ruiqi","middleName":"","lastName":"Liu","suffix":""},{"id":321303375,"identity":"e50c15b1-1665-49e6-bfe6-934abeda5350","order_by":2,"name":"Jiaming Hu","email":"","orcid":"","institution":"Beijing University of Chinese Medicine","correspondingAuthor":false,"prefix":"","firstName":"Jiaming","middleName":"","lastName":"Hu","suffix":""},{"id":321303376,"identity":"6a04ea8d-2aea-450f-a521-8be7622ae34e","order_by":3,"name":"Yating Wang","email":"","orcid":"","institution":"Beijing University of Chinese Medicine","correspondingAuthor":false,"prefix":"","firstName":"Yating","middleName":"","lastName":"Wang","suffix":""},{"id":321303377,"identity":"4c01d571-e9c1-4471-8d23-42cca7dcea71","order_by":4,"name":"Jingfang Zhang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABGElEQVRIiWNgGAWjYDACCRDBAySZDwAZFQcgojxEaWFLADLOEK0FBEBaGNuI0CI/u/nZwy8yFnnybszPHn6ddyexf9oBxgdv2xjkzXFoYZxzzNxYhkei2PAYm7mx7LZniTNuJzAbzm1jMNzZgF0Ls0SCmbQEj0TixvkNZtKS2w4nbpBOYJPmbWNIMDiAXQubRPo3iJY29m/SknPAWth/49PCI5FjJvkBqGU+G4+Z5McGiC3M+LRISOSUSQM1Jm5g4wEyjj0znnE7sVlyzjkJww04tMjPSN8m+bOnLnF+G/s2yR81d2T7Zycf/PCmzEYely3gIODtYWAAKWCGRAdjAwNSfGEFjD9+AK1rADHwqhsFo2AUjIKRCgCz9lobYOg+dgAAAABJRU5ErkJggg==","orcid":"","institution":"Beijing University of Chinese Medicine","correspondingAuthor":true,"prefix":"","firstName":"Jingfang","middleName":"","lastName":"Zhang","suffix":""}],"badges":[],"createdAt":"2024-06-14 10:03:57","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4581229/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4581229/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":60019633,"identity":"d7bd8fe3-1c79-4988-bb4e-208e3f13b283","added_by":"auto","created_at":"2024-07-10 15:27:02","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":237067,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eNetwork visualization of the term relationships of cellular process. \u003c/strong\u003eThe relative dot link (RDL) was calculated between any two terms in supplemental table 2 and visualized in network. The size of nodes indicates the number of genes associated with that term. The edges indicate the RDL, with arrows pointing to the terms being compared. The blue, yellow, and red edges represent weak, moderate, and strong relationships, respectively. The thickness of edge indicates the number of overlapped genes between two terms.\u003c/p\u003e","description":"","filename":"fig1.png","url":"https://assets-eu.researchsquare.com/files/rs-4581229/v1/b987ea76002fa0401cc129ef.png"},{"id":60019632,"identity":"93a12cd1-38cb-4a4e-8c0e-3f82170d8979","added_by":"auto","created_at":"2024-07-10 15:27:02","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":71506,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eNetwork visualization of the 12 major subsets. \u003c/strong\u003eThe relative dot link (RDL) was calculated between any two subsets in supplemental table 4 and visualized in network. The 12 subsets were divided into three groups based on the number of genes they contained. The edges indicate the RDL, with arrows pointing to the terms being compared. The blue, yellow, and red edges represent weak, moderate, and strong relationships, respectively.\u003c/p\u003e","description":"","filename":"fig2.png","url":"https://assets-eu.researchsquare.com/files/rs-4581229/v1/b5210f50784526da2bad5853.png"},{"id":60019634,"identity":"a6606b10-e153-456d-b1cb-22976763ecea","added_by":"auto","created_at":"2024-07-10 15:27:02","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1568773,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThe relationships of the 12 major subsets based on protein-protein interaction (PPI) networks. \u003c/strong\u003e(A) Schematic diagram illustrating the generation of new edges when combining two sets of genes. (B) The relative new-edge links (RNLs) were calculated between any two of the 12 subsets and the common logarithm of RNL is presented. (C) The saturation ratios (SRs) were calculated among subsets in group 2, group 3, and also subsets in group 2 versus ones in group 3.\u003c/p\u003e","description":"","filename":"fig3.png","url":"https://assets-eu.researchsquare.com/files/rs-4581229/v1/0f98819f561f8d0c4af97751.png"},{"id":60019637,"identity":"881fcc84-e8b5-4330-b424-87555478c057","added_by":"auto","created_at":"2024-07-10 15:27:02","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":426920,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDistribution of TFs and TcoFs and kinases in the 12 subsets. \u003c/strong\u003e(A) The distribution of transcription factors (TFs) and transcriptional cofactors (TcoFs) in the 12 subsets. (B) The distribution of kinases in the 12 subsets. Chi-square test was performed. When expected values were less than 5, Yate’s continuity correction was applied. *p\u0026lt;0.05, **p\u0026lt;0.01, ***p\u0026lt;0.001.\u003c/p\u003e","description":"","filename":"fig4.png","url":"https://assets-eu.researchsquare.com/files/rs-4581229/v1/e99b2d9b118601984dd1ccf1.png"},{"id":60019639,"identity":"7fc8eccc-9580-4f1d-9573-3530d4ddf9bb","added_by":"auto","created_at":"2024-07-10 15:27:03","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":844265,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDistribution of 4 cancer genomes in the 12 subsets. \u003c/strong\u003eThe distribution of 4 types of cancer genome in the 12 subsets. (A) Acute myeloid leukemia. (B) Breast cancer. (C) Lung adenocarcinoma. (D) Colon and rectal cancer. Chi-square test was performed. *p\u0026lt;0.05, **p\u0026lt;0.01, ***p\u0026lt;0.001.\u003c/p\u003e","description":"","filename":"fig5.png","url":"https://assets-eu.researchsquare.com/files/rs-4581229/v1/ae605bd1b7de56c51e6d241c.png"},{"id":71301229,"identity":"f94e59ae-0779-47fd-b080-f65854c02f95","added_by":"auto","created_at":"2024-12-13 05:32:51","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3609105,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4581229/v1/96287eb7-cecc-44f2-b3c9-f92b86a18ca4.pdf"},{"id":60019631,"identity":"6eb7c8fa-0a4f-4468-9845-af64bffbfa32","added_by":"auto","created_at":"2024-07-10 15:27:02","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":28478,"visible":true,"origin":"","legend":"","description":"","filename":"Table1.docx","url":"https://assets-eu.researchsquare.com/files/rs-4581229/v1/c425f522382364cc1d872606.docx"},{"id":60020393,"identity":"e88e1740-f6eb-4f41-86b6-a31188891a6d","added_by":"auto","created_at":"2024-07-10 15:43:02","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":4096747,"visible":true,"origin":"","legend":"","description":"","filename":"supplementaltables.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-4581229/v1/77822d572ded9fac986cd050.xlsx"},{"id":60019636,"identity":"c5d7c749-a654-4780-9942-27639a56f8cf","added_by":"auto","created_at":"2024-07-10 15:27:02","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":4096747,"visible":true,"origin":"","legend":"","description":"","filename":"supplementaltables.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-4581229/v1/e7fd2e9bb0c73fa6c5e78fbe.xlsx"},{"id":60019641,"identity":"34edac1b-8098-4967-90b1-124b481a6f4b","added_by":"auto","created_at":"2024-07-10 15:27:03","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":261092,"visible":true,"origin":"","legend":"","description":"","filename":"supplementalfigures.docx","url":"https://assets-eu.researchsquare.com/files/rs-4581229/v1/a52b6e38cc3e4601bddc1f45.docx"},{"id":60019642,"identity":"90d5f59c-a797-4511-aadc-f1f7e5e81fae","added_by":"auto","created_at":"2024-07-10 15:27:03","extension":"docx","order_by":8,"title":"","display":"","copyAsset":false,"role":"supplement","size":261092,"visible":true,"origin":"","legend":"","description":"","filename":"supplementalfigures.docx","url":"https://assets-eu.researchsquare.com/files/rs-4581229/v1/95a87708e95bf4575db0e006.docx"},{"id":60020089,"identity":"2f35cc11-dafa-4bbb-a0be-5cfc48392ded","added_by":"auto","created_at":"2024-07-10 15:35:03","extension":"xlsx","order_by":9,"title":"","display":"","copyAsset":false,"role":"supplement","size":4096747,"visible":true,"origin":"","legend":"","description":"","filename":"supplementaltables.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-4581229/v1/77503610147ae0b73f1bb7b5.xlsx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Generation of the 12-GO-Subsets to Interpretate Human Cellular Process","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe Gene Ontology (GO) is a robust knowledgebase that offers precisely defined ontology terms and associated annotations for a wide range of organisms\u003csup\u003e[\u003cspan additionalcitationids=\"CR2\" citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]\u003c/sup\u003e. The GO is organized into three main categories: molecular function (MF), cellular component (CC), and biological process (BP). Within each category, terms are linked through meaningful relationships such as \u0026lsquo;is_a\u0026rsquo;, \u0026lsquo;part_of\u0026rsquo;, \u0026lsquo;has_part\u0026rsquo; or \u0026lsquo;regulates\u0026rsquo;. These terms and relationships collectively form a directed acyclic graph (DAG), a more flexible structure than a traditional hierarchy, as each term can have multiple parents. GO annotations associate genes or its products (such as proteins or noncoding RNAs) with GO terms. Each gene may have multiple annotations reflecting its multifunctional feature. Since its inception, the GO knowledgebase is diligently maintained and updated by the GO consortium. Now it has become a gold standard database for gene function reference, which empowers researchers in uncovering biological significance through large-scale data analysis\u003csup\u003e[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eHowever, as biological data continues to expand rapidly, an increasing number of gene annotations have been identified, which made the GO database more and more complicated with numerous terms and a deep-, multi-branched structure\u003csup\u003e[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]\u003c/sup\u003e. This complexity poses a challenge for enrichment analysis, as the results often contain a long list of redundant GO terms and require further effort for proper interpretation. A lot of tools have been developed to reduce the redundancy in enrichment analysis, either through visualizing enriched terms in a local context of DAG to enhance comprehension of their relationships, such as GOMiner\u003csup\u003e[\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]\u003c/sup\u003e, BiNGO\u003csup\u003e[\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]\u003c/sup\u003e, GOLEM\u003csup\u003e[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]\u003c/sup\u003e, GOrilla\u003csup\u003e[\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]\u003c/sup\u003e, NaviGO\u003csup\u003e[\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]\u003c/sup\u003e, WebGestalt\u003csup\u003e[\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]\u003c/sup\u003e, and GOnet\u003csup\u003e[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]\u003c/sup\u003e, or by clustering enriched terms into more condensed and representative categories using various algorisms, such as GOstats\u003csup\u003e[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]\u003c/sup\u003e, RedundancyMiner\u003csup\u003e[\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]\u003c/sup\u003e, REVIGO\u003csup\u003e[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]\u003c/sup\u003e, Enrichment Map\u003csup\u003e[\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]\u003c/sup\u003e, ClueGO\u003csup\u003e[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]\u003c/sup\u003e, parent-child approach\u003csup\u003e[\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]\u003c/sup\u003e, and GO-Proxy\u003csup\u003e[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eUnfortunately, very few bioinformatics studies focused on the comprehensive overview of the entire structure of the GO itself\u003csup\u003e[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]\u003c/sup\u003e, impeding researchers\u0026rsquo; ability to gain a holistic understanding of biological systems. The GO Consortium has recently generated a manually trimmed database known as \u0026lsquo;Generic GO subset\u0026rsquo;, which contained only 75 BP terms, 40 MF terms and 29 CC terms\u003csup\u003e[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]\u003c/sup\u003e. K. Manjang deduced the DAG of human BP to a simplified one with only 39 nodes by combining GO terms into three categories, leaf nodes, regular nodes, and jump nodes\u003csup\u003e[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]\u003c/sup\u003e. However, previous research has not attempted to group GO terms into larger categories to offer a comprehensive overview of biological systems. Here, we acquired about 60 direct child terms of human \u0026lsquo;cellular process\u0026rsquo; (CP, GO:0009987) from AmiGO and regrouped them into 12 major GO subsets. These 12 subsets encompass most of the genes involved in CP, exhibit a degree of independency, and demonstrate interconnectedness, suggesting their potential as the representative major subsets of CP. Finally, we utilized our new categories to investigate the distribution of human transcription factors (TFs), kinases, and also several types of cancer genome, and they displayed specific patterns, respectively. In general, our research initially built a novel category system based on GO terms to enhance the comprehensive understanding of cellular functions.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\n \u003ch2\u003eRetrieving all direct child terms of CP\u003c/h2\u003e\n \u003cp\u003eSpecifically, we concentrated on CP instead of BP for the following reasons: 1) CP is the largest child term under BP and it contains most of genes in BP (76%). 2) Some BP\u0026rsquo;s child terms also have its counterparts in CP, for example \u0026lsquo;developmental process\u0026rsquo; (GO:0032502) versus \u0026lsquo;cellular developmental process\u0026rsquo; (GO:0048869), which results in a lot of redundancy and complicate the following analysis. 3) Only considering terms directly connected with CP helps us to increase consistency because all the terms discussed are at the same level.\u003c/p\u003e\n \u003cp\u003eTo get a general view of the whole cellular system, we omitted the deep branched terms, instead only downloaded all 66 direct child terms of CP of homo sapiens from AmiGO (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://amigo.geneontology.org/amigo/dd_browse\u003c/span\u003e\u003c/span\u003e). By excluding the empty ones, 60 terms were left. Additionally, there exist many repeats in each term, because one gene can be annotated to a term with multiple annotations\u003csup\u003e[\u003cspan class=\"CitationRef\"\u003e2\u003c/span\u003e]\u003c/sup\u003e. We defined the term size as the number of genes a term contained instead of its annotation number to avoid confusion. Then the 60 non-empty terms, which are all direct child of CP were reordered by their sizes (supplemental table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e\n \u003cp\u003eTo our surprise, the sizes of the 60 terms were quite different, with the smallest term only containing one gene while the biggest term associated with more than half of the total genes in CP. This uneven distribution is partially due to the fact that the GO interprets cellular functions of all organisms in one structure, so some features of other organisms are presented here, e.g., \u0026lsquo;light absorption\u0026rsquo; (GO:0016037). Additionally, unwell characterization also renders some terms in small size. Also, the DAG structure allows deep branched terms connected with CP directly. However, many CP\u0026rsquo;s direct child terms even without any branches also account for this issue. So, it requires further consideration of the relationships among these CP\u0026rsquo;s direct child terms in order to get a better categorizing strategy to summarize the major properties of CP.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\n \u003ch2\u003eNetwork visualization of term relationships under CP\u003c/h2\u003e\n \u003cp\u003eTo investigate the interactions among all those 60 terms under CP, we generated a term interaction matrix by calculating the number of overlapped genes between any two terms (supplemental table \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e). Then we defined the relative dot link (RDL) between two terms as the ratio of overlapped gene number to each term size, respectively (see methods). By comparing the RDL with three predefined thresholds, 30%, 50%, and 70%, three levels of interactions were obtained. Over 30% but less than 50% was defined as weak connection, while moderate connection between 50\u0026ndash;70% and strong connection over 70%. By excluding the small sized terms (less than 20 genes), as well as \u0026lsquo;positive regulation of cellular process\u0026rsquo; (GO:0048522) and \u0026lsquo;negative regulation of cellular process\u0026rsquo; (GO:0048523) because they are two subsets redundant with \u0026lsquo;regulation of cellular process\u0026rsquo; (GO:0050794), there were 43 terms left. Additionally, we combined terms that sharing more than 90% of genes, e.g., the term \u0026lsquo;cell-cell fusion\u0026rsquo; (GO:0140253) was combined with \u0026lsquo;syncytium formation\u0026rsquo; (GO:0006949), and the term \u0026lsquo;signal transduction\u0026rsquo; (GO:0007165) combined with \u0026lsquo;cell communication\u0026rsquo; (GO:0007154). In total, 41 terms were left and visualized by a network, in which nodes indicating terms and edges representing RDL (Fig. \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e). As shown in Fig. \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e, most of the small terms are isolated from each other but have connections with large terms, while large terms connect more extensively with each other.\u003c/p\u003e\n \u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\n \u003ch2\u003eGeneration of the 12 major GO subsets of CP\u003c/h2\u003e\n \u003cp\u003eAlthough small sized terms are meaningful, juxtaposing those with bigger ones will sacrifice the general view of the whole cellular process. So, we further investigated the term interaction matrix (supplemental table \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e) to generate major GO subsets, which were supposed to cover most genes in CP, be relatively independent while overlapping was allowed. To achieve this goal, we generated three rules discussed as follows.\u003c/p\u003e\n \u003cp\u003eFirst, exclusion. Exclude the terms with less than 20 genes. In addition, exclude the term \u0026lsquo;regulation of cellular process\u0026rsquo; (GO:0050794) and its two subsets, because they refer regulation of any cellular process, and more importantly, these genes have been included in the corresponding cellular process. Second, coverage. To start with, we defined terms with over 3000 genes as seed terms for large terms are more comprehensive and representatives (supplemental table \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e, black box). Then all the smaller terms were compared with the seed terms. If over 70% of genes of a small term fall into one or more of the seed terms, in addition, if its function can be summarized by the seed terms, the small one will be represented by the seed terms. In other worlds, the small term was omitted. Based on this rule, most of the small terms could be ignored. However, some terms did not fall into any of the seed terms, such as \u0026lsquo;cell recognition\u0026rsquo; (GO:0008037) and \u0026lsquo;cell killing\u0026rsquo; (GO:0001906). Also, the function of some terms, such as \u0026lsquo;cell cycle\u0026rsquo; (GO:0007049), cannot be represented by any of the seed terms. So, these terms were maintained. Third, combination. Terms of similar size and function were combined to reduce the redundance. Such as, \u0026lsquo;signal transduction\u0026rsquo; (GO:0007165), \u0026lsquo;cell communication\u0026rsquo; (GO:0007154), and \u0026lsquo;cellular response to stimulus\u0026rsquo; (GO:0051716) were merged into a new subset containing 6917 genes. Also \u0026lsquo;cell division\u0026rsquo; (GO:0051301), \u0026lsquo;cell cycle process\u0026rsquo; (GO:0022402), and \u0026lsquo;cell cycle\u0026rsquo; (GO:0007049) were merged into a new one with 1302 genes. However, one exception was \u0026lsquo;cellular localization\u0026rsquo; (GO:0051641) and \u0026lsquo;cellular component organization or biogenesis\u0026rsquo; (GO:0071840). As their function cannot be combined, both terms were remained.\u003c/p\u003e\n \u003cp\u003eIn conclusion, finally we obtained 12 major GO subsets to interpret CP (supplemental table 3). These 12 subsets covered most of the genes in CP (supplemental Fig. 1A). The overlapped genes between any two of these 12 subsets were calculated (supplemental table 4) and visualized as a RDL network with 12 nodes and 28 edges (Fig. \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e). As seen in Fig. \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e, most of the interactions are weak or moderate, suggesting the 12 new generated subsets are more independent. In sum, our result indicates that CP could be represented by the 12-GO-subsets which are fairly independent but inter-connections are allowed.\u003c/p\u003e\n \u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\n \u003ch2\u003eInvestigation of relationships among the 12 major subsets based on protein-protein interactions\u003c/h2\u003e\n \u003cp\u003ePreviously, we detected the interaction of those 12 subsets mainly based on the genes they shared. To further determine the relationships at protein-protein interaction (PPI) level, we used STRING \u003csup\u003e[\u003cspan class=\"CitationRef\"\u003e20\u003c/span\u003e]\u003c/sup\u003e to measure the edge number of each or any two of the subsets.\u003c/p\u003e\n \u003cp\u003eWhen set A and B are put together, suppose the new formed set is C. So, the edge number of set C can be calculated using the following equation:\u003c/p\u003e\n \u003cp\u003e|C\u003csub\u003ee\u003c/sub\u003e|= |A\u003csub\u003ee\u003c/sub\u003e|+|B\u003csub\u003ee\u003c/sub\u003e|-|S\u003csub\u003ee\u003c/sub\u003e|+|N\u003csub\u003ee\u003c/sub\u003e|\u003c/p\u003e\n \u003cp\u003e|A\u003csub\u003ee\u003c/sub\u003e|, |B\u003csub\u003ee\u003c/sub\u003e|, and |C\u003csub\u003ee\u003c/sub\u003e| is the edge number of set A, B, and C, respectively. |S\u003csub\u003ee\u003c/sub\u003e| is the edge number of set A\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\cap\\)\u003c/span\u003e\u003c/span\u003eB. N\u003csub\u003ee\u003c/sub\u003e refers new edges generated when combining set A and B (Fig. \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003eA).\u003c/p\u003e\n \u003cp\u003eNext, we generated an edge matrix of the 12-GO-subsets by measuring the edge number of the 12 subnetworks and also the new edge number when combing any two subsets (supplemental table 5). Then the ratio of new edge number to the edge number of each subset respectively was calculated and defined as the relative new-edge link (RNL, see methods) (Fig. \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003eB). Based on the result of Fig. \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003eB and also the size of subsets, these 12 subsets can be classified into 3 groups (Figs. \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e \u0026amp; \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003eB). For group 1, it contained 2 of the smallest subsets, \u0026lsquo;cell recognition\u0026rsquo; (GO:0008037) and \u0026lsquo;cell killing\u0026rsquo; (GO:0001906). When combining the subsets in group 1 with other larger subsets, the new edges were several times the amount of their own edges, which means if visualized by network, the small subset will melt into the big ones. Group2 contained five intermediate subsets with the size around 1000 genes, which included \u0026lsquo;cell adhesion\u0026rsquo; (GO:0007155), \u0026lsquo;cell motility\u0026rsquo; (GO:0048870), \u0026lsquo;cell death\u0026rsquo; (GO:0008219), \u0026lsquo;cell division\u0026rsquo; (GO:0051301)+\u0026lsquo;cell cycle\u0026rsquo; (GO:0007049), and \u0026lsquo;transmembrane transport\u0026rsquo; (GO:0055085). When the subsets in group 2 were combined with the five largest ones (subsets in group 3, with over 3000 genes), the RNLs were moderate, between 0.4 and 1.3, indicating even combined with the five biggest ones, subsets in group 2 still maintains their own PPI network structures. For all three groups, the RNLs among groups were relatively low, suggesting independency among these subsets.\u003c/p\u003e\n \u003cp\u003eFrom Fig. \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003eA, we also learned that shared edges already exist in set A and B, and related to the overlapped gene formed network. Therefore, the higher ratio of shared edges to new edges indicates that when set A and B are combined together, not many new edges form, suggesting set A and B are relative saturated networks. So, we named |S\u003csub\u003ee\u003c/sub\u003e|/|N\u003csub\u003ee\u003c/sub\u003e| as the saturation ratio (SR, see methods). Subsequently, we calculated the SRs among different groups. Figure \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003eC showed that new edges account for a lot when combining any two subsets in group 2. However, the ratios among group 3 were much higher, suggesting shared edges represent a big proportion. And the ratios were in the middle when combining subsets in group 2 with ones in group 3. Generally speaking, this result indicates that the 5 largest subsets in group 3, which including \u0026lsquo;cellular developmental process\u0026rsquo; GO:0048869), \u0026lsquo;cellular localization\u0026rsquo; (GO:0051641), \u0026lsquo;cellular response to stimulus\u0026rsquo; (GO:0051716)+\u0026lsquo;signal transduction\u0026rsquo; (GO:0007165)+\u0026lsquo;cell communication\u0026rsquo; (GO:0007154), \u0026lsquo;cellular component organization or biogenesis\u0026rsquo; (GO:0071840), and \u0026lsquo;cellular metabolic process\u0026rsquo; (GO:0044237) are relatively saturated ones. And the 5 median subsets in group 2, form much more new edges when combined with each other or subsets in group 3.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\n \u003ch2\u003eThe distribution of TFs and kinases in the 12 major subsets\u003c/h2\u003e\n \u003cp\u003eTo dissect the distribution of TFs and kinases in the 12 major GO subsets, we firstly download human TFs and kinase from AnimalTFDB\u003csup\u003e[\u003cspan class=\"CitationRef\"\u003e21\u003c/span\u003e]\u003c/sup\u003e and Kinome Atlas\u003csup\u003e[\u003cspan class=\"CitationRef\"\u003e22\u003c/span\u003e]\u003c/sup\u003e, respectively. 1659 TFs and 1024 transcriptional cofactors (TcoFs) were obtained and chi-square test showed that they were more likely to be enriched in the subsets of \u0026lsquo;cell motility\u0026rsquo;, \u0026lsquo;cell death\u0026rsquo;, \u0026lsquo;cell cycle\u0026rsquo;, \u0026lsquo;cellular developmental process\u0026rsquo;, and \u0026lsquo;cellular response to stimulus\u0026rsquo; (Fig. \u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003eA). In addition, 538 kinases were acquired and similar analysis were conducted. Figure \u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003eB showed that kinases were enriched in the subsets of \u0026lsquo;cell adhesion\u0026rsquo;, \u0026lsquo;cell motility\u0026rsquo;, \u0026lsquo;cell death\u0026rsquo;, \u0026lsquo;cell cycle\u0026rsquo;, \u0026lsquo;cellular developmental process\u0026rsquo;, \u0026lsquo;cellular response to stimulus\u0026rsquo; and \u0026lsquo;cellular metabolic process\u0026rsquo;.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\n \u003ch2\u003eSeveral cancer genomes shared similar distribution\u003c/h2\u003e\n \u003cp\u003eFurthermore, we investigated in which subsets cancer genomes would be enriched. Several types of cancer genome were collected from The Cancer Genome Atlas (TCGA) database, including breast cancer\u003csup\u003e[\u003cspan class=\"CitationRef\"\u003e23\u003c/span\u003e]\u003c/sup\u003e, lung adenocarcinoma\u003csup\u003e[\u003cspan class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/sup\u003e, colon and rectal cancer \u003csup\u003e[\u003cspan class=\"CitationRef\"\u003e25\u003c/span\u003e]\u003c/sup\u003e and acute myeloid leukemia\u003csup\u003e[\u003cspan class=\"CitationRef\"\u003e26\u003c/span\u003e]\u003c/sup\u003e (Table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e). To investigate the distribution of these cancer genomes in the 12 major GO subsets, we performed enrichment analysis and determined the statistical significance also by chi-squire test. Not surprisingly, the mutated genes of all the 4 types of cancer genome displayed similar distribution, highly enriched in all five subsets in group 2 and also two subsets in group 3 (Fig. \u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e), which might be caused by a large number of shared mutated genes among different cancers (supplemental Fig.\u0026nbsp;2). These results demonstrated the commonly accepted prospection that although targeting different type of tissues, cancer mutations may disturb similar underlying mechanisms\u003csup\u003e[\u003cspan class=\"CitationRef\"\u003e27\u003c/span\u003e]\u003c/sup\u003e.\u003c/p\u003e\n \u003cp\u003e\u003c/p\u003e\u0026nbsp;\u003ctable id=\"Tab2\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eInformation of 4 types of cancer genomes.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCancer type\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eNumber of patients\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMutations\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMethods\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eRef\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBreast cancer\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e463\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e14130\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWhole-exome sequencing\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003csup\u003e[\u003cspan class=\"CitationRef\"\u003e23\u003c/span\u003e]\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLung adenocarcinoma\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e230\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e18253\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWhole-exome sequencing, mRNA sequencing\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003csup\u003e[\u003cspan class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eColon and rectal cancer\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e224\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e15995\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eExome capture DNA sequencing\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003csup\u003e[\u003cspan class=\"CitationRef\"\u003e25\u003c/span\u003e]\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAcute myeloid leukemia\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e200\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e1996\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWhole genome sequencing (50 cases), RNA sequencing,\u003c/p\u003e\n \u003cp\u003ewhole-exome sequencing (150 cases)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003csup\u003e[\u003cspan class=\"CitationRef\"\u003e26\u003c/span\u003e]\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n\u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe GO knowledgebase become more and more complicated not only due to its deep branches in vertical, but also imbalanced structures at the same level. In this study, we examined the factors affect the latter one and visualized the interactions among CP\u0026rsquo;s direct child terms through shared gene networks (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). To streamline the categorization of terms in CP, three rules-exclusion, coverage, and combination-were applied manually to condense the structure into a simplified one containing only 12 subsets. These 12 subsets encompass most of the genes in CP, are relatively independent and allow for connections between any two of them (supplemental Fig.\u0026nbsp;1A \u0026amp; Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Additionally, PPI networks were utilized to further examine the interconnections among these 12 subsets, resulting in their classification into 3 groups. The 5 biggest subsets in group 3 collectively accounted for over 90% of the genes in CP (supplemental Fig.\u0026nbsp;1B) and exhibited saturated PPI networks with minimal new edges present when combining any two of them (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC). The 5 median subsets in group 2 displayed a lot of new edges when combined with each other and with the 5 largest ones in group 3 (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC). In conclusion, the 12 GO subsets better represent the entire cellular process.\u003c/p\u003e \u003cp\u003eNotably, the uneven distribution of terms with varying sizes at one level is very common in the GO system. Our newly developed categorization strategy relies primarily on the number of shared genes between any two terms, and its accuracy was validated by PPI based network analysis. This strategy showed promise for application to other large GO terms. Also, this study introduced several novel metrics for network analysis, including RDL, RNL, and SR, which enhance our understanding of the network structure. Finally, the implementation of this categorization system provided valuable insights into the distribution of TFs, kinase, and genetic alterations related to cancer, which suggests its potential utility in improving our understanding of biological systems.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eData collection\u003c/h2\u003e \u003cp\u003e \u003cem\u003eThe GO and GO annotation\u003c/em\u003e The GO (last file loaded on 2023-11-17, AmiGO 2 version: 2.5.17) is a regularly updated online knowledgebase that serves as a standardized vocabulary for annotating genes as well as gene products\u003csup\u003e[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]\u003c/sup\u003e. In this study, the GO was browsed using AmiGO (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://amigo.geneontology.org/amigo/dd_browse\u003c/span\u003e\u003cspan address=\"https://amigo.geneontology.org/amigo/dd_browse\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e)\u003csup\u003e[\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]\u003c/sup\u003e. All human related terms and annotations under \u0026lsquo;cellular process\u0026rsquo; (GO:0009987) were manually downloaded from \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.geneontology.org\u003c/span\u003e\u003cspan address=\"https://www.geneontology.org\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cem\u003eTFs, TcoFs, protein kinase, and cancer mutated genes\u003c/em\u003e The Animal Transcription Factor Database (AnimalTFDB)\u003csup\u003e[\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]\u003c/sup\u003e is the most comprehensive database for animal TFs. In this study, 1659 human related TFs and 1024 TcoFs data were downloaded from the AnimalTFDB database (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://guolab.wchscu.cn/AnimalTFDB4/#/\u003c/span\u003e\u003cspan address=\"https://guolab.wchscu.cn/AnimalTFDB4/#/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e). The Kinome Atlas (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://cellimagelibrary.org/pages/kinome_atlas\u003c/span\u003e\u003cspan address=\"http://cellimagelibrary.org/pages/kinome_atlas\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e)\u003csup\u003e[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]\u003c/sup\u003e database is a comprehensive imaging database of the kinome with compartmental annotations at the subcellular level. We downloaded 538 kinases encoded by the human genome\u003csup\u003e[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eTCGA (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.cancer.gov/tcga\u003c/span\u003e\u003cspan address=\"https://www.cancer.gov/tcga\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) is one of the most successful cancer genomic programs that generates and analyzes cancer genomes. Here, 1996 mutated genes in AML\u003csup\u003e[\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]\u003c/sup\u003e, 14130 mutated genes in breast cancer\u003csup\u003e[\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]\u003c/sup\u003e, 18253 mutated genes in lung adenocarcinoma\u003csup\u003e[\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/sup\u003e and 15995 mutated genes in colon and rectal cancer\u003csup\u003e[\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]\u003c/sup\u003e were collected. The mutation types sorted out include \u0026ldquo;Missense_Mutation\u0026rdquo;, \u0026ldquo;Nonsense_Mutation\u0026rdquo;, \u0026ldquo;Splice_Site\u0026rdquo;, and \u0026ldquo;Frame_Shift_Del\u0026rdquo;, etc.\u003c/p\u003e \u003cp\u003e \u003cem\u003eProtein-protein\u0026ensp;interaction\u003c/em\u003e The Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) module in Cytoscape 3.10.1\u003csup\u003e[\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]\u003c/sup\u003e software was used to obtain the human PPI among one or two subsets of genes with default settings.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eCalculation of the RDL between two terms\u003c/h2\u003e \u003cp\u003eTo obtain the number of overlapped genes, we developed Python scripts to process terms obtained from the GO. These scripts ran under a Python environment with the requirement of pandas and itertools\u003csup\u003e[\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eWe introduced the concept of the RDL as the basis for comparing interactions between two terms. Assuming two terms \u0026lsquo;a\u0026rsquo; and \u0026lsquo;b\u0026rsquo; with annotated gene sets as \u0026lsquo;A\u0026rsquo; and \u0026lsquo;B\u0026rsquo;, respectively. The RDL of term \u0026lsquo;a\u0026rsquo; is:\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$${\\text{RDL}}_{\\text{ab/a}}\\text{=}\\frac{\\text{|A\u0026cap;B|}}{\\text{|A|}}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eThe RDL of term \u0026lsquo;b\u0026rsquo; is:\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e\n$${\\text{RDL}}_{\\text{ab/b}}\\text{=}\\frac{\\text{|A\u0026cap;B|}}{\\text{|B|}}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eHere, |A\u0026cap;B| denotes the number of overlapped genes between gene set \u0026lsquo;A\u0026rsquo; and \u0026lsquo;B\u0026rsquo;.\u003c/p\u003e \u003cp\u003eA network was visualized using Gephi 0.10\u003csup\u003e[\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]\u003c/sup\u003e, where nodes represent terms and edges represent the RDLs.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eCalculation of the RNL and SR based on PPI networks\u003c/h2\u003e \u003cp\u003eCompare the number of new edges of two subsets with the number of edges in set A or B, and define the ratio as the RNL. The formulas are as follows:\u003cdiv id=\"Equc\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equc\" name=\"EquationSource\"\u003e\n$${\\text{R}\\text{N}\\text{L}}_{\\text{ab/a}}\\text{=}\\frac{\\text{|}\\text{N}\\text{e}\\text{|}}{\\text{|}\\text{A}\\text{e}\\text{|}}$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equd\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equd\" name=\"EquationSource\"\u003e\n$${\\text{R}\\text{N}\\text{L}}_{\\text{ab/b}}\\text{=}\\frac{\\text{|}\\text{N}\\text{e}\\text{|}}{\\text{|}\\text{B}\\text{e}\\text{|}}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eFurther, compare the number of shared edges of two subsets with the number of new edges, and denote the ratio as SR. The formula is as follows:\u003cdiv id=\"Eque\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Eque\" name=\"EquationSource\"\u003e\n$${\\text{SR}}_{\\text{ab}}\\text{=}\\frac{\\text{|}\\text{S}\\text{e}\\text{|}}{\\text{|}\\text{N}\\text{e}\\text{|}}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eThe application of the 12 major GO subsets\u003c/h2\u003e \u003cp\u003eWe calculated the distribution of TFs, TcoFs, protein kinase, and cancer genomes in the 12 major GO subsets. Expected data was calculated by multiplying the total number of tested genes by the percentage of each subset in the 12 major subsets (expectation). Simultaneously, we identified the number of shared genes between the test data and each subset (observation). Chi-square tests were employed to compare the differences between the observation and expectation groups. When expected values were less than 5, Yate\u0026rsquo;s continuity correction was applied.\u003c/p\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e \u003ch2\u003eCompeting interests\u003c/h2\u003e \u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eJ.Z. conceived and designed the study. Y.L., J.H., and Y.W. acquired the data. Y.L., J.Z., and R.L. conducted data analysis and result interpretation. J.Z., Y.L., and R.L. wrote and revised the manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgement\u003c/h2\u003e\u003cp\u003eWe would like to thank Beijing University of Chinese Medicine for use of its shared services to complete this research. This work was supported by the Young Scientists Fund of the National Natural Science Foundation of China (No. 31701280 to J. Zhang).\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eData is provided within the manuscript or supplementary information files.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eASHBURNER M, BALL C A, BLAKE J A, et al. Gene Ontology: tool for the unification of biology [J]. Nature Genetics, 2000, 25(1): 25\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDU PLESSIS L, SKUNCA N, DESSIMOZ C. The what, where, how and why of gene ontology\u0026ndash;a primer for bioinformaticians [J]. Briefings in bioinformatics, 2011, 12(6): 723\u0026ndash;35.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eALEKSANDER S A, BALHOFF J, CARBON S, et al. The Gene Ontology knowledgebase in 2023 [J]. Genetics, 2023, 224(1).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMANJANG K, TRIPATHI S, YLI-HARJA O, et al. Graph-based exploitation of gene ontology using GOxploreR for scrutinizing biological significance [J]. Scientific Reports, 2020, 10(1): 16672.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZEEBERG B R, FENG W, WANG G, et al. GoMiner: a resource for biological interpretation of genomic and proteomic data [J]. Genome biology, 2003, 4(4): R28.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMAERE S, HEYMANS K, KUIPER M. BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks [J]. Bioinformatics (Oxford, England), 2005, 21(16): 3448\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSEALFON R S, HIBBS M A, HUTTENHOWER C, et al. GOLEM: an interactive graph-based gene-ontology navigation and analysis tool [J]. BMC bioinformatics, 2006, 7: 443.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEDEN E, NAVON R, STEINFELD I, et al. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists [J]. BMC bioinformatics, 2009, 10(1): 48.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWEI Q, KHAN I K, DING Z, et al. NaviGO: interactive tool for visualization and functional similarity and coherence analysis with gene ontology [J]. BMC bioinformatics, 2017, 18(1): 177.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZHANG B, KIROV S, SNODDY J. WebGestalt: an integrated system for exploring gene sets in various biological contexts [J]. Nucleic acids research, 2005, 33(Web Server issue): W741-8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWANG J, VASAIKAR S, SHI Z, et al. WebGestalt 2017: a more comprehensive, powerful, flexible and interactive gene set enrichment analysis toolkit [J]. Nucleic acids research, 2017, 45(W1): W130-w7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePOMAZNOY M, HA B, PETERS B. GOnet: a tool for interactive Gene Ontology analysis [J]. BMC bioinformatics, 2018, 19(1): 470.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFALCON S, GENTLEMAN R. Using GOstats to test gene lists for GO term association [J]. Bioinformatics (Oxford, England), 2007, 23(2): 257\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZEEBERG B R, LIU H, KAHN A B, et al. RedundancyMiner: De-replication of redundant GO categories in microarray and proteomics analysis [J]. BMC bioinformatics, 2011, 12(1): 52.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSUPEK F, BOŠNJAK M, ŠKUNCA N, et al. REVIGO summarizes and visualizes long lists of gene ontology terms [J]. PloS one, 2011, 6(7): e21800.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMERICO D, ISSERLIN R, STUEKER O, et al. Enrichment map: a network-based method for gene-set enrichment visualization and interpretation [J]. PloS one, 2010, 5(11): e13984.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBINDEA G, MLECNIK B, HACKL H, et al. ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks [J]. Bioinformatics (Oxford, England), 2009, 25(8): 1091\u0026ndash;3.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGROSSMANN S, BAUER S, ROBINSON P N, et al. Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis [J]. Bioinformatics (Oxford, England), 2007, 23(22): 3024\u0026ndash;31.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMARTIN D, BRUN C, REMY E, et al. GOToolBox: functional analysis of gene datasets based on Gene Ontology [J]. Genome biology, 2004, 5(12): R101.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSZKLARCZYK D, KIRSCH R, KOUTROULI M, et al. The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest [J]. Nucleic acids research, 2023, 51(D1): D638-d46.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSHEN W K, CHEN S Y, GAN Z Q, et al. AnimalTFDB 4.0: a comprehensive animal transcription factor database updated with variation and expression annotations [J]. Nucleic acids research, 2023, 51(D1): D39-d45.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZHANG H, CAO X, TANG M, et al. A subcellular map of the human kinome [J]. eLife, 2021, 10.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKOBOLDT D C, FULTON R S, MCLELLAN M D, et al. Comprehensive molecular portraits of human breast tumours [J]. Nature, 2012, 490(7418): 61\u0026ndash;70.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCOLLISSON E A, CAMPBELL J D, BROOKS A N, et al. Comprehensive molecular profiling of lung adenocarcinoma [J]. Nature, 2014, 511(7511): 543\u0026ndash;50.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eComprehensive molecular characterization of human colon and rectal cancer [J]. Nature, 2012, 487(7407): 330\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLEY T J, MILLER C, DING L, et al. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia [J]. The New England journal of medicine, 2013, 368(22): 2059\u0026ndash;74.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKANDOTH C, MCLELLAN M D, VANDIN F, et al. Mutational landscape and significance across 12 major cancer types [J]. Nature, 2013, 502(7471): 333\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCARBON S, IRELAND A, MUNGALL C J, et al. AmiGO: online access to ontology and annotation data [J]. Bioinformatics, 2008, 25(2): 288\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMUZNY D M, BAINBRIDGE M N, CHANG K, et al. Comprehensive molecular characterization of human colon and rectal cancer [J]. Nature, 2012, 487(7407): 330\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSHANNON P, MARKIEL A, OZIER O, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks [J]. Genome research, 2003, 13(11): 2498\u0026ndash;504.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMCKINNEY W. Data Structures for Statistical Computing in Python; proceedings of the SciPy, F, 2010 [C].\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBASTIAN M, HEYMANN S, JACOMY M. Gephi: an open source software for exploring and manipulating networks; proceedings of the Proceedings of the international AAAI conference on web and social media, F, 2009 [C].\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"gene ontology, terms, subsets, cellular process","lastPublishedDoi":"10.21203/rs.3.rs-4581229/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4581229/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eAs the Gene Ontology (GO) knowledgebase becomes more and more complicated, it is difficult for researchers to follow and get a comprehensive overview of biological processes. Here, we generated a classification strategy through carefully investigating the genes any two terms shared. Using this strategy, we categorized the 66 direct child terms of the cellular process into 12 major subsets, and the interactions among them were further confirmed by studying the protein-protein interaction based networks. Subsequently, these 12 subsets were used to investigate the distribution of transcription factors, kinases and also several cancer genomes. Above all, the 12-GO-subsets provide researchers a more comprehensive overview of the cellular process, and the categorizing strategy developed herein can be utilized to characterize other large GO terms.\u003c/p\u003e","manuscriptTitle":"Generation of the 12-GO-Subsets to Interpretate Human Cellular Process","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-07-10 15:26:57","doi":"10.21203/rs.3.rs-4581229/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"f242ee4a-3d2c-4fac-a079-eb50c61618cf","owner":[],"postedDate":"July 10th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":33969240,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":33969241,"name":"Biological sciences/Systems biology"}],"tags":[],"updatedAt":"2024-12-13T05:23:53+00:00","versionOfRecord":[],"versionCreatedAt":"2024-07-10 15:26:57","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4581229","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4581229","identity":"rs-4581229","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.