CoMAP: A Program to Cluster Pathways Overrepresented with Specific Cofactors in Human, Mouse, and Yeast Biology | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF software CoMAP: A Program to Cluster Pathways Overrepresented with Specific Cofactors in Human, Mouse, and Yeast Biology Isabella K. Barichello, Travis J. A. Craddock This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6605908/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background: Protein cofactors, such as metal ions are an essential part of many proteins, playing key structural, regulatory, and enzymatic roles. Without these cofactors, roughly one-third of all proteins would cease functioning properly. Deficiencies of these cofactors can have detrimental effects on health, contributing to the development of various diseases. The same is true for mice and yeast. It is evident then that the concentrations of cofactors must be carefully maintained so as not to damage the organism. The Cofactor Mapping & Analysis Program (CoMAP), allows users to determine the pathways in human, mouse, and yeast biology where any cofactor crucial to protein function is significantly involved. This tool enables the identification of biological processes and specific pathways within an organism that are influenced by changes in cofactor concentrations, providing a deeper insight into cofactor-dependent proteins and their involvement in biology. Results: To our knowledge, no other bioinformatics tool exists with the same functionality as CoMAP. Via a graphical user interface CoMAP constructs a list of the Ensembl gene IDs encoding for proteins containing specific cofactor(s), performs an overrepresentation analysis to identify the significant pathways in the organism using these genes, and hierarchically clusters these pathways based on similarities in their gene sets. A use case, investigating ferroptosis in humans, was done using this method. The example application took in iron II, iron III, and later, iron-sulfur clusters as the selected cofactors, identifying the pathways in human biology that would be affected by ferroptosis. The program returned pathways that have been experimentally shown to be impacted by ferroptosis in addition to novel pathways. The CoMAP Python script is available at https://github.com/tcraddock/CoMAP . Conclusions: CoMAP provides insight into the cellular functions most likely to be affected by a depletion or augmentation of the cofactor(s). CoMAP has use cases, applications and future directions for the fields of bioengineering of biological pathways, advancing personalized medicine, and elucidating to new ways to treat diseases resulting from cofactor deficiency and/or imbalance. Protein cofactor biological pathways overrepresentation analysis hierarchical clustering ferroptosis Background Protein cofactors, a non-protein chemical compound or metallic ion, play a fundamental role in sustaining life on Earth, serving as essential components in a wide array of biochemical and cellular processes across all domains of life [1]. The evolution of cofactor utilization in biological systems is thought to reflect the availability and functionality of these molecules in various environments [1]. Approximately one-third of all proteins depend on cofactors—ranging from metal ions to organic molecules—to carry out their proper biological functions [2]. These cofactors may serve catalytic, regulatory, structural, or transport-related roles within proteins. The study of protein-cofactor interactions, known broadly as cofactor biology or cofactor proteomics, includes understanding the uptake, processing, binding, and physiological roles of these molecules in cellular function [3]. Cofactors are broadly categorized into two types: organic cofactors (such as vitamins, flavins, and nucleotide derivatives like FAD and NAD) and inorganic cofactors (commonly metal ions like magnesium, zinc, and iron) [4]. These molecules are widely conserved among organisms, although their specific usage and abundance can vary across species [5]. Cofactors are involved in a diverse set of biological functions, including enzyme catalysis, signal transduction, gene expression, cellular respiration and photosynthesis, structural stabilization of proteins and nucleic acids, and the metabolism of drugs and nutrients [2, 6, 7]. In human biology, a range of cofactors—both metallic and non-metallic—are known to be essential. For example, NAD and FAD are key electron carriers in metabolic pathways [4], while metal ions such as Mg²⁺, Ca²⁺, Zn²⁺, and Fe²⁺ are indispensable for structural and enzymatic roles [8]. Deficiencies in these cofactors can lead to widespread physiological dysfunction. For instance, magnesium is required for over 300 enzymatic reactions, including those involved in nerve function and energy production [8, 9], while iron, as part of the heme cofactor in hemoglobin, is essential for oxygen transport [8]. Animal models such as mice have shown similar dependency on both metal and organic cofactors for cellular and systemic function. Disruption in cofactor availability has been linked to metabolic impairments, developmental issues, and organ toxicity [6, 10-12]. Similarly, in microorganisms like yeast, cofactors such as zinc, magnesium, iron, and flavins are critical for enzyme activity, membrane potential, and stress responses [13, 14]. For instance, the yeast proteome contains hundreds of zinc-binding proteins, and deficiencies in zinc or magnesium led to impaired growth and cellular stress [13, 15]. Importantly, dysregulation of cofactor homeostasis—whether metal-based or organic—has been implicated in a wide range of diseases, including neurodegenerative disorders (such as Alzheimer's and Parkinson's), cardiovascular diseases, cancer, and psychiatric conditions [7]. The precise regulation of cofactor concentrations within the cell is therefore a key determinant of health and resilience against disease. Maintaining this balance involves complex cellular networks that tightly control the synthesis, uptake, transport, and recycling of both metal ions and organic cofactors. The issue is that with the vast number of molecules playing a role in biological pathways, it is difficult to narrow down only the pathways in an organism in which a specific cofactor is disproportionately involved. To our knowledge, existing resources fail to provide a method to select cofactors of interest, identify the significant pathways involving them, and provide several cluster combinations, alluding to the biological processes of an organism that are most likely to be influenced by these cofactors. Cofactor Mapping & Analysis Program (CoMAP) has these functionalities. This simple and user-friendly resource takes in the cofactors and organism of interest and clusters the overrepresented pathways based on their protein coding genes, which encode for the cofactor-dependant molecules. CoMAP is designed as a versatile tool for mapping relevant pathways within an organism and enabling the straightforward visualization of their relationships, offering a novel approach to exploring cofactor biology. As mentioned, prior, cofactor levels must be carefully maintained lest detrimental effects occur. This program makes it possible to see both the broader biological functions and specific pathways in an organism that would be affected by a surplus or deficiency of certain cofactors, bettering our understanding of cofactor-dependent polymers. Implementation CoMAP was created in-house using Python 3.10 and was both developed and run in the Google Colaboratory environment [16]. The large language model ChatGPT was utilized as a resource to assist with source code development on occasion and served as a tool for troubleshooting when necessary. The source code can be accessed at https://github.com/tcraddock/CoMAP. The execution of CoMAP can be summarized as follows. Main Program Curation of Protein List using the Protein Data Bank: Using the rcsb-api package in Python, a specific query is constructed consisting of a source organism, polymer entity type(s), and cofactor name(s). Polymer entity types include protein, DNA, RNA, nucleic acid-hybrid (NA-hybrid), and other. The application programming interface (API) combs through the entries in the Protein Data Bank (PDB), saving the results that meet all user-specified criteria. Custom reports for each saved result are then obtained from the PDB. The reports include the entry ID, PDB ID, gene name, macromolecule name, and accession code(s) of the polymers that comprise each molecule [17]. Returning this exact report format was achieved through using the corresponding GraphQL query provided by PDB. A total of 5,000 entries can be processed in a single batch. If more than 5,000 entries are returned the batches are combined into one. Conversion of Accession Codes to Ensembl IDs: A unique list of accession codes is created from the returned PDB results. These codes are converted using the DAVID gene ID conversion tool [18, 19]. Selenium WebDriver, an automation tool that simulates a person interacting with a website, uploads and submits the list to the conversion tool [18, 19]. Prior to submitting the list, Selenium WebDriver selects both the desired ID type—Ensembl gene ID in this case—and the source organism. To avoid any performance issues or errors when submitting large lists, accession codes are uploaded in batches of 500. If more than one batch is needed, the converted results are combined into one at the completion of this step. Creation of Unique Ensembl Gene List: The PDB file from the first step is updated to include the corresponding Ensembl gene ID(s) for each accession code. A list of the unique Ensembl gene IDs is generated. Both the PDB file updated with Ensembl gene IDs, and the unique gene list can be selected for download. Overrepresentation Analysis : Selenium WebDriver is used to navigate to the correct ConsensusPathDB (CPDB) biological pathways databank for the user-specified source organism. The biological pathways databank is downloaded to a temporary folder with the pathway genes being identified by Ensembl ID [20-23]. An overrepresentation analysis is then completed on these pathways using a variety of imported Python functions. Due to “Service Unavailable” errors encountered when working with CPDB’s SOAP/WSDL interface, scripting the overrepresentation analysis proved the most suitable solution. To calculate the p-value for each pathway, the hypergeom function is imported from the scipy.stats module and used in conjunction with the survival function. The p-value describes how likely a pathway is to be overrepresented due to chance rather than being a real biological effect. A smaller p-value indicates the result is less likely to be due to chance. The hypergeom function takes in four variables: the total number of genes in the specific pathway being analyzed (n), the number of overlapping genes between the specific pathway and the unique gene list (x), the total background size (N), and the number of mapped entities (M). N and M are pulled from CPDB using the Selenium WebDriver tool. In this program, N represents the total number of genes that are present in at least one CPDB pathway and identifiable by Ensembl ID [20-23]. This value is dependent on the source organism. M means the number of genes from the unique gene list that are present in at least one CPDB pathway and varies for each unique gene list [20-23]. To calculate the q-values, the multipletests function is imported from the statsmodels.stats.multitest module and the Bonferroni method is used. The q-value adjusts the p-values for multiple testing. In this overrepresentation analysis, multiple gene sets are tested, increasing the rate of false positives. The q-value accounts for this, determining if a result is still significant after correcting for multiple tests. The Bonferroni method was selected to calculate the q-values due to its highly conservative nature, reducing false positives. The pathways are then filtered depending on their q-values. Any pathway with a q-value less than or equal to the user input value is retained, the others are discarded. The list of filtered overrepresented pathways can be selected for download. This file provides information on the pathway name, the source the pathway was retrieved from, the identifier used in said source to identify the pathway, the Ensembl IDs of the genes from the unique gene list that are found in the pathway, the percentage of genes in the pathway that are found in the unique gene list, and the p- and q- values of the pathway. Hierarchical Clustering and Cluster Analysis Output : The filtered overrepresented pathways are clustered to account for similarities between pathways from the multiple sources compiled into CPDB. To assess the similarities between different biological pathways, agglomerative hierarchical clustering is done. In this program, the similarity between two pathways is defined as the ratio of the number of genes shared by both pathways to the total number of unique genes when the gene sets of both pathways are combined. For example, if pathway A has 10 genes, pathway B has 15 genes, and 5 genes are present in both pathway A and B, the similarity of the two pathways would be calculated as 5/20=0.25. A similarity matrix is created using the calculated similarity for every pair of overrepresented pathways. To ensure the proper matrix shape, the squareform function is used and imported from the scipy.spatial.distance module. The similarity matrix is then converted to a distance matrix by subtracting each entry from the number one, so that a smaller number signifies a greater relation between pathways. The hierarchical clustering is performed with the linkage function using the average method. The fcluster function assigns the pathways a unique cluster label from the given distance matrix. Both functions are imported from the scipy.cluster.hierarchy module. There are two variables of importance to note. First, the cutoff value which is the maximum allowed percentage of dissimilarity between two pathways. A cutoff value of 0.2 means that the pathways in each cluster can be at most 20% different. The second is the minimum number of pathways variable. This variable dictates how many pathways must be grouped together to be considered a cluster. A heatmap is generated and output to the terminal using every combination of these two variables to provide a simple visualization of how many clusters are produced for each combination. The cutoff value ranges from 0.05 to 0.95 in 0.05 increments and the minimum number of pathways ranges from 1 to 8 in whole number increments. For simplicity, each combination of cutoff value and minimum number of pathways will be termed a set. At the completion of the program, a final cluster analysis file can be downloaded. This file contains information on each set. Specifically, it details the number of clusters, percentage of pathways expressed, percentage of genes expressed, average compactness, and word frequency of each set. The compactness of a cluster is the sum of the similarities—from the similarity matrix—for every unique pair of pathways in the cluster, divided by the number of pairs. For example, if a cluster consists of pathways A, B, and C, the cluster compactness will be the average of the similarities between pathways A and B, A and C, and B and C. The average compactness of a set is then the average of the compactness for every cluster in that set. Unique pairs are found using the combinations function from the itertools module. Finally, word frequency for a file is the single most frequent word(s) from the pathway names for each cluster. The program excludes common filler words and other terms deemed unhelpful, including human, mouse, budding yeast, and their scientific names. Words from the same cluster are joined by commas and words from different clusters are separated by semicolons. Hyphenated words are treated as a single word. Secondary Program Specific Set Download: After executing the main program and reviewing the cluster analysis file, specific sets of the user’s choice can be downloaded. The downloaded file has four columns. The first is the cluster ID, the numeric value given to identify that cluster. This column does not provide helpful information on the pathways but is essential for the user to pair clusters with their corresponding genes. More information on gene cluster membership can be found in the “Graphical User Interface (GUI)” subsection. The second column is the list of pathways within the cluster. The third column is the cluster compactness score. This value is calculated as described in the “Hierarchical Clustering and Cluster Analysis Output” step. Finally, the fourth column is the cluster label. This is the most frequently appearing word(s) in the pathway names of each cluster. Clusters are determined through hierarchical clustering, following the same procedure as described in the “Hierarchical Clustering and Cluster Analysis Output” step. However, as opposed to iterating through every set, this function only develops clusters for the exact cutoff value and minimum number of pathways value entered by the user. This function can be run multiple times after the completion of the first program, so the user can download all the sets they are interested in. Results and Discussion Graphical User Interface (GUI) There are two graphical user interfaces, one for each program, that open at the beginning of each program’s execution. In the Google Colaboratory environment [ 16 ], the GUI’s will be found at the bottom of the code block in the output terminal. Main Program GUI : The main program has several user-specified values in the GUI. The first option is a dropdown to select the source organism, the organism from which the protein or other molecule originally came from. The choices are Homo sapiens (human), Mus musculus (mouse), and Saccharomyces cerevisiae (yeast), the options being restricted to the available CPDB databanks. Next, is the polymer entity type selection. A default polymer entity already exists and cannot be removed. The dropdown is initially set to “protein” but can be changed to any of the following polymer types offered in the PDB: protein, DNA, RNA, NA-hybrid, and other. A green “Add Polymer Entity” and red “Remove Polymer Entity” button allow the user to submit between 1 and 5 polymer types. The grey “AND/OR Button” determines how multiple polymers will be queried. An “AND” selection means that the macromolecule must contain all the selected polymers in the structure. An “OR” selection means that the macromolecule must contain at least one of the specified polymers. The “Remove Polymer Entity” and “AND/OR Button” buttons are disabled to start. They will be automatically enabled with the addition of a new polymer. The “Add Polymer Entity” button will become disabled once 5 polymer entries exist. The cofactor name selection follows. At least one cofactor must be submitted when running the code. As such, one cofactor entry already exists and cannot be removed. The cofactor name can be input using either the dropdown option, which includes a handful of pre-selected metal and non-metal cofactors from the PDB, or the custom option. The custom option reveals a textbox where a cofactor of the user’s choosing can be entered. The cofactor text must be entered exactly as it appears in the PDB to search the PDB properly. Additionally, there is a green “Add Cofactor” button and a red “Remove Cofactor” button, allowing the user to add and remove as many cofactors as desired. The grey “AND/OR Button” acts the same as for the polymer entities. An “AND” selection means that the molecules must contain all the selected cofactors. An “OR” selection means that the molecules must contain at least one of the specified cofactors. The “Remove Cofactor” and “AND/OR Button” buttons are also disabled to begin and will be enabled once a second cofactor is added. Next, a threshold q-value is entered into a textbox. This value is the user’s restriction for whether an overrepresented pathway is still significant after adjusting for multiple tests. Typically, a value of 0.05 (5%) or less is used. Values can be entered by either typing out the decimal number or using scientific notation (E or e). A zero or negative value will default to 0.05 upon submission. The last four dropdowns are true or false. They each dictate whether a different file from some step in the code will (true) or will not (false) be downloaded. Secondary Program GUI : The secondary program starts with five user-specified values in the GUI. The top two options are used to specify the desired set. First, is the cutoff value textbox which takes in any positive decimal number, typed out or using scientific notation. This value is usually kept below one as any value greater than or equal to one will return only one cluster containing all the pathways. An input of a negative decimal number or zero will default to 0.1 upon submission. Second, is the minimum number of pathways textbox. This input will accept any positive integers greater than zero. A zero or negative value will default to one upon submission. Though the heat map and cluster analysis file from the “Hierarchical Clustering and Cluster Analysis Output” step only show certain combinations of variables, the textboxes can accept other values. The following three inputs are true or false dropdowns. The first is for whether to plot the dendrogram, a tree diagram to show how the pathways are clustered together. The next dictates whether to print the gene cluster memberships to the console. Gene cluster membership is output in the format of: Ensembl gene ID, number of clusters the gene is found in, and specific IDs of the clusters the gene is in. Selecting true on this option will reveal two additional input fields where the user will specify how many clusters a gene needs to be involved in to be printed to the terminal. A value of zero or less will automatically default to one upon submission. Lastly, the user can decide to download or not download the set file. Use Case We know that ferroptosis is a non-apoptotic mode of cell death, relying on the metal iron [ 24 – 27 ]. This mechanism is driven by lipid peroxidation, where oxygen and polyunsaturated fatty acid lipids react with free intracellular iron or enzymes with iron cofactors to create lipid peroxides [ 24 – 27 ]. The build-up of these lipid peroxides and iron molecules is what becomes toxic to the cell and leads to cell death [ 24 – 26 ]. Thus, a surplus of iron can make a cell more susceptible to ferroptosis. Various biological pathways have been implicated in the regulation of ferroptosis, such as mitochondrial and energy production pathways and amino acid metabolism [ 24 , 25 ]. In aerobic respiration, a sugar molecule is broken down through glycolysis and the citric acid cycle, producing the energy-carrying molecules NADH and FADH2. These molecules proceed to the electron transport chain (ETC), within the mitochondria, to donate electrons and synthesize ATP and water. Certain protein pumps in the ETC can generate reactive oxygen species (ROS) through the partial reduction of oxygen [ 25 ]. The accumulation of ROS in the cell leads to lipid peroxidation, making it evident that these highly reactive molecules are crucial for ferroptosis. Furthermore, cellular labile iron can catalyze the formation of ROS, resulting in oxidative stress, cellular harm, and potential ferroptosis [ 28 ]. Mitochondria work to control the levels of labile iron through the synthesis of iron-sulfur (Fe-S) clusters [ 25 ]. This process consumes iron molecules that could otherwise be used for ROS production [ 25 ]. Another cellular function that manages ferroptosis is the metabolism of amino acids. Once selenium, a micronutrient, is metabolized, it can be incorporated into the amino acid selenocysteine which can be further used in selenoproteins [ 29 ]. Selenoproteins, such as GPX4, are used by cells to protect them against oxidative damage, limit lipid peroxidation, and inhibit ferroptosis [ 29 ]. Ferroptosis has been reported to play a role in the development of multiple diseases, including cancer, neurological disorders, and cardiovascular diseases, and has also been linked to T-cell immunity [ 26 , 27 , 30 ]. Numerous studies have been completed on ferroptosis in cancerous cells. This non-apoptotic form of cell death has been successfully induced in pancreatic, colorectal, breast, lung, renal cell, and adrenocortical cancer cells, thereby killing tumorous cells and inhibiting the growth of tumors [ 26 ]. A whole new avenue for antitumor therapy has been opened with this discovery, showcasing the great potential ferroptosis has for cancer treatment. This mechanism can still result in cancer proliferation when ferroptotic damage triggers inflammation, creating a tumor-promoting environment [ 31 ]. A study by Wang et al. demonstrated that immunotherapy-activated CD8 + T cells promote lipid peroxidation in tumor cells, enhancing ferroptosis and playing a crucial role in anti-tumor immunotherapy [ 32 ]. The report also mentioned ferroptosis’s potential involvement in T cell immunity, stating that it is still unclear how exactly this cell death mechanism factors in [ 32 ]. Extensive research suggests that several neurodegenerative diseases involve the build-up of iron in localized areas of the central and peripheral nervous systems, as well as in brain tissue [ 26 ]. This build-up is frequently driven by the altered distribution of iron in the body, disrupting iron homeostasis, promoting ROS generation, and ultimately leading to severe oxidative damage and ferroptosis [ 26 ]. In short, ferroptosis, a relatively new and iron-dependent form of cell death, can contribute to the development of multiple diseases. Many questions, however, remain unanswered about this mechanism and its role in biology. Seeing how a surplus of iron increases a cell’s susceptibility to ferroptosis, and the numerous biological pathways influenced by ferroptotic behaviour, we chose to use iron (Fe) as our use case to demonstrate the functionality of our software. We ran the main program the first time using the following parameters: source organism of Homo sapiens , polymer entity type of protein, cofactor of Fe (II) OR Fe (III), and threshold q-value of 0.05. A total of 117 pathways were overrepresented and they were clustered using a cutoff value of 0.25 and minimum number of pathways of 1, resulting in 21 clusters. The complete cluster set is included as an additional file (Additional file 1). The “ferroptosis” and “cellular responses” clusters were of most interest to us. The former cluster is the exact mechanism we were looking to investigate with our use case, correctly identifying that this mode of cell death is dependent on iron-containing proteins. The latter is about the cellular responses to chemical stress, such as ROS, and external stimuli, such as the metal cofactor of Fe. As mentioned, prior, ferroptosis is linked with an accumulation of ROS and lipid peroxides, thus, pathways on how to respond to these stressors are expected to be linked to the Fe (II) and Fe (III) cofactor as we see from our program. Further, we find a “renal cell carcinoma” cluster, a disease that we know is influenced by ferroptosis. The “mitochondrial iron-sulfur cluster biogenesis” cluster is also worth exploring. By consuming Fe to produce Fe-S clusters, this pathway controls the iron levels in the cell, minimizing ROS production. It is then clear that a surplus of iron in the cell would drastically affect this function. Noticing this, we ran the main program again for the iron/sulfur cluster cofactor with all other inputs remaining the same as above. Utilizing the same minimum number of pathways value of 1 and a cutoff of 0.45, 23 clusters were formed. The complete Fe-S cluster set is included as an additional file (Additional file 2). The “disease” cluster aligned closely with the different diseases ferroptosis may play a role in. The cluster included various neurodegenerative pathways (Alzheimer’s, Amyotrophic lateral sclerosis, Huntington’s, Prion, and Parkinson’s disease), diabetic cardiomyopathy, and non-alcoholic fatty liver disease. This cluster also included numerous pathways on the ETC and oxidative phosphorylation, pathways that have been found to generate ROS. Another cluster in this file contains the metabolism of amino acids and derivatives, as well as selenocysteine synthesis and selenoamino acid metabolism pathways, all of which are used in the protective function of cells against ferroptosis. Further, ferroptosis has been implicated in T cell differentiation, function, viability, and immunity [ 30 , 33 ]. The NFAT signaling pathway is an important link between T cell receptor engagement and gene transcription regulation [ 34 ]. Activation of the signaling pathway results in the accumulation and activation of NFAT transcription factors [ 34 ]. The transcription factors regulate several T cell genes and influence the T cell responses that rid the body of tumors and infections [ 34 ]. The NFAT activation pathways are clustered in the Fe (III) or Fe (II) file, possibly explaining the role of ferroptosis in T cell immunity and the inhibition of tumor growth. The clusters mentioned up till now make reasonable sense because we know how ferroptosis is linked to the pathways within them. There are, however, some clusters where this relationship is not as clear. For example, in the Fe-S file we see a cluster on nucleotide excision repair. It is unclear how exactly ferroptosis may affect this function, making this an area of research scientists may want to explore in the future. It is evident that our program displays accurate clusters containing related pathways and highlights potential new connections between them. Discussion Analyzing overrepresented pathways and gene-sets can be challenging when a substantial amount of information is output. Tools exist that attempt to make interpreting this data easier. One software that shares some functionality with CoMAP is GeneSetCluster [ 35 ]. GeneSetCluster groups gene-sets by calculating a distance score based on the overlap of genes between two pathways [ 35 ]. Hierarchical clustering can then be applied to the gene-sets, allowing users to gain deeper insights into biological information [ 35 ]. This process is like the one we employ in CoMAP to group biological pathways. One difference to note is that GeneSetCluster is a code package in R [ 35 ]. This means that users need to not only develop code to accomplish their desired tasks but also retrieve the gene-sets they wish to analyze from other databanks. Our program provides improved functionality by only requiring a handful of inputs from the user to execute. Data retrieval, gene list creation, overrepresentation analyses, and clustering are performed automatically without requiring any code alterations from users. CoMAP is designed to be intuitive and accessible to non-bioinformaticians, featuring a well-developed code structure and a straightforward GUI. It is also highly efficient, typically completing tasks within minutes. Finally, this novel software surpasses the programs we know to exist in terms of functionality by supporting multiple cofactor and polymer selections and offering flexible file download options, making CoMAP valuable to a broad range of researchers. With CoMAP’s application to various areas of research, there are extensive uses for this program beyond the provided application. By clustering pathways with similar genetic components, researchers can design ways to engineer biological systems utilizing specific genes. Pairing this with patient-specific gene expressions gives the opportunity for advancement in personalized medicine. The clustering of disease and disorder pathways can highlight common genes across different conditions, not only expanding our knowledge of complex diseases but showcasing potential drug targets common among these pathways. All in all, there are countless future directions for this research to go. Conclusion Metalloproteins are essential in a wide variety of cellular functions. These proteins consist of metal ions, which enable protein activity or act as structural sites to support protein folding and assembly. Found in numerous organisms, these metal-binding proteins are involved in biological pathways including, but not limited to, energy production, biological polymer synthesis, motility, skeletal growth, and immune system regulation, making them crucial for sustaining life on Earth. The concentrations of these cofactors, both metal and not, must be kept in a careful balance. Dramatic shifts in the homeostasis of cofactors have been linked to neurodegenerative disorders, cardiovascular diseases, cancers, cell death, and countless other harmful effects. But, with these ligands being involved in hundreds and thousands of biological pathways it is extremely difficult to pinpoint which pathways are significant and how they group together into larger cellular functions. Our program provides a method for extracting the pathways in human, mouse, and yeast biology that are overrepresented with cofactor(s) and clusters them according to similarity. The Cofactor Mapping and Analysis Program (CoMAP) is a new way to study cofactors in biology, expanding our understanding of cellular functions dependent on selected cofactors and possibly revealing new molecule-based relationships between pathways. One of the key advantages of CoMAP is its ability to efficiently map key biological processes within an organism and enable the intuitive formation of connections between them. To our knowledge, there are no existing resources that allow for users to retrieve the pathways disproportionately represented with selected cofactors, organize these pathways into categories based on similarity of genes, and provide information on how closely clustered these pathways are. While other overrepresentation analysis and clustering software exist in languages such as Python and R, using these tools would require the user to develop, debug, and execute code themselves. With CoMAP, these functionalities have already been implemented in the code. All the user is required to do is click some buttons. Furthermore, the functionality of the program to take in multiple polymer types and cofactor names in one run and the users preferred method of querying them, allows for diverse and unique searches to be constructed for use in numerous areas of research. CoMAP makes use of three large and well-established databanks, the PDB, DAVID, and CPDB. These databanks contain a vast amount of information and are continually updated. As such, our program makes use of the latest information in bioinformatics, providing the most complete set of data possible for users. The final advantage to consider is the choice provided in the program to download a variety of different files from different steps in the code. The ability to select the PDB entries file, the unique Ensembl gene IDs file, and the filtered overrepresented pathways prior to clustering, provides the user with additional data to use as needed in their research. Despite its advantages, there are certain limitations that are inherent to this program. When querying multiple polymer types, searches can only use the “AND” or the “OR” operator. This is also true for the cofactor names. This means results are limited to entries that either have all the input polymers/cofactors, or at least one of them. If users, for example, combined multiple cofactors using both operators, there would be a high level of ambiguity regarding how the results were obtained. An input of “A and B or C” gives rise to both “(A and B) or C” and “A and (B or C)”, making the search criteria difficult to determine from the initial input. Constraints also arise from the availability of the data being used. Since the program relies on information from other sources, if something is not included in the PDB, DAVID, CPDB, or one of the sources compiled into CPDB, it cannot and will not be used in the program’s data analysis. This limits the availability of data, as protein structures or pathways may exist in other databanks that we do not access. Expanding on this, all the entries in the PDB have experimentally determined three-dimensional crystal structures. A biological molecule that does not yet have an experimentally determined crystal structure will not be incorporated into this databank and will, therefore, not be returned in a query using our program. Furthermore, species to query are restricted to those in the CPDB. Modifications to the existing code are needed to include species beyond Homo sapiens, Mus musculus , and Saccharomyces cerevisiae . The final limitation to mention is that because the external sources are accessed during program execution, through API’s and automated web interactions, the execution of the program relies on these sources operating successfully. If one of these databanks is down for maintenance, has reached server capacity, or has encountered an alternate error, CoMAP cannot be run and will not be properly executed. In this report, we detailed one potential application of our program using Fe (II), Fe (III), and iron/sulfur clusters as the selected cofactors. We investigated the different cellular functions and specific pathways that would be affected by a surplus of iron in the cell, which corresponded with existing experimental data. As there are numerous cofactor options, this program can explore the proteins, nucleic acids, and other biological molecules that depend on dozens of different ligands, highlighting the exact places and pathways that are expected to be affected by a shift in homeostasis of said cofactors. This novel method for visualizing biological pathways can provide much-needed insights into diseases and illnesses, revealing previously unknown relationships between pathways. Clustering can lead to the discovery of commonalities across different conditions, unveiling new methods of treatment or drug targets. Furthermore, the similarities among pathways can provide information as to how to bioengineer these cellular functions, advancing treatment options for a whole host of diseases. A betterment in personalized medicine may even be seen if biological functions can be altered to meet the specific needs of the patient. Overall, this research opens the door to numerous exciting future possibilities. Availability and Requirements Project name: Cofactor Mapping & Analysis Program (CoMAP) Project home page: https://github.com/tcraddock/CoMAP Operating system(s): Platform independent. Programming language: Python 3.10. Other requirements: Run program in Google Colab environment with Python 3.10 or higher. License: CoMAP is free to use for academics. Any restrictions to use by non-academics: Commercial users should contact Dr. Travis Craddock at [email protected] . Data used by CoMAP for academic use is available under the licence terms of each contributing databank (the Protein Data Bank, DAVID, and ConsensusPathDB). Abbreviations CoMAP: Cofactor Mapping & Analysis Program NA-hybrid: Nucleic-acid hybrid API: Application programming interface PDB: The Protein Data Bank CPDB: ConsensusPathDB GUI: Graphical user interface ETC: Electron transport chain ROS: Reactive oxygen species Declarations Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Availability of data and materials All data analysed during this study are included in this published article and its supplementary information files. Competing interests The authors declare that they have no competing interests. Funding This research was undertaken thanks to funding from the Canada Research Chairs Program to Travis Craddock (CRC-2022-00204) and funding from the University of Waterloo. Authors’ contributions Research design, TJAC; program development, TJAC, IKB.; data analysis, TJAC, IKB; writing—initial manuscript, IKB; writing—review and editing, TJAC, IKB. All authors read and approved the final manuscript. Acknowledgements The authors would like to thank Drs. Hadi Zadeh-Haghighi and Lea Gassab for helpful insights and advice. References Dupont CL, Yang S, Palenik B, Bourne PE: Modern proteomes contain putative imprints of ancient shifts in trace metal geochemistry . Proceedings of the National Academy of Sciences 2006, 103 (47):17822-17827. Li J, He X, Gao S, Liang Y, Qi Z, Xi Q, Zuo Y, Xing Y: The metal-binding protein atlas (MbPA): an integrated database for curating metalloproteins in all aspects . Journal of molecular biology 2023, 435 (14):168117. Shi W, Chance M: Metallomics and metalloproteomics . Cellular and Molecular Life Sciences 2008, 65 :3040-3048. Cantó C, Menzies KJ, Auwerx J: NAD+ metabolism and the control of energy homeostasis: a balancing act between mitochondria and the nucleus . Cell Metab 2015, 22 (1):31-53. Andreini C, Bertini I, Rosato A: Metalloproteomes: a bioinformatic approach . Acc Chem Res 2009, 42 (10):1471-1479. Dudev T, Lim C: Competition among metal ions for protein binding sites: determinants of metal ion selectivity in proteins . Chemical reviews 2014, 114 (1):538-556. Jomova K, Makova M, Alomar SY, Alwasel SH, Nepovimova E, Kuca K, Rhodes CJ, Valko M: Essential metals in health and disease . Chemico-biological interactions 2022, 367 :110173. Zoroddu MA, Aaseth J, Crisponi G, Medici S, Peana M, Nurchi VM: The essential metals for humans: a brief overview . J Inorg Biochem 2019, 195 :120-129. Jahnen-Dechent W, Ketteler M: Magnesium basics . Clinical kidney journal 2012, 5 (Suppl_1):i3-i14. Prohaska JR, Smith TL: Effect of dietary or genetic copper deficiency on brain catecholamines, trace metals and enzymes in mice and rats . The Journal of Nutrition 1982, 112 (9):1706-1717. Liu Yk, Xu H, Liu F, Tao R, Yin J: Effects of serum cobalt ion concentration on the liver, kidney and heart in mice . Orthopaedic Surgery 2010, 2 (2):134-140. Pereira M, Pereira M, Sousa J: Evaluation of nickel toxicity on liver, spleen, and kidney of mice after administration of high ‐dose metal ion . Journal of Biomedical Materials Research: An Official Journal of The Society for Biomaterials, The Japanese Society for Biomaterials, and the Australian Society for Biomaterials 1998, 40 (1):40-47. Chen Y, Li F, Mao J, Chen Y, Nielsen J: Yeast optimizes metal utilization based on metabolic network and enzyme kinetics . Proceedings of the National Academy of Sciences 2021, 118 (12):e2020154118. Cyert MS, Philpott CC: Regulation of cation balance in Saccharomyces cerevisiae . Genetics 2013, 193 (3):677-713. Wang Y, Weisenhorn E, MacDiarmid CW, Andreini C, Bucci M, Taggart J, Banci L, Russell J, Coon JJ, Eide DJ: The cellular economy of the Saccharomyces cerevisiae zinc proteome . Metallomics 2018, 10 (12):1755-1776. Google Colaboratory [https://colab.research.google.com/] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank . Nucleic acids research 2000, 28 (1):235-242. Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources . Nat Protoc 2009, 4 (1):44-57. Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists . Nucleic acids research 2009, 37 (1):1-13. Kamburov A, Herwig R: ConsensusPathDB 2022: molecular interactions update as a resource for network biology . Nucleic acids research 2022, 50 (D1):D587-D595. Kamburov A, Pentchev K, Galicka H, Wierling C, Lehrach H, Herwig R: ConsensusPathDB: toward a more complete picture of cell biology . Nucleic acids research 2011, 39 (suppl 1):D712-D717. Kamburov A, Stelzl U, Lehrach H, Herwig R: The ConsensusPathDB interaction database: 2013 update . Nucleic acids research 2013, 41 (D1):D793-D800. Kamburov A, Wierling C, Lehrach H, Herwig R: ConsensusPathDB—a database for integrating human functional interaction networks . Nucleic acids research 2009, 37 (suppl 1):D623-D628. Jiang X, Stockwell BR, Conrad M: Ferroptosis: mechanisms, biology and role in disease . Nature reviews Molecular cell biology 2021, 22 (4):266-282. Dixon SJ, Olzmann JA: The cell biology of ferroptosis . Nature reviews Molecular cell biology 2024, 25 (6):424-442. Li J, Cao F, Yin H-l, Huang Z-j, Lin Z-t, Mao N, Sun B, Wang G: Ferroptosis: past, present and future . Cell death & disease 2020, 11 (2):88. Yu Y, Yan Y, Niu F, Wang Y, Chen X, Su G, Liu Y, Zhao X, Qian L, Liu P: Ferroptosis: a cell death connecting oxidative stress, inflammation and cardiovascular diseases . Cell death discovery 2021, 7 (1):193. Cabantchik ZI: Labile iron in cells and body fluids: physiology, pathology, and pharmacology . Frontiers in pharmacology 2014, 5 :45. Shimada BK, Swanson S, Toh P, Seale LA: Metabolism of selenium, selenocysteine, and selenoproteins in ferroptosis in solid tumor cancers . Biomolecules 2022, 12 (11):1581. Xie Y, Hou W, Song X, Yu Y, Huang J, Sun X, Kang R, Tang D: Ferroptosis: process and function . Cell Death & Differentiation 2016, 23 (3):369-379. Chen X, Kang R, Kroemer G, Tang D: Broadening horizons: the role of ferroptosis in cancer . Nature reviews Clinical oncology 2021, 18 (5):280-296. Wang W, Green M, Choi JE, Gijón M, Kennedy PD, Johnson JK, Liao P, Lang X, Kryczek I, Sell A: CD8+ T cells regulate tumour ferroptosis during cancer immunotherapy . Nature 2019, 569 (7755):270-274. Xia X, Wu H, Chen Y, Peng H, Wang S: Ferroptosis of T cell in inflammation and tumour immunity . Clinical and Translational Medicine 2025, 15 (3):e70253. Wither MJ, White WL, Pendyala S, Leanza PJ, Fowler DM, Kueh HY: Antigen perception in T cells by long-term Erk and NFAT signaling dynamics . Proceedings of the National Academy of Sciences 2023, 120 (52):e2308366120. Ewing E, Planell-Picola N, Jagodic M, Gomez-Cabrero D: GeneSetCluster: a tool for summarizing and integrating gene-set analysis results . BMC Bioinformatics 2020, 21 :1-7. Additional Declarations No competing interests reported. Supplementary Files Additionalfile1.csv Supplementary Information Additional file 1.csv Fe (II) OR Fe (III) Detailed Cluster Set Detailed cluster breakdown—including pathways in each cluster, cluster compactness scores, and cluster labels—of CoMAP results using: Homo sapiens , protein, Fe (II) OR Fe (III), 0.05. The minimum number of pathways value was 1 and the cutoff value was 0.25. Additionalfile2.csv Additional file 2.csv Iron/Sulfur Cluster Detailed Cluster Set Detailed cluster breakdown—including pathways in each cluster, cluster compactness scores, and cluster labels—of CoMAP results using: Homo sapiens , protein, iron/sulfur cluster, 0.05. The minimum number of pathways value was 1 and the cutoff value was 0.45. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6605908","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"software","associatedPublications":[],"authors":[{"id":456197309,"identity":"36ef94ca-e619-4687-9014-bc8b1d41a1a9","order_by":0,"name":"Isabella K. Barichello","email":"","orcid":"","institution":"University of Waterloo","correspondingAuthor":false,"prefix":"","firstName":"Isabella","middleName":"K.","lastName":"Barichello","suffix":""},{"id":456197310,"identity":"bf6bea3c-bc89-44cd-9be7-7fa1d811528b","order_by":1,"name":"Travis J. A. Craddock","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABG0lEQVRIiWNgGAWjYLCCBAhlIAFkykHFLHCq5kHXYgxkMDYwMEjg18KApCWxgZAWe+nDzyQe/GGQN28/vPHGh5q09A03kp8/+LhHQo6fgfnhB2y28KWZSSS2MRjOOZNWbDnjWE7uhhtpho0znkkYSzawGWOzi4eHwdgA6JgECYYcM2nehorcDbcTDJt5DkgkbjjAg9V5PDzsnw0S/gC18L8Ba0k3uJ3+sfkPRAvzD6xaeAwfJLABtUiAbclJMLidY9jMANHChtWWMzyFDxLbJAxnSDwD+SXNcOb9N4Uzew4A/dLMZoYtdth72Dcc/PHHRl6CPxkUYsnyfGeOb/jw44CNHD978+MbOIMaeyww41Y/CkbBKBgFowA/AACKUGAQMqOhVwAAAABJRU5ErkJggg==","orcid":"","institution":"University of Waterloo","correspondingAuthor":true,"prefix":"","firstName":"Travis","middleName":"J. A.","lastName":"Craddock","suffix":""}],"badges":[],"createdAt":"2025-05-06 19:08:18","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6605908/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6605908/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":102902287,"identity":"702bb612-3d62-4825-879c-2b5c5ec0fa42","added_by":"auto","created_at":"2026-02-18 08:26:50","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1640689,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6605908/v1/56749b13-d95b-4aa3-97f9-b9df48eb3207.pdf"},{"id":82739452,"identity":"95ddb67c-9a06-4787-9b79-6a408d76696c","added_by":"auto","created_at":"2025-05-14 16:38:49","extension":"csv","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":5082,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSupplementary Information\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAdditional file 1.csv\u003c/p\u003e\n\u003cp\u003eFe (II) OR Fe (III) Detailed Cluster Set\u003c/p\u003e\n\u003cp\u003eDetailed cluster breakdown—including pathways in each cluster, cluster compactness scores, and cluster labels—of CoMAP results using: \u003cem\u003eHomo sapiens\u003c/em\u003e, protein, Fe (II) OR Fe (III), 0.05. The minimum number of pathways value was 1 and the cutoff value was 0.25.\u003c/p\u003e","description":"","filename":"Additionalfile1.csv","url":"https://assets-eu.researchsquare.com/files/rs-6605908/v1/9387857aaad2834234d55bd9.csv"},{"id":82739453,"identity":"b756215f-ebd0-453d-a01f-89a88cfe9ac5","added_by":"auto","created_at":"2025-05-14 16:38:49","extension":"csv","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":7670,"visible":true,"origin":"","legend":"\u003cp\u003eAdditional file 2.csv\u003c/p\u003e\n\u003cp\u003eIron/Sulfur Cluster Detailed Cluster Set\u003c/p\u003e\n\u003cp\u003eDetailed cluster breakdown—including pathways in each cluster, cluster compactness scores, and cluster labels—of CoMAP results using: \u003cem\u003eHomo sapiens\u003c/em\u003e, protein, iron/sulfur cluster, 0.05. The minimum number of pathways value was 1 and the cutoff value was 0.45.\u003c/p\u003e","description":"","filename":"Additionalfile2.csv","url":"https://assets-eu.researchsquare.com/files/rs-6605908/v1/a22baa88c80513c17a7f663d.csv"}],"financialInterests":"No competing interests reported.","formattedTitle":"CoMAP: A Program to Cluster Pathways Overrepresented with Specific Cofactors in Human, Mouse, and Yeast Biology","fulltext":[{"header":"Background","content":"\u003cp\u003eProtein cofactors, a non-protein chemical compound or metallic ion, play a fundamental role in sustaining life on Earth, serving as essential components in a wide array of biochemical and cellular processes across all domains of life [1]. The evolution of cofactor utilization in biological systems is thought to reflect the availability and functionality of these molecules in various environments [1]. Approximately one-third of all proteins depend on cofactors—ranging from metal ions to organic molecules—to carry out their proper biological functions [2]. These cofactors may serve catalytic, regulatory, structural, or transport-related roles within proteins. The study of protein-cofactor interactions, known broadly as cofactor biology or cofactor proteomics, includes understanding the uptake, processing, binding, and physiological roles of these molecules in cellular function [3].\u003c/p\u003e\n\u003cp\u003eCofactors are broadly categorized into two types: organic cofactors (such as vitamins, flavins, and nucleotide derivatives like FAD and NAD) and inorganic cofactors (commonly metal ions like magnesium, zinc, and iron) [4]. These molecules are widely conserved among organisms, although their specific usage and abundance can vary across species [5]. Cofactors are involved in a diverse set of biological functions, including enzyme catalysis, signal transduction, gene expression, cellular respiration and photosynthesis, structural stabilization of proteins and nucleic acids, and the metabolism of drugs and nutrients [2, 6, 7].\u003c/p\u003e\n\u003cp\u003eIn human biology, a range of cofactors—both metallic and non-metallic—are known to be essential. For example, NAD and FAD are key electron carriers in metabolic pathways [4], while metal ions such as Mg²⁺, Ca²⁺, Zn²⁺, and Fe²⁺\u0026nbsp;are indispensable for structural and enzymatic roles\u0026nbsp;[8]. Deficiencies in these cofactors can lead to widespread physiological dysfunction. For instance, magnesium is required for over 300 enzymatic reactions, including those involved in nerve function and energy production\u0026nbsp;[8, 9], while iron, as part of the heme cofactor in hemoglobin, is essential for oxygen transport\u0026nbsp;[8].\u003c/p\u003e\n\u003cp\u003eAnimal models such as mice have shown similar dependency on both metal and organic cofactors for cellular and systemic function. Disruption in cofactor availability has been linked to metabolic impairments, developmental issues, and organ toxicity [6, 10-12]. Similarly, in microorganisms like yeast, cofactors such as zinc, magnesium, iron, and flavins are critical for enzyme activity, membrane potential, and stress responses [13, 14]. For instance, the yeast proteome contains hundreds of zinc-binding proteins, and deficiencies in zinc or magnesium led to impaired growth and cellular stress [13, 15].\u003c/p\u003e\n\u003cp\u003eImportantly, dysregulation of cofactor homeostasis—whether metal-based or organic—has been implicated in a wide range of diseases, including neurodegenerative disorders (such as Alzheimer's and Parkinson's), cardiovascular diseases, cancer, and psychiatric conditions [7]. The precise regulation of cofactor concentrations within the cell is therefore a key determinant of health and resilience against disease. Maintaining this balance involves complex cellular networks that tightly control the synthesis, uptake, transport, and recycling of both metal ions and organic cofactors.\u003c/p\u003e\n\u003cp\u003eThe issue is that with the vast number of molecules playing a role in biological pathways, it is difficult to narrow down only the pathways in an organism in which a specific cofactor is disproportionately involved. To our knowledge, existing resources fail to provide a method to select cofactors of interest, identify the significant pathways involving them, and provide several cluster combinations, alluding to the biological processes of an organism that are most likely to be influenced by these cofactors. Cofactor Mapping \u0026amp; Analysis Program (CoMAP) has these functionalities. This simple and user-friendly resource takes in the cofactors and organism of interest and clusters the overrepresented pathways based on their protein coding genes, which encode for the cofactor-dependant molecules. CoMAP is designed as a versatile tool for mapping relevant pathways within an organism and enabling the straightforward visualization of their relationships, offering a novel approach to exploring cofactor biology. As mentioned, prior, cofactor levels must be carefully maintained lest detrimental effects occur. This program makes it possible to see both the broader biological functions and specific pathways in an organism that would be affected by a surplus or deficiency of certain cofactors, bettering our understanding of cofactor-dependent polymers.\u003c/p\u003e\n\u003ch2\u003eImplementation\u0026nbsp;\u003c/h2\u003e\n\u003cp\u003eCoMAP was created in-house using Python 3.10 and was both developed and run in the Google Colaboratory environment [16]. The large language model ChatGPT was utilized as a resource to assist with source code development on occasion and served as a tool for troubleshooting when necessary. The source code can be accessed at https://github.com/tcraddock/CoMAP.\u003c/p\u003e\n\u003cp\u003eThe execution of CoMAP can be summarized as follows.\u003c/p\u003e\n\u003ch3\u003eMain Program\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eCuration of Protein List using the Protein Data Bank:\u003c/strong\u003e Using the rcsb-api package in Python, a specific query is constructed consisting of a source organism, polymer entity type(s), and cofactor name(s). Polymer entity types include protein, DNA, RNA, nucleic acid-hybrid (NA-hybrid), and other. The application programming interface (API) combs through the entries in the Protein Data Bank (PDB), saving the results that meet all user-specified criteria. Custom reports for each saved result are then obtained from the PDB. The reports include the entry ID, PDB ID, gene name, macromolecule name, and accession code(s) of the polymers that comprise each molecule [17]. Returning this exact report format was achieved through using the corresponding GraphQL query provided by PDB. A total of 5,000 entries can be processed in a single batch. If more than 5,000 entries are returned the batches are combined into one.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConversion of Accession Codes to Ensembl IDs:\u003c/strong\u003e A unique list of accession codes is created from the returned PDB results. These codes are converted using the DAVID gene ID conversion tool [18, 19]. Selenium WebDriver, an automation tool that simulates a person interacting with a website, uploads and submits the list to the conversion tool [18, 19]. Prior to submitting the list, Selenium WebDriver selects both the desired ID type—Ensembl gene ID in this case—and the source organism. To avoid any performance issues or errors when submitting large lists, accession codes are uploaded in batches of 500. If more than one batch is needed, the converted results are combined into one at the completion of this step.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCreation of Unique Ensembl Gene List:\u003c/strong\u003e The PDB file from the first step is updated to include the corresponding Ensembl gene ID(s) for each accession code. A list of the unique Ensembl gene IDs is generated. Both the PDB file updated with Ensembl gene IDs, and the unique gene list can be selected for download.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eOverrepresentation Analysis\u003c/strong\u003e: Selenium WebDriver is used to navigate to the correct ConsensusPathDB (CPDB) biological pathways databank for the user-specified source organism. The biological pathways databank is downloaded to a temporary folder with the pathway genes being identified by Ensembl ID [20-23]. An overrepresentation analysis is then completed on these pathways using a variety of imported Python functions. Due to “Service Unavailable” errors encountered when working with CPDB’s SOAP/WSDL interface, scripting the overrepresentation analysis proved the most suitable solution. To calculate the p-value for each pathway, the hypergeom function is imported from the scipy.stats module and used in conjunction with the survival function. The p-value describes how likely a pathway is to be overrepresented due to chance rather than being a real biological effect. A smaller p-value indicates the result is less likely to be due to chance. The hypergeom function takes in four variables: the total number of genes in the specific pathway being analyzed (n), the number of overlapping genes between the specific pathway and the unique gene list (x), the total background size (N), and the number of mapped entities (M). N and M are pulled from CPDB using the Selenium WebDriver tool. In this program, N represents the total number of genes that are present in at least one CPDB pathway and identifiable by Ensembl ID [20-23]. This value is dependent on the source organism. M means the number of genes from the unique gene list that are present in at least one CPDB pathway and varies for each unique gene list [20-23]. To calculate the q-values, the multipletests function is imported from the statsmodels.stats.multitest module and the Bonferroni method is used. The q-value adjusts the p-values for multiple testing. In this overrepresentation analysis, multiple gene sets are tested, increasing the rate of false positives. The q-value accounts for this, determining if a result is still significant after correcting for multiple tests. The Bonferroni method was selected to calculate the q-values due to its highly conservative nature, reducing false positives. The pathways are then filtered depending on their q-values. Any pathway with a q-value less than or equal to the user input value is retained, the others are discarded. The list of filtered overrepresented pathways can be selected for download. This file provides information on the pathway name, the source the pathway was retrieved from, the identifier used in said source to identify the pathway, the Ensembl IDs of the genes from the unique gene list that are found in the pathway, the percentage of genes in the pathway that are found in the unique gene list, and the p- and q- values of the pathway.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHierarchical Clustering and Cluster Analysis Output\u003c/strong\u003e: The filtered overrepresented pathways are clustered to account for similarities between pathways from the multiple sources compiled into CPDB. To assess the similarities between different biological pathways, agglomerative hierarchical clustering is done. In this program, the similarity between two pathways is defined as the ratio of the number of genes shared by both pathways to the total number of unique genes when the gene sets of both pathways are combined. For example, if pathway A has 10 genes, pathway B has 15 genes, and 5 genes are present in both pathway A and B, the similarity of the two pathways would be calculated as 5/20=0.25. A similarity matrix is created using the calculated similarity for every pair of overrepresented pathways. To ensure the proper matrix shape, the squareform function is used and imported from the scipy.spatial.distance module. The similarity matrix is then converted to a distance matrix by subtracting each entry from the number one, so that a smaller number signifies a greater relation between pathways. The hierarchical clustering is performed with the linkage function using the average method. The fcluster function assigns the pathways a unique cluster label from the given distance matrix. Both functions are imported from the scipy.cluster.hierarchy module. There are two variables of importance to note. First, the cutoff value which is the maximum allowed percentage of dissimilarity between two pathways. A cutoff value of 0.2 means that the pathways in each cluster can be at most 20% different. The second is the minimum number of pathways variable. This variable dictates how many pathways must be grouped together to be considered a cluster. A heatmap is generated and output to the terminal using every combination of these two variables to provide a simple visualization of how many clusters are produced for each combination. The cutoff value ranges from 0.05 to 0.95 in 0.05 increments and the minimum number of pathways ranges from 1 to 8 in whole number increments. For simplicity, each combination of cutoff value and minimum number of pathways will be termed a set. At the completion of the program, a final cluster analysis file can be downloaded. This file contains information on each set. Specifically, it details the number of clusters, percentage of pathways expressed, percentage of genes expressed, average compactness, and word frequency of each set. The compactness of a cluster is the sum of the similarities—from the similarity matrix—for every unique pair of pathways in the cluster, divided by the number of pairs. For example, if a cluster consists of pathways A, B, and C, the cluster compactness will be the average of the similarities between pathways A and B, A and C, and B and C. The average compactness of a set is then the average of the compactness for every cluster in that set. Unique pairs are found using the combinations function from the itertools module. Finally, word frequency for a file is the single most frequent word(s) from the pathway names for each cluster. The program excludes common filler words and other terms deemed unhelpful, including human, mouse, budding yeast, and their scientific names. Words from the same cluster are joined by commas and words from different clusters are separated by semicolons. Hyphenated words are treated as a single word.\u003c/p\u003e\n\u003ch3\u003eSecondary Program\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003eSpecific Set Download:\u0026nbsp;\u003c/strong\u003eAfter executing the main program and reviewing the cluster analysis file, specific sets of the user’s choice can be downloaded. The downloaded file has four columns. The first is the cluster ID, the numeric value given to identify that cluster. This column does not provide helpful information on the pathways but is essential for the user to pair clusters with their corresponding genes. More information on gene cluster membership can be found in the “Graphical User Interface (GUI)” subsection. The second column is the list of pathways within the cluster. The third column is the cluster compactness score. This value is calculated as described in the “Hierarchical Clustering and Cluster Analysis Output” step. Finally, the fourth column is the cluster label. This is the most frequently appearing word(s) in the pathway names of each cluster. Clusters are determined through hierarchical clustering, following the same procedure as described in the “Hierarchical Clustering and Cluster Analysis Output” step. However, as opposed to iterating through every set, this function only develops clusters for the exact cutoff value and minimum number of pathways value entered by the user. This function can be run multiple times after the completion of the first program, so the user can download all the sets they are interested in.\u003c/p\u003e"},{"header":"Results and Discussion","content":"\u003cp\u003eGraphical User Interface (GUI)\u003c/p\u003e \u003cp\u003eThere are two graphical user interfaces, one for each program, that open at the beginning of each program\u0026rsquo;s execution. In the Google Colaboratory environment [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e], the GUI\u0026rsquo;s will be found at the bottom of the code block in the output terminal.\u003c/p\u003e \u003cp\u003e \u003cb\u003eMain Program GUI\u003c/b\u003e: The main program has several user-specified values in the GUI. The first option is a dropdown to select the source organism, the organism from which the protein or other molecule originally came from. The choices are \u003cem\u003eHomo sapiens\u003c/em\u003e (human), \u003cem\u003eMus musculus\u003c/em\u003e (mouse), and \u003cem\u003eSaccharomyces cerevisiae\u003c/em\u003e (yeast), the options being restricted to the available CPDB databanks. Next, is the polymer entity type selection. A default polymer entity already exists and cannot be removed. The dropdown is initially set to \u0026ldquo;protein\u0026rdquo; but can be changed to any of the following polymer types offered in the PDB: protein, DNA, RNA, NA-hybrid, and other. A green \u0026ldquo;Add Polymer Entity\u0026rdquo; and red \u0026ldquo;Remove Polymer Entity\u0026rdquo; button allow the user to submit between 1 and 5 polymer types. The grey \u0026ldquo;AND/OR Button\u0026rdquo; determines how multiple polymers will be queried. An \u0026ldquo;AND\u0026rdquo; selection means that the macromolecule must contain all the selected polymers in the structure. An \u0026ldquo;OR\u0026rdquo; selection means that the macromolecule must contain at least one of the specified polymers. The \u0026ldquo;Remove Polymer Entity\u0026rdquo; and \u0026ldquo;AND/OR Button\u0026rdquo; buttons are disabled to start. They will be automatically enabled with the addition of a new polymer. The \u0026ldquo;Add Polymer Entity\u0026rdquo; button will become disabled once 5 polymer entries exist. The cofactor name selection follows. At least one cofactor must be submitted when running the code. As such, one cofactor entry already exists and cannot be removed. The cofactor name can be input using either the dropdown option, which includes a handful of pre-selected metal and non-metal cofactors from the PDB, or the custom option. The custom option reveals a textbox where a cofactor of the user\u0026rsquo;s choosing can be entered. The cofactor text must be entered exactly as it appears in the PDB to search the PDB properly. Additionally, there is a green \u0026ldquo;Add Cofactor\u0026rdquo; button and a red \u0026ldquo;Remove Cofactor\u0026rdquo; button, allowing the user to add and remove as many cofactors as desired. The grey \u0026ldquo;AND/OR Button\u0026rdquo; acts the same as for the polymer entities. An \u0026ldquo;AND\u0026rdquo; selection means that the molecules must contain all the selected cofactors. An \u0026ldquo;OR\u0026rdquo; selection means that the molecules must contain at least one of the specified cofactors. The \u0026ldquo;Remove Cofactor\u0026rdquo; and \u0026ldquo;AND/OR Button\u0026rdquo; buttons are also disabled to begin and will be enabled once a second cofactor is added. Next, a threshold q-value is entered into a textbox. This value is the user\u0026rsquo;s restriction for whether an overrepresented pathway is still significant after adjusting for multiple tests. Typically, a value of 0.05 (5%) or less is used. Values can be entered by either typing out the decimal number or using scientific notation (E or e). A zero or negative value will default to 0.05 upon submission. The last four dropdowns are true or false. They each dictate whether a different file from some step in the code will (true) or will not (false) be downloaded.\u003c/p\u003e \u003cp\u003e \u003cb\u003eSecondary Program GUI\u003c/b\u003e: The secondary program starts with five user-specified values in the GUI. The top two options are used to specify the desired set. First, is the cutoff value textbox which takes in any positive decimal number, typed out or using scientific notation. This value is usually kept below one as any value greater than or equal to one will return only one cluster containing all the pathways. An input of a negative decimal number or zero will default to 0.1 upon submission. Second, is the minimum number of pathways textbox. This input will accept any positive integers greater than zero. A zero or negative value will default to one upon submission. Though the heat map and cluster analysis file from the \u0026ldquo;Hierarchical Clustering and Cluster Analysis Output\u0026rdquo; step only show certain combinations of variables, the textboxes can accept other values. The following three inputs are true or false dropdowns. The first is for whether to plot the dendrogram, a tree diagram to show how the pathways are clustered together. The next dictates whether to print the gene cluster memberships to the console. Gene cluster membership is output in the format of: Ensembl gene ID, number of clusters the gene is found in, and specific IDs of the clusters the gene is in. Selecting true on this option will reveal two additional input fields where the user will specify how many clusters a gene needs to be involved in to be printed to the terminal. A value of zero or less will automatically default to one upon submission. Lastly, the user can decide to download or not download the set file.\u003c/p\u003e \u003cp\u003eUse Case\u003c/p\u003e \u003cp\u003eWe know that ferroptosis is a non-apoptotic mode of cell death, relying on the metal iron [\u003cspan additionalcitationids=\"CR25 CR26\" citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. This mechanism is driven by lipid peroxidation, where oxygen and polyunsaturated fatty acid lipids react with free intracellular iron or enzymes with iron cofactors to create lipid peroxides [\u003cspan additionalcitationids=\"CR25 CR26\" citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. The build-up of these lipid peroxides and iron molecules is what becomes toxic to the cell and leads to cell death [\u003cspan additionalcitationids=\"CR25\" citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. Thus, a surplus of iron can make a cell more susceptible to ferroptosis. Various biological pathways have been implicated in the regulation of ferroptosis, such as mitochondrial and energy production pathways and amino acid metabolism [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e, \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. In aerobic respiration, a sugar molecule is broken down through glycolysis and the citric acid cycle, producing the energy-carrying molecules NADH and FADH2. These molecules proceed to the electron transport chain (ETC), within the mitochondria, to donate electrons and synthesize ATP and water. Certain protein pumps in the ETC can generate reactive oxygen species (ROS) through the partial reduction of oxygen [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. The accumulation of ROS in the cell leads to lipid peroxidation, making it evident that these highly reactive molecules are crucial for ferroptosis. Furthermore, cellular labile iron can catalyze the formation of ROS, resulting in oxidative stress, cellular harm, and potential ferroptosis [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. Mitochondria work to control the levels of labile iron through the synthesis of iron-sulfur (Fe-S) clusters [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. This process consumes iron molecules that could otherwise be used for ROS production [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. Another cellular function that manages ferroptosis is the metabolism of amino acids. Once selenium, a micronutrient, is metabolized, it can be incorporated into the amino acid selenocysteine which can be further used in selenoproteins [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. Selenoproteins, such as GPX4, are used by cells to protect them against oxidative damage, limit lipid peroxidation, and inhibit ferroptosis [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eFerroptosis has been reported to play a role in the development of multiple diseases, including cancer, neurological disorders, and cardiovascular diseases, and has also been linked to T-cell immunity [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e, \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]. Numerous studies have been completed on ferroptosis in cancerous cells. This non-apoptotic form of cell death has been successfully induced in pancreatic, colorectal, breast, lung, renal cell, and adrenocortical cancer cells, thereby killing tumorous cells and inhibiting the growth of tumors [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. A whole new avenue for antitumor therapy has been opened with this discovery, showcasing the great potential ferroptosis has for cancer treatment. This mechanism can still result in cancer proliferation when ferroptotic damage triggers inflammation, creating a tumor-promoting environment [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. A study by Wang et al. demonstrated that immunotherapy-activated CD8\u0026thinsp;+\u0026thinsp;T cells promote lipid peroxidation in tumor cells, enhancing ferroptosis and playing a crucial role in anti-tumor immunotherapy [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. The report also mentioned ferroptosis\u0026rsquo;s potential involvement in T cell immunity, stating that it is still unclear how exactly this cell death mechanism factors in [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. Extensive research suggests that several neurodegenerative diseases involve the build-up of iron in localized areas of the central and peripheral nervous systems, as well as in brain tissue [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. This build-up is frequently driven by the altered distribution of iron in the body, disrupting iron homeostasis, promoting ROS generation, and ultimately leading to severe oxidative damage and ferroptosis [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. In short, ferroptosis, a relatively new and iron-dependent form of cell death, can contribute to the development of multiple diseases. Many questions, however, remain unanswered about this mechanism and its role in biology.\u003c/p\u003e \u003cp\u003eSeeing how a surplus of iron increases a cell\u0026rsquo;s susceptibility to ferroptosis, and the numerous biological pathways influenced by ferroptotic behaviour, we chose to use iron (Fe) as our use case to demonstrate the functionality of our software. We ran the main program the first time using the following parameters: source organism of \u003cem\u003eHomo sapiens\u003c/em\u003e, polymer entity type of protein, cofactor of Fe (II) OR Fe (III), and threshold q-value of 0.05. A total of 117 pathways were overrepresented and they were clustered using a cutoff value of 0.25 and minimum number of pathways of 1, resulting in 21 clusters. The complete cluster set is included as an additional file (Additional file 1). The \u0026ldquo;ferroptosis\u0026rdquo; and \u0026ldquo;cellular responses\u0026rdquo; clusters were of most interest to us. The former cluster is the exact mechanism we were looking to investigate with our use case, correctly identifying that this mode of cell death is dependent on iron-containing proteins. The latter is about the cellular responses to chemical stress, such as ROS, and external stimuli, such as the metal cofactor of Fe. As mentioned, prior, ferroptosis is linked with an accumulation of ROS and lipid peroxides, thus, pathways on how to respond to these stressors are expected to be linked to the Fe (II) and Fe (III) cofactor as we see from our program. Further, we find a \u0026ldquo;renal cell carcinoma\u0026rdquo; cluster, a disease that we know is influenced by ferroptosis. The \u0026ldquo;mitochondrial iron-sulfur cluster biogenesis\u0026rdquo; cluster is also worth exploring. By consuming Fe to produce Fe-S clusters, this pathway controls the iron levels in the cell, minimizing ROS production. It is then clear that a surplus of iron in the cell would drastically affect this function. Noticing this, we ran the main program again for the iron/sulfur cluster cofactor with all other inputs remaining the same as above. Utilizing the same minimum number of pathways value of 1 and a cutoff of 0.45, 23 clusters were formed. The complete Fe-S cluster set is included as an additional file (Additional file 2). The \u0026ldquo;disease\u0026rdquo; cluster aligned closely with the different diseases ferroptosis may play a role in. The cluster included various neurodegenerative pathways (Alzheimer\u0026rsquo;s, Amyotrophic lateral sclerosis, Huntington\u0026rsquo;s, Prion, and Parkinson\u0026rsquo;s disease), diabetic cardiomyopathy, and non-alcoholic fatty liver disease. This cluster also included numerous pathways on the ETC and oxidative phosphorylation, pathways that have been found to generate ROS. Another cluster in this file contains the metabolism of amino acids and derivatives, as well as selenocysteine synthesis and selenoamino acid metabolism pathways, all of which are used in the protective function of cells against ferroptosis. Further, ferroptosis has been implicated in T cell differentiation, function, viability, and immunity [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e, \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]. The NFAT signaling pathway is an important link between T cell receptor engagement and gene transcription regulation [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]. Activation of the signaling pathway results in the accumulation and activation of NFAT transcription factors [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]. The transcription factors regulate several T cell genes and influence the T cell responses that rid the body of tumors and infections [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]. The NFAT activation pathways are clustered in the Fe (III) or Fe (II) file, possibly explaining the role of ferroptosis in T cell immunity and the inhibition of tumor growth. The clusters mentioned up till now make reasonable sense because we know how ferroptosis is linked to the pathways within them. There are, however, some clusters where this relationship is not as clear. For example, in the Fe-S file we see a cluster on nucleotide excision repair. It is unclear how exactly ferroptosis may affect this function, making this an area of research scientists may want to explore in the future. It is evident that our program displays accurate clusters containing related pathways and highlights potential new connections between them.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eAnalyzing overrepresented pathways and gene-sets can be challenging when a substantial amount of information is output. Tools exist that attempt to make interpreting this data easier. One software that shares some functionality with CoMAP is GeneSetCluster [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. GeneSetCluster groups gene-sets by calculating a distance score based on the overlap of genes between two pathways [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. Hierarchical clustering can then be applied to the gene-sets, allowing users to gain deeper insights into biological information [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. This process is like the one we employ in CoMAP to group biological pathways. One difference to note is that GeneSetCluster is a code package in R [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. This means that users need to not only develop code to accomplish their desired tasks but also retrieve the gene-sets they wish to analyze from other databanks. Our program provides improved functionality by only requiring a handful of inputs from the user to execute. Data retrieval, gene list creation, overrepresentation analyses, and clustering are performed automatically without requiring any code alterations from users. CoMAP is designed to be intuitive and accessible to non-bioinformaticians, featuring a well-developed code structure and a straightforward GUI. It is also highly efficient, typically completing tasks within minutes. Finally, this novel software surpasses the programs we know to exist in terms of functionality by supporting multiple cofactor and polymer selections and offering flexible file download options, making CoMAP valuable to a broad range of researchers.\u003c/p\u003e \u003cp\u003eWith CoMAP\u0026rsquo;s application to various areas of research, there are extensive uses for this program beyond the provided application. By clustering pathways with similar genetic components, researchers can design ways to engineer biological systems utilizing specific genes. Pairing this with patient-specific gene expressions gives the opportunity for advancement in personalized medicine. The clustering of disease and disorder pathways can highlight common genes across different conditions, not only expanding our knowledge of complex diseases but showcasing potential drug targets common among these pathways. All in all, there are countless future directions for this research to go.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eMetalloproteins are essential in a wide variety of cellular functions. These proteins consist of metal ions, which enable protein activity or act as structural sites to support protein folding and assembly. Found in numerous organisms, these metal-binding proteins are involved in biological pathways including, but not limited to, energy production, biological polymer synthesis, motility, skeletal growth, and immune system regulation, making them crucial for sustaining life on Earth. The concentrations of these cofactors, both metal and not, must be kept in a careful balance. Dramatic shifts in the homeostasis of cofactors have been linked to neurodegenerative disorders, cardiovascular diseases, cancers, cell death, and countless other harmful effects. But, with these ligands being involved in hundreds and thousands of biological pathways it is extremely difficult to pinpoint which pathways are significant and how they group together into larger cellular functions. Our program provides a method for extracting the pathways in human, mouse, and yeast biology that are overrepresented with cofactor(s) and clusters them according to similarity. The Cofactor Mapping and Analysis Program (CoMAP) is a new way to study cofactors in biology, expanding our understanding of cellular functions dependent on selected cofactors and possibly revealing new molecule-based relationships between pathways.\u003c/p\u003e \u003cp\u003eOne of the key advantages of CoMAP is its ability to efficiently map key biological processes within an organism and enable the intuitive formation of connections between them. To our knowledge, there are no existing resources that allow for users to retrieve the pathways disproportionately represented with selected cofactors, organize these pathways into categories based on similarity of genes, and provide information on how closely clustered these pathways are. While other overrepresentation analysis and clustering software exist in languages such as Python and R, using these tools would require the user to develop, debug, and execute code themselves. With CoMAP, these functionalities have already been implemented in the code. All the user is required to do is click some buttons. Furthermore, the functionality of the program to take in multiple polymer types and cofactor names in one run and the users preferred method of querying them, allows for diverse and unique searches to be constructed for use in numerous areas of research. CoMAP makes use of three large and well-established databanks, the PDB, DAVID, and CPDB. These databanks contain a vast amount of information and are continually updated. As such, our program makes use of the latest information in bioinformatics, providing the most complete set of data possible for users. The final advantage to consider is the choice provided in the program to download a variety of different files from different steps in the code. The ability to select the PDB entries file, the unique Ensembl gene IDs file, and the filtered overrepresented pathways prior to clustering, provides the user with additional data to use as needed in their research.\u003c/p\u003e \u003cp\u003eDespite its advantages, there are certain limitations that are inherent to this program. When querying multiple polymer types, searches can only use the \u0026ldquo;AND\u0026rdquo; or the \u0026ldquo;OR\u0026rdquo; operator. This is also true for the cofactor names. This means results are limited to entries that either have all the input polymers/cofactors, or at least one of them. If users, for example, combined multiple cofactors using both operators, there would be a high level of ambiguity regarding how the results were obtained. An input of \u0026ldquo;A and B or C\u0026rdquo; gives rise to both \u0026ldquo;(A and B) or C\u0026rdquo; and \u0026ldquo;A and (B or C)\u0026rdquo;, making the search criteria difficult to determine from the initial input. Constraints also arise from the availability of the data being used. Since the program relies on information from other sources, if something is not included in the PDB, DAVID, CPDB, or one of the sources compiled into CPDB, it cannot and will not be used in the program\u0026rsquo;s data analysis. This limits the availability of data, as protein structures or pathways may exist in other databanks that we do not access. Expanding on this, all the entries in the PDB have experimentally determined three-dimensional crystal structures. A biological molecule that does not yet have an experimentally determined crystal structure will not be incorporated into this databank and will, therefore, not be returned in a query using our program. Furthermore, species to query are restricted to those in the CPDB. Modifications to the existing code are needed to include species beyond \u003cem\u003eHomo sapiens, Mus musculus\u003c/em\u003e, and \u003cem\u003eSaccharomyces cerevisiae\u003c/em\u003e. The final limitation to mention is that because the external sources are accessed during program execution, through API\u0026rsquo;s and automated web interactions, the execution of the program relies on these sources operating successfully. If one of these databanks is down for maintenance, has reached server capacity, or has encountered an alternate error, CoMAP cannot be run and will not be properly executed.\u003c/p\u003e \u003cp\u003eIn this report, we detailed one potential application of our program using Fe (II), Fe (III), and iron/sulfur clusters as the selected cofactors. We investigated the different cellular functions and specific pathways that would be affected by a surplus of iron in the cell, which corresponded with existing experimental data. As there are numerous cofactor options, this program can explore the proteins, nucleic acids, and other biological molecules that depend on dozens of different ligands, highlighting the exact places and pathways that are expected to be affected by a shift in homeostasis of said cofactors. This novel method for visualizing biological pathways can provide much-needed insights into diseases and illnesses, revealing previously unknown relationships between pathways. Clustering can lead to the discovery of commonalities across different conditions, unveiling new methods of treatment or drug targets. Furthermore, the similarities among pathways can provide information as to how to bioengineer these cellular functions, advancing treatment options for a whole host of diseases. A betterment in personalized medicine may even be seen if biological functions can be altered to meet the specific needs of the patient. Overall, this research opens the door to numerous exciting future possibilities.\u003c/p\u003e \u003cp\u003eAvailability and Requirements\u003c/p\u003e \u003cp\u003eProject name: Cofactor Mapping \u0026amp; Analysis Program (CoMAP)\u003c/p\u003e \u003cp\u003eProject home page: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/tcraddock/CoMAP\u003c/span\u003e\u003cspan address=\"https://github.com/tcraddock/CoMAP\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e \u003cp\u003eOperating system(s): Platform independent.\u003c/p\u003e \u003cp\u003eProgramming language: Python 3.10.\u003c/p\u003e \u003cp\u003eOther requirements: Run program in Google Colab environment with Python 3.10 or higher.\u003c/p\u003e \u003cp\u003eLicense: CoMAP is free to use for academics.\u003c/p\u003e \u003cp\u003eAny restrictions to use by non-academics: Commercial users should contact Dr. Travis Craddock at
[email protected]. Data used by CoMAP for academic use is available under the licence terms of each contributing databank (the Protein Data Bank, DAVID, and ConsensusPathDB).\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eCoMAP: Cofactor Mapping \u0026amp; Analysis Program\u003c/p\u003e\n\u003cp\u003eNA-hybrid: Nucleic-acid hybrid\u003c/p\u003e\n\u003cp\u003eAPI: Application programming interface\u003c/p\u003e\n\u003cp\u003ePDB: The Protein Data Bank\u003c/p\u003e\n\u003cp\u003eCPDB: ConsensusPathDB\u003c/p\u003e\n\u003cp\u003eGUI: Graphical user interface\u003c/p\u003e\n\u003cp\u003eETC: Electron transport chain\u003c/p\u003e\n\u003cp\u003eROS: Reactive oxygen species\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll data analysed during this study are included in this published article and its supplementary information files.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis research was undertaken thanks to funding from the Canada Research Chairs Program to Travis Craddock (CRC-2022-00204) and funding from the University of Waterloo.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors’ contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eResearch design, TJAC; program development, TJAC, IKB.; data analysis, TJAC, IKB; writing—initial manuscript, IKB; writing—review and editing, TJAC, IKB. All authors read and approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors would like to thank Drs. Hadi Zadeh-Haghighi and Lea Gassab for helpful insights and advice.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eDupont CL, Yang S, Palenik B, Bourne PE: \u003cstrong\u003eModern proteomes contain putative imprints of ancient shifts in trace metal geochemistry\u003c/strong\u003e. \u003cem\u003eProceedings of the National Academy of Sciences \u003c/em\u003e2006, \u003cstrong\u003e103\u003c/strong\u003e(47):17822-17827.\u003c/li\u003e\n\u003cli\u003eLi J, He X, Gao S, Liang Y, Qi Z, Xi Q, Zuo Y, Xing Y: \u003cstrong\u003eThe metal-binding protein atlas (MbPA): an integrated database for curating metalloproteins in all aspects\u003c/strong\u003e. \u003cem\u003eJournal of molecular biology \u003c/em\u003e2023, \u003cstrong\u003e435\u003c/strong\u003e(14):168117.\u003c/li\u003e\n\u003cli\u003eShi W, Chance M: \u003cstrong\u003eMetallomics and metalloproteomics\u003c/strong\u003e. \u003cem\u003eCellular and Molecular Life Sciences \u003c/em\u003e2008, \u003cstrong\u003e65\u003c/strong\u003e:3040-3048.\u003c/li\u003e\n\u003cli\u003eCant\u0026oacute; C, Menzies KJ, Auwerx J: \u003cstrong\u003eNAD+ metabolism and the control of energy homeostasis: a balancing act between mitochondria and the nucleus\u003c/strong\u003e. \u003cem\u003eCell Metab \u003c/em\u003e2015, \u003cstrong\u003e22\u003c/strong\u003e(1):31-53.\u003c/li\u003e\n\u003cli\u003eAndreini C, Bertini I, Rosato A: \u003cstrong\u003eMetalloproteomes: a bioinformatic approach\u003c/strong\u003e. \u003cem\u003eAcc Chem Res \u003c/em\u003e2009, \u003cstrong\u003e42\u003c/strong\u003e(10):1471-1479.\u003c/li\u003e\n\u003cli\u003eDudev T, Lim C: \u003cstrong\u003eCompetition among metal ions for protein binding sites: determinants of metal ion selectivity in proteins\u003c/strong\u003e. \u003cem\u003eChemical reviews \u003c/em\u003e2014, \u003cstrong\u003e114\u003c/strong\u003e(1):538-556.\u003c/li\u003e\n\u003cli\u003eJomova K, Makova M, Alomar SY, Alwasel SH, Nepovimova E, Kuca K, Rhodes CJ, Valko M: \u003cstrong\u003eEssential metals in health and disease\u003c/strong\u003e. \u003cem\u003eChemico-biological interactions \u003c/em\u003e2022, \u003cstrong\u003e367\u003c/strong\u003e:110173.\u003c/li\u003e\n\u003cli\u003eZoroddu MA, Aaseth J, Crisponi G, Medici S, Peana M, Nurchi VM: \u003cstrong\u003eThe essential metals for humans: a brief overview\u003c/strong\u003e. \u003cem\u003eJ Inorg Biochem \u003c/em\u003e2019, \u003cstrong\u003e195\u003c/strong\u003e:120-129.\u003c/li\u003e\n\u003cli\u003eJahnen-Dechent W, Ketteler M: \u003cstrong\u003eMagnesium basics\u003c/strong\u003e. \u003cem\u003eClinical kidney journal \u003c/em\u003e2012, \u003cstrong\u003e5\u003c/strong\u003e(Suppl_1):i3-i14.\u003c/li\u003e\n\u003cli\u003eProhaska JR, Smith TL: \u003cstrong\u003eEffect of dietary or genetic copper deficiency on brain catecholamines, trace metals and enzymes in mice and rats\u003c/strong\u003e. \u003cem\u003eThe Journal of Nutrition \u003c/em\u003e1982, \u003cstrong\u003e112\u003c/strong\u003e(9):1706-1717.\u003c/li\u003e\n\u003cli\u003eLiu Yk, Xu H, Liu F, Tao R, Yin J: \u003cstrong\u003eEffects of serum cobalt ion concentration on the liver, kidney and heart in mice\u003c/strong\u003e. \u003cem\u003eOrthopaedic Surgery \u003c/em\u003e2010, \u003cstrong\u003e2\u003c/strong\u003e(2):134-140.\u003c/li\u003e\n\u003cli\u003ePereira M, Pereira M, Sousa J: \u003cstrong\u003eEvaluation of nickel toxicity on liver, spleen, and kidney of mice after administration of high\u003c/strong\u003e\u003cstrong\u003e‐dose metal ion\u003c/strong\u003e. \u003cem\u003eJournal of Biomedical Materials Research: An Official Journal of The Society for Biomaterials, The Japanese Society for Biomaterials, and the Australian Society for Biomaterials \u003c/em\u003e1998, \u003cstrong\u003e40\u003c/strong\u003e(1):40-47.\u003c/li\u003e\n\u003cli\u003eChen Y, Li F, Mao J, Chen Y, Nielsen J: \u003cstrong\u003eYeast optimizes metal utilization based on metabolic network and enzyme kinetics\u003c/strong\u003e. \u003cem\u003eProceedings of the National Academy of Sciences \u003c/em\u003e2021, \u003cstrong\u003e118\u003c/strong\u003e(12):e2020154118.\u003c/li\u003e\n\u003cli\u003eCyert MS, Philpott CC: \u003cstrong\u003eRegulation of cation balance in Saccharomyces cerevisiae\u003c/strong\u003e. \u003cem\u003eGenetics \u003c/em\u003e2013, \u003cstrong\u003e193\u003c/strong\u003e(3):677-713.\u003c/li\u003e\n\u003cli\u003eWang Y, Weisenhorn E, MacDiarmid CW, Andreini C, Bucci M, Taggart J, Banci L, Russell J, Coon JJ, Eide DJ: \u003cstrong\u003eThe cellular economy of the Saccharomyces cerevisiae zinc proteome\u003c/strong\u003e. \u003cem\u003eMetallomics \u003c/em\u003e2018, \u003cstrong\u003e10\u003c/strong\u003e(12):1755-1776.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eGoogle Colaboratory \u003c/strong\u003e[https://colab.research.google.com/]\u003c/li\u003e\n\u003cli\u003eBerman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: \u003cstrong\u003eThe protein data bank\u003c/strong\u003e. \u003cem\u003eNucleic acids research \u003c/em\u003e2000, \u003cstrong\u003e28\u003c/strong\u003e(1):235-242.\u003c/li\u003e\n\u003cli\u003eHuang DW, Sherman BT, Lempicki RA: \u003cstrong\u003eSystematic and integrative analysis of large gene lists using DAVID bioinformatics resources\u003c/strong\u003e. \u003cem\u003eNat Protoc \u003c/em\u003e2009, \u003cstrong\u003e4\u003c/strong\u003e(1):44-57.\u003c/li\u003e\n\u003cli\u003eHuang DW, Sherman BT, Lempicki RA: \u003cstrong\u003eBioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists\u003c/strong\u003e. \u003cem\u003eNucleic acids research \u003c/em\u003e2009, \u003cstrong\u003e37\u003c/strong\u003e(1):1-13.\u003c/li\u003e\n\u003cli\u003eKamburov A, Herwig R: \u003cstrong\u003eConsensusPathDB 2022: molecular interactions update as a resource for network biology\u003c/strong\u003e. \u003cem\u003eNucleic acids research \u003c/em\u003e2022, \u003cstrong\u003e50\u003c/strong\u003e(D1):D587-D595.\u003c/li\u003e\n\u003cli\u003eKamburov A, Pentchev K, Galicka H, Wierling C, Lehrach H, Herwig R: \u003cstrong\u003eConsensusPathDB: toward a more complete picture of cell biology\u003c/strong\u003e. \u003cem\u003eNucleic acids research \u003c/em\u003e2011, \u003cstrong\u003e39\u003c/strong\u003e(suppl 1):D712-D717.\u003c/li\u003e\n\u003cli\u003eKamburov A, Stelzl U, Lehrach H, Herwig R: \u003cstrong\u003eThe ConsensusPathDB interaction database: 2013 update\u003c/strong\u003e. \u003cem\u003eNucleic acids research \u003c/em\u003e2013, \u003cstrong\u003e41\u003c/strong\u003e(D1):D793-D800.\u003c/li\u003e\n\u003cli\u003eKamburov A, Wierling C, Lehrach H, Herwig R: \u003cstrong\u003eConsensusPathDB\u0026mdash;a database for integrating human functional interaction networks\u003c/strong\u003e. \u003cem\u003eNucleic acids research \u003c/em\u003e2009, \u003cstrong\u003e37\u003c/strong\u003e(suppl 1):D623-D628.\u003c/li\u003e\n\u003cli\u003eJiang X, Stockwell BR, Conrad M: \u003cstrong\u003eFerroptosis: mechanisms, biology and role in disease\u003c/strong\u003e. \u003cem\u003eNature reviews Molecular cell biology \u003c/em\u003e2021, \u003cstrong\u003e22\u003c/strong\u003e(4):266-282.\u003c/li\u003e\n\u003cli\u003eDixon SJ, Olzmann JA: \u003cstrong\u003eThe cell biology of ferroptosis\u003c/strong\u003e. \u003cem\u003eNature reviews Molecular cell biology \u003c/em\u003e2024, \u003cstrong\u003e25\u003c/strong\u003e(6):424-442.\u003c/li\u003e\n\u003cli\u003eLi J, Cao F, Yin H-l, Huang Z-j, Lin Z-t, Mao N, Sun B, Wang G: \u003cstrong\u003eFerroptosis: past, present and future\u003c/strong\u003e. \u003cem\u003eCell death \u0026amp; disease \u003c/em\u003e2020, \u003cstrong\u003e11\u003c/strong\u003e(2):88.\u003c/li\u003e\n\u003cli\u003eYu Y, Yan Y, Niu F, Wang Y, Chen X, Su G, Liu Y, Zhao X, Qian L, Liu P: \u003cstrong\u003eFerroptosis: a cell death connecting oxidative stress, inflammation and cardiovascular diseases\u003c/strong\u003e. \u003cem\u003eCell death discovery \u003c/em\u003e2021, \u003cstrong\u003e7\u003c/strong\u003e(1):193.\u003c/li\u003e\n\u003cli\u003eCabantchik ZI: \u003cstrong\u003eLabile iron in cells and body fluids: physiology, pathology, and pharmacology\u003c/strong\u003e. \u003cem\u003eFrontiers in pharmacology \u003c/em\u003e2014, \u003cstrong\u003e5\u003c/strong\u003e:45.\u003c/li\u003e\n\u003cli\u003eShimada BK, Swanson S, Toh P, Seale LA: \u003cstrong\u003eMetabolism of selenium, selenocysteine, and selenoproteins in ferroptosis in solid tumor cancers\u003c/strong\u003e. \u003cem\u003eBiomolecules \u003c/em\u003e2022, \u003cstrong\u003e12\u003c/strong\u003e(11):1581.\u003c/li\u003e\n\u003cli\u003eXie Y, Hou W, Song X, Yu Y, Huang J, Sun X, Kang R, Tang D: \u003cstrong\u003eFerroptosis: process and function\u003c/strong\u003e. \u003cem\u003eCell Death \u0026amp; Differentiation \u003c/em\u003e2016, \u003cstrong\u003e23\u003c/strong\u003e(3):369-379.\u003c/li\u003e\n\u003cli\u003eChen X, Kang R, Kroemer G, Tang D: \u003cstrong\u003eBroadening horizons: the role of ferroptosis in cancer\u003c/strong\u003e. \u003cem\u003eNature reviews Clinical oncology \u003c/em\u003e2021, \u003cstrong\u003e18\u003c/strong\u003e(5):280-296.\u003c/li\u003e\n\u003cli\u003eWang W, Green M, Choi JE, Gij\u0026oacute;n M, Kennedy PD, Johnson JK, Liao P, Lang X, Kryczek I, Sell A: \u003cstrong\u003eCD8+ T cells regulate tumour ferroptosis during cancer immunotherapy\u003c/strong\u003e. \u003cem\u003eNature \u003c/em\u003e2019, \u003cstrong\u003e569\u003c/strong\u003e(7755):270-274.\u003c/li\u003e\n\u003cli\u003eXia X, Wu H, Chen Y, Peng H, Wang S: \u003cstrong\u003eFerroptosis of T cell in inflammation and tumour immunity\u003c/strong\u003e. \u003cem\u003eClinical and Translational Medicine \u003c/em\u003e2025, \u003cstrong\u003e15\u003c/strong\u003e(3):e70253.\u003c/li\u003e\n\u003cli\u003eWither MJ, White WL, Pendyala S, Leanza PJ, Fowler DM, Kueh HY: \u003cstrong\u003eAntigen perception in T cells by long-term Erk and NFAT signaling dynamics\u003c/strong\u003e. \u003cem\u003eProceedings of the National Academy of Sciences \u003c/em\u003e2023, \u003cstrong\u003e120\u003c/strong\u003e(52):e2308366120.\u003c/li\u003e\n\u003cli\u003eEwing E, Planell-Picola N, Jagodic M, Gomez-Cabrero D: \u003cstrong\u003eGeneSetCluster: a tool for summarizing and integrating gene-set analysis results\u003c/strong\u003e. \u003cem\u003eBMC Bioinformatics \u003c/em\u003e2020, \u003cstrong\u003e21\u003c/strong\u003e:1-7.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Protein cofactor, biological pathways, overrepresentation analysis, hierarchical clustering, ferroptosis","lastPublishedDoi":"10.21203/rs.3.rs-6605908/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6605908/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground:\u003c/h2\u003e \u003cp\u003eProtein cofactors, such as metal ions are an essential part of many proteins, playing key structural, regulatory, and enzymatic roles. Without these cofactors, roughly one-third of all proteins would cease functioning properly. Deficiencies of these cofactors can have detrimental effects on health, contributing to the development of various diseases. The same is true for mice and yeast. It is evident then that the concentrations of cofactors must be carefully maintained so as not to damage the organism. The Cofactor Mapping \u0026amp; Analysis Program (CoMAP), allows users to determine the pathways in human, mouse, and yeast biology where any cofactor crucial to protein function is significantly involved. This tool enables the identification of biological processes and specific pathways within an organism that are influenced by changes in cofactor concentrations, providing a deeper insight into cofactor-dependent proteins and their involvement in biology.\u003c/p\u003e\u003ch2\u003eResults:\u003c/h2\u003e \u003cp\u003eTo our knowledge, no other bioinformatics tool exists with the same functionality as CoMAP. Via a graphical user interface CoMAP constructs a list of the Ensembl gene IDs encoding for proteins containing specific cofactor(s), performs an overrepresentation analysis to identify the significant pathways in the organism using these genes, and hierarchically clusters these pathways based on similarities in their gene sets. A use case, investigating ferroptosis in humans, was done using this method. The example application took in iron II, iron III, and later, iron-sulfur clusters as the selected cofactors, identifying the pathways in human biology that would be affected by ferroptosis. The program returned pathways that have been experimentally shown to be impacted by ferroptosis in addition to novel pathways. The CoMAP Python script is available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/tcraddock/CoMAP\u003c/span\u003e\u003cspan address=\"https://github.com/tcraddock/CoMAP\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\u003ch2\u003eConclusions:\u003c/h2\u003e \u003cp\u003eCoMAP provides insight into the cellular functions most likely to be affected by a depletion or augmentation of the cofactor(s). CoMAP has use cases, applications and future directions for the fields of bioengineering of biological pathways, advancing personalized medicine, and elucidating to new ways to treat diseases resulting from cofactor deficiency and/or imbalance.\u003c/p\u003e","manuscriptTitle":"CoMAP: A Program to Cluster Pathways Overrepresented with Specific Cofactors in Human, Mouse, and Yeast Biology","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-05-14 16:38:44","doi":"10.21203/rs.3.rs-6605908/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"19df49fc-2a43-46da-97c6-17bef52b879a","owner":[],"postedDate":"May 14th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-02-18T08:26:22+00:00","versionOfRecord":[],"versionCreatedAt":"2025-05-14 16:38:44","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6605908","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6605908","identity":"rs-6605908","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.