Prediction and discovery of protein-protein direct interactions and stable complexes based on gene co-expression and co-evolution

doi:10.1101/2024.10.05.616780

Prediction and discovery of protein-protein direct interactions and stable complexes based on gene co-expression and co-evolution

2024 · doi:10.1101/2024.10.05.616780

preprint OA: closed CC-BY-NC-4.0

📄 Open PDF Full text JSON View at publisher

Full text 43,618 characters · extracted from oa-pdf · 7 sections · click to expand

Abstract

In this study we employed a data-driven appr oach to explore the evolutionary and genetic determinants of protein direct interactions and stable complex formation in the human proteome. We found that simple co-evolutionary and co-expression metrics are highly informative of direct interactions and stable complexes. We used this information to train supervised binary classifiers to predict interactions either directly involved in the formation of a complex (as annotated in IntAct) or forming stable complexes (from Complex Portal). In the former task, our model was able to discriminate direct interactions with an AUROC=0.813, while in the latter it discriminated interaction forming stable complexes with an AUROC=0.964. In both cases, our approach outperformed String, that we employed as a baseline. Feature importance analysis revealed different contributions to the prediction of these distinct interaction types. Co-evolutionary features, in particular those referred to protein domains involved in interaction interfaces, are more important to discriminate direct interactions. On the other hand, co-expression features contributed more to the prediction of stable complexes. From these pairwise predi ctions we generated a proteome-wide network that we clustered to assess the recovery of known complexes from Complex Portal within .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint network communities. We were able to recover known complexes at a higher accuracy compared to other approaches. In conclusion, we propose a new method able to discriminate direct interactions as well as forming stable complexes. This method can be used to stratify molecular interaction networks, as well as to perform discovery of new functional complexes at a proteome-wide scale.

Introduction

Protein-protein interaction networks orchestr ate the structure and functioning of the cell and are often disarranged in disease. Advancement in mass spectrometry techniques coupled with high-throughput screenings such as yeast two-hybrid (Y2H) (Luck et al., 2020), affinity- purification (AP-MS) (Huttlin et al., 2021), or proximity labeling (Sears et al., 2019) enabled the discovery of interactions on a proteome-wide scale. However, these experimental approaches produce large interactomic datasets that are often affected by non-specific interactions which do not reflect physiological conditions, leading to high false positive rates. Available resources such as IntAct (del Toro et al., 2022) are engaged in classifying interactions into different types, i.e. direct, physical or association, via a curation process which entails the integration with other experimental sources, including analysis of 3D complex structures. However, the discrimination of direct interactions from indirect or spurious associations within large PPI networks remains a challenging task, often preventing a deeper mechanistic interpretation of these datasets. Computational strategies to structurally dissect molecular interaction networks have been recently proposed, sparked by the success in AI -driven structural predictions. For instance, AlphaFold-multimer(Evans et al., 2022) has been used to predict with high confidence the structures of thousands of human protein interact ions (Burke et al., 2023). Variations of the RosettaFold method (i.e. Rosetta 2 Track (Humphreys et al., n.d.) and RoseTTAFold2-Lite (Humphreys et al., 2024)) have been employed as strategies to screen large PPI network sets to structurally predict, in combination with AF-multimer, the structures of high confidence complexes. Machine learning methods have also been developed to discriminate direct interactions by taking AF-multimer predictions as input (Schmid & Walter, n.d.). These approaches rely in first place on the stru ctural predictions of the complexes through AI models such as AF-multimer, which are computationally intensive and might be prohibitive at the proteome scale. For this reason, such methods mostly rely on modeling binary interactions, not considering the modelin g of higher order multimers. Moreover, given the stringency of the metrics employed, which are based on AF-multimer’s confidence score, these approaches tend to focus only on a limited number of higher confidence structures, opening the possibility of high false negatives rates. Hence, an algorithm able to process an input inte ractomic dataset, to score interactions that are more likely to be direct or to for stable complexes, would be highly desirable and could be used to streamline effort to structurally model proteome-wide interactomes. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint

Results

Gene co-evolution and co-expression inform about protein direct interactions and stable complexes We derived a set of pairwise gene co-evolutionary and co-expression features and checked for their statistical associations with molecular interaction types (Figure 1). To obtain co-evolutionary features, we first re trieved the list of orthologs for each human gene available from a reference orthology database (i.e. OMA (Altenhoff et al., 2021); see Methods). We estimated the co-evolution between two genes as the degree of their co- presence in sequenced genomes, which we assessed through both Jaccard and Mutual Information metrics (see Methods). We also der ived similar features at the domain level by considering the coevolution of pairs of annotat ed domains (i.e. Interpro (Blum et al., 2021)) on protein pairs across sequenced genomes (see Methods). We considered every possible domain pair as well as those found in spatial contact in 3D structures from the Protein Data Bank (PDB) (see Methods). Next, we assessed the degree of co-expression of interacting genes from 47 healthy (GTEx) as well as from 32 cancer (TCGA) tissues, using weighted gene co-expression network analysis (WGCNA, (Langfelder & Horv ath, 2008)) from which we considered both expression correlation and Topological Over lap Measure (TOM) as an estimate of gene pairs coexpression (see Methods). We checked for correlation among features and, as expected, we found out that features of the same type (i.e. co-evolutionary or co-expression) have higher correlation values (Figure 2A). Among co-expression features, we found that correlations in healthy sub-tissues preserved the tissue of origin (Figure 2A). Some of them, such as brain or gastro-intestinal tract healthy tissues, are characterized by higher correlation values (Figure 2A). On the other hand, the correlation values derived from c ancer tissues were characterized by higher values, irrespective of the tissue of origins, which suggests that tumor tissues have lost their original transcriptional program characteristic of the healthy tissue to foster oncogenic transformation. We then used the co-evolution and co-expression values as features (independent variables) of gene pairs, and we employed the in teraction types as target variables (i.e. classification labels). We considered two protein- protein interaction (PPI) datasets. First, we used the entire human interactome from IntAct ((del Toro et al., 2022), i.e. IA set), where we stratified interactions according to the annotated interaction type, i.e. Association, Physical and Direct (in turn sub-divided into Enzymatic and Non-enzymatic) interactions. To these standard annotated types, we added an additional term depending on whether the interaction is reported to participate in stable co mplexes from Complex Portal (Meldal et al., 2022). We also generated a second dataset by cons idering all the human interactions from Complex Portal, which we combined with randomly picked protein-protein pairs which we considered as a negative, background set (i.e. CP set; see Methods). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint We found that distinct interaction classes from IntAct are associated with significantly different distributions of these features (Figure 2B). For instance, features based on coevolution have significantly higher values for stable complexes or direct interactions (Figure 2B) compared to other interaction types (i .e. Association or Physical interaction). In particular, all 164 features (both coevoluti on and coexpression based) were significantly higher in interactions between proteins belonging to the same complex (from Complex Portal, with maximum P-value < 10^-58, Mann-Whitney test with Bonferroni correction on the IA set). We found that certain individual features were predictive of certain interaction types. In the IA set, we found 42 unique features with AUROC values greater than 0.7 for interactions forming stable complexes in Complex Portal, with coexpression from lung adenocarcinoma (“lung_adenocarcinoma_TOM”) being the single feature with the highest AUROC (0.736) (Supplementary Table 1). In general, we found higher AUROC values for co-expression features over the co-evolutionary ones for the prediction of stable complex interactions. In particular, co-expressions from cancer transcriptomics are characterized by higher AUROC for the prediction of stable complexes (Figure 2C), suggesting a higher activity of complexes sustaining oncogenic processes. Training a stable interaction classifier based on co-evolutionary and co-expression features We employed 6 coevolutionary and 158 coexpression features to train a supervised machine learning algorithm to discriminate interacti ons involved in stable complexes and direct interactions. We employed a state of the art algorithm for supervised learning on tabular feature sets (i.e. XGBoost (Chen & Guestrin , 2016), with a bayesian procedure for optimal hyperparameters search (i.e. Optuna (Akiba et al., 2019)). The model is able to discriminate interactions forming stable complexes in the CP dataset with AUROC 0.964 (Figure 3A). In the IA dataset, the model achieves an AUROC of 0.920 for the same task. The slightly worse performance on the IA dataset is likely attributed to using protein pairs froma other interaction types (i.e. Physical and Associations) as negatives instead of random pairs. To further assess the model's capacity to identify novel complexes, we randomly selected 10% of the complexes in Complex Portal as a "held-out" set. Proteins within these complexes were excluded from the training data, and the model was trained on the remaining CP set. We then evaluated the model's ability to retrieve stable interactions within this held-out protein set, achieving an AUROC of 0.802 on this task. We additionally utilized the IA set to train a model for predicting the five identified interaction types: association, physical association, non- enzymatic direct interaction, and enzymatic direct interaction, direct interaction (the union of non-enzymatic and enzymatic direct interactions). The ROC curves for associat ion and physical association predictors are approximately 0.76, while those for the three direct interaction categories range between 0.81 and 0.85 (Figure 3B). We compared the performance of trained models with one of String networks on the entire human Complex Portal as well as IntAct datasets. On the CP dataset, our best performing .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint model performed similarly to STRING and STRING Physical (Szklarczyk et al., 2023). It must be noted, however, that while String considers a wealth of information to predict functional interactions, including data from the literature that are used to curate a resource like Complex Portal, our model exclusively exploited co-evolutionary and co-expression features. Indeed, when considering only coevol ution and co-expression scoring from String, our model achieved better performances (Figure 3C). On the IA direct interaction set, our model outperformed all the different scoring criteria from String (Figure 3D). We inspected the importance of the features in the different models. Consistently with the feature exploratory analysis (see above), we f ound that in the task of stable complexes interaction classification the most important features are dominated by co-expression, followed by co-evolutionary features (Figure 3E). On the other hand, in the tasks of direct interaction prediction, the most important feat ures are the co-evolutions of interacting domains or entire genes (Figure 3F). This sugges ts that while for the prediction of stable complexes the cellular contextual information, expressed as gene co-expression, is critical to achieve good prediction results, for the prediction of direct interactions molecular and structural features, such as co-evolution of interacting domains, are more important. Human proteome-wide prediction and discovery of novel complexes We employed the CP models to evaluate their capability to recover known complexes from Complex Portal. In this respect, we predicted with the models trained on Complex Portal the probability of interaction of every possible prot ein-protein pair within the human proteome. We then used the probabilities returned by the models as weights to obtain a proteome- wide, weighted adjacency matrix. We clustered the resulting graph using the Louvain approach (Nguyen et al., 2008) to retrieve communities of interacting proteins. We then assessed the recovery of known complexes from Complex Portal within detected modules at different clustering depths using the geometr ic accuracy, a cluster comparison metric specifically developed for protein-protein in teraction networks (Brohée & van Helden, 2006) (Figure 4; see Methods). We found that our approach outperformed all String baselines, including String and String Physical networks, in recovering known complexes on the entire CP sets (Figure 4).

Discussion

In this study we have developed machine learning models to accurately predict interactions either involved in direct associations or m ediating the formation of stable complexes. We demonstrated that by leveraging co-evolution information, such as the co-presence in sequenced genomes of gene or interacting domain pairs, as well as the information of gene co-expression from both healthy and cancer tissues, we could achieve predictive performances competitive with state-of-the-art methods such as String (Szklarczyk et al., 2023). Notably, the String scores obtained with similar features to the ones we employed, i.e. co-presence and co-expression scores, performed much worse. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint Intriguingly, co-expression information, particularly from cancer tissues, is more important for the prediction of stable complexes, suggesting th at contextual information is important for the definition of stable complexes. On the other hand, we found that co-evolutionary information, particularly the one of domain-domain pairs known to form 3D interfaces, was deemed more important for the prediction of direct interactions, which are expected to involved 3D structured interfaces. Both of our models showed better performance than String in recovering known complexes when clustering the adjacency matrix of the human proteome obtained by weighting the edges via the interaction probabilities obtained from the models. Taken together, these results suggest that our model could be used to process any given input interactomic dataset to discriminate interactions more likely to be either direct or involved in stable complexes, from other interaction types. The model can also potentially be applied to the prediction of protein complex topologies as well as to the discovery of new complexes at a proteome-scale level. This method could be pipelined with structural prediction algorithms such as AF-multimer to narrow down the list of candidates to model as well as to suggest more likely higher order complex topologies.

Methods

IA Dataset The IA set was derived from the IntAct database (May 2024 release; https://www.ebi.ac.uk/intact/download/ftp), cons idering only interactions involving human proteins. Interactions were classified into four categories based on IntAct's "interaction type": ● "association": included interactions labeled as "colocalization," "proximity," or "association." ● "physical association": retained the original "physical association" label. ● "non-enzymatic direct interaction": included interactions labeled as "direct interaction." ● "enzymatic direct interaction": included interactions annotated with the name of an enzymatic reaction. In cases where multiple instances of the same protein pair existed in IntAct, these were aggregated, and the interaction was assigned the highest priority class from the following order: "enzymatic direct interaction" > "non- enzymatic direct interaction" > "physical association" > "association". In the end the dataset contained 421,579 PPI: 302,139 associations, 111,368 physical associations, 5,894 non-enzymatic direct interactions and 2,178 direct interactions. We also annotated as Complex Portal interactions, the protein pairs that were included in the same complex in at least one of the instances of Complex Portal (9,654 positives). CP Dataset .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint The CP dataset was generated taking as positive instances all protein pairs that co-occur within the same Complex Portal(Meldal et al., 2022) complex (15,881 pairs). Negative instances were randomly sampled from the human proteome, excluding positive pairs, to a total of 674,637, to reproduce the same positive/negative ratio of the IA dataset. Coevolution based features For each protein of the human proteome, we ex tracted the list of genomes containing an orthologous sequence of that protein fr om the Orthologous MAtrix (OMA) database (Altenhoff et al., 2021) using a custom Python script using the PyOMADB client (Kaleb et al., 2019) (December 13, 2023, database version July 2023). Coevolution between each protein pair was measured using two approaches: ● The Jaccard similarity coefficient between the sets of genomes containing orthologs of the two proteins. ● The mutual information regarding the pres ence/absence of the two proteins across genomes. Where, i = 1 (or 0) denotes the presence (or absence) of an ortholog of the first protein, and j = 1 (or 0) denotes the pres ence (or absence) of an ortholog of the second protein. 'fij' represents the frequency of observing the combined state (i, j). Domain coevolution We retrieved domain annotations for each protein from the InterPro database (Blum et al., 2021) (v101.0). For a given domain D within a protein P, we define the set of genomes containing an ortholog of D as those genome s where an ortholog of P exists and also possesses a domain with the same ID as D. This definition allows us to measure the coevolution between two domains in a manner analogous to protein coevolution: ● Using the Jaccard similarity coefficient between the sets of genomes containing orthologs of the two domains. ● Calculating the mutual information regarding the presence/absence of orthologs of the two domains across genomes. For our classifier, we utilize the mean coevolution of all domain pairs as features. This means that coevolution is computed using both the Jaccard similarity coefficient and mutual information approaches. Interacting domains We compiled a catalog of domain pairs with at least one structurally resolved interaction interface in the PDB. We adopted InterPro's domain definitions and considered two domains to be in structural contact if they had a minimum of 5 residue-residue contacts. A residue- .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint residue contact was defined as having a distance of 8 Å or less between the C β atoms (or Cα for glycine) of the two amino acids. For our classifier, we used as additional feat ures the mean coevolution of the domain pairs that have at least one structurally resolved in terchain interaction interface between domains with the same IDs. In cases where no such domain pairs existed, these features were set to 0. Co-expression based features Bulk RNA-seq data is obtained from UCSC Xena(Goldman et al., 2020). The cohort includes data from 3 projects: TCGA, focusing on c ancer tissue; GTEx, healthy tissue and TARGET, pediatric data. All data from TARGET is removed from subsequent analysis, and only primary tumor data is kept from TCGA. The re sulting dataset consists of a total of 18305 samples, 7775 healthy tissue samples from GTEx and 10530 primary tumor samples from TCGA. First, outliers are identified and removed based on broad quality control metrics. Samples with less than 20000 or more than 40000 genes detected are removed, same for samples with less than 10 millions or more than 120 millions total counts. Samples in which the top 100 most expressed genes accounted for more than 90% of total counts were also removed. The data is then split according to tissue and histology (e.g. healthy pancreas and pancreatic adenocarcinoma are considered different tiss ues) and tissue-specific preprocessing is applied. In particular, we consider the following quality control metrics: logarithm of total counts, logarithm of genes with at least one count (in both cases a pseudocount is added to handle 0 values) and percentage of counts in the top 100 most expressed genes. Only samples in which all QC metrics are within 5 median absolute deviations from their median value of the tissue are kept. Next, we remove lowly expressed genes to reduce the effects of sparsity on the correlation analysis and prevent the formation of cluste rs of lowly expressed genes, which could complicate interpretation of the results. To this end, genes with less than 15 counts in more than 75% of the samples are removed from each tissue. We then apply normalization and variance stabilizing transformation from DeSeq2 (Love et al., 2014) (PyDeSeq2 implementation (Muzellec et al., 2023)). Finally, we use principal component analysis to identify additional outliers. Specifically, we compute the distance of each point from the centroid in the princi pal components space (considering the first 10 principal components). Samples that are more than 5 standard deviations away from the centroid are then removed. In order to generate co-expression based features, we use the implementation of the WGCNA (Langfelder & Horvath, 2008)algorithm provided in the PyWGCNA Python module (Rezaie et al., 2023). WGCNA operates by first computing the Pearson correlation coefficient of each pair of genes across all samples, the correlation is then processed to obtain a non-negative, weighted, undirected adjacency matrix (signed adjacency): .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint /g1853 /g3036/g3037 /g3404/g4666 /g1855/g1867/g1870/g1870/g4666/g1876 /g3036 ,/g1876 /g4667 /g3397 1 2 /g4667 /g3081 Where /g2010 represents a soft thresholding power and its purpose is thresholding the adjacency without actually binarizing it. The optimal value of /g2010 is determined according to the approximate scale-free criterion (Zhang & Horvat h, 2005) selecting a target value of /g1844 /g2870 /g34040 . 8 5 . Finally, the TOM matrix is obtained from the adjacency matrix (Zhang & Horvath, 2005): /g1846/g1841/g1839 /g3036/g3037 /g3404 ∑ /g3048 /g1853 /g3036/g3048 /g1853 /g3048/g3037 /g3397 /g1853 /g3036/g3037 /g1865/g1861/g1866/g4666/g1863 /g3036 ,/g1863 /g3037 /g4667 /g3397 1 /g3398 |/g1853 /g3036/g3037 | Where /g1863 /g3036 /g3404 ∑ /g3037 /g1853 /g3036/g3037 is the degree of gene /g1861 in the co-expression network. Model training To classify the protein pairs in the different interaction types we trained an XGBoost classifier using the official package (Chen & Guestrin, 2016). Hyperparameters optimization was performed with Optuna (100 trials), maxi mizing average precision. All 164 previously described features were used as input. Undefined features (that were impossible to compute for certain pairs) were set to -2 (a value outside of the range of all the features), and pairs with all features undefined were removed. Both the IA and CP datasets were split into 80% training, 10% validation (used for hyperparam eters optimization), and 10% testing sets. The final models were trained on the combined trai ning and validation sets before evaluation on the held-out test set. The stability of the models was evaluated with a 10-fold cross-validation using the scikit- learn python package (Pedregosa et al., 2011). Testing on held out complexes We randomly selected 10% of the original co mplexes as Test Complexes and another 10% as Validation Complexes. The training set included all pairs in the CP set that did not involve proteins from either the Test or Validation Complexes. The validation set, used for hyperparameter optimization, incl uded all pairs in the CP set exclusively involving proteins found in the Validation Complexes but not in the Test Complexes. After hyperparameter optimization, the model was trained on all pairs in the CP set that did not involve proteins from the Test Complexes. It was then tested on all possible pairs of proteins found within the union of all the Test Complexes (including those not present in the CP set). STRING scores We benchmarked our models' predictions against evidence scores from STRING (v12.0), mapping UniProt protein IDs to STRING IDs using the UniProt ID mapping files (Huang et al., 2011). We considered both the 'combined score' and the physical networks from STRING. Additionally, we included cooccurrence and coexpression scores for comparison, as these reflect the types of features utilized by our predictor. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint Clustering To identify clusters of proteins likely to fo rm complexes, we constructed a network of all proteins belonging to at least one complex in the Complex Portal. We used the predicted probability of two proteins belonging to the same complex as the edge weight between them in this network. Clusters were then identified within this weighted network using the Louvain algorithm, using resolution ranging from 1 to 100. Clustering was performed using NetworkX (v3.2.1; https://networkx.org/ ) and CDlib library (v0.4.0; https://cdlib.readthedocs.io/en/latest/). The clusters obtained at each resolution were compared to the known complexes using the cdlib.evaluation.geometric_accuracy function. Software Plots were generated with customized scripts in python (v3.9.13), using matplotlib (v3.5.1) and seaborn (v0.11.2). Statistical analysis was performed using scikit-learn (v1.0.2). Figures Figure 1 workflow of the procedure. Schematic workflow of the pipeline from dataset acquisition and model training through evaluation and human proteome level clustering. We used the OMA database to identify ortholog sequences, InterPro to identify domains and GTEX and TCGA to compute the Tissue-specific coexpressions. Two model types are trained: one to predict if a protein pair is part of a stable protein complex and one to discriminate direct interactions from weaker associations. After hyperparameter optimization and evaluation, the models are applied to the entire human proteome. The resulting interaction scores are used to construct a weighted network, which is then clustered to reveal protein complexes. .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint Figure 2 Analysis of Individual Features A) Clustermap of the pairwise Spearman correlations of the 164 features; B) Distribut ion of the Jaccard coevolution across all the PPIs annotated in Intact; C) Area under the ROC cu rve of the different feature categories in normal and tumor tissues for the task of identifying protein pairs involved in stable .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint complexes. P-values refer to a Mann-Whitney test with Bonferroni correction (*p<0.05, **p<0.01, ***p<0.001, ****p<0.0001). .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint Figure 3 Machine learning models to predict direct and stable complex interactions. A) ROC curves of the model trained to discriminate interactions forming stable complexes on the IA set (blue), on the CP set (orange), on a set of random complexes excluded from the training set in their entirety (green); B) ROC curves of the models trained on the IA set to discriminate associations (red), physical associat ions (purple), direct interactions (brown), enzymatic direct interactions (pink), non-enzymatic direct interactions (gray); C) ROC curves of the model trained to discriminate interactions forming stable complexes on the CP set (orange) compared to the ones obtained using the different STRING confidence scores as predictors for the same task; D) ROC curves of the model trained to discriminate direct interactions on the IA set (pink) compared to the ones obtained using the different STRING confidence scores as predictors for the same task; E) Histogram representing the 10 most important features for the task of discriminating interactions forming stable complexes; F) Histogram representing the 10 most important features for the task of discriminating direct interactions. Feature importances are co mputed as the average loss change due to each feature. Figure 4 Complex recovery at a proteome scale using the STRING and the predicted networks. Complex recovery is assessed using Louvain clustering at different depths: A) Performance evaluation on the network of proteins belonging to at least one known complex; B) Performance evaluation on the network of prot eins belonging to the complexes that were held out from the training set of the stable complex prediction model in pink.

Acknowledgement

We gratefully acknowledge computational resources of the Center for High Performance Computing (CHPC) at SNS.

References

Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. Proceedings of the ACM SIGKDD .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint International Conference on Knowledge Discovery and Data Mining, 2623–2631. https://doi.org/10.1145/3292500.3330701 Altenhoff, A. M., Train, C. M., Gilbert, K. J., Mediratta, I., de Farias, T. M., Moi, D., Nevers, Y., Radoykova, H. S., Rossier, V., Vesztrocy, A. W., Glover, N. M., & Dessimoz, C. (2021). OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Research, 49(D1), D373–D379. https://doi.org/10.1093/NAR/GKAA1007 Blum, M., Chang, H. Y., Chuguransky, S., Grego, T., Kandasaamy, S., Mitchell, A., Nuka, G., Paysan-Lafosse, T., Qureshi, M., Raj, S., Richardson, L., Salazar, G. A., Williams, L., Bork, P., Bridge, A., Gough, J., Haft, D. H., Letunic, I., Marchler-Bauer, A., … Finn, R. D. (2021). The InterPro protein families and domains database: 20 years on. Nucleic Acids Research, 49(D1), D344–D354. https://doi.org/10.1093/NAR/GKAA977 Brohée, S., & van Helden, J. (2006). Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics, 7(1), 1–19. https://doi.org/10.1186/1471- 2105-7-488/FIGURES/5 Burke, D. F., Bryant, P., Barrio-Hernandez, I., Memon, D., Pozzati, G., Shenoy, A., Zhu, W., Dunham, A. S., Albanese, P., Keller, A., Scheltema, R. A., Bruce, J. E., Leitner, A., Kundrotas, P., Beltrao, P., & Elofsson, A. (2023). Towards a structurally resolved human protein interaction network. Nature Structural and Molecular Biology, 30(2), 216–225. https://doi.org/10.1038/s41594-022-00910-8 Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17-August-2016, 785–794. https://doi.org/10.1145/2939672.2939785 del Toro, N., Shrivastava, A., Ragueneau, E., Meldal, B., Combe, C., Barrera, E., Perfetto, L., How, K., Ratan, P., Shirodkar, G., Lu, O., Mészáros, B., Watkins, X., Pundir, S., Licata, L., Iannuccelli, M., Pellegrini, M., Martin, M. J., Panni, S., … Hermjakob, H. (2022). The IntAct database: efficient access to fine-grained molecular interaction data. Nucleic Acids Research, 50(D1), D648–D653. https://doi.org/10.1093/NAR/GKAB1006 Evans, R., O’Neill, M., Pritzel, A., Antropova, N., Senior, A., Green, T., Žídek, A., Bates, R., Blackwell, S., Yim, J., Ronneberger, O., Bodenstein, S., Zielinski, M., Bridgland, A., Potapenko, A., Cowie, A., Tunyasuvunakool, K., Jain, R., Clancy, E., … Hassabis, D. (2022). Protein complex prediction with AlphaFold-Multimer. BioRxiv, 2021.10.04.463034. https://doi.org/10.1101/2021.10.04.463034 Goldman, M. J., Craft, B., Hastie, M., Repe č ka, K., McDade, F., Kamath, A., Banerjee, A., Luo, Y., Rogers, D., Brooks, A. N., Zhu, J., & Haussler, D. (2020). Visualizing and interpreting cancer genomics data via the Xena platform. Nature Biotechnology, 38(6), 675–678. https://doi.org/10.1038/S41587-020-0546-8 Huang, H., McGarvey, P. B., Suzek, B. E., Mazumder, R., Zhang, J., Chen, Y., & Wu, C. H. (2011). A comprehensive protein-centric ID mapping service for molecular data integration. Bioinformatics, 27(8), 1190–1191. https://doi.org/10.1093/BIOINFORMATICS/BTR101 Humphreys, I. R., Humphreys, I. R., Pei, J., Baek, M., Krishnakumar, A., & Anishchenko, I. (n.d.). Computed structures of core eukaryotic protein complexes. 1–17. Humphreys, I. R., Zhang, J., Baek, M., Wang, Y., Krishnakumar, A., Pei, J., Anishchenko, I., Tower, C. A., Jackson, B. A., Warrier, T., Hung, D. T., Peterson, S. B., Mougous, J. D., Cong, Q., & Baker, D. (2024). Protein interactions in human pathogens revealed through deep learning. Nature Microbiology. https://doi.org/10.1038/s41564-024-01791- x .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint Huttlin, E. L., Bruckner, R. J., Navarrete-Perea, J., Cannon, J. R., Baltier, K., Gebreab, F., Gygi, M. P., Thornock, A., Zarraga, G., Tam, S., Szpyt, J., Gassaway, B. M., Panov, A., Parzen, H., Fu, S., Golbazi, A., Maenpaa, E., Stricker, K., Guha Thakurta, S., … Gygi, S. P. (2021). Dual proteome-scale networks reveal cell-specific remodeling of the human interactome. Cell, 184(11), 3022-3040.e28. https://doi.org/10.1016/J.CELL.2021.04.011 Kaleb, K., Vesztrocy, A. W., Altenhoff, A., & Dessimoz, C. (2019). Expanding the Orthologous Matrix (OMA) programmatic interfaces: REST API and the OmaDB packages for R and Python. F1000Research 2019 8:42, 8, 42. https://doi.org/10.12688/f1000research.17548.2 Langfelder, P., & Horvath, S. (2008). WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics, 9(1), 1–13. https://doi.org/10.1186/1471-2105-9- 559/FIGURES/4 Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550. https://doi.org/10.1186/s13059-014-0550-8 Luck, K., Kim, D. K., Lambourne, L., Spirohn, K., Begg, B. E., Bian, W., Brignall, R., Cafarelli, T., Campos-Laborie, F. J., Charloteaux, B., Choi, D., Coté, A. G., Daley, M., Deimling, S., Desbuleux, A., Dricot, A., Gebbia, M., Hardy, M. F., Kishore, N., … Calderwood, M. A. (2020). A reference map of the human binary protein interactome. Nature 2020 580:7803, 580(7803), 402–408. https://doi.org/10.1038/s41586-020-2188- x Meldal, B. H. M., Perfetto, L., Combe, C., Lubiana, T., Cavalcante, J. V. F., Bye-A-Jee, H., Waagmeester, A., del-Toro, N., Shrivastava, A., Barrera, E., Wong, E., Mlecnik, B., Bindea, G., Panneerselvam, K., Willighagen, E., Rappsilber, J., Porras, P., Hermjakob, H., & Orchard, S. (2022). Complex Portal 2022: new curation frontiers. Nucleic Acids Research, 50(D1), D578–D586. https://doi.org/10.1093/NAR/GKAB991 Muzellec, B., Tele ń czuk, M., Cabeli, V., & Andreux, M. (2023). PyDESeq2: a python package for bulk RNA-seq differential expression analysis. Bioinformatics, 39(9). https://doi.org/10.1093/BIOINFORMATICS/BTAD547 Nguyen, L. Van, Laval, J.-P., Chainais, P., Iop, A., Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008. https://doi.org/10.1088/1742-5468/2008/10/P10008 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830. http://www.jmlr.org/papers/v12/pedregosa11a.html Rezaie, N., Reese, F., & Mortazavi, A. (2023). PyWGCNA: a Python package for weighted gene co-expression network analysis. Bioinformatics, 39(7). https://doi.org/10.1093/BIOINFORMATICS/BTAD415 Schmid, E. W., & Walter, J. C. (n.d.). Predictomes: A classifier-curated database of AlphaFold-modeled protein-protein interactions. https://doi.org/10.1101/2024.04.09.588596 Sears, R. M., May, D. G., & Roux, K. J. (2019). BioID as a Tool for Protein-Proximity Labeling in Living Cells. Methods in Molecular Biology (Clifton, N.J.), 2012, 299. https://doi.org/10.1007/978-1-4939-9546-2_15 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint Szklarczyk, D., Kirsch, R., Koutrouli, M., Nastou, K., Mehryary, F., Hachilif, R., Gable, A. L., Fang, T., Doncheva, N. T., Pyysalo, S., Bork, P., Jensen, L. J., & Von Mering, C. (2023). The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Research, 51(D1), D638–D646. https://doi.org/10.1093/NAR/GKAC1000 Zhang, B., & Horvath, S. (2005). A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology, 4(1). https://doi.org/10.2202/1544-6115.1128/MACHINEREADABLECITATION/RIS .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-NC-4.0