Methods
IA Dataset
The IA set was derived from the IntAct database (May 2024 release;
https://www.ebi.ac.uk/intact/download/ftp), cons idering only interactions involving human
proteins. Interactions were classified into four categories based on IntAct's "interaction type":
● "association": included interactions labeled as "colocalization," "proximity," or
"association."
● "physical association": retained the original "physical association" label.
● "non-enzymatic direct interaction": included interactions labeled as "direct
interaction."
● "enzymatic direct interaction": included interactions annotated with the name of an
enzymatic reaction.
In cases where multiple instances of the same protein pair existed in IntAct, these were
aggregated, and the interaction was assigned the highest priority class from the following
order: "enzymatic direct interaction" > "non- enzymatic direct interaction" > "physical
association" > "association". In the end the dataset contained 421,579 PPI: 302,139
associations, 111,368 physical associations, 5,894 non-enzymatic direct interactions and
2,178 direct interactions.
We also annotated as Complex Portal interactions, the protein pairs that were included in the
same complex in at least one of the instances of Complex Portal (9,654 positives).
CP Dataset
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint
The CP dataset was generated taking as positive instances all protein pairs that co-occur
within the same Complex Portal(Meldal et al., 2022) complex (15,881 pairs). Negative
instances were randomly sampled from the human proteome, excluding positive pairs, to a
total of 674,637, to reproduce the same positive/negative ratio of the IA dataset.
Coevolution based features
For each protein of the human proteome, we ex tracted the list of genomes containing an
orthologous sequence of that protein fr om the Orthologous MAtrix (OMA) database
(Altenhoff et al., 2021) using a custom Python script using the PyOMADB client (Kaleb et al.,
2019) (December 13, 2023, database version July 2023).
Coevolution between each protein pair was measured using two approaches:
● The Jaccard similarity coefficient between the sets of genomes containing orthologs
of the two proteins.
● The mutual information regarding the pres ence/absence of the two proteins across
genomes.
Where, i = 1 (or 0) denotes the presence (or absence) of an ortholog of the first
protein, and j = 1 (or 0) denotes the pres ence (or absence) of an ortholog of the
second protein. 'fij' represents the frequency of observing the combined state (i, j).
Domain coevolution
We retrieved domain annotations for each protein from the InterPro database (Blum et al.,
2021) (v101.0). For a given domain D within a protein P, we define the set of genomes
containing an ortholog of D as those genome s where an ortholog of P exists and also
possesses a domain with the same ID as D. This definition allows us to measure the
coevolution between two domains in a manner analogous to protein coevolution:
● Using the Jaccard similarity coefficient between the sets of genomes containing
orthologs of the two domains.
● Calculating the mutual information regarding the presence/absence of orthologs of
the two domains across genomes.
For our classifier, we utilize the mean coevolution of all domain pairs as features. This
means that coevolution is computed using both the Jaccard similarity coefficient and mutual
information approaches.
Interacting domains
We compiled a catalog of domain pairs with at least one structurally resolved interaction
interface in the PDB. We adopted InterPro's domain definitions and considered two domains
to be in structural contact if they had a minimum of 5 residue-residue contacts. A residue-
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint
residue contact was defined as having a distance of 8 Å or less between the C β atoms (or
Cα for glycine) of the two amino acids.
For our classifier, we used as additional feat ures the mean coevolution of the domain pairs
that have at least one structurally resolved in terchain interaction interface between domains
with the same IDs. In cases where no such domain pairs existed, these features were set to
0.
Co-expression based features
Bulk RNA-seq data is obtained from UCSC Xena(Goldman et al., 2020). The cohort includes
data from 3 projects: TCGA, focusing on c ancer tissue; GTEx, healthy tissue and TARGET,
pediatric data. All data from TARGET is removed from subsequent analysis, and only
primary tumor data is kept from TCGA. The re sulting dataset consists of a total of 18305
samples, 7775 healthy tissue samples from GTEx and 10530 primary tumor samples from
TCGA.
First, outliers are identified and removed based on broad quality control metrics. Samples
with less than 20000 or more than 40000 genes detected are removed, same for samples
with less than 10 millions or more than 120 millions total counts. Samples in which the top
100 most expressed genes accounted for more than 90% of total counts were also removed.
The data is then split according to tissue and histology (e.g. healthy pancreas and pancreatic
adenocarcinoma are considered different tiss ues) and tissue-specific preprocessing is
applied. In particular, we consider the following quality control metrics: logarithm of total
counts, logarithm of genes with at least one count (in both cases a pseudocount is added to
handle 0 values) and percentage of counts in the top 100 most expressed genes. Only
samples in which all QC metrics are within 5 median absolute deviations from their median
value of the tissue are kept.
Next, we remove lowly expressed genes to reduce the effects of sparsity on the correlation
analysis and prevent the formation of cluste rs of lowly expressed genes, which could
complicate interpretation of the results. To this end, genes with less than 15 counts in more
than 75% of the samples are removed from each tissue.
We then apply normalization and variance stabilizing transformation from DeSeq2 (Love et
al., 2014) (PyDeSeq2 implementation (Muzellec et al., 2023)). Finally, we use principal
component analysis to identify additional outliers. Specifically, we compute the distance of
each point from the centroid in the princi pal components space (considering the first 10
principal components). Samples that are more than 5 standard deviations away from the
centroid are then removed.
In order to generate co-expression based features, we use the implementation of the
WGCNA (Langfelder & Horvath, 2008)algorithm provided in the PyWGCNA Python module
(Rezaie et al., 2023).
WGCNA operates by first computing the Pearson correlation coefficient of each pair of
genes across all samples, the correlation is then processed to obtain a non-negative,
weighted, undirected adjacency matrix (signed adjacency):
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint
/g1853 /g3036/g3037 /g3404/g4666 /g1855/g1867/g1870/g1870/g4666/g1876 /g3036 ,/g1876 /g4667 /g3397 1
2 /g4667 /g3081
Where /g2010 represents a soft thresholding power and its purpose is thresholding the adjacency
without actually binarizing it. The optimal value of /g2010 is determined according to the
approximate scale-free criterion (Zhang & Horvat h, 2005) selecting a target value of
/g1844 /g2870 /g34040 . 8 5 .
Finally, the TOM matrix is obtained from the adjacency matrix (Zhang & Horvath, 2005):
/g1846/g1841/g1839 /g3036/g3037 /g3404
∑ /g3048 /g1853 /g3036/g3048 /g1853 /g3048/g3037 /g3397 /g1853 /g3036/g3037
/g1865/g1861/g1866/g4666/g1863 /g3036 ,/g1863 /g3037 /g4667 /g3397 1 /g3398 |/g1853 /g3036/g3037 |
Where /g1863 /g3036 /g3404 ∑ /g3037 /g1853 /g3036/g3037 is the degree of gene /g1861 in the co-expression network.
Model training
To classify the protein pairs in the different interaction types we trained an XGBoost
classifier using the official package (Chen & Guestrin, 2016). Hyperparameters optimization
was performed with Optuna (100 trials), maxi mizing average precision. All 164 previously
described features were used as input. Undefined features (that were impossible to compute
for certain pairs) were set to -2 (a value outside of the range of all the features), and pairs
with all features undefined were removed. Both the IA and CP datasets were split into 80%
training, 10% validation (used for hyperparam eters optimization), and 10% testing sets. The
final models were trained on the combined trai ning and validation sets before evaluation on
the held-out test set.
The stability of the models was evaluated with a 10-fold cross-validation using the scikit-
learn python package (Pedregosa et al., 2011).
Testing on held out complexes
We randomly selected 10% of the original co mplexes as Test Complexes and another 10%
as Validation Complexes. The training set included all pairs in the CP set that did not involve
proteins from either the Test or Validation Complexes. The validation set, used for
hyperparameter optimization, incl uded all pairs in the CP set exclusively involving proteins
found in the Validation Complexes but not in the Test Complexes. After hyperparameter
optimization, the model was trained on all pairs in the CP set that did not involve proteins
from the Test Complexes. It was then tested on all possible pairs of proteins found within the
union of all the Test Complexes (including those not present in the CP set).
STRING scores
We benchmarked our models' predictions against evidence scores from STRING (v12.0),
mapping UniProt protein IDs to STRING IDs using the UniProt ID mapping files (Huang et
al., 2011). We considered both the 'combined score' and the physical networks from
STRING. Additionally, we included cooccurrence and coexpression scores for comparison,
as these reflect the types of features utilized by our predictor.
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint
Clustering
To identify clusters of proteins likely to fo rm complexes, we constructed a network of all
proteins belonging to at least one complex in the Complex Portal. We used the predicted
probability of two proteins belonging to the same complex as the edge weight between them
in this network. Clusters were then identified within this weighted network using the Louvain
algorithm, using resolution ranging from 1 to 100. Clustering was performed using NetworkX
(v3.2.1; https://networkx.org/
) and CDlib library (v0.4.0;
https://cdlib.readthedocs.io/en/latest/). The clusters obtained at each resolution were
compared to the known complexes using the cdlib.evaluation.geometric_accuracy function.
Software
Plots were generated with customized scripts in python (v3.9.13), using matplotlib (v3.5.1)
and seaborn (v0.11.2). Statistical analysis was performed using scikit-learn (v1.0.2).
Figures
Figure 1 workflow of the procedure. Schematic workflow of the pipeline from dataset
acquisition and model training through evaluation and human proteome level clustering. We
used the OMA database to identify ortholog sequences, InterPro to identify domains and
GTEX and TCGA to compute the Tissue-specific coexpressions. Two model types are
trained: one to predict if a protein pair is part of a stable protein complex and one to
discriminate direct interactions from weaker associations. After hyperparameter optimization
and evaluation, the models are applied to the entire human proteome. The resulting
interaction scores are used to construct a weighted network, which is then clustered to
reveal protein complexes.
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint
Figure 2 Analysis of Individual Features A) Clustermap of the pairwise Spearman
correlations of the 164 features; B) Distribut ion of the Jaccard coevolution across all the
PPIs annotated in Intact; C) Area under the ROC cu rve of the different feature categories in
normal and tumor tissues for the task of identifying protein pairs involved in stable
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint
complexes. P-values refer to a Mann-Whitney test with Bonferroni correction (*p<0.05,
**p<0.01, ***p<0.001, ****p<0.0001).
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint
Figure 3 Machine learning models to predict direct and stable complex interactions. A) ROC
curves of the model trained to discriminate interactions forming stable complexes on the IA
set (blue), on the CP set (orange), on a set of random complexes excluded from the training
set in their entirety (green); B) ROC curves of the models trained on the IA set to
discriminate associations (red), physical associat ions (purple), direct interactions (brown),
enzymatic direct interactions (pink), non-enzymatic direct interactions (gray); C) ROC curves
of the model trained to discriminate interactions forming stable complexes on the CP set
(orange) compared to the ones obtained using the different STRING confidence scores as
predictors for the same task; D) ROC curves of the model trained to discriminate direct
interactions on the IA set (pink) compared to the ones obtained using the different STRING
confidence scores as predictors for the same task; E) Histogram representing the 10 most
important features for the task of discriminating interactions forming stable complexes; F)
Histogram representing the 10 most important features for the task of discriminating direct
interactions. Feature importances are co mputed as the average loss change due to each
feature.
Figure 4 Complex recovery at a proteome scale using the STRING and the predicted
networks. Complex recovery is assessed using Louvain clustering at different depths: A)
Performance evaluation on the network of proteins belonging to at least one known complex;
B) Performance evaluation on the network of prot eins belonging to the complexes that were
held out from the training set of the stable complex prediction model in pink.
References
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-generation
Hyperparameter Optimization Framework. Proceedings of the ACM SIGKDD
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint
International Conference on Knowledge Discovery and Data Mining, 2623–2631.
https://doi.org/10.1145/3292500.3330701
Altenhoff, A. M., Train, C. M., Gilbert, K. J., Mediratta, I., de Farias, T. M., Moi, D., Nevers,
Y., Radoykova, H. S., Rossier, V., Vesztrocy, A. W., Glover, N. M., & Dessimoz, C.
(2021). OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene
order and more. Nucleic Acids Research, 49(D1), D373–D379.
https://doi.org/10.1093/NAR/GKAA1007
Blum, M., Chang, H. Y., Chuguransky, S., Grego, T., Kandasaamy, S., Mitchell, A., Nuka,
G., Paysan-Lafosse, T., Qureshi, M., Raj, S., Richardson, L., Salazar, G. A., Williams,
L., Bork, P., Bridge, A., Gough, J., Haft, D. H., Letunic, I., Marchler-Bauer, A., … Finn,
R. D. (2021). The InterPro protein families and domains database: 20 years on. Nucleic
Acids Research, 49(D1), D344–D354. https://doi.org/10.1093/NAR/GKAA977
Brohée, S., & van Helden, J. (2006). Evaluation of clustering algorithms for protein-protein
interaction networks. BMC Bioinformatics, 7(1), 1–19. https://doi.org/10.1186/1471-
2105-7-488/FIGURES/5
Burke, D. F., Bryant, P., Barrio-Hernandez, I., Memon, D., Pozzati, G., Shenoy, A., Zhu, W.,
Dunham, A. S., Albanese, P., Keller, A., Scheltema, R. A., Bruce, J. E., Leitner, A.,
Kundrotas, P., Beltrao, P., & Elofsson, A. (2023). Towards a structurally resolved
human protein interaction network. Nature Structural and Molecular Biology, 30(2),
216–225. https://doi.org/10.1038/s41594-022-00910-8
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of
the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
13-17-August-2016, 785–794. https://doi.org/10.1145/2939672.2939785
del Toro, N., Shrivastava, A., Ragueneau, E., Meldal, B., Combe, C., Barrera, E., Perfetto,
L., How, K., Ratan, P., Shirodkar, G., Lu, O., Mészáros, B., Watkins, X., Pundir, S.,
Licata, L., Iannuccelli, M., Pellegrini, M., Martin, M. J., Panni, S., … Hermjakob, H.
(2022). The IntAct database: efficient access to fine-grained molecular interaction data.
Nucleic Acids Research, 50(D1), D648–D653. https://doi.org/10.1093/NAR/GKAB1006
Evans, R., O’Neill, M., Pritzel, A., Antropova, N., Senior, A., Green, T., Žídek, A., Bates, R.,
Blackwell, S., Yim, J., Ronneberger, O., Bodenstein, S., Zielinski, M., Bridgland, A.,
Potapenko, A., Cowie, A., Tunyasuvunakool, K., Jain, R., Clancy, E., … Hassabis, D.
(2022). Protein complex prediction with AlphaFold-Multimer. BioRxiv,
2021.10.04.463034. https://doi.org/10.1101/2021.10.04.463034
Goldman, M. J., Craft, B., Hastie, M., Repe
č ka, K., McDade, F., Kamath, A., Banerjee, A.,
Luo, Y., Rogers, D., Brooks, A. N., Zhu, J., & Haussler, D. (2020). Visualizing and
interpreting cancer genomics data via the Xena platform. Nature Biotechnology, 38(6),
675–678. https://doi.org/10.1038/S41587-020-0546-8
Huang, H., McGarvey, P. B., Suzek, B. E., Mazumder, R., Zhang, J., Chen, Y., & Wu, C. H.
(2011). A comprehensive protein-centric ID mapping service for molecular data
integration. Bioinformatics, 27(8), 1190–1191.
https://doi.org/10.1093/BIOINFORMATICS/BTR101
Humphreys, I. R., Humphreys, I. R., Pei, J., Baek, M., Krishnakumar, A., & Anishchenko, I.
(n.d.). Computed structures of core eukaryotic protein complexes. 1–17.
Humphreys, I. R., Zhang, J., Baek, M., Wang, Y., Krishnakumar, A., Pei, J., Anishchenko, I.,
Tower, C. A., Jackson, B. A., Warrier, T., Hung, D. T., Peterson, S. B., Mougous, J. D.,
Cong, Q., & Baker, D. (2024). Protein interactions in human pathogens revealed
through deep learning. Nature Microbiology. https://doi.org/10.1038/s41564-024-01791-
x
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint
Huttlin, E. L., Bruckner, R. J., Navarrete-Perea, J., Cannon, J. R., Baltier, K., Gebreab, F.,
Gygi, M. P., Thornock, A., Zarraga, G., Tam, S., Szpyt, J., Gassaway, B. M., Panov, A.,
Parzen, H., Fu, S., Golbazi, A., Maenpaa, E., Stricker, K., Guha Thakurta, S., … Gygi,
S. P. (2021). Dual proteome-scale networks reveal cell-specific remodeling of the
human interactome. Cell, 184(11), 3022-3040.e28.
https://doi.org/10.1016/J.CELL.2021.04.011
Kaleb, K., Vesztrocy, A. W., Altenhoff, A., & Dessimoz, C. (2019). Expanding the
Orthologous Matrix (OMA) programmatic interfaces: REST API and the OmaDB
packages for R and Python. F1000Research 2019 8:42, 8, 42.
https://doi.org/10.12688/f1000research.17548.2
Langfelder, P., & Horvath, S. (2008). WGCNA: An R package for weighted correlation
network analysis. BMC Bioinformatics, 9(1), 1–13. https://doi.org/10.1186/1471-2105-9-
559/FIGURES/4
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and
dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.
https://doi.org/10.1186/s13059-014-0550-8
Luck, K., Kim, D. K., Lambourne, L., Spirohn, K., Begg, B. E., Bian, W., Brignall, R.,
Cafarelli, T., Campos-Laborie, F. J., Charloteaux, B., Choi, D., Coté, A. G., Daley, M.,
Deimling, S., Desbuleux, A., Dricot, A., Gebbia, M., Hardy, M. F., Kishore, N., …
Calderwood, M. A. (2020). A reference map of the human binary protein interactome.
Nature 2020 580:7803, 580(7803), 402–408. https://doi.org/10.1038/s41586-020-2188-
x
Meldal, B. H. M., Perfetto, L., Combe, C., Lubiana, T., Cavalcante, J. V. F., Bye-A-Jee, H.,
Waagmeester, A., del-Toro, N., Shrivastava, A., Barrera, E., Wong, E., Mlecnik, B.,
Bindea, G., Panneerselvam, K., Willighagen, E., Rappsilber, J., Porras, P., Hermjakob,
H., & Orchard, S. (2022). Complex Portal 2022: new curation frontiers. Nucleic Acids
Research, 50(D1), D578–D586. https://doi.org/10.1093/NAR/GKAB991
Muzellec, B., Tele
ń czuk, M., Cabeli, V., & Andreux, M. (2023). PyDESeq2: a python
package for bulk RNA-seq differential expression analysis. Bioinformatics, 39(9).
https://doi.org/10.1093/BIOINFORMATICS/BTAD547
Nguyen, L. Van, Laval, J.-P., Chainais, P., Iop, A., Blondel, V. D., Guillaume, J.-L.,
Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks.
Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008.
https://doi.org/10.1088/1742-5468/2008/10/P10008
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,
Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in
Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.
http://www.jmlr.org/papers/v12/pedregosa11a.html
Rezaie, N., Reese, F., & Mortazavi, A. (2023). PyWGCNA: a Python package for weighted
gene co-expression network analysis. Bioinformatics, 39(7).
https://doi.org/10.1093/BIOINFORMATICS/BTAD415
Schmid, E. W., & Walter, J. C. (n.d.). Predictomes: A classifier-curated database of
AlphaFold-modeled protein-protein interactions.
https://doi.org/10.1101/2024.04.09.588596
Sears, R. M., May, D. G., & Roux, K. J. (2019). BioID as a Tool for Protein-Proximity
Labeling in Living Cells. Methods in Molecular Biology (Clifton, N.J.), 2012, 299.
https://doi.org/10.1007/978-1-4939-9546-2_15
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint
Szklarczyk, D., Kirsch, R., Koutrouli, M., Nastou, K., Mehryary, F., Hachilif, R., Gable, A. L.,
Fang, T., Doncheva, N. T., Pyysalo, S., Bork, P., Jensen, L. J., & Von Mering, C.
(2023). The STRING database in 2023: protein-protein association networks and
functional enrichment analyses for any sequenced genome of interest. Nucleic Acids
Research, 51(D1), D638–D646. https://doi.org/10.1093/NAR/GKAC1000
Zhang, B., & Horvath, S. (2005). A general framework for weighted gene co-expression
network analysis. Statistical Applications in Genetics and Molecular Biology, 4(1).
https://doi.org/10.2202/1544-6115.1128/MACHINEREADABLECITATION/RIS
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted October 6, 2024. ; https://doi.org/10.1101/2024.10.05.616780doi: bioRxiv preprint