{"paper_id":"91f574e9-e581-446d-a8cb-d2b955f68f70","body_text":"Uncovering disease associations is of great significance in biomedical research. Similar diseases tend to exhibit analogous clinical manifestations or stem from similar molecular mechanisms. Therefore, measuring disease similarity cannot only unravel disease pathological mechanisms but also foster advancements in disease diagnosis ( Xiang  et al.  2022 ), particularly in the identification of downstream disease-related genes ( Le 2020 ,  Cao  et al.  2022 ), noncoding RNAs ( Wei  et al.  2021 ,  Dai  et al.  2022 ,  Sun  et al.  2022 ,  Yan  et al.  2022 ,  Zhang and Liu 2022 ), and other biomarkers ( Wang  et al.  2023d ,  Cao  et al.  2024 ). Moreover, such research contributes to therapeutic interventions, facilitating drug discovery ( Wang  et al.  2023c ), and drug repurposing ( Bang  et al.  2023 ) by discovering analogous treatment targets across different diseases.\nWith the in-depth study of diseases, more and more biomedical data sources are collected. For examples, the semantic terminology systems including International Classification of Diseases (ICD) ( Giles and Sander 2012 ), Disease Ontology (DO) ( Bello  et al.  2018 ), and Medical Subject Headings (MeSH) ( Dhammi and Kumar 2014 ) have been constructed, where diseases are organized by hierarchical tree structures. In addition, a large number of diseases with phenotypic annotations are recorded in Human Phenotype Ontology (HPO) ( Gargano  et al.  2024 ), Orphanet ( Gomez-Paramio  et al.  2011 ), and PhenoDis ( Adler  et al.  2018 ). To reveal the underlying pathological mechanism at bio-molecule level, different types of biological databases such as DisGeNET ( Piñero  et al.  2017 ), dbSNP ( Sherry  et al.  2001 ), MNDR ( Ning  et al.  2021 ), and Reactome ( Fabregat  et al.  2018 ) have been established, providing experimentally validated associations between diseases and genes, SNPs, ncRNAs, and pathways, respectively.\nThe accumulation of biomedical data makes it possible to measure disease similarity by computational methods. There are three main categories of methods semantic-based, phenotype-based, and molecule-based. The semantic-based methods measure disease similarity by considering the path length or information content of different terms in the hierarchical network of diseases ( Li  et al.  2003 ,  Wang  et al.  2007 ,  Yu  et al.  2015 ,  Feng  et al.  2022 ,  Ai  et al.  2023 ), where their corresponding computational tools have been widely used to construct disease similarity networks ( Ma and Jiang 2020 ,  Wang  et al.  2023e ,  Zhu  et al.  2023 ). The phenotype-based methods mainly depend on the common phenotypic characteristics of diseases, where similar diseases may share more phenotype annotations ( Deng  et al.  2015 ,  Peng  et al.  2018 ,  Wang  et al.  2023b ). Compared to the semantic and phenotype-based methods, the molecule-based methods can explore potential disease relationships from micro-level and achieve superior performance, where diseases can be described by their associated molecules via graph representation learning techniques and aggregation strategies ( Cheng  et al.  2014 ,  Yang  et al.  2021 ,  Chen  et al.  2022 ,  Zeng  et al.  2022a ). Despite the significant advances in computing disease similarity, there are still two main issues that need to be addressed: (i) the occurrence and development of disease is an extremely complicated process, involving the interaction and regulation of different bio-molecules. Most molecule-based methods represent diseases solely at the genetic level, failing to capture sufficient semantic information; (ii) existing studies measure disease similarity in an unsupervised manner and ignore the valuable priori disease association knowledge, which may cause inadequate pattern learning and limited prediction efficiency ( Zeng  et al.  2022b ).\nThis study is initiated in an attempt to overcome these problems by designing a novel computational method called DiSMVC to measure disease similarity. DiSMVC is a supervised graph collaborative framework including two major modules cross-view graph contrastive learning and association pattern joint learning. The former aims to enrich disease representation by considering their underlying molecular mechanism from both genetic and transcriptional views, while the latter captures deep association patterns by incorporating phenotypically interpretable multimorbidities in a supervised manner. Experimental results indicated that DiSMVC can identify molecularly interpretable similar diseases, and the synergies gained from DiSMVC contributed to its superior performance in measuring disease similarity.\n\nTo enrich disease description and explore their underlying molecular mechanism, five bio-entity networks including gene interaction network, miRNA similarity network, gene–miRNA interaction network, gene–disease, and miRNA–disease association networks are constructed and integrated into the proposed graph collaborative framework.\nThe gene interaction network  G g  is downloaded from HumanNet-FN ( Kim  et al.  2022 ), collecting the co-functional gene links. The miRNA similarity network  G m  is constructed by Gaussian Interaction Profile (GIP) kernel similarity measurement, where the interaction profiles are expressed as the association vector between miRNAs and diseases. The network  G g m  collects the gene–miRNA interactions with Transcriptome Dysregulation Measure Score ( TDMDScore )  > 1 , derived from ENCORI ( Li  et al.  2014 ). All the gene–disease associations collected in DisGeNET v7.0 ( Piñero  et al.  2017 ) and the experimentally validated miRNA–disease associations collected in RNADisease ( Chen  et al.  2023a ) are downloaded to construct two heterogeneous networks,  G d g  and  G d m ,  respectively . In these networks, the genes and miRNAs excluded from  G g  and  G m  are filtered out. The specific statistical information of above constructed networks is listed in  Table 1 .\nThe statistical information of different datasets.\nG m \n  was constructed by Gaussian Interaction Profile kernel similarity measurement based on experimentally validated miRNA–disease associations.\nWe detect the disease association pattern in a supervised manner, the prior disease associations are downloaded from the previous study ( Dong  et al.  2021 ), which analyzed the multimorbidities among common diseases in the UK Biobank. After unifying disease identifiers across multiple datasets and filtering out disease associations with criteria of RR value  <  15 following the study ( Dong  et al.  2021 ), 3091 high-quality disease associations supported with phenotypic interpretability are collected to construct  S benchmark +  for training process.\nTo increase the data coverage and comprehensively evaluate the performance of different methods, the phenotypically and genetically interpretable multimorbidities in the study ( Dong  et al.  2021 ), the verified highly similar disease pairs predicted by the study ( Mathur and Dinakarpandian 2012 ), and the validated disease pairs obtained based on the electronic health records (EHR) of the US population ( Pakhomov  et al.  2010 ) are integrated to construct testing positive set  S independent + . The dataset can be represented as:\nwhere  S benchmark ∩ S independent = ∅ ,  S benchmark +  includes 3091 training positive pairs while  S independent +  includes 665 testing positive pairs.  S benchmark -  and  S independent -  are constructed by randomly selected negative pairs with the same number of positive set. The datasets can be downloaded from  https://github.com/Biohang/DiSMVC/tree/main/Dataset .\nThe framework of DiSMVC contains two main modules (cross-view graph contrastive learning and association pattern joint learning). As shown in  Fig. 1 , the former module is designed to extract the features of disease-related molecules via horizontally collaborative learning across different networks, and the latter module aims to detect association patterns between diseases via vertical joint learning. These two modules will be introduced in the following sections.\nThe framework of DiSMVC. There are two main steps: (i) Cross-view graph contrastive learning. Gene interaction network and miRNA similarity network are constructed, based on that node features are extracted by considering their proximity structures via different graph representation algorithms. Graph contrastive learning is implemented to further refine the hidden features of genes and miRNAs. (ii) Association pattern joint learning. Average pooling and concatenate strategies are applied to obtain initial disease pair features based on various priori bio-entity networks. Multi-layer perceptron models are jointly learned to detect hidden association patterns and predict disease similarity scores.\nBecause the deregulation of similar genes or miRNAs may tend to be closely related to the development of similar diseases, it is reasonable to detect disease association patterns from molecular mechanisms. In this study, two homogeneous bio-networks including gene interaction network  G g  and miRNA similarity network  G m   are constructed. The weight score  w i , j  between genes  g i  and  g j  in  G g  is calculated by:\nwhere  l i , j  represents the log likelihood similarity score obtained from HumanNet ( Kim  et al.  2022 ), and  l max  and  l min  are the maximum and minimum log likelihood similarity scores between genes. The network  G m  mainly collects the functionally similar miRNAs, where the similarity score  s i , j  between miRNAs  m i  and  m j  is computed via the Gaussian Interaction Profile (GIP) kernel similarity measurement:\nwhere  A d m ∈ R 30170 * 4798  is an adjacency matrix of miRNA–disease association network  G d m .  A i d m and  A j d m  is the  i th and  j th row in matrix  A dm , representing the interaction profiles of miRNA  m i  and  m j  with different diseases, respectively.  n  is the number of miRNAs, while  γ '  denotes the original kernel bandwidth and is defined as 1 following the previous study ( Wei and Liu 2020 ,  Wang  et al.  2023a ). To ensure the high quality of miRNA similarity network, the edges with similarity score > 0.8 are retained in  G m .\nGraph representation is designed to encode and extract meaningful embeddings preserving intricate structural properties from graph-structured data, and has been successfully applied in various bioinformatics tasks, such as circRNA-disease association detection ( Wang  et al.  2020 ,  Chen  et al.  2024 ,  Niu  et al.  2024 ), drug–target binding affinity prediction ( Öztürk  et al.  2018 ), cell–cell interaction identification ( Yang  et al.  2023 ). Inspired by the powerful ability of graph neural networks to uncover hidden semantic knowledge from bio-entity networks ( Zhou  et al.  2020 ,  Yi  et al.  2022 ), we adopt two graph encoders including Graph Convolutional Networks (GCN) ( Kipf and Welling 2017 ) and Graph Attention Networks (GAT) ( Veličković  et al.  2018 ) for exploring the latent interaction pattern from each homogeneous bio-network.\nFor the gene interaction network  G g , graph encoder with GCN is constructed. GCN utilizes convolutional operation and message passing mechanism on network to learn and update node embeddings by considering the local and global neighborhood structure. For the  l + 1 th network layer, the gene feature matrix  F g, l+1  can be obtained by:\nwhere  A g ^ = A g + I g  denotes the adjacency matrix of gene interaction network  G g  with inserted self-loops, the elements in  A g ∈ R 17247 * 17247  are the weight scores [cf  Equation (2) ].  D  is a diagonal degree matrix calculated by D i i = ∑ j = 1 n A i j g , where  n  denotes the number of genes in  G g .  W l  is the learnable weight matrix from the  l th layer,  F g , 0  denotes the initial feature matrix of genes constructed by one-hot encoding, and  σ  represents the activation function defined as LeakyReLU.\nTo reduce noise interference in miRNA similarity network  G m , graph encoder with GAT is constructed for extracting miRNA features. Different from GCN, GAT extracts node features via aggregating their neighborhoods’ information with different attention coefficients and can apply to inductive as well as transductive problems ( Yan  et al.  2023 ,  Zhang  et al.  2024 ). For the  l + 1 th network layer, the feature vector  f i m , l + 1  of miRNA  m i  can be obtained by three main steps. Firstly, the attention coefficient  e i j  representing the importance of miRNA  m j  to  m i  is calculated by:\nwhere  a  is a learnable attention parameter vector and  W l  is a weight matrix. Symbols T and  ∥  denote transposition and concatenation operations, respectively.  f i m , l  and  f j m , l  denote the embedding vectors of miRNA  m i  and  m j  obtained from the  l th layer. Then, the normalized attention coefficient is calculated using the SoftMax function to make different nodes comparable:\nwhere  N i  represents the set containing all the first-order neighbors of miRNA  m i . At last, the feature vector  f i m , l + 1  of miRNA  m i  is obtained by neighborhood aggregation strategy:\nThrough above intra-view graph representation learning, two feature matrices  F g ∈ R 17247 * 128 ,  F m ∈ R 4798 * 128  and graph encoders can be obtained, where the feature dimensionality are set as 128 and each encoder contains 2 layers to avoid over-smoothing.\nContrastive learning is primarily designed to learn representation by maximining the agreement between similar samples and minimizing it for dissimilar ones, and it has been widely applied for extracting informative representations from biological data, facilitating various tasks such as structure prediction ( Wang  et al.  2021 ), drug discovery ( Singh  et al.  2023 ), sequence analysis ( Li  et al.  2023 ), and image analysis ( Sanchez-Fernandez  et al.  2023 ).\nBecause genes and miRNAs are intertwined to significantly influence cellular function and disease progression, we propose a cross-view contrastive loss  L cv  for enhancing the representations of genes and miRNAs by incorporating the critical regulatory mechanism between them.  L cv  consists of two parts aiming to refine gene and miRNA expressions respectively and can be formulated as:\nL cv _ g \n  and  L cv _ m  are the contrastive losses for genes and miRNAs,  β  is a parameter controlling the importance of loss computed with different central molecules. Inspired by the effectiveness of InfoNCE loss in contrastive learning ( Dwibedi  et al.  2021 ), we define   L cv _ g  and  L cv _ m  as:\nwhere  rel ( · )  represents a correlation function, more specifically,  rel f i g , f j m = f i g f j m T f i g f j m  denotes the relevance degree between the  i th gene and  j th miRNA, while  f i g  and  f j m  are their feature vectors [cf  Equations (4)  and  (7) ].  Π ( ⋅ ) ∈ 0,1  is an indicator function,  Π ( ⋅ ) = 1  if the target gene and miRNA are interacted in the network  G g m , otherwise 0.  N g  and  N m  are the numbers of genes and miRNAs, respectively, and  τ  is a temperature parameter controlling the scale of distribution.\nTo explore underlying molecular mechanism and obtain meaningful representation of diseases, we construct disease features from both genetic and transcriptional views. The associations between diseases and genes, miRNAs are collected from networks  G d g  and  G d m , and the average pooling method is used to obtain feature vectors  f i d _ g  and  f i d _ m   for disease  d i :\nwhere  G i  and  M i  are two sets containing the genes and miRNAs associated with disease  d i , while  f j g  and  f p m  are the feature vectors of gene  g j  and miRNA  m p .\nDifferent from unsupervised measurement, we attempt to detect complex association pattern between diseases by considering supervised signals. For each disease pair < d i ,  d j > in  S benchmark  and  S independent  [cf  Equation (1) ], two pair features  h i , j d _ g  and  h i , j d _ m  can be obtained by concatenating strategy based on disease feature matrices  F d _ g ∈ R 30170 * 128  and  F d _ m ∈ R 30170 * 128 , where  h i , j d _ g = f i d _ g , f j d _ g  and  h i , j d _ m = f i d _ m , f j d _ m . Furthermore, two multi-layer perceptron networks, each including 256 neurons for the hidden layer and 128 neurons for the output layer, are designed to extract high-level feature matrices  H d _ g ^ ∈ R N p * 128  and  H d _ m ^ ∈ R N p * 128  for disease pairs:\nwhere  N p  represents the number of disease pairs in  S benchmark  [cf  Equation (1) ],  W 1 , g ,  W 2 , g ,  W 1 , m and  W 2 , m  are learnable weight matrices from different fully connected layers,  σ  is defined as ReLU function. Normalization operation is conducted after each network layer for reducing internal covariate shift and improving stability.\nTo capture hidden association pattern between diseases, we design an attentive multi-layer perception network constituting of attentive layer and similarity computation layer. The attentive layer aims to integrate the high-level features of diseases  H d _ g ^  and  H d _ m ^  by considering the feature correlation and contribution within different disease pairs. The feature matrix  H d ¯  extracted from attentive layer can be formulated as:\nwhere  H d ^ ∈ R N p * 256  is initially constructed by concatenating  H d _ g ^  and  H d _ m ^ , and  W cor  is a learnable feature correlation matrix.  ⊙  denotes Hadamard product operation between matrices,  σ  is defined as ReLU function, and  φ  is a column-wise normalization operation defined as:\nM = W cor H d ^ \n . The feature matrix  H d ¯ ∈ R N p * 256  is further fed into a fully connected network layer consisting of 256 neurons, after which a single neuron is used to calculate disease similarity scores. The valuable disease association information is considered as supervision signal for providing critical guidance for model optimization. To accelerate training efficiency, binary cross-entropy with logits loss is used and formulated as:\ny i \n  and  y i ′  denote the true label and predicted similarity score for the  i th disease pair.\nOverall, the cross-view contrastive learning and association pattern joint learning are horizontally and vertically collaborative to reveal the deep correlation between diseases. The integrated loss function is  L all = L sim + L cv , and each unknown disease pair can be evaluated by the trained predictor.\nFour comprehensive indicators including Area Under the Receiver Operating Characteristics Curve (AUC), Area Under the Precision-Recall Curve (AUPR), F1-Score, and Matthews Correlation Coefficient (MCC) are adopted to evaluate the performance of various methods ( Zulfiqar  et al.  2023 ,  Ma  et al.  2024 ). F1-Score and MCC are two balanced metrics that take into account precision and recall, as well as true positive and negative rates, and false positive and negative rates, respectively. AUC illustrates the trade-off between sensitivity and specificity while AUPR describes the trade-off between precision and recall at various threshold settings ( Hasanin  et al.  2019 ,  Chen  et al.  2023b ). Besides, three conditional indicators including accuracy (Acc), Sensitivity (Sen) and Precision (Pre) are also used to evaluate different models. It is widely acknowledged that the disease pairs with higher predicted scores tend to be focused more, therefore two additional indicators are used. One is ROC k  describing the area under Receiver Operating Characteristics Curve (ROC) up to the  k th false positives ( Gribskov and Robinson 1996 ), while the other measures the number of true positive pairs within top  n  predicted pairs.\n\nThe impact of three important parameters including disease feature dimensionality, training epoch and weighting coefficient  β  [cf  Equation (8) ] are analyzed. The benchmark dataset  S benchmark  is randomly divided into five folds, where four folds are used for training while one remaining fold is used for validating. The parameter analysis experiments are performed by varying one parameter while fixing others. The results are shown in  Fig. 2 , from which we can see that the performance of DiSMVC initially improves and relatveily stabilizes with the increment of the feature dimensionality and epoch. The impact of the weighted coefficient  β  on the performance of DiSMVC is not significant. However, we can still observe that integrating two contrastive losses for genes and miRNAs can contribute to improving performance. With regard to both prediction accuracy and training time, we set the values of the disease feature dimensionality, training epoch, and weighted coefficient  β  as 128, 200, and 0.3, respectively.\nParameter analysis of DiSMVC. The influence of disease feature dimensionality, training epoch and weighting coefficient  β  on the performance of DiSMVC.\nTo illustrate the effectiveness of DiSMVC, we compare DiSMVC with five state-of-the-art methods, including Wang’s method ( Wang  et al.  2007 ), Li-MaxPooling ( Yang  et al.  2021 ), Li-AvePooling ( Yang  et al.  2021 ), CoGO ( Chen  et al.  2022 ), and SynerSim ( Gao  et al.  2023 ). The first four methods measure disease similarity in an unsupervised manner. Wang’s method measures disease similarity based on their Directed Acyclic Graphs (DAGs), where diseases tend to be similar if they share disease semantic terms significantly. Li-MaxPooling, Li-AvePooling, and CoGO are three molecule-based methods, that extract disease features via different graph representation techniques. SynerSim utilizes initial highly sparse disease association network as self-supervised information to guide the association pattern mining. In contrast, DiSMVC introduces additional phenotype-annotated association supervision signals, enabling biological semantics enhancement and label leakage avoidance. Since the source code or web server of SynerSim is not accessible, we further conduct an unbiased performance comparison of DiSMVC with other four unsupervised methods. The performance comparison results are shown in  Table 2  and  Fig. 3 , from which we can see the following: (i) compared to semantic-based method, the molecule-based predictors obtain superior performance in terms of AUC, AUPR, F1-Score and MCC, attributing to their powerful ability of uncovering hidden molecule pathogenic mechanism; (ii) in contrast with the unsupervised measurement, DiSMVC can capture more informative and deeper disease semantic by integrating multiple bio-entity association knowledge and supervision signals. As a result, DiSMVC achieves obviously higher performance than all the competing methods in terms of the comprehensive classification metrics; (iii)  Fig. 3c  and  d  shows the number of true positive pairs in the top- n  pairs and the ROC k  scores obtained by different predictors. Benefiting from the multi-view graph collaborative learning framework, DiSMVC can also improve the quality of top-ranked results.\nPerformance comparison of different methods on  S independent . (a) and (b) show the ROC and PR curves with their corresponding AUC and AUPR scores obtained by different methods. (c) shows the numbers of true positive pairs in the top-n pairs predicted by various methods, and (d) shows the ROC k  scores obtained by different predictors.\nPerformance comparison of different methods on  S independent\nTo explore the reason for the performance improvement of DiSMVC for measuring disease similarity, the disease pair features and similarity scores learned by DiSMVC are analyzed and shown in  Fig. 4 . From  Fig. 4  we can see the following: (i) the visualized features of disease-disease pairs within the same type are similar, but the hidden features of positive pairs are obviously different from those of negative pairs, illustrating that DiSMVC can uncover the similar pattern within the same pair group, as well as the specific patterns between different pair groups; (ii) the extracted high-level features further facilitate to the strong discriminative ability of DiSMVC. As a result, the similarity scores predicted by DiSMVC show significant differences between two different pair groups, where most disease pairs in positive group are similar and those in negative group have much lower similarity scores.\nAn analysis of features learned by DiSMVC. (a) shows the visualized heatmap of hidden features of disease pairs extracted by DiSMVC. The rows denote the pair index of positive pairs and negative pairs from  S independent , and the columns represent the corresponding feature index. (b) shows the similarity score distribution of disease pairs within different types obtained by DiSMVC.\nDiSMVC is constituted of two crucial modules including cross-view graph contrastive learning and association pattern joint learning. To analyze their contributions to the prediction ability of DiSMVC, we construct seven competing baseline methods and conduct an ablation study.\nThe variants of DiSMVC are named ‘w/o  L cv _ g ’, ‘w/o  L cv _ m ’, ‘w/o  L cv ’, ‘w/o  L sim ’, ‘w/o  miRNA ’, ‘w/o  gene ’, and ‘w/o attention’, respectively. Their predictive performance along with DiSMVC is shown in  Fig. 5 , from which we can draw the following conclusions: (i) almost all the variants of DiSMVC achieve inferior performance in terms of AUC, AUPR, and F1-Score, indicating that each module in DiSMVC is important, and cross-view graph contrastive learning and association pattern joint learning collaboratively advance the performance; (ii) different from DiSMVC, ‘w/o  L sim ’ computes pair score in an unsupervised manner. A considerable performance discrepancy is observed between them, hence proving the significant contribution of phenotypic multimorbidity-based supervision signal in measuring disease similarity; (iii) DiSMVC is superior to ‘w/o  miRNA ’ and ‘w/o  gene ’, illustrating the advantages of enriching disease representation by considering underlying molecular mechanisms from both genetic and transcriptional views. Meanwhile, the significant decrease observed in ‘w/o  gene ’ suggests that genes play a more crucial role in disease association detection compared to miRNA. It is worth noting that ‘w/o attention’ uses concatenate operation to integrate disease embeddings derived from gene and miRNA features, resulting in limited performance improvement. In contrast, DiSMVC integrates features from multiple views by considering their different importance via attention layer, as illustrated in  Fig. 5b , enabling effective fusion and strong prediction ability.\nAn importance analysis of different modules in DiSMVC. (a) shows the performance comparison of DiSMVC with other seven baseline methods. ‘w/o  L cv _ g ’, ‘w/o  L cv _ m ’, and ‘w/o  L cv ’ are three variants of DiSMVC trained without computing contrastive loss  L cv _ g ,  L cv _ m  or  L cv . ‘w/o  L sim ’ combines cross-view graph contrastive learning and cosine correlation analysis to measure disease similarity. ‘w/o  miRNA ’ and ‘w/o  gene ’ are two variant methods where hidden features of genes or miRNAs separately extracted by intra-view graph representation are used to construct disease pair features, and training attentive multi-layer perception to predict disease similarity scores. ‘w/o attention’ integrates cross-view graph contrastive learning and association pattern joint learning without attention mechanism. (b) shows the average feature attention weights obtained by DiSMVC for the disease pairs in  S independent .\nTo illustrate the validity of DiSMVC for measuring disease similarity, we conduct case study on the interpretability of top disease pairs predicted by DiSMVC. The 10th revision of International Classification of Diseases (ICD-10) is globally recognized and widely used for categorizing and coding various diseases, therefore, we analyze the predictive results along with ICD-10 category. All the candidate disease pairs with ICD-10 code in  S independent  are predicted and further ranked in descending order of pair scores.\nThe network of top 100 disease pairs predicted by DiSMVC is shown in  Fig. 6 , from which we can see that diseases belonging to the ‘ICD-10/K: Disease of digestive system’ or ‘ICD-10/I: Disease of the circulatory system’ are more likely to be in closed proximity, indicating that the disease relationship predicted by DiSMVC to some extent conforms to the ICD-10 classification. In addition, DiSMVC can also identify disease associations from different categories via exploring underlying molecular mechanisms. For example, the autoimmune hepatitis labeled as C4721555 belonging to ICD-10/K shows close correlation with other four types of diseases including C0162323, C0011884, C0014175, and C0155789.\nThe network of top 100 disease pairs predicted by DiSMVC. The diseases with UMLS concept unique identifier are shown in circles, where the size illustrates the number of related diseases, and the color represents disease types classified based on ICD-10 version. The thickness of edges denotes the different pair scores between diseases predicted by DiSMVC, where thicker edges indicate closer relations.\nTo further investigate the molecular interpretability of predicted disease pairs, the top 10 disease pairs are further analyzed and shown in  Table 3 . We can observe that seven disease pairs can be supported by experimental literatures in PubMed, where most diseases are accompanied by common molecule or biological process abnormity. For examples, ‘Bleeding esophageal varices’ is a circulatory system disease and ‘Fatty liver, alcoholic’ belongs to the subtype of digestive system disease. Long-term alcoholic fatty liver is one of the common causes of bleeding esophageal varices. Although the two diseases originate from different human systems, they are mediated by two important biological processes including ‘Mitochondrial protein import’ in Reactome ( Fabregat  et al.  2018 ) and ‘Taste transduction’ in KEGG ( Kanehisa and Goto 2000 ), and significantly associated with PNPLA3 gene and four SNPs including rs738408, rs2294915, rs3747207, and rs738409. Besides, it is reported that cholestasis is one of the key pathogenic factors of alcoholic liver disease. The predicted disease pair of ‘Fatty liver, alcoholic’ and ‘Biliary cirrhosis’ are mediated by overlap genes, SNPs, and pathways as shown in  Fig. 7 .\nThe network of bio-entity associations for three representative diseases. Three diseases are shown in rhombuses, and their associated genes, SNPs, and pathways are shown in triangles, circles, and octagons, respectively\nTop 10 disease pairs predicted by DiSMVC.\nOverall, DiSMVC can identify similar diseases that are molecularly interpretable and provide guidance for understanding the pathological mechanisms of diseases.\n\nIn this study, we propose a new computational predictor named DiSMVC to measure disease similarity. Compared with other competing methods, it has the following advantages: (i) the graph collaborative learning framework of DiSMVC is able to integrate multiple aspects of disease-related information including bio-entity association at micro-level and phenotype-based multimorbidity at macro-level, showing promising performance and molecular interpretability in measuring disease similarity; (ii) the high-level molecular interaction features related to diseases are extracted and refined through cross-view contrastive learning, based on which more informative disease representation can be obtained from both genetic and transcriptional perspectives; (iii) different from other existing methods, DiSMVC measures disease similarity in a supervised manner by incorporating priori disease association knowledge, so as to capture strongly discriminative association patterns.\nThere is still potential for further improvement in the study. The occurrence and development of diseases is an extremely complex process that is influenced by various biological and environmental factors. Therefore, considering more disease related bio-entities such as proteins, metabolites and microbes is beneficial to enhance the richness and comprehensiveness of disease representation. We use similar, comorbid conditions as positive labels, leading to limitation in detecting biologically similar diseases that share underlying mechanisms but may not necessarily be comorbid. Incorporating auxiliary supervised signals from DisGeNET can help capture biological similarities between diseases. Finally, the integration of additional knowledge inevitably introduces noise. More high-quality data sources or advanced heterogeneous network learning techniques are expected to extract meaningful information while mitigate noise interference.","source_license":"CC-BY-4.0","license_restricted":false}