Section 1
Uncovering disease associations is of great significance in biomedical research. Similar diseases tend to exhibit analogous clinical manifestations or stem from similar molecular mechanisms. Therefore, measuring disease similarity cannot only unravel disease pathological mechanisms but also foster advancements in disease diagnosis ( Xiang et al. 2022 ), particularly in the identification of downstream disease-related genes ( Le 2020 , Cao et al. 2022 ), noncoding RNAs ( Wei et al. 2021 , Dai et al. 2022 , Sun et al. 2022 , Yan et al. 2022 , Zhang and Liu 2022 ), and other biomarkers ( Wang et al. 2023d , Cao et al. 2024 ). Moreover, such research contributes to therapeutic interventions, facilitating drug discovery ( Wang et al. 2023c ), and drug repurposing ( Bang et al. 2023 ) by discovering analogous treatment targets across different diseases.
With the in-depth study of diseases, more and more biomedical data sources are collected. For examples, the semantic terminology systems including International Classification of Diseases (ICD) ( Giles and Sander 2012 ), Disease Ontology (DO) ( Bello et al. 2018 ), and Medical Subject Headings (MeSH) ( Dhammi and Kumar 2014 ) have been constructed, where diseases are organized by hierarchical tree structures. In addition, a large number of diseases with phenotypic annotations are recorded in Human Phenotype Ontology (HPO) ( Gargano et al. 2024 ), Orphanet ( Gomez-Paramio et al. 2011 ), and PhenoDis ( Adler et al. 2018 ). To reveal the underlying pathological mechanism at bio-molecule level, different types of biological databases such as DisGeNET ( Piñero et al. 2017 ), dbSNP ( Sherry et al. 2001 ), MNDR ( Ning et al. 2021 ), and Reactome ( Fabregat et al. 2018 ) have been established, providing experimentally validated associations between diseases and genes, SNPs, ncRNAs, and pathways, respectively.
The accumulation of biomedical data makes it possible to measure disease similarity by computational methods. There are three main categories of methods semantic-based, phenotype-based, and molecule-based. The semantic-based methods measure disease similarity by considering the path length or information content of different terms in the hierarchical network of diseases ( Li et al. 2003 , Wang et al. 2007 , Yu et al. 2015 , Feng et al. 2022 , Ai et al. 2023 ), where their corresponding computational tools have been widely used to construct disease similarity networks ( Ma and Jiang 2020 , Wang et al. 2023e , Zhu et al. 2023 ). The phenotype-based methods mainly depend on the common phenotypic characteristics of diseases, where similar diseases may share more phenotype annotations ( Deng et al. 2015 , Peng et al. 2018 , Wang et al. 2023b ). Compared to the semantic and phenotype-based methods, the molecule-based methods can explore potential disease relationships from micro-level and achieve superior performance, where diseases can be described by their associated molecules via graph representation learning techniques and aggregation strategies ( Cheng et al. 2014 , Yang et al. 2021 , Chen et al. 2022 , Zeng et al. 2022a ). Despite the significant advances in computing disease similarity, there are still two main issues that need to be addressed: (i) the occurrence and development of disease is an extremely complicated process, involving the interaction and regulation of different bio-molecules. Most molecule-based methods represent diseases solely at the genetic level, failing to capture sufficient semantic information; (ii) existing studies measure disease similarity in an unsupervised manner and ignore the valuable priori disease association knowledge, which may cause inadequate pattern learning and limited prediction efficiency ( Zeng et al. 2022b ).
This study is initiated in an attempt to overcome these problems by designing a novel computational method called DiSMVC to measure disease similarity. DiSMVC is a supervised graph collaborative framework including two major modules cross-view graph contrastive learning and association pattern joint learning. The former aims to enrich disease representation by considering their underlying molecular mechanism from both genetic and transcriptional views, while the latter captures deep association patterns by incorporating phenotypically interpretable multimorbidities in a supervised manner. Experimental results indicated that DiSMVC can identify molecularly interpretable similar diseases, and the synergies gained from DiSMVC contributed to its superior performance in measuring disease similarity.
Section 2
To enrich disease description and explore their underlying molecular mechanism, five bio-entity networks including gene interaction network, miRNA similarity network, gene–miRNA interaction network, gene–disease, and miRNA–disease association networks are constructed and integrated into the proposed graph collaborative framework.
The gene interaction network G g is downloaded from HumanNet-FN ( Kim et al. 2022 ), collecting the co-functional gene links. The miRNA similarity network G m is constructed by Gaussian Interaction Profile (GIP) kernel similarity measurement, where the interaction profiles are expressed as the association vector between miRNAs and diseases. The network G g m collects the gene–miRNA interactions with Transcriptome Dysregulation Measure Score ( TDMDScore ) > 1 , derived from ENCORI ( Li et al. 2014 ). All the gene–disease associations collected in DisGeNET v7.0 ( Piñero et al. 2017 ) and the experimentally validated miRNA–disease associations collected in RNADisease ( Chen et al. 2023a ) are downloaded to construct two heterogeneous networks, G d g and G d m , respectively . In these networks, the genes and miRNAs excluded from G g and G m are filtered out. The specific statistical information of above constructed networks is listed in Table 1 .
The statistical information of different datasets.
G m
was constructed by Gaussian Interaction Profile kernel similarity measurement based on experimentally validated miRNA–disease associations.
We detect the disease association pattern in a supervised manner, the prior disease associations are downloaded from the previous study ( Dong et al. 2021 ), which analyzed the multimorbidities among common diseases in the UK Biobank. After unifying disease identifiers across multiple datasets and filtering out disease associations with criteria of RR value < 15 following the study ( Dong et al. 2021 ), 3091 high-quality disease associations supported with phenotypic interpretability are collected to construct S benchmark + for training process.
To increase the data coverage and comprehensively evaluate the performance of different methods, the phenotypically and genetically interpretable multimorbidities in the study ( Dong et al. 2021 ), the verified highly similar disease pairs predicted by the study ( Mathur and Dinakarpandian 2012 ), and the validated disease pairs obtained based on the electronic health records (EHR) of the US population ( Pakhomov et al. 2010 ) are integrated to construct testing positive set S independent + . The dataset can be represented as:
where S benchmark ∩ S independent = ∅ , S benchmark + includes 3091 training positive pairs while S independent + includes 665 testing positive pairs. S benchmark - and S independent - are constructed by randomly selected negative pairs with the same number of positive set. The datasets can be downloaded from https://github.com/Biohang/DiSMVC/tree/main/Dataset .
The framework of DiSMVC contains two main modules (cross-view graph contrastive learning and association pattern joint learning). As shown in Fig. 1 , the former module is designed to extract the features of disease-related molecules via horizontally collaborative learning across different networks, and the latter module aims to detect association patterns between diseases via vertical joint learning. These two modules will be introduced in the following sections.
The framework of DiSMVC. There are two main steps: (i) Cross-view graph contrastive learning. Gene interaction network and miRNA similarity network are constructed, based on that node features are extracted by considering their proximity structures via different graph representation algorithms. Graph contrastive learning is implemented to further refine the hidden features of genes and miRNAs. (ii) Association pattern joint learning. Average pooling and concatenate strategies are applied to obtain initial disease pair features based on various priori bio-entity networks. Multi-layer perceptron models are jointly learned to detect hidden association patterns and predict disease similarity scores.
Because the deregulation of similar genes or miRNAs may tend to be closely related to the development of similar diseases, it is reasonable to detect disease association patterns from molecular mechanisms. In this study, two homogeneous bio-networks including gene interaction network G g and miRNA similarity network G m are constructed. The weight score w i , j between genes g i and g j in G g is calculated by:
where l i , j represents the log likelihood similarity score obtained from HumanNet ( Kim et al. 2022 ), and l max and l min are the maximum and minimum log likelihood similarity scores between genes. The network G m mainly collects the functionally similar miRNAs, where the similarity score s i , j between miRNAs m i and m j is computed via the Gaussian Interaction Profile (GIP) kernel similarity measurement:
where A d m ∈ R 30170 * 4798 is an adjacency matrix of miRNA–disease association network G d m . A i d m and A j d m is the i th and j th row in matrix A dm , representing the interaction profiles of miRNA m i and m j with different diseases, respectively. n is the number of miRNAs, while γ ' denotes the original kernel bandwidth and is defined as 1 following the previous study ( Wei and Liu 2020 , Wang et al. 2023a ). To ensure the high quality of miRNA similarity network, the edges with similarity score > 0.8 are retained in G m .
Graph representation is designed to encode and extract meaningful embeddings preserving intricate structural properties from graph-structured data, and has been successfully applied in various bioinformatics tasks, such as circRNA-disease association detection ( Wang et al. 2020 , Chen et al. 2024 , Niu et al. 2024 ), drug–target binding affinity prediction ( Öztürk et al. 2018 ), cell–cell interaction identification ( Yang et al. 2023 ). Inspired by the powerful ability of graph neural networks to uncover hidden semantic knowledge from bio-entity networks ( Zhou et al. 2020 , Yi et al. 2022 ), we adopt two graph encoders including Graph Convolutional Networks (GCN) ( Kipf and Welling 2017 ) and Graph Attention Networks (GAT) ( Veličković et al. 2018 ) for exploring the latent interaction pattern from each homogeneous bio-network.
For the gene interaction network G g , graph encoder with GCN is constructed. GCN utilizes convolutional operation and message passing mechanism on network to learn and update node embeddings by considering the local and global neighborhood structure. For the l + 1 th network layer, the gene feature matrix F g, l+1 can be obtained by:
where A g ^ = A g + I g denotes the adjacency matrix of gene interaction network G g with inserted self-loops, the elements in A g ∈ R 17247 * 17247 are the weight scores [cf Equation (2) ]. D is a diagonal degree matrix calculated by D i i = ∑ j = 1 n A i j g , where n denotes the number of genes in G g . W l is the learnable weight matrix from the l th layer, F g , 0 denotes the initial feature matrix of genes constructed by one-hot encoding, and σ represents the activation function defined as LeakyReLU.
To reduce noise interference in miRNA similarity network G m , graph encoder with GAT is constructed for extracting miRNA features. Different from GCN, GAT extracts node features via aggregating their neighborhoods’ information with different attention coefficients and can apply to inductive as well as transductive problems ( Yan et al. 2023 , Zhang et al. 2024 ). For the l + 1 th network layer, the feature vector f i m , l + 1 of miRNA m i can be obtained by three main steps. Firstly, the attention coefficient e i j representing the importance of miRNA m j to m i is calculated by:
where a is a learnable attention parameter vector and W l is a weight matrix. Symbols T and ∥ denote transposition and concatenation operations, respectively. f i m , l and f j m , l denote the embedding vectors of miRNA m i and m j obtained from the l th layer. Then, the normalized attention coefficient is calculated using the SoftMax function to make different nodes comparable:
where N i represents the set containing all the first-order neighbors of miRNA m i . At last, the feature vector f i m , l + 1 of miRNA m i is obtained by neighborhood aggregation strategy:
Through above intra-view graph representation learning, two feature matrices F g ∈ R 17247 * 128 , F m ∈ R 4798 * 128 and graph encoders can be obtained, where the feature dimensionality are set as 128 and each encoder contains 2 layers to avoid over-smoothing.
Contrastive learning is primarily designed to learn representation by maximining the agreement between similar samples and minimizing it for dissimilar ones, and it has been widely applied for extracting informative representations from biological data, facilitating various tasks such as structure prediction ( Wang et al. 2021 ), drug discovery ( Singh et al. 2023 ), sequence analysis ( Li et al. 2023 ), and image analysis ( Sanchez-Fernandez et al. 2023 ).
Because genes and miRNAs are intertwined to significantly influence cellular function and disease progression, we propose a cross-view contrastive loss L cv for enhancing the representations of genes and miRNAs by incorporating the critical regulatory mechanism between them. L cv consists of two parts aiming to refine gene and miRNA expressions respectively and can be formulated as:
L cv _ g
and L cv _ m are the contrastive losses for genes and miRNAs, β is a parameter controlling the importance of loss computed with different central molecules. Inspired by the effectiveness of InfoNCE loss in contrastive learning ( Dwibedi et al. 2021 ), we define L cv _ g and L cv _ m as:
where rel ( · ) represents a correlation function, more specifically, rel f i g , f j m = f i g f j m T f i g f j m denotes the relevance degree between the i th gene and j th miRNA, while f i g and f j m are their feature vectors [cf Equations (4) and (7) ]. Π ( ⋅ ) ∈ 0,1 is an indicator function, Π ( ⋅ ) = 1 if the target gene and miRNA are interacted in the network G g m , otherwise 0. N g and N m are the numbers of genes and miRNAs, respectively, and τ is a temperature parameter controlling the scale of distribution.
To explore underlying molecular mechanism and obtain meaningful representation of diseases, we construct disease features from both genetic and transcriptional views. The associations between diseases and genes, miRNAs are collected from networks G d g and G d m , and the average pooling method is used to obtain feature vectors f i d _ g and f i d _ m for disease d i :
where G i and M i are two sets containing the genes and miRNAs associated with disease d i , while f j g and f p m are the feature vectors of gene g j and miRNA m p .
Different from unsupervised measurement, we attempt to detect complex association pattern between diseases by considering supervised signals. For each disease pair in S benchmark and S independent [cf Equation (1) ], two pair features h i , j d _ g and h i , j d _ m can be obtained by concatenating strategy based on disease feature matrices F d _ g ∈ R 30170 * 128 and F d _ m ∈ R 30170 * 128 , where h i , j d _ g = f i d _ g , f j d _ g and h i , j d _ m = f i d _ m , f j d _ m . Furthermore, two multi-layer perceptron networks, each including 256 neurons for the hidden layer and 128 neurons for the output layer, are designed to extract high-level feature matrices H d _ g ^ ∈ R N p * 128 and H d _ m ^ ∈ R N p * 128 for disease pairs:
where N p represents the number of disease pairs in S benchmark [cf Equation (1) ], W 1 , g , W 2 , g , W 1 , m and W 2 , m are learnable weight matrices from different fully connected layers, σ is defined as ReLU function. Normalization operation is conducted after each network layer for reducing internal covariate shift and improving stability.
To capture hidden association pattern between diseases, we design an attentive multi-layer perception network constituting of attentive layer and similarity computation layer. The attentive layer aims to integrate the high-level features of diseases H d _ g ^ and H d _ m ^ by considering the feature correlation and contribution within different disease pairs. The feature matrix H d ¯ extracted from attentive layer can be formulated as:
where H d ^ ∈ R N p * 256 is initially constructed by concatenating H d _ g ^ and H d _ m ^ , and W cor is a learnable feature correlation matrix. ⊙ denotes Hadamard product operation between matrices, σ is defined as ReLU function, and φ is a column-wise normalization operation defined as:
M = W cor H d ^
. The feature matrix H d ¯ ∈ R N p * 256 is further fed into a fully connected network layer consisting of 256 neurons, after which a single neuron is used to calculate disease similarity scores. The valuable disease association information is considered as supervision signal for providing critical guidance for model optimization. To accelerate training efficiency, binary cross-entropy with logits loss is used and formulated as:
y i
and y i ′ denote the true label and predicted similarity score for the i th disease pair.
Overall, the cross-view contrastive learning and association pattern joint learning are horizontally and vertically collaborative to reveal the deep correlation between diseases. The integrated loss function is L all = L sim + L cv , and each unknown disease pair can be evaluated by the trained predictor.
Four comprehensive indicators including Area Under the Receiver Operating Characteristics Curve (AUC), Area Under the Precision-Recall Curve (AUPR), F1-Score, and Matthews Correlation Coefficient (MCC) are adopted to evaluate the performance of various methods ( Zulfiqar et al. 2023 , Ma et al. 2024 ). F1-Score and MCC are two balanced metrics that take into account precision and recall, as well as true positive and negative rates, and false positive and negative rates, respectively. AUC illustrates the trade-off between sensitivity and specificity while AUPR describes the trade-off between precision and recall at various threshold settings ( Hasanin et al. 2019 , Chen et al. 2023b ). Besides, three conditional indicators including accuracy (Acc), Sensitivity (Sen) and Precision (Pre) are also used to evaluate different models. It is widely acknowledged that the disease pairs with higher predicted scores tend to be focused more, therefore two additional indicators are used. One is ROC k describing the area under Receiver Operating Characteristics Curve (ROC) up to the k th false positives ( Gribskov and Robinson 1996 ), while the other measures the number of true positive pairs within top n predicted pairs.
Section 3
The impact of three important parameters including disease feature dimensionality, training epoch and weighting coefficient β [cf Equation (8) ] are analyzed. The benchmark dataset S benchmark is randomly divided into five folds, where four folds are used for training while one remaining fold is used for validating. The parameter analysis experiments are performed by varying one parameter while fixing others. The results are shown in Fig. 2 , from which we can see that the performance of DiSMVC initially improves and relatveily stabilizes with the increment of the feature dimensionality and epoch. The impact of the weighted coefficient β on the performance of DiSMVC is not significant. However, we can still observe that integrating two contrastive losses for genes and miRNAs can contribute to improving performance. With regard to both prediction accuracy and training time, we set the values of the disease feature dimensionality, training epoch, and weighted coefficient β as 128, 200, and 0.3, respectively.
Parameter analysis of DiSMVC. The influence of disease feature dimensionality, training epoch and weighting coefficient β on the performance of DiSMVC.
To illustrate the effectiveness of DiSMVC, we compare DiSMVC with five state-of-the-art methods, including Wang’s method ( Wang et al. 2007 ), Li-MaxPooling ( Yang et al. 2021 ), Li-AvePooling ( Yang et al. 2021 ), CoGO ( Chen et al. 2022 ), and SynerSim ( Gao et al. 2023 ). The first four methods measure disease similarity in an unsupervised manner. Wang’s method measures disease similarity based on their Directed Acyclic Graphs (DAGs), where diseases tend to be similar if they share disease semantic terms significantly. Li-MaxPooling, Li-AvePooling, and CoGO are three molecule-based methods, that extract disease features via different graph representation techniques. SynerSim utilizes initial highly sparse disease association network as self-supervised information to guide the association pattern mining. In contrast, DiSMVC introduces additional phenotype-annotated association supervision signals, enabling biological semantics enhancement and label leakage avoidance. Since the source code or web server of SynerSim is not accessible, we further conduct an unbiased performance comparison of DiSMVC with other four unsupervised methods. The performance comparison results are shown in Table 2 and Fig. 3 , from which we can see the following: (i) compared to semantic-based method, the molecule-based predictors obtain superior performance in terms of AUC, AUPR, F1-Score and MCC, attributing to their powerful ability of uncovering hidden molecule pathogenic mechanism; (ii) in contrast with the unsupervised measurement, DiSMVC can capture more informative and deeper disease semantic by integrating multiple bio-entity association knowledge and supervision signals. As a result, DiSMVC achieves obviously higher performance than all the competing methods in terms of the comprehensive classification metrics; (iii) Fig. 3c and d shows the number of true positive pairs in the top- n pairs and the ROC k scores obtained by different predictors. Benefiting from the multi-view graph collaborative learning framework, DiSMVC can also improve the quality of top-ranked results.
Performance comparison of different methods on S independent . (a) and (b) show the ROC and PR curves with their corresponding AUC and AUPR scores obtained by different methods. (c) shows the numbers of true positive pairs in the top-n pairs predicted by various methods, and (d) shows the ROC k scores obtained by different predictors.
Performance comparison of different methods on S independent
To explore the reason for the performance improvement of DiSMVC for measuring disease similarity, the disease pair features and similarity scores learned by DiSMVC are analyzed and shown in Fig. 4 . From Fig. 4 we can see the following: (i) the visualized features of disease-disease pairs within the same type are similar, but the hidden features of positive pairs are obviously different from those of negative pairs, illustrating that DiSMVC can uncover the similar pattern within the same pair group, as well as the specific patterns between different pair groups; (ii) the extracted high-level features further facilitate to the strong discriminative ability of DiSMVC. As a result, the similarity scores predicted by DiSMVC show significant differences between two different pair groups, where most disease pairs in positive group are similar and those in negative group have much lower similarity scores.
An analysis of features learned by DiSMVC. (a) shows the visualized heatmap of hidden features of disease pairs extracted by DiSMVC. The rows denote the pair index of positive pairs and negative pairs from S independent , and the columns represent the corresponding feature index. (b) shows the similarity score distribution of disease pairs within different types obtained by DiSMVC.
DiSMVC is constituted of two crucial modules including cross-view graph contrastive learning and association pattern joint learning. To analyze their contributions to the prediction ability of DiSMVC, we construct seven competing baseline methods and conduct an ablation study.
The variants of DiSMVC are named ‘w/o L cv _ g ’, ‘w/o L cv _ m ’, ‘w/o L cv ’, ‘w/o L sim ’, ‘w/o miRNA ’, ‘w/o gene ’, and ‘w/o attention’, respectively. Their predictive performance along with DiSMVC is shown in Fig. 5 , from which we can draw the following conclusions: (i) almost all the variants of DiSMVC achieve inferior performance in terms of AUC, AUPR, and F1-Score, indicating that each module in DiSMVC is important, and cross-view graph contrastive learning and association pattern joint learning collaboratively advance the performance; (ii) different from DiSMVC, ‘w/o L sim ’ computes pair score in an unsupervised manner. A considerable performance discrepancy is observed between them, hence proving the significant contribution of phenotypic multimorbidity-based supervision signal in measuring disease similarity; (iii) DiSMVC is superior to ‘w/o miRNA ’ and ‘w/o gene ’, illustrating the advantages of enriching disease representation by considering underlying molecular mechanisms from both genetic and transcriptional views. Meanwhile, the significant decrease observed in ‘w/o gene ’ suggests that genes play a more crucial role in disease association detection compared to miRNA. It is worth noting that ‘w/o attention’ uses concatenate operation to integrate disease embeddings derived from gene and miRNA features, resulting in limited performance improvement. In contrast, DiSMVC integrates features from multiple views by considering their different importance via attention layer, as illustrated in Fig. 5b , enabling effective fusion and strong prediction ability.
An importance analysis of different modules in DiSMVC. (a) shows the performance comparison of DiSMVC with other seven baseline methods. ‘w/o L cv _ g ’, ‘w/o L cv _ m ’, and ‘w/o L cv ’ are three variants of DiSMVC trained without computing contrastive loss L cv _ g , L cv _ m or L cv . ‘w/o L sim ’ combines cross-view graph contrastive learning and cosine correlation analysis to measure disease similarity. ‘w/o miRNA ’ and ‘w/o gene ’ are two variant methods where hidden features of genes or miRNAs separately extracted by intra-view graph representation are used to construct disease pair features, and training attentive multi-layer perception to predict disease similarity scores. ‘w/o attention’ integrates cross-view graph contrastive learning and association pattern joint learning without attention mechanism. (b) shows the average feature attention weights obtained by DiSMVC for the disease pairs in S independent .
To illustrate the validity of DiSMVC for measuring disease similarity, we conduct case study on the interpretability of top disease pairs predicted by DiSMVC. The 10th revision of International Classification of Diseases (ICD-10) is globally recognized and widely used for categorizing and coding various diseases, therefore, we analyze the predictive results along with ICD-10 category. All the candidate disease pairs with ICD-10 code in S independent are predicted and further ranked in descending order of pair scores.
The network of top 100 disease pairs predicted by DiSMVC is shown in Fig. 6 , from which we can see that diseases belonging to the ‘ICD-10/K: Disease of digestive system’ or ‘ICD-10/I: Disease of the circulatory system’ are more likely to be in closed proximity, indicating that the disease relationship predicted by DiSMVC to some extent conforms to the ICD-10 classification. In addition, DiSMVC can also identify disease associations from different categories via exploring underlying molecular mechanisms. For example, the autoimmune hepatitis labeled as C4721555 belonging to ICD-10/K shows close correlation with other four types of diseases including C0162323, C0011884, C0014175, and C0155789.
The network of top 100 disease pairs predicted by DiSMVC. The diseases with UMLS concept unique identifier are shown in circles, where the size illustrates the number of related diseases, and the color represents disease types classified based on ICD-10 version. The thickness of edges denotes the different pair scores between diseases predicted by DiSMVC, where thicker edges indicate closer relations.
To further investigate the molecular interpretability of predicted disease pairs, the top 10 disease pairs are further analyzed and shown in Table 3 . We can observe that seven disease pairs can be supported by experimental literatures in PubMed, where most diseases are accompanied by common molecule or biological process abnormity. For examples, ‘Bleeding esophageal varices’ is a circulatory system disease and ‘Fatty liver, alcoholic’ belongs to the subtype of digestive system disease. Long-term alcoholic fatty liver is one of the common causes of bleeding esophageal varices. Although the two diseases originate from different human systems, they are mediated by two important biological processes including ‘Mitochondrial protein import’ in Reactome ( Fabregat et al. 2018 ) and ‘Taste transduction’ in KEGG ( Kanehisa and Goto 2000 ), and significantly associated with PNPLA3 gene and four SNPs including rs738408, rs2294915, rs3747207, and rs738409. Besides, it is reported that cholestasis is one of the key pathogenic factors of alcoholic liver disease. The predicted disease pair of ‘Fatty liver, alcoholic’ and ‘Biliary cirrhosis’ are mediated by overlap genes, SNPs, and pathways as shown in Fig. 7 .
The network of bio-entity associations for three representative diseases. Three diseases are shown in rhombuses, and their associated genes, SNPs, and pathways are shown in triangles, circles, and octagons, respectively
Top 10 disease pairs predicted by DiSMVC.
Overall, DiSMVC can identify similar diseases that are molecularly interpretable and provide guidance for understanding the pathological mechanisms of diseases.
Section 4
In this study, we propose a new computational predictor named DiSMVC to measure disease similarity. Compared with other competing methods, it has the following advantages: (i) the graph collaborative learning framework of DiSMVC is able to integrate multiple aspects of disease-related information including bio-entity association at micro-level and phenotype-based multimorbidity at macro-level, showing promising performance and molecular interpretability in measuring disease similarity; (ii) the high-level molecular interaction features related to diseases are extracted and refined through cross-view contrastive learning, based on which more informative disease representation can be obtained from both genetic and transcriptional perspectives; (iii) different from other existing methods, DiSMVC measures disease similarity in a supervised manner by incorporating priori disease association knowledge, so as to capture strongly discriminative association patterns.
There is still potential for further improvement in the study. The occurrence and development of diseases is an extremely complex process that is influenced by various biological and environmental factors. Therefore, considering more disease related bio-entities such as proteins, metabolites and microbes is beneficial to enhance the richness and comprehensiveness of disease representation. We use similar, comorbid conditions as positive labels, leading to limitation in detecting biologically similar diseases that share underlying mechanisms but may not necessarily be comorbid. Incorporating auxiliary supervised signals from DisGeNET can help capture biological similarities between diseases. Finally, the integration of additional knowledge inevitably introduces noise. More high-quality data sources or advanced heterogeneous network learning techniques are expected to extract meaningful information while mitigate noise interference.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.