A Generalized Higher-order Correlation Analysis Framework for Multi-Omics Network Inference

preprint OA: closed CC-BY-NC-ND-4.0
📄 Open PDF Full text JSON View at publisher
Full text 97,767 characters · extracted from oa-pdf · 14 sections · click to expand

Abstract

Multiple -omics (genomics, proteomics, etc.) profiles are commonly generated to gain insight into a disease or physiological system. Constructing multi-omics networks with respect to the trait(s) of interest provides an opportunity to understand relationships between molecular features but integration is challenging due to multiple data sets with high dimensionality. One approach is to use canonical correlation to integrate one or two omics types and a single trait of interest. However, these types of methods may be limited due to (1) not accounting for higher-order correlations existing among features, (2) computational inefficiency when extending to more than two omics data when using a penalty term-based sparsity method, and (3) lack of flexibility for focusing on specific correlations (e.g., omics-to-phenotype correlation versus omics-to-omics correlations). In this work, we have developed a novel multi-omics network analysis pipeline called Sparse Generalized Tensor Canonical Correlation Analysis Network Inference (SGTCCA-Net) that can effectively overcome these limitations. We also introduce an implementation to improve the summarization of networks for downstream analyses. Simulation and real-data experiments demonstrate the effectiveness of our novel method for inferring omics networks and features of interest. Author summary Multi-omics network inference is crucial for identifying disease-specific molecular interactions across various molecular profiles, which helps understand the biological processes related to disease etiology. Traditional multi-omics integration methods focus mainly on pairwise interactions by only considering two molecular profiles at a time. This approach overlooks the complex, higher-order correlations often present in multi-omics data, especially when analyzing more than two types of -omics data and phenotypes. Higher-order correlation, by definition, refers to the simultaneous relationships among more than two types of -omics data and phenotype, providing a January 18, 2024 1/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint more complex and complete understanding of the interactions in biological systems. Our research introduces Sparse Generalized Tensor Canonical Correlation Network Analysis (SGTCCA-Net), a novel framework that effectively utilizes both higher-order and lower-order correlations for multi-omics network inference. SGTCCA-Net is adaptable for exploring diverse correlation structures within multi-omics data and is able to construct complex multi-omics networks in a two-dimensional space. This

Method

offers a comprehensive view of molecular feature interactions with respect to complex diseases. Our simulation studies and real data experiments validate SGTCCA-Net as a potent tool for biomarker identification and uncovering biological mechanisms associated with targeted diseases.

Introduction

1 0.1 Multi-Omics Data Integration 2 The integration of multiple datasets with the same set of subjects has been actively 3 explored in the field of machine learning, referred to as multi-view machine learning [1].4 This approach has a broad range of applications, including clustering subjects based on5 the consensus of different views [2] and performing image annotation [3]. Furthermore,6 multi-view machine learning has been used extensively in the biomedical domain [4,5].7 Recent advances in biomedical technologies have enabled the generation of 8 high-throughput data at the molecular level, such as the genome, transcriptome, 9 proteome, and metabolome. Collectively, these datasets are referred to as multi-omics10 data [6]. Traditionally, each omics dataset has been analyzed separately [7,8], which can11

Result

in the loss of important information about the connections between omics 12 datasets. Multi-omics integration is an alternative approach that identifies shared and13 complimentary information across different molecular profiles [9]. Integrating 14 multi-omics data can help uncover biological mechanisms and interactions at the 15 molecular level, and has been used for various purposes such as disease 16 subtyping [10,11], variable selection [12,13], network analysis [14,15] and prediction of17 biomarkers [6,16,17]. 18 0.2 Multi-View Dimension Reduction 19 Dimension reduction is one of the more common goals of high dimensional data analysis.20 For example, non-negative matrix factorization is implemented to cluster multiple data21 sets [18], and multi-view co-reduction is developed to preserve the locality within the22 view and the consistency between the views in the lower dimensional embedding of each23 view [19]. Other dimension reduction methods for multiple data sets are based on 24 Canonical Correlation Analysis (CCA) [20], which seeks to find the linear combination25 (canonical weight) that maximizes the correlation between two sets of data. Usually, 26 there is more than one solution, which may be referred to as a canonical weight matrix,27 and thus it is commonly used for dimension reduction by projecting the original data28 into a shared lower dimensional space between two views (or omics data) with canonical29 weight matrix. Given two data setsX1 and X2, the optimization function is: 30 w1, w2 = arg max w1,w2 ˜wT 1 X T 1 X2 ˜w2p ˜wT 1 X T 1 X1 ˜w1 ˜wT 2 X T 2 X2 ˜w2 (1) However, this formulation only considers two sets of data and is not generalized to 31 multiple data. A multiple canonical correlation method was developed to maximize the32 January 18, 2024 2/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint sum of pairwise canonical correlations [21]. Suppose there arej = 1, . . . , Kviews, then 33 the problem can be formulated as follows: 34 max w1,...,wk KX i<j wT i X T i Xjwj s.t.wT j X T j Xjwj = 1 ∀j. (2) A penalty term can be added to the formulation above to achieve sparsity, which is 35 called Sparse Multiple Canonical Correlation Analysis (SmCCA). 36 Network analysis for molecular profiles is often used with dimension reduction 37

Methods

like SmCCA to identify and visualize connections between features [14,15] for38 the purpose of inferring underlying biological interactions. In addition to canonical 39 correlation analysis, regression is another approach that is implemented for multi-omics40 network inference, particularly for gene regulatory networks [22,23]. However, most 41

Methods

do not incorporate an outcome or phenotype in the form of a quantitative trait.42 Canonical correlation analysis-based methods can accomplish this goal by expanding 43 CCA to incorporate a phenotype (Y). SmCCNet was proposed to partition omics 44 features into different network modules while considering the phenotype(s) of 45 interest [24] and using a scaled version of SmCCA with the following optimization: 46 (w1, w2, ..., wk) = arg max ˜w1, ˜w2,..., ˜wK   X i<j i,j=1,2,...,K ai,j ˜wT i X T i Xj ˜wj + KX i=1 bi ˜wT i X T i Y   , subject to ∥ ˜wj∥2 = 1, P j( ˜wj) ≤ cj, j = 1, 2, ..., K, (3) where ai,j and bi are scaling factors that place greater importance on pairwise 47 combinations of views. For example, one may want to increasebi to increase the 48 influence on the phenotype to determine the canonical weights, andP (·) is the lasso 49 penalty parameter for sparsity [25], but other types of penalties can also be used. The50 optimal penalty parameter is selected through k-fold cross-validation. After that, the 51 canonical weights can be extracted based on the objective function above to construct52 an adjacency matrix. Finally, to construct a network, hierarchical clustering can be 53 implemented afterward to extract multiple multi-omics network modules. This method54 has been applied in various contexts including identifying phenotype-specific 55 miRNA-mRNA and proteomics-metabolomics correlations and networks [26]. However, 56 the existing SmCCNet method can only be adapted to two molecular profiles plus one57 single phenotype. When extending it to 3 or more omics data types, it can become 58 computationally expensive due to the cross-validation step to find the optimal penalty59 parameter for each omics data set. 60 Another multi-omics analysis and network inference method Data Integration 61 Analysis for Biomarker discovery using Latent variable approaches for Omics studies 62 (DIABLO) [27] has a similar formulation to the sparse multiple canonical correlation 63 analysis problems (Eq 3). Although the formulations of SmCCNet and DIABLO are 64 similar, they have some differences: (1) In DIABLO, there is no lasso penalty, but the65 user can choose how many features to include as nonsparse for each data view instead;66 (2) DIABLO focuses more on the prediction of phenotype and biomarker discovery, 67 while SmCCNet focuses more on graph learning of the interaction between biomarkers68 with respect to phenotype; (3) the adjacency matrix of SmCCNet is aggregated through69 canonical weights obtained from subsampling the data multiple times for a more robust70 solution, while the adjacency matrix of DIABLO is the subset of correlation matrix with71 only molecular features selected by the canonical correlation analysis. 72 January 18, 2024 3/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint 0.3 Tensor Canonical Correlation Analysis 73 As reviewed, the dimension reduction methods based on canonical correlation analysis74 only consider the summation of all pairwise correlation, including methods like 75 SmCCNet and DIABLO. However, for multi-omics data with more than two molecular76 profiles, the correlation structure can be higher-order rather than pairwise, where 77 higher-order correlation is defined as the simultaneous correlation among 3 or more 78 features. This type of higher-order correlation can be captured by pairwise correlation if79 and only if all pairwise correlations are strong. However, in multi-omics data, it may80 not be the case that higher-order correlations also have strong pairwise correlations. 81 A particular variant of the multiple canonical correlation analysis extends the 82 pairwise relationship to a higher-order relationship by maximizing the tensor canonical83 correlation [28], which captures higher-order correlations among multiple data sets and84 projects them into a shared lower dimensional embedding. A further extension to this85 tensor canonical correlation analysis (TCCA) is to combine deep learning with tensor86 canonical correlation analysis to maximize the higher-order correlation between views in87 the non-linear lower-dimensional space for dimensional reduction [29] to allow for 88 non-linearity. 89 Fig 1. Comparison between SmCCA, TCCA, and GTCCA Visualization of the comparison between Sparse multiple Canonical Correlation Analysis (SmCCA), Tensor Canonical Correlation Analysis (TCCA), and Generalized Tensor Canonical Correlation Analysis (GTCCA). SmCCA only captures pairwise correlations (e.g., genes and proteins) and TCCA only captures the highest order correlations (e.g., gene, protein and metabolites),. GTCCA considers the combination of highest order correlations ((TCCA) and lower-order correlations (SmCCA).January 18, 2024 4/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint 0.4 Contributions 90 As reviewed, SmCCA captures pairwise correlation, while TCCA captures higher-order91 correlation. However, TCCA only captures the highest order correlation, but not all 92 types of lower-order correlation structures of interest. In addition, sparsity is not 93 considered in TCCA, which prevents the model from focusing only on the most relevant94 information in the data. Furthermore, TCCA has a high memory and computation cost,95 which is inefficient for high-dimensional omics data. Therefore, in this work, we present96 a new method for identifying multi-omics phenotype specifics network that is based on97 TCCA but with significant extensions by including correlations of higher and lower 98 order simultaneously (Figure 1) and by incorporating sparsity. This new method is 99 called Sparse Generalized Tensor Canonical Correlation Analysis for Network Analysis100 (SGTCCA-Net). In particular, we avoid repeatedly applying TCCA to each correlation101 structure, but instead use a simultaneous algorithm that accounts for all the correlation102 components. 103 In this novel pipeline, we introduce multiple major contributions: New definition of104 higher-order correlation compared to the original TCCA, which better measures 105 higher-order correlation without interpreting the directionality and avoids effect 106 cancellation issues; The development of Sparse Generalized Tensor Canonical 107 Correlation Analysis (SGTCCA) with consideration of higher-order correlation, flexible108 design of correlation structures, and biased subsampling algorithm that ensures sparsity109 and computational efficiency; Implementation of network pruning and summarization 110 algorithms to keep only the most important molecular features. Our pipeline first 111 performs feature selection with SGTCCA, which extracts molecular features that are 112 involved in one or more specified higher/lower-order correlation structures. Then an 113 adjacency matrix between selected features is constructed to identify the inter-molecule114 relationships and collapse all higher/lower-order correlations into the two-dimensional115 space. Afterward, a network pruning algorithm is implemented to the adjacency matrix116 with the PageRank algorithm [30] NetSHy summarization score [31] to reduce the size117 of the network and include only the most relevant molecular features, yielding the final118 multi-omics network. Details of the pipeline are described below 119

Materials and methods

120 0.5 Summary of Pipeline Workflow 121 The end-to-end pipeline for SGTCCA-Net (Figure 2 and Algorithm 1) inputs the 122 multi-omics data X1, X2, ..., Xk ∈ RN ×dj , j = 1, 2, ..., k, each with dimensiondj, and 123 outputs subnetwork adjacency matrixM(sub) ∈ R(Pk j=1 d(sub) j )×(Pk j=1 d(sub) j ), where 124 d(sub) j ≤ dj for all j = 1, 2, ..., k is the number of features belong to thejth molecular 125 profile. There are two major contributions in this paper: (1) the development of Sparse126 Generalized Tensor Canonical Correlation Analysis that combines higher-order and 127 lower-order correlation of interests and guarantees high computational efficiency, and (2)128 a novel network pruning algorithm based on the network summarization score. 129 January 18, 2024 5/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint Fig 2. SGTCCA-Net workflow Workflow of SGTCCA-Net pipeline for multi-omics network inference. It consists of three core steps: higher-order correlation extraction, affinity matrix construction, and network trimming. Biased subsampling involves calculating the covariance tensor/matrix density with respect to certain molecular profiles, which is also the importance/probability vector when randomly sampling features from each molecular profile. Algorithm 1: SGTCCA-Net End-to-End Pipeline Algorithm. Input: Multi-omics and phenotype dataXj ∈ RN ×dj , j = 1, 2, ..., k, preferred number of subsampless ; (1) Run the SGTCCA algorithm and obtain canonical weight matrices Hj ∈ Rdp×s, j = 1, 2, ..., k (Algorithm 2 and Algorithm 1 in S3 T ext); (2) Construct global adjacency matricesM based on Hj, j = 1, 2, ..., k (Algorithm 2 in S5 T ext); (3) Prune each network module with PageRank algorithm and NetSHy network summarization score, and obtain sub-networksM (sub)(Algorithm 2 in S5 T ext) ; (4) Use Pearson’s correlation to filter out weaker edges from each subnetwork (Algorithm 2 in S5 T ext); Output:Multi-omics network module with respect to phenotype(s) 130 0.6 Generalized Tensor Canonical Correlation Analysis 131 0.6.1 New T ensor Canonical Correlation Analysis 132 The original TCCA algorithm has two major issues: (1) the directionality 133 (positive/negative) of higher-order correlation is not interpretable, especially when there134 are more than three datasets, (2) the correlation effect will be cancelled for odd number135 of datasets (see S1 Text for detail). Therefore, we addressed these limitations by 136 proposing a new tensor canonical correlation analysis formulation. Let 137 z1, z2, ..., zk ∈ Rn×1 be k n × 1 dimensional vectors that are centered and scaled, and 138 1 ∈ Rn×1 is the all-onen × 1 vector, then the higher-order correlation between these 139 January 18, 2024 6/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint vectors can be defined as: 140 ρ(z1, z2, ..., zk) = | 1 n(z1z2...zk)T 1|, (4) where z1z2...zk is the element-wise multiplication betweenk vectors (assuming k is 141 even). The higher-order correlation calculation method is slightly different for the odd142 number of views to avoid effect cancellation (see S1 Text for detail). Eq 4 calculates the143 higher-order correlation between vectors, but if the higher-order correlation between 144 matrices (e.g. each matrix is an omics dataset) needs to be calculated, we can avoid145 processing multiple loops to calculate the higher-order correlation for all combinations of146 features through a different equation. Suppose that there arek data in total, denoted by 147 Zj ∈ RN ×dj , j = 1, 2, .., k, which are centered and scaled. Letzji ∈ R1×dj be the vector 148 for subject i in the jth view. For each view, there areN observations and dj features, 149 then the covariance tensorC1,2,...,k ∈ Rd1×d2×...×dk for k views can be denoted by: 150 C1,2,...,k = 1 N | NX i=1 z1i ◦ z2i ◦ ... ◦ zki|, (5) where ◦ denotes the outer product. It can be shown that each entry ofC1,2,...,k, denoted by C1,2,...,k(j1, j2, ..., jk), can also be calculated through Eq 4 (see S1 Text for more details). If the canonical weights for each view arehj ∈ Rdj ×1, j = 1, 2, .., k, then our new tensor canonical correlation analysis is formulated as follows: arg max hj ρ =C1,2,...,k ×1 hT 1 ×2 hT 2 × ... ×k hT k s.t. hT j hj = 1, j = 1, 2, ..., k, (6) where the i-mode product ofC1,2,...,k and hT i , denoted byC1,2,...,k ×i hi is defined as 151 follow: 152 (C1,2,...,k ×i hT i )(j1, j2, ..., jp−1, 1, jp+1, ..., jk) = C1,2,...,k(j1, j2, ..., jk) · hT i . (7) It has been shown that the optimization problem above is equivalent to the following form [32]: min ρ,h1,h2,...,hk ||C1,2,...,k − ˆC1,2,...,k||2 F s.t. ˆC1,2,...,k = ρ · h1 ◦ h2 ◦ ... ◦ hk. (8) In this optimization form, the canonical weightshj are solved using either a 153 gradient-based method or alternating least squares [33]. 154 0.6.2 Generalized T ensor Canonical Correlation Analysis Combining 155 Higher and Lower-Order Correlations 156 Algorithm 2: Generalized Tensor Canonical Correlation Analysis Algorithm. Input: Multi-omics and phenotype dataXj ∈ RN ×dj , j = 1, 2, ..., K, preferred number of network moduless ; (1) Calculate covariance tensors/matricesC1, C2, ..., Cl of interest using Eq 5; (2) Formulate multiple tensors/matrices factorization problem using Eq 11 ; (3) Use the first-order gradient-based algorithm to find optimal canonical weight matrices Hj ∈ Rdj ×s, j = 1, 2, ..., K; Output: Canonical weight matricesHj ∈ Rdj ×s, j = 1, 2, ..., k 157 January 18, 2024 7/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint The section above illustrates TCCA assuming that only a single full covariance 158 tensor is used and is not capable of adding more covariance structures. For example,159 with the presence of transcriptomic, proteomics, metabolomics, and phenotype data, we160 may also be interested in the 3-way or pairwise correlation with respect to phenotype in161 addition to the full 4-way correlation structure. Therefore, to incorporate multiple 162 correlation structures of interest, we extended our new TCCA method and developed163 the Generalized Tensor Canonical Correlation Analysis (GTCCA). 164 Let Sm = {(m1, ..., mm) : mi ∈ {1, ..., k}, m1 ̸= m2 ̸= ... ̸= mm} be all possible combinations of k choose m, let Sm(i) be the ith element in the set. For instance, with k = 3 and m = 2, it would be the set of{S2(1) = (1, 2), S2(2) = (1, 3), S2(3) = (2, 3)}, then the GTCCA can be defined as: arg max h1,h2,...,hk ak,1ρk(1)2 + ( k k−1)X j=1 ak−1,jρk−1(j)2 + ... + ( k 3)X j=1 a3,jρ3(j)2 + ( k 2)X j=1 a2,jρ2(j)2 s.t. hT j hj = 1, j = 1, 2, ..., k, (9) where ρm(j) = CSm(i) ×1 hm1 × ... ×m hmm, where (m1, m2, ..., mm) = Sm(i). 165 Compared to TCCA, this design allows flexibility with respect to a specific 166 experimental design of interest by allowing a portion ofai,j to be 0. For example, if the 167 investigator is not interested in a 3-way higher-order correlation between omics data 1,2,168 and 3, then the scaling factor associated withρ1,2,3 can be set to 0. In addition, the 169 scaling factor ai,j can be set to values other than 1 to prioritize certain correlation 170 structures. Eq 9 cannot be directly optimized with the existing gradient-based method,171 and thus we found the problem equivalency as follows (proof in S2 Text): 172 Theorem 1 Let CSm(j) be the covariance tensor of view with (m1, ..., mm) = Sm(j) such that CSm(j) ∈ Rdm1 ×dm2 ×...×dmm , If the optimization goal is formulated as follows: arg max h1,h2,...,hk ak,1ρk(1)2 + ( k k−1)X j=1 ak−1,jρk−1(j)2 + ... + ( k 3)X j=1 a3,jρ3(j)2 + ( k 2)X j=1 a2,jρ2(j)2 s.t. hT j hj = 1, j = 1, 2, ..., k, (10) where ρm(i) = CSm(i) ×1 hm1 × ... ×m hmm for all m = 1, 2, ..., k, then the optimization January 18, 2024 8/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint problem above is equivalent to the following: arg min h1,h2,...,hk ak,1||CSk(1) − ˆCSk(1)||2 F + ( k k−1)X j=1 ak−1,j||CSk−1(j) − ˆCSk−1(j)||2 F + ... + ( k 3)X j=1 a3,j||CS3(j) − ˆCS3(j)||2 F + ( k 2)X j=1 a2,j||CS2(j) − ˆCS2(j)||2 F s.t. hT j hj = 1, j = 1, 2, ..., k, (11) where ˆCSm(i) = ρm(i)hm1 ◦ hm2 ◦ ... ◦ hmm is the rank-1 approximation of CSm(i). 173 Since the original problem is shown to be equivalent to the multiple tensor factorization174 problem, it can be solved iteratively using any first-order gradient-based algorithm. We175 choose the non-linear conjugate gradient algorithm with the Dai/Yuan update and 176 restart [34,35]. 177 0.7 Multi-Omics Data Example 178 In this section, we demonstrate how the first-order gradient of Eq 11 can be calculated179 with a multi-omics data example. Suppose there are three types of molecular profiles:180 transcriptomics (tr), proteomics (pr), and metabolomics (me), and phenotype (ph) data.181 Define these data asXtr, Xpr, Xme and Yph. Using GTCCA to find the 182 phenotype-related correlation structure, the optimization problem is given by: 183 arg max htr,hpr,hme,hph ρ2 tr,pr,me,ph + ρ2 tr,pr,ph + ρ2 tr,me,ph + ρ2 pr,me,ph + ρ2 tr,ph + ρ2 pr,ph + ρ2 me,ph s.t. hT j hj = 1, i = g, pr, m, ph, (12) where ρtr,pr,me,ph = Ctr,pr,me,ph ×1 htr ×2 hpr ×3 hme ×4 hph, and the other correlation components are also in the same form. By Theorem (1), optimizing this objective function is equivalent to: arg min htr,hpr,hme,hph ||Ctr,pr,me,ph − ˆCtr,pr,me,ph||2 F + ||Ctr,pr,ph − ˆCtr,pr,ph||2 F + ||Ctr,me,ph − ˆCtr,pr,ph||2 F + ||Cpr,me,ph − ˆCpr,me,ph||2 F + ||Ctr,ph − ˆCtr,ph||2 F + ||Cpr,ph − ˆCpr,ph||2 F + ||Cme,ph − ˆCme,ph||2 F s.t. hT j hj = 1, j = g, pr, m, ph, (13) where ˆCtr,pr,me,ph = ρtr,pr,me,ph · htr ◦ hpr ◦ hem ◦ hph represents the rank-1 approximation of the covariance tensor, and other components are similar. Below is the January 18, 2024 9/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint example gradient calculation for the transcriptomehtr, ∂f ∂htr = [ ˆCtr,pr,me,ph(1) − Ctr,pr,me,ph(1)] · (ρtr,pr,me,ph · htr ⊙ hpr ⊙ hme) + [ ˆCtr,pr,ph(1) − Ctr,pr,ph(1)](ρtr,pr,ph · hph ⊙ hpr) + [ ˆCtr,me,ph(1) − Ctr,me,ph(1)](ρtr,me,ph · hph ⊙ hme) + (htrρ1hT ph − C1,y)hphρtr,ph + λ(htr − ¯hph) (14) ∂f ∂ρtr,pr,me,ph = ( ˆCtr,pr,me,ph − Ctr,pr,me,ph) ×1 htr ×2 hpr ×3 hme ×4 hph, (15) where ⊙ is the Khatri-Rao product, given two matricesA ∈ Rm1×n and B ∈ Rm1×n, 184 and let ⊗ denote the Kronecker product between two vectors, the Khatri-Rao product is185 given by: 186 A ⊙ B = [a1 ⊗ b1, a2 ⊗ b2, ..., an ⊗ bn], (16) Ctr,pr,me,ph(i) is the mode-imatricization of tensorCtr,pr,me,ph, and "mode" stands 187 for the index of dimension in tensor data, and "mode-i" means theith dimension of the 188 tensor data. This is a way to matricize the tensor by mapping the elements of the 189 tensor into a matrix by arranging the mode-i fibers (think of it as the higher-order rows190 and columns) so that they become the columns of[Ctr,pr,me,ph](i). The gradient for 191 other hs and ρs can be calculated in a similar way. After taking the gradient, the next192 step is to concatenate all of the gradients into a long vector. All first-order 193 gradient-based methods can be used for optimization, and the nonlinear conjugate 194 gradient method is chosen to solve for the optimal canonical weights [36]. 195 0.8 Sparse Generalized Tensor Canonical Correlation Analysis 196 Common methods for ensuring sparsity in statistical models include using penalty-based197 techniques such as lasso [25] and elastic net [37]. However, applying these methods to198 GTCCA in practice can be computationally intensive, as they require tensor 199 computation on the original covariance tensors, which can potentially cause memory 200 issues. To address this problem, adapting the idea from Turbo-SMT [38], we use the201 biased subsampling method to guarantee sparsity (the details of the algorithm setup are202 shown in Algorithm 1 in S3 Text.). This method is both accurate and efficient, 203 reducing the tensor factorization problem by 1000 times. The core concept of this 204

Method

is to subsample features based on the covariance density value calculated for205 each feature. If a feature is more likely to be connected to other features or phenotypes,206 then it is more likely to be selected in the biased subsampling phase. Details of the 207 calculation of the covariance density are given in S4 Text. 208 In practice, multi-omics data are often high-dimensional, which poses challenges in209 terms of memory and computation when computing covariance tensors. Our preliminary210 findings indicate that in GTCCA optimization, even a covariance tensor of size 211 1000 × 1000 × 1000 requires more than 16 GB of RAM, leading to memory explosion.212 Although SGTCCA reduces the GTCCA memory requirement by feature subsampling,213 it still requires the calculation of covariance density vectors, which also involves the 214 computation of the full covariance tensor. Therefore, to further improve the algorithm215 efficiency, when calculating the covariance density vectors, we used the unbiased feature216 subsampling approach to approximate each covariance tensor. In each subsampling 217 iteration of our method, a small, randomly selected subset of features from each 218 January 18, 2024 10/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint molecular profile is used. For these subsets, we calculate the covariance density based219 on the provided correlation structures. This process is repeated across multiple 220 iterations. The final estimated covariance density vector is obtained by averaging the221 covariance densities from all subsamples. As a general rule for setting parameters, a 222 lower subsampling percentage (e.g.,10%) requires a greater number of iterations to 223 achieve reliable results, while a higher subsampling percentage (e.g.,70%) typically 224 requires fewer iterations. 225 To ensure the consistency of canonical weight vectors from each subsample, a 226 portion of features from each dataset are preselected and shared across different 227 subsamples. For instance, for each subsample, we can have8% of the features shared 228 across all subsamples and2% of the distinct features. A general recommendation is to229 set the total percentage of featured subsampled to be between10% and 20% to ensure 230 both computational efficiency. 231 0.9 Network Construction and Pruning 232 After obtaining the canonical weight through the SGTCCA, the next step is to 233 construct an adjacency matrix. The adjacency matrix construction process is the same234 as SmCCNet, and the detailed algorithm is illustrated in Algorithm 2 of S5 Text. The235 general idea is to take the outer product between the concatenated canonical weight and236 itself. However, even though the adjacency matrix is sparse, it may still contain 237 features/nodes that are less associated with other features/phenotypes. Therefore, we238 prune the global network with the PageRank algorithm [39] and the NetSHy network239 summarization score [31]. The original PageRank algorithm is widely used to rank web240 pages according to their importance. Its application to networks (adjacency matrix) is to241 count the number and strength of the edges for each node to determine the importance242 of each node. The NetSHy summarization score is the principal component score that243 takes into account the graph/network topology. We chose the NetSHy summarization244 score rather than the regular principal component analysis to prioritize the contribution245 of nodes with high network connectivity. The network pruning algorithm ensures both a246 high correlation of the summarization score with the phenotype and high network 247 connectivity. Step-by-step details of the algorithm can be found in S5 Text. 248 The algorithm above provides node-wise network pruning, but the pruned 249 sub-networks will be highly densely connected, and it requires edge pruning. Therefore,250 we calculate the between-node Pearson’s correlation and use it to filter out edges that251 connect to two weakly or non-correlated nodes, which improves the final network 252 visualization. 253 0.10 Computational Complexity 254 Suppose there arek omics data, each withd1, d2, ..., dk features, the common 255 subsampling proportion ispc and the distinct subsampling proportion ispd (see S3 Text 256 for details). Define p = pc + pd ∈ [0, 1], then the new tensor canonical correlation 257 analysis defined in this paper has the space complexity ofO(d1d2...dk), while SGTCCA 258 has the space complexity ofO(d1d2...dkpk). The computational cost per iteration for 259 new tensor canonical correlation analysis isO(d1d2...dks), while for SGTCCA it is 260 O(d1d2...dkpks). Since 0 < p < 1, SGTCCA has a more efficient space complexity and261 computational cost than the new TCCA. 262 January 18, 2024 11/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint Experiments 263 0.11 Simulation Study 264 F eature IndexData Type 1:20 21:40 41:60 61:80 81:100 101:160 161:220 221:280 281:340 341:1000 Omics 1 Laten t 1* Laten t 2* Laten t 3* Noise Laten t 5* Laten t 8 Laten t 9 Laten t 10 Noise Noise Omics 2 Laten t 1* Laten t 2* Noise Laten t 4* Laten t 6* Laten t 8 Laten t 9 Noise Laten t 11 Noise Omics 3 Laten t 1* Noise Laten t 3* Laten t 4* Laten t 7* Laten t 8 Noise Laten t 10 Laten t 11 Noise Phenot ype Data Latent F actor Composition Phenot ype Laten t 1*; Latent 2*; Latent 3*; Latent 4*; Latent 5*; Latent 6*; Latent 7* F eature IndexData Type 1:20 21:40 41:60 61:80 81:100 101:160 161:220 221:280 281:340 341:1000 Omics 1 Laten t 1* Laten t 2* Laten t 3* Noise Laten t 5* Laten t 8 Laten t 9 Laten t 10 Noise Noise Omics 2 Laten t 1* Laten t 2* Noise Laten t 4* Laten t 6* Laten t 8 Laten t 9 Noise Laten t 11 Noise Omics 3 Laten t 1* Noise Laten t 3* Laten t 4* Laten t 7* Laten t 8 Noise Laten t 10 Laten t 11 Noise Phenot ype Data Latent F actor Composition Phenot ype Laten t 1* F eature IndexData Type 1:20 21:40 41:60 61:80 81:100 101:160 161:220 221:280 281:340 341:1000 Omics 1 Laten t 1* Laten t 2* Laten t 3* Noise Laten t 5* Laten t 8 Laten t 9 Laten t 10 Noise Noise Omics 2 Laten t 1* Laten t 2* Noise Laten t 4* Laten t 6* Laten t 8 Laten t 9 Noise Laten t 11 Noise Omics 3 Laten t 1* Noise Laten t 3* Laten t 4* Laten t 7* Laten t 8 Noise Laten t 10 Laten t 11 Noise Phenot ype Data Latent F actor Composition Phenot ype Laten t 2*; Latent 3*; Latent 4*; Latent 5*; Latent 6*; Latent 7* T able 1. Simulated multi-omics data correlation structure for cases 1-3. Red and "*" mean that features that are simulated with this latent factor are considered signal features. In addition to the existing latent factors and random noise, as shown in the table, additional random noise will be added to all simulated molecular features and phenotype data. The first table is for simulation case 1, where all types of phenotype-specific correlation structures are simulated and considered signal; the second table is for simulation case 2, where only 4-way phenotype-specific correlation structures are simulated and considered signal; the third table is for simulation case 3, where all 3-way, pairwise phenotype-specific correlation structure is simulated and considered signal. We simulated multi-omics data using independent latent factors to represent various265 correlation structures [40] (Table 1). We aimed to compare the performance of 266 SGTCCA-Net with other multi-omics network inference pipelines. Since the comparison267 relies on the adjacency matrix, it is necessary that all methods have an available 268 adjacency matrix. We generated three datasets for molecular profiles, termed omics 1,269 omics 2, and omics 3, along with a quantitative phenotype. To mimic the true 270 multi-omics correlation structure in most scenarios (the strong connection between 271 omics but the weak connection between omics and phenotype), we imposed weaker noise272 on the omics data and stronger noise on the simulated phenotype data, and all the 273 random noise are generated from Gaussian distribution. This simulation study is 274 designed to evaluate the accuracy of different methods in identifying omics features 275 associated with both higher-order and lower-order phenotype-specific correlation 276 structures, and to assess the robustness of the model in handling noisy or highly skewed277 omics data. To this end, we introduced three latent factors simulation strategies: 278 multivariate normal latent factors, highly right-skewed latent factors (Fleishman power279 transformation [41]), and noisy latent factors (multivariate normal distribution with 280 noise). Under each simulation strategy, we proposed three distinct correlation structure281 setups: one with all possible phenotype-specific higher-order/lower-order correlations, 282 one with only phenotype-specific four-way correlations, and one with all 283 phenotype-specific correlation structures except for the 4-way correlations. These signal284 correlation structures are simulated by latent factors (red in Table 1). Furthermore, we285 also simulated nonphenotype-specific correlation structures to interfere with signal 286 feature identification and test the robustness of each method (black in Table 1). 287 The positive "signal" features, intended for model identification, are defined as 288 January 18, 2024 12/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint four-way correlations involving all omics and phenotype, multiple three-way correlations289 among different combinations of omics with phenotype, and pairwise correlations 290 between each omics and phenotype. Additionally, the nonphenotype-specific correlations291 are negative "random noise" features to challenge signal feature detection. These 292 include a 3-way correlation among all omics and pairwise correlations among the omics,293 and these correlation structures serve as noise features, which models should ideally 294 overlook. The performance is evaluated at the node level with Area Under the Receiver295 Operating Characteristic Curve (AUC) score. A node is predicted positive (signal) if its296 maximal connection to other nodes in the adjacency matrix passes a certain threshold,297 which is consistent with the SmCCNet evaluation [24]. The AUC score is then 298 calculated checking prediction results across a series of the threshold value. 299 We compare the performance of SGTCCA-Net to other multi-omics network analysis300

Methods

with 25 replications: (1) STCCA-Net with only the higher order 4-way 301 correlation structure;(2) Best SmCCNet AUC score from various combinations of 302 scaling factors and sparsity levels; (3) Best DIABLO AUC score with combinations of303 different levels of sparsity and scaling factors. The parameter setup for SmCCNet and304 DIABLO resulted in a total of 9 models for each method. To ensure a fair comparison305 between methods, we extracted only one set of canonical weights from each method.306 More details on parameter settings can be found in S6 Text. 307 0.12 Multi-Omics Data Analysis 308 In addition to the simulations, we compared the performance of SGTCCA and SmCCA309 on two data sets. To maintain a fair comparison, we kept all subsequent network 310 analysis steps constant, and therefore did not use DIABLO as the network inference for311 that method implements a different pipeline. Therefore, we opt for a modified version of312 SmCCNet (SmCCNet 2.0) [42], ensuring that the network analysis steps align with 313 those of SGTCCA-Net. The best scaling factors and penalty terms selection for 314 SmCCNet are based on 5-fold cross-validation. To be specific, we let the penalty term315 for each molecular profile vary between 0.25, 0.5, and 0.75, and the scaling factor 316 associated with each phenotype-specific pairwise correlation component specific to the317 phenotype varies between 1, 5, and 10, and we evaluate the scaled prediction error for318 each particular combination. In an example of two omics data, if(a1,2, b1, b2) are the 319 scaling factors whereb1 and b2 represent the phenotype-specific scaling factors anda1,2 320 stands for the scaling factor for between-omics correlation, we will scale the scaling 321 factor by δ to ensure a1,2 + b1 + b2 = 1, then the best scaling factors and their 322 associated penalty terms are determined by minimizing: 323 (a1,2, b1, b2)=arg max ˜a1,2,˜b1,˜c2 |trainCC − testCC | |testCC | , (17) where trainCC and testCC stand for total training canonical correlation and total 324 testing canonical correlation. All other steps in the pipeline, including network 325 construction and pruning, were kept unchanged. Specifically, we fix steps (2) through326 (4) of Algorithm 1 and vary the setup for step (1) between SGTCCA and SmCCNet.327 The study aimed to compare the performance of SGTCCA to SmCCA in extracting the328 network associated with the phenotype of interest. 329 0.12.1 TCGA Breast Cancer Network Analysis 330 The Cancer Genome Atlas Program (TCGA) breast invasive carcinoma project includes331 RNA sequencing data with normalized counts obtained through the Illumina HiSeq 332 platform, microRNA (miRNA) expression data of tumor samples obtained through the333 January 18, 2024 13/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint Illumina HiSeq platform at the miRgene-level, and log-ratio normalized reverse phase334 protein arrays (RPPA) expression data from tumor samples at the gene level. In this335 experiment, we use tumor purity as the phenotype, defined as the percentage of cancer336 cells in a sample of tumor tissue [43]. After matching subjects with all available 337 molecular profiles and phenotype data, we obtained a cohort of 105 subjects. Because of338 their relatively large number, we filtered genes based on the standard deviation to 339 eliminate genes that exhibit lower variability and only include the top 25% of the genes, 340 resulting in a total of 5039 genes, in addition to the 823 miRNAs, and 175 RPPAs. In341 the data preprocessing step, we regress out age, race, and whether patients received 342 radiation therapy for each of the molecular features to adjust for covariates. 343 Our SGTCCA-Net pipeline assumes the correlation structure of 344 gene-miRNA-RPPA-phenotype, gene-miRNA-phenotype, gene-RPPA-phenotype, 345 miRNA-RPPA-phenotype, and all pairwise molecular profiles with phenotype. Same as346 the simulation study, we set the percentage of common subsampled molecular features347 at 8% and the percentage of distinct molecular features to2% with a total of 10 348 iterations when bias subsampling is performed to guarantee sparsity. We use the 349 network pruning algorithm to prune the multi-omics network and set the 350 minimally/maximally final subnetwork size to 30/300. 351 To evaluate the efficacy of each method, we obtained the NetSHy summarization 352 score for every subnetwork, and correlate the scores with tumor purity. In addition, we353 calculated the correlation of individual molecular features with tumor purity, 354 contrasting the number of strongly and weakly correlated features between both 355 methods. Since one of the advantages of SGTCCA-Net is its ability to extract 356 higher-order correlation among molecular features, we rank all potential molecular 357 feature triplets and pairs in relation to tumor purity based on their respective 358 higher-order correlation values. This is done to investigate how different molecular 359 features simultaneously interact with respect to tumor purity. 360 An enrichment analysis was also performed on the aggregate set of genes and 361 proteins through Metascape (v. 3.5.20230501) [44], with the reference set of all genes362 and protein target genes fed into the SGTCCA-Net pipeline and recognized by 363 Cytoscape (4540). While the aforementioned approach focused solely on genes and 364 proteins, we also extended the enrichment analysis to miRNAs. Using MultiMiR [45],365 we first identified validated target genes of miRNAs in the subnetwork, then treated366 target genes that have a non-zero canonical weight in the global network as the enriched367 set. Enrichment analysis is also performed in Metascape with the same background set368 as described above. 369 0.12.2 COPDGene Network Analysis 370 Phase II COPDGene [46] includes transcriptomics (RNA-Seq), proteomics (SomaLogic371 1.3k), and metabolomics (Metabolon) data [47] for studying chronic obstructive 372 pulmonary disease (COPD). For demonstration, we use a smaller cohort to compare 373 different methods. We construct multi-omics networks with respect to forced expiratory374 volume at one second (FEV1). To ensure that we have roughly 1000 genes to run the375 model, we use standard deviation to filter out genes with less variability (sd < 0.435).376 After filtering subjects with available data for all molecular profiles and the phenotype,377 there were 461 subjects with available data for all molecular profiles, with 972 genes,378 1305 proteins, and 995 metabolites. To adjust for covariates, we regressed out effects379 from sex, age, and clinical center for each molecular feature. 380 Our SGTCCA-Net pipeline assumes the correlation structure of 381 gene-protein-metabolite-phenotype, gene-protein-phenotype, gene-metabolite-phenotype, 382 protein-metabolite-phenotype and all pairwise molecular profiles with phenotype. As in383 the simulation study and the TCGA breast cancer study, we set the percentage of 384 January 18, 2024 14/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint common subsampled molecular features at8% and the percentage of distinct molecular 385 features to 2% with a total of 10 iterations when biased subsampling was performed to386 guarantee sparsity. 387 To evaluate the efficacy of each method, we obtained the NetSHy summarization 388 scores for every subnetwork, correlating them with FEV1. In addition, we calculated 389 the correlation of individual molecular features with FEV1, contrasting the number of 390 strongly and weakly correlated features between both methods. Since one of the 391 advantages of SGTCCA-Net is its ability to extract higher-order correlation between 392 molecular features, we rank all potential molecular feature triplets and pairs in relation393 to FEV1 based on their respective higher-order correlation values. This is done to 394 investigate how different molecular features simultaneously interact with respect to 395 FEV1. 396 Enrichment analysis was also performed on the union set of genes and proteins in397 the subnetwork through Metascape (v. 3.5.20230501) [44], with the background set to398 be all genes and protein target genes fed into the SGTCCA-Net pipeline and recognized399 by Cytoscape (1750). While the aforementioned approach focused solely on genes and400 proteins, we also extended the analysis to metabolites by running metabolite enrichment401 analysis on IMPaLa [48] based on the HMDB names of the metabolites in the 402 subnetwork, with the background set to be all metabolites with available HMDB name403 (227). 404 FEV11 is a measurement in liters for each subject, which did not take into account405 body habitus and race. Therefore, we also ran SGTCCA-Net on the FEV1 percent 406 predicted (FEV11PP) to check if the top molecular features identified fromFEV11 are 407 also identified in the FEV11PP network, and if the correlation changes. Additionally, 408 we also ran an enrichment analysis in Metascape to check if the pathways overlapped409 between the two networks. 410

Results

411 0.13 Simulation 412 SGTCCA-Net significantly outperforms the other three methods (SmCCNet, 413 STCCA-Net, and DIABLO) in all simulation study settings and simulated correlation414 structure designs (Table 2), even when SmCCNet and DIABLO use the parameter setup415 that performs best in each iteration. In case 1, where all types of phenotype-specific 416 correlation structures are presented and in case 3, when the 4-way correlation structure417 is removed, the performance of SGTCCA-Net improves by around10% compared to the 418 best-performing results from SmCCNet and DIABLO, and the level of improvement 419 increases as the number of subjects increases. SGTCCA-Net achieves high signal feature420 identification accuracy even with only 100 subjects in the presence and absence of 421 different phenotype-specific correlation structures and provides nearly-perfect prediction422 when the number of subjects doubles, outperforming the other methods by around15% 423 to 20% in AUC. In case 2, when there is only one type of phenotype-specific correlation424 structure (4-way correlation), SGTCCA-Net, STCCA-Net, and best-performing 425 DIABLO perform equally well with perfect or near-perfect AUC scores. 426 When the underlying latent factors are highly right-skewed and omics data are noisy,427 the performance of all models are impacted, but SGTCCA-Net still outperforms the 428 other methods. This indicates that even when the signal is masked by noise to some429 extent, or normality is not guaranteed, SGTCCA-Net still performs well compared to430 other methods, and its performance improves as the sample size increases. 431 January 18, 2024 15/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint Normal Latent F actor: Median AUC (Interquartile Range) Metho d n SGTCCA-Net STCCA-Net SmCCNet (Best) DIABLO (Best) Case 1 n = 100 0.885 (0.859, 0.906) 0.656 (0.604, 0.680) 0.706 (0.689, 0.736) 0.745 (0.718, 0.780) n = 200 0.978 (0.966, 0.986) 0.694 (0.614, 0.738) 0.742 (0.720, 0.818) 0.825 (0.754, 0.880) Case 2 n = 100 1.000 (1.000, 1.000) 0.995 (0.955, 1.000) 0.825 (0.779, 0.853) 0.931 (0.920, 0.954) n = 200 1.000 (1.000, 1.000) 1.000 (1.000, 1.000) 0.823 (0.811, 0.837) 0.939 (0.931, 1.000) Case 3 n = 100 0.890 (0.864, 0.942) 0.622 (0.583, 0.701) 0.710 (0.679, 0.740) 0.701 (0.670, 0.757) n = 200 0.977 (0.965, 0.983) 0.679 (0.635, 0.711) 0.766 (0.726, 0.816) 0.814 (0.786, 0.891) Sk ewed Latent F actor: Median AUC (Interquartile Range) Metho d n SGTCCA-Net STCCA-Net SmCCNet (Best) DIABLO (Best) Case 1 n = 100 0.845 (0.814, 0.884) 0.686 (0.603, 0.730) 0.705 (0.664, 0.722) 0.713 (0.675, 0.768) n = 200 0.935 (0.921, 0.954) 0.729 (0.675, 0.788) 0.756 (0.726, 0.804) 0.847 (0.813, 0.887) Case 2 n = 100 1.000 (1.000, 1.000) 1.000 (0.980, 1.000) 0.824 (0.780, 0.846) 0.939 (0.932, 1.000) n = 200 1.000 (1.000, 1.000) 1.000 (1.000, 1.000) 0.817 (0.770, 0.855) 0.949 (0.933, 1.000) Case 3 n = 100 0.855 (0.799, 0.891) 0.677 (0.614, 0.719) 0.714 (0.681, 0.748) 0.725 (0.651, 0.777) n = 200 0.936 (0.925, 0.949) 0.755 (0.678, 0.795) 0.712 (0.687, 0.735) 0.803 (0.738, 0.839) Noisy Omics Data: Median AUC (Interquartile Range) Metho d n SGTCCA-Net STCCA-Net SmCCNet (Best) DIABLO (Best) Case 1 n = 100 0.776 (0.761, 0.805) 0.578 (0.554, 0.595) 0.702 (0.677, 0.724) 0.703 (0.667, 0.780) n = 200 0.909 (0.892, 0.919) 0.603 (0.579, 0.638) 0.785 (0.748, 0.801) 0.813 (0.766, 0.920) Case 2 n = 100 0.998 (0.996, 1.000) 0.777 (0.666, 0.880) 0.839 (0.789, 0.853) 1.000 (0.918, 1.000) n = 200 1.000 (1.000, 1.000) 0.921 (0.822, 0.959) 0.832 (0.815, 0.846) 1.000 (0.936, 1.000) Case 3 n = 100 0.781 (0.746, 0.809) 0.556 (0.538, 0.614) 0.710 (0.678, 0.729) 0.701 (0.670, 0.757) n = 200 0.916 (0.888, 0.936) 0.583 (0.565, 0.609) 0.782 (0.745, 0.798) 0.799 (0.778, 0.876) T able 2. Simulation results. Performance is evaluated through the AUC of the precision-recall curve generated by applying different thresholds to the maximal connection of molecular features to each other. For this simulation, 20 replications are the AUC median and interquartile range in parenthesis of is reported. "Best" AUC for SmCCNet and DIABLO denotes that in each replication, 9 SmCCNet/DIABLO models are run and only the highest AUC score is recorded. The first table is the simulation study for setting 1, which uses latent factors simulated from multivariate normal distribution; the second panel is the simulation study for setting 2, where latent factors are simulated with a highly right-skewed distribution; the third panel is the simulation study for setting 3, where latent factors are simulated with a multivariate normal distribution (same as case 1), but strong random noise is enforced on omics data. Each simulation study contains 3 cases: Case 1 means that the signal molecular features are defined as all 4-way, 3-way, and pairwise phenotype-specific correlation structure; case 2 removes the phenotype-specific 4-way correlation structure; case 3 removes the phenotype-specific 3-way and pairwise correlation structure. In each case, the data is simulated with 100 or 200 subjects. 0.14 TCGA Breast Cancer Network Analysis 432 After running a network analysis pipeline with SGTCCA and SmCCNet for the 433 phenotype of tumor purity, the final network size for both SGTCCA-Net and SmCCNet434 is 300. Specifically, there are 128 genes, 125 miRNAs, and 47 proteins in the SGTCCA435 network, while there are 128 genes, 59 miRNAs, and 113 proteins in the SmCCNet 436 network. The NetSHy network summarization score correlation with respect to tumor437 purity is 0.689 and 0.500 for SGTCCA-Net and SmCCNet respectively. In the 438 SGTCCA-Net network, there are 55 molecular features with weak correlation with 439 respect to tumor purity (FDR-adjusted > 0.05), while for SmCCNet there are 105 such440 molecular features. To check whether each method includes these nodes incorrectly or441 whether they contribute to the network, we ran the NetSHy network summarization 442 score on these weak nodes only, extracted the first 10 PCs, and calculated the maximum443 correlation between the PC score and tumor purity. For SGTCCA-Net, the maximum444 correlation is -0.232 (p = 0.017), while for SmCCNet, the maximum correlation is only445 -0.176 (p = 0.072). This implies that weak nodes within the SGTCCA-Net network are446 more significantly associated with tumor purity compared to those in the SmCCNet 447 Jan uary 18, 2024 16/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint Type Name Correlation FDR Gene (128) TMC8 -0.712 <0.001 ARHGAP9 -0.752 <0.001 SPIB -0.638 <0.001 HCST -0.712 <0.001 IRF8 -0.748 <0.001 miRNA (125) hsa-mir-142 -0.564 <0.001 hsa-mir-146a -0.560 <0.001 hsa-mir-150 -0.593 <0.001 hsa-mir-155 -0.534 <0.001 hsa-let-7i -0.572 <0.001 Protein (47) PCNA -0.564 <0.001 CCNB1 -0.409 <0.001 NRG1 -0.466 <0.001 KAT2A 0.324 0.001 ACVRL1 0.297 0.003 T able 3. Top 5 molecular features from each molecular profile and their individual correlation with respect to tumor purity for TCGA breast cancer data (with p-value). network. Furthermore, a total of 139 molecular features are shared between the 448 SGTCCA-Net and SmCCNet methods, with each method having 161 unique molecular449 features. Of the 161 unique molecular features in each network, 125 and 66 are 450 significantly associated with tumor purity (FDR < 0.05) for SGTCCA-Net and 451 SmCCNet, respectively. 452 We calculate the PageRank score for the SGTCCA-Net subnetwork and present the453 top 5 nodes for each molecular profile in Table 3. We observed that all of the molecular454 features in the table are significantly correlated with tumor purity.ACVRL1 and 455 KAT2Aare the only two nodes with low correlation to tumor purity (FDR > 0.001). To456 check whether these nodes contribute to the network, we examine the correlation 457 between these two nodes and other molecular features of the subnetwork. We observed458 that ACVRL1 is highly correlated withPCNA (ρbetween−omics = 0.474), which is highly 459 correlated with tumor purity (ρomics−pheno = −0.564) and KAT2A 460 (ρbetween−omics = 0.476), which is highly correlated with tumor purity 461 (ρomics−pheno = −0.564). 462 The top 5 ontologies from the enrichment analysis for the SGTCCA-Net final 463 subnetwork (Figure 3) includes lymphocyte activation (49/186 genes enriched), positive464 regulation of immune response (45/204 genes enriched), adaptive immune response 465 (38/145 genes enriched), regulation of leukocyte activation (46/227 genes enriched), and466 Adaptive Immune System (37/178 genes enriched). Compared to enrichment analysis467

Result

from SmCCNet, we observe that the ontology results from both networks fall into468 similar categories. However, when only looking at the immune response-related 469 ontologies, a stronger enrichment analysis result is identified in SGTCCA-Net network470 (S1 Fig). Additionally, using MCODE, we observed 6 modules of the protein-protein 471 interaction network as shown in Fig 3b. For example, in the first network module (red),472 one of the proteins isCD3D, which has prognostic potential for breast cancer and 473 associated with lymphocyte infiltration and immune checkpoints [49]. 474 We ran MultiMiR on the miRNAs from each subnetwork. For the SGTCCA-Net 475 network, among 1288 global network genes with non-zero canonical weight, 179/697 of476 them are the validated/predicted target genes of miRNAs in the subnetwork; while for477 the SmCCNet network, among 1024 global network genes with non-zero canonical 478 January 18, 2024 17/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint 0 5 10 15 20 25 30 -log10(P) hsa04650: Natural killer cell mediated cytotoxicity hsa01521: EGFR tyrosine kinase inhibitor resistance GO:0006974: DNA damage response GO:0002683: negative regulation of immune system process hsa05169: Epstein-Barr virus infection M141: PID PI3KCI PATHWAY WP4255: Non-small cell lung cancer R-HSA-5663202: Diseases of signal transduction by growth factor receptors and second messengers hsa05340: Primary immunodeficiency R-HSA-1280215: Cytokine Signaling in Immune system GO:0045058: T cell selection GO:0050854: regulation of antigen receptor-mediated signaling pathway GO:0042113: B cell activation M34: PID TCR PATHWAY GO:0001819: positive regulation of cytokine production R-HSA-1280218: Adaptive Immune System GO:0002694: regulation of leukocyte activation GO:0002250: adaptive immune response GO:0050778: positive regulation of immune response GO:0046649: lymphocyte activation (a) (b) Fig 3. Enrichment analysis results for TCGA breast cancer data with respect to tumor purity . (a) The top pathways that are associated with the final network. (b) Protein-protein interaction (PPI) network for the multi-omics network from SGTCCA-Net with respect to tumor purity colored by clusters. Clusters are generated based on the Molecular Complex Detection (MCODE) algorithm. weight, only 102/212 of them are the validated/predicted target genes of miRNAs in the479 subnetwork. This result suggests that more connections between genes and miRNAs are480 found in the SGTCCA-Net network compared to the SmCCNet network. The 481 enrichment analysis based on the 179 validated targets of SGTCCA-Net subnetworks (S2482 Fig) shows that the ontologies fall into similar categories compared to the gene/protein483 enrichment analysis, including lymphocyte differentiation and adaptive immune system.484 We examined the top combinations (4/3-way correlation) and found that the top 485 4-way correlation implies the top 3-way correlation in some cases (S2 Table). For 486 example, it was observed thathsa-mir-3133 and KDR dominate the 4-way relationship 487 between gene, miRNA, protein, and tumor purity, which also implies the 3-way 488 correlation between hsa-mir-3133, KDR, and tumor purity. Interestingly,BEND4, one 489 of the genes present in the top 4-way correlation, is shown to be the predicted target of490 hsa-mir-3133 with a predicted score of 0.728, andKDR is also the predicted target of 491 January 18, 2024 18/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint hsa-mir-3133 with a predicted score of 0.572. Through examining the 3-way correlation492 we can find that even though the top 3-way correlation could imply the 4-way 493 correlation, the signal is generally weaker. For instance, it was found thathsa-mir-142 494 dominates the gene-miRNA-tumor purity correlation,PCNA dominates both the 495 gene-protein-tumor purity correlation and the miRNA-protein-tumor purity correlation,496 and CCNB1 dominates the miRNA-protein-tumor purity correlation (details in S2 497 Table). However, it was observed that the highest 4-way correlation involved in 498 hsa-mir-142 is 2.405 (rank 94/631,680 across all 4-way correlation), inPCNA is 2.615 499 (rank 59/631,680 across all 4-way correlation), inCCNB1 is 2.093 (rank 191/631,680 500 across all 4-way correlation). A particular example is the 3-way correlation that involves501 CCNB1 and hsa-mir-150, which is observed in the top 10 3-way correlation between 502 miRNAs, proteins and tumor purity, but the highest 4-way correlation involves these503 two molecular features is only 1.914 (rank 398/631,680). This suggests that some 3-way504 correlation combinations may be ignored by the 4-way correlation, and indicates the 505 importance of incorporating the 3-way correlation into the model. 506 Lastly, the PageRank score for each molecular feature in the final subnetwork was507 calculated, and the top 10 molecular features from each molecular profile were selected508 for network visualization in Figure 4. It was noticed that higher connectivity occurs in509 proteins PCNA and CCNB1 (both appear in Table 3 and S2 Table), and both of them510 are observed in the top 3-way miRNA-protein-tumor purity correlation and found as511 prognostic biomarkers for breast cancer [50,51]. 512 Fig 4. SGTCCA-Net network with top 10 molecular features from each molecular profile. Multi-omics network module for TCGA breast cancer data with respect to tumor purity. Nodes are genes (red), miRNAs (blue) and RPPAs (green). The edge color denotes positive correlation (red) or negative correlation (blue) between molecular features with the width denoting the strength of the connection. Edges are filtered based on Pearson correlation with a threshold of 0.2. January 18, 2024 19/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint Type Name Correlation FDR Gene (68) HK3 -0.177 0.004 MGAM2 -0.123 0.019 S100A9 -0.138 0.013 BCL6 -0.121 0.029 IRAK3 -0.134 0.013 Protein (43) Matrix metalloproteinase-9 -0.149 0.012 calgranulin B -0.122 0.029 Complement component C9 -0.163 0.007 Kallistatin 0.158 0.009 Neutrophil gelatinase-associated lipocalin -0.095 0.060 Metabolite (59) 1-oleoylglycerol (18:1) 0.114 0.028 1-stearoyl-2-oleoyl-GPI (18:0/18:1)* 0.094 0.062 1-stearoyl-2-linoleoyl-GPI (18:0/18:2) 0.143 0.013 1-linoleoyl-GPA (18:2)* 0.103 0.042 retinol (Vitamin A) 0.093 0.064 T able 4. Top 5 molecular features from each molecular profile for SGTCCA-Net based on the PageRank and their individual correlation with respect to FEV1 for COPDGene data (with p-value). 0.15 COPDGene Network Analysis 513 After running a network analysis pipeline with SGTCCA-Net and SmCCNet for the 514 FEV1 phenotype, the final network size for SGTCCA-Net is 170, while SmCCNet has a515 final network size of 117. The NetSHy summarization score for SGTCCA-Net is 516 correlated with FEV1 with a correlation of 0.211, while for SmCCNet it is only 0.162.517 In the SGTCCA-Net network, there are 58 molecular features with weak correlation 518 with respect to FEV1 (FDR > 0.05), while for SmCCNet, there are 82 such molecular519 features. To check whether each method includes these nodes incorrectly or whether 520 they contribute to the network, we extracted the NetSHy summarization score on these521 weak nodes only, extracted the first 3 PCs, and calculated the maximum correlation522 between PC score and FEV1. For SGTCCA-Net, the maximum correlation is -0.159 523 (p < 10−3), while for SmCCNet, the maximum correlation is only 0.090 (p = 0.053). 524 This implies that the weak nodes included in SGTCCA-Net are shown to be more 525 related to the phenotype than the weak nodes included in SmCCNet. Furthermore, a526 total of 29 molecular features are shared between the SGTCCA-Net and SmCCNet 527 methods, with each method having 141 and 88 distinct molecular features respectively.528 Of the 141 unique molecular features in the SGTCCA-Net network, 89 of them are 529 significantly associated withFEV1 (FDR < 0.05), while of 88 unique molecular features530 in the SmCCNet network, only 14 of them are significantly associated with FEV1 (FDR531 < 0.05). In particular,Troponin T (ρ = −0.278, p < 10−3) and C-reactive protein 532 (ρ = −0.190, p < 10−3) have a strong correlation with respect to FEV1 and are not 533 identified by SmCCNet, and studies have shown that elevatedTroponin T levels during 534 exacerbation are associated with death in patients with COPD [52,53], and more 535 validation will need to be conducted to evaluate whether it is related to FEV1 at the 536 stable state. 537 We calculate the PageRank score for the SGTCCA-Net subnetwork and present the538 top 5 nodes for each molecular profile in Table 4. We observed that most of the 539 molecular features in the table are significantly correlated with FEV1 (FDR < 0.05),540 with exceptions such asNeutrophil gelatinase-associated lipocalin and 541 1-stearoyl-2-oleoyl-GPI (18:0/18:1)*. To check whether these nodes contribute to the 542 network, we examined the correlation between these two nodes and other molecular 543 January 18, 2024 20/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint features of the subnetwork and observed thatNeutrophil gelatinase-associated lipocalin 544 is highly correlated withN-acetylneuraminate (ρbetween−omics = 0.382), which is highly 545 correlated to FEV1 (ρomics−pheno = −0.134) and 1-stearoyl-2-oleoyl-GPI (18:0/18:1)* 546 is highly correlated with s1-(1-enyl-stearoyl)-GPE (P-18:0)* (ρbetween−omics = 0.509), 547 which is highly correlated to FEV1 (ρomics−pheno = 0.151). 548 Pathways such as regulation of TLR by endogenous ligand (7/13 genes enriched) and549 neutrophil degranulation (23/102 genes enriched) are strongly associated with the 550 severity of COPD from the enrichment analysis (Figure 5a) . Toll-like receptor (TLR) 2551 is elevated in monocytes from individuals with COPD [54], and increased degranulation552 in COPD is found to increase on the surface of unstimulated neutrophils in patients553 with COPD and can cause additional airway damage in patients with COPD [55]. In554 particular, when checking the correlation between the network and cell type, we found555 that NetSHy PC1 is highly correlated with the percentage of neutrophil cells 556 (ρ = −0.767), which may explain the significance of the neutrophil degranulation 557 pathway. However, we did not regress out cell type composition in advance as they can558 reflect disease state. 559 We ran the enrichment analysis on the metabolites of the SGTCCA-Net subnetwork560 and found that most of the top pathways are associated with 561 phospholipid/glyerophospholipid and bile acid (S4 Table). Compared to non-smokers, 562 current smokers have decreased phospholipid levels regardless of COPD status [56]. 563 Figure 5b shows that there are three groups of protein-protein interaction based on564 MCODE, and the largest group is centered on the group with SAA1 and MMP9 (red)565 with the enrichment score of 1.5. SAA1 is related to the acute exacerbation of chronic566 obstructive pulmonary disease [57], and higher levels of MMP-9 correspond to a higher567 influx of neutrophils and lymphocytes, signaling an exacerbation of COPD in which a568 higher burden of MMP-9 is observed in the airways [58]. 569 We examined the top combinations (4/3-way correlation) and found that the top 570 4-way correlation implies the top 3-way correlation in some cases (S3 Table). For 571 instance, it was observed thatMTCO1P12, Tyrosine-protein kinase Yes, and 572 C-glycosyltryptophan dominate the 4-way relationship between gene, protein, metabolite,573 and FEV1, which also implies the 3-way correlation betweenTyrosine-protein kinase 574 Yes (the protein product ofYES1), C-glycosyltryptophan and FEV1. Through 575 examining the 3-way correlation we can find that even though the top 3-way correlation576 could imply the 4-way correlation but the signal is generally weaker. For instance, it577 was found thatMatrix metalloproteinase-9 dominates the gene-protein-fev1 correlation, 578 1-stearoyl-2-oleoyl-GPI (18:0/18:1)* dominates the gene-metabolite-fev1 correlation, 579 and C-glycosyltryptophan and iminodiacetate (IDA) dominate the 580 protein-metabolite-fev1 correlation (details in S3 Table). However, it was observed that581 the highest 4-way correlation involved inMatrix metalloproteinase-9 is 0.431 (rank 582 233/172,516 across all 4-way correlation), iniminodiacetate (IDA) is 0.580 (rank 583 30/172,516 across all 4-way correlation). A particular example is the 3-way correlation584 that involvesMatrix metalloproteinase-9 and other genes, we found that all top 10 585 3-way correlation between gene, protein, and FEV1 involve this protein, but only 2 of 586 the genes are shown in the top 1000 4-way correlation along withMatrix 587 metalloproteinase-9. This suggests that some 3-way correlation combinations may be 588 ignored by the 4-way correlation, and indicates the importance of incorporating the 589 3-way correlation into the model. 590 Furthermore, the PageRank score for each molecular feature in the final subnetwork591 was calculated, and the top 10 molecular features from each molecular profile were 592 selected for visualization of the network in Fig. 6. In particular, the C-reactive protein593 has been shown to be associated with poor lung function [59]; the expression level of594 S100A9 is elevated in patients with COPD, which triggers neutrophil degranulation and595 January 18, 2024 21/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint 0 1 2 3 4 5 -log10(P) GO:0030574: collagen catabolic process hsa00500: Starch and sucrose metabolism GO:0006957: complement activation, alternative pathway R-HSA-6803157: Antimicrobial peptides GO:1901570: fatty acid derivative biosynthetic process GO:0002523: leukocyte migration involved in inflammatory response GO:0045087: innate immune response GO:0006953: acute-phase response R-HSA-1222556: ROS and RNS production in phagocytes GO:0001818: negative regulation of cytokine production R-HSA-6798695: Neutrophil degranulation R-HSA-5686938: Regulation of TLR by endogenous ligand (a) (b) Fig 5. Enrichment analysis results for COPDGene data with respect to FEV1. (a) The top pathways that are associated with the final network. (b) Protein-protein interaction (PPI) network for the multi-omics network from SGTCCA-Net with respect to FEV1 colored by clusters. Clusters are generated based on the Molecular Complex Detection (MCODE) algorithm. releases inflammatory and proteolytic enzymes [60]. 596 We also repeated the analysis on FEV1 percent predicted (FEV11PP), which takes 597 into account body habitus and race. After applying the SGTCCA-Net toFEV11PP, we 598 identified a final subnetwork with 43 features. In particular, 16 of these features exhibit599 a strong correlation withFEV11PP (with |ρ| > 0.15), and are all present in theFEV11 600 final subnetwork. This significant overlap between the FEV11 and FEV11PP 601 subnetworks suggests a consistent molecular pattern that influences both metrics. 602 Furthermore, Metascape enrichment analysis highlighted similar patterns in both 603 FEV11 and FEV11PP networks, with shared pathways including acute-phase response,604 inflammatory response, and neutrophil degranulation (S3 Fig). Interestingly, these 605 molecular features generally show a higher correlation withFEV11PP than withFEV11 606 (detailed in S5 Table). This observation demonstrates that the associations between 607 top-correlated molecular features and pulmonary function remain robust, unaffected by608 variations in body habitus and race. 609 January 18, 2024 22/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint Fig 6. SGTCCA-Net network with top 10 molecular features from each molecular profile. Multi-omics network module for COPDGene data with respect to FEV 1. The red nodes stand for genes, the purple nodes stand for proteins, and the black nodes stand for metabolites. The width and color depth of the edge stands for the strength of the connection between two molecular features and the type of color stands for whether two nodes are positively correlated (red) or negatively correlated (blue). Edges are filtered based on Pearson correlation with a threshold of 0.2.

Discussion

610 In this work, we introduce a new multi-omics network analysis pipeline, Sparse 611 Generalized Tensor Canonical Correlation Analysis Network Inference (SGTCCA-Net),612 designed for biomarker identification through both higher-order and lower-order 613 correlations, thereby creating molecular interaction networks pertinent to specific 614 phenotypes. We propose a new approach, Sparse Generalized Tensor Canonical 615 Correlation Analysis (SGTCCA), which extends Tensor Canonical Correlation Analysis616 to accommodate multiple correlation structures at once, ensuring sparsity via the biased617 subsampling technique. Additionally, we bring forth a novel network analysis and 618 network pruning algorithm to derive multi-omics subnetwork modules associated with619 the phenotype of interest. 620 Contrary to existing CCA-based multi-omics network inference approaches like 621 SmCCNet and DIABLO, which only examine pairwise correlations, SGTCCA-Net 622 encapsulates both higher and lower-order correlations. Our simulation analyses 623 demonstrate that SGTCCA-Net is adept at pinpointing features with diverse 624 phenotype-specific correlation structures, thereby surpassing the other two CCA-based625

Methods

across varied simulation settings. 626 Our real data analysis highlights the ability of SGTCCA-Net to identify molecular627 features for diseases such as cancer and chronic obstructive pulmonary disease (COPD).628 Interestingly, the biomarkers identified are not limited to those with strong phenotype629 January 18, 2024 23/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint correlations; we also identified molecular features with weak phenotype correlations that630 are highly connected to other nodes and contribute to the network. Furthermore, we631 validated our findings using various techniques, including enrichment analysis of 632 biological pathways and miRNA-target relationships. These validations confirm the 633 enhanced performance of SGTCCA-Net over SmCCNet and the ability to find relevant634 pathways. Furthermore, we also demonstrate the top higher-order correlation associated635 with the final subnetwork. This study showcases the potential for the integration of 636 multi-omics data to improve our understanding of complex diseases, such as COPD and637 breast cancer, and highlights the utility of SGTCCA-Net as a powerful tool for 638 analyzing such data. 639 Beyond its primary use in multi-omics network inference, SGTCCA has broader 640 applications, and can be useful for multi-view dimensional reduction tasks. It can be641 used to project omics data into a shared lower-dimensional space, suitable for both 642 supervised tasks, such as disease classification, and unsupervised tasks, such as 643 subtyping. 644 Supporting information 645 S1 Fig. Enrichment analysis ontology results based on the joint set of 646 gene/protein for TCGA breast cancer data and COPDGene data for 647 SmCCNet subnetwork. The top pathways that are associated with the final network648 of SmCCNet based on Metascape. a: Top ontology pathways for TCGA breast cancer649 data with respect to tumor purity; b: Top ontology pathways for COPDGene data with650 respect to FEV1. 651 S2 Fig. Results of the enrichment analysis ontology based on the validated 652 miRNA target genes for the TCGA breast cancer data for the 653 SGTCCA-Net subnetwork. Top ontology pathways based on validated miRNA 654 target genes for the breast cancer data from TCGA with respect to tumor purity. 655 S3 Fig. Enrichment Analysis Results of SGTCCA-Net for FEV1 Percent 656 Predicted: This figure focused on pathways enriched in the joint gene/protein set 657 derived from the COPDGene data, specifically within the SGTCCA-Net subnetwork 658 associated withFEV1 Percent Predicted. The identified pathways provide insights into659 the molecular mechanisms linked to pulmonary function as measured byFEV1 Percent 660 Predicted. 661 S1 T ext. Higher-order Covariance T ensor for both Odd and Even Number 662 of Views 663 S2 T ext. Proof of Theorem 1 664 S3 T ext. Sparse Generalized T ensor Canonical Correlation Analysis 665 (SGTCCA) Algorithm 666 S4 T ext. Biased Subsampling Covariance Density 667 S5 T ext. Network Analysis and Network Pruning Algorithm 668 S6 T ext. Additional Detail of Simulation Study Setup 669 January 18, 2024 24/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint S1 T able. Simulation Setup T able. Summary table of all simulation scenarios. 670 Starting with 3 omics data simulation settings: (1) normal latent factors, (2) highly 671 right-skewed latent factors, and (3) noisy omics data. Each of these settings has 3 cases:672 case 1 with all phenotype-specific correlation structures; case 2 with only 4-way 673 phenotype-specific correlation structure; and (3) case 3 with 3-way and pairwise 674 phenotype-specific correlation structure. All these cases will be evaluated based on 675 sample sizes of 100 and 200.Σ represents a diagonal matrix with diagonal values to be676 1. 677 S2 T able. T op Network Higher-order Correlation for TCGA Breast Cancer 678 Network. Top 5 network higher-order correlation of breast cancer data of different 679 correlation structures. 680 S3 T able. T op network higher-order correlation for the COPDGene 681 network. Top 5 network higher-order correlation of COPDGene data of different 682 correlation structures. 683 S4 T able. T op Metabolite Enrichment Pathways. Top 10 pathways of 684 metabolite enrichment analysis result from IMPaLa for SGTCCA-Net subnetwork. 685 S5 T able. Overlap Molecular F eatures between FEV1 and FEV1 Percent 686 Predicted Network Molecular features overlap between FEV1 and FEV1 percent 687 predicted network generated from SGTCCA-Net (where absolute correlation with 688 phenotype >0.15). 689 Implementation 690 The model, simulations, and real data experiment were implemented using R version691 4.2.2. All code and scripts used in this study are publicly available via GitHub at 692 https://github.com/liux4283/SparseGTCCANet. 693 A vailability 694 The TCGA breast cancer data used in the real data experiment section is available at:695 http://linkedomics.org/data_download/TCGA-BRCA/. Clinical data and 696 SOMAScan data are available through COPDGene 697 (https://www.ncbi.nlm.nih.gov/gap/, ID: phs000179.v6.p2). RNA-Seq data are 698 available through dbGaP (https://www.ncbi.nlm.nih.gov/gap/, ID: phs000765.v3.p2). 699 Metabolon data is available at Metabolomics Workbench 700 (https://www.metabolomicsworkbench.org/ ID: PR000907). The source code for 701 SGTCCA-Net is available at https://github.com/liux4283/SparseGTCCANet. 702 Acknowledgments 703 We thank all members of the NetCO team for all the insightful feedback on this novel704 pipeline. This work is also supported in part by funds from the National Heart, Lung,705 and Blood Institute, National Institutes of Health (R01 HL152735 and TransOmics for706 Precision Medicines Fellowship). 707 This work was also supported by NHLBI grants U01 HL089897 and U01 HL089856708 and by NIH contract 75N92023D00011. The COPDGene study (NCT00608764) has also709 January 18, 2024 25/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint been supported by the COPD Foundation through contributions made to an Industry710 Advisory Committee that has included AstraZeneca, Bayer Pharmaceuticals, 711 Boehringer-Ingelheim, Genentech, GlaxoSmithKline, Novartis, Pfizer, and Sunovion. 712

References

1. Xu C, Tao D, Xu C. A survey on multi-view learning. arXiv preprint arXiv:13045634. 2013;. 2. Bickel S, Scheffer T. Multi-view clustering. In: ICDM. vol. 4; 2004. p. 19–26. 3. Kalayeh MM, Idrees H, Shah M. NMF-KNN: Image annotation using weighted multi-view non-negative matrix factorization. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2014. p. 184–191. 4. Rappoport N, Shamir R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic acids research. 2018;46(20):10546–10562. 5. Nicora G, Vitali F, Dagliati A, Geifman N, Bellazzi R. Integrated multi-omics analyses in oncology: a review of machine learning methods and tools. Frontiers in oncology. 2020;10:1030. 6. Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-omics data integration, interpretation, and its application. Bioinformatics and biology insights. 2020;14:1177932219899051. 7. Gilchrist A, Au CE, Hiding J, Bell AW, Fernandez-Rodriguez J, Lesimple S, et al. Quantitative proteomics analysis of the secretory pathway. Cell. 2006;127(6):1265–1281. 8. Monteiro M, Carvalho M, Bastos M, Guedes de Pinho P. Metabolomics analysis for biomarker discovery: advances and challenges. Current medicinal chemistry. 2013;20(2):257–271. 9. Picard M, Scott-Boyer MP, Bodein A, Périn O, Droit A. Integration strategies of multi-omics data for machine learning analysis. Computational and Structural Biotechnology Journal. 2021;19:3735–3746. 10. Menyhárt O, Győrffy B. Multi-omics approaches in cancer research with applications in tumor subtyping, prognosis, and diagnosis. Computational and structural biotechnology journal. 2021;19:949–960. 11. Duan R, Gao L, Gao Y, Hu Y, Xu H, Huang M, et al. Evaluation and comparison of multi-omics data integration methods for cancer subtyping. PLoS computational biology. 2021;17(8):e1009224. 12. Lin E, Lane HY. Machine learning and systems genomics approaches for multi-omics data. Biomarker research. 2017;5(1):1–6. 13. Pierre-Jean M, Deleuze JF, Le Floch E, Mauger F. Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration. Briefings in bioinformatics. 2020;21(6):2011–2030. 14. Jiang D, Armour CR, Hu C, Mei M, Tian C, Sharpton TJ, et al. Microbiome multi-omics network analysis: statistical considerations, limitations, and opportunities. Frontiers in genetics. 2019;10:995. January 18, 2024 26/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint 15. Zhou G, Li S, Xia J. Network-based approaches for multi-omics integration. Computational Methods and Data Analysis for Metabolomics. 2020; p. 469–487. 16. Wu C, Zhou F, Ren J, Li X, Jiang Y, Ma S. A selective review of multi-level omics data integration using variable selection. High-throughput. 2019;8(1):4. 17. Yan J, Risacher SL, Shen L, Saykin AJ. Network approaches to systems biology analysis of complex disease: integrative methods for multi-omics data. Briefings in bioinformatics. 2018;19(6):1370–1381. 18. Liu J, Wang C, Gao J, Han J. Multi-view clustering via joint nonnegative matrix factorization. In: Proceedings of the 2013 SIAM International Conference on Data Mining. SIAM; 2013. p. 252–260. 19. Zhang C, Fu H, Hu Q, Zhu P, Cao X. Flexible multi-view dimensionality co-reduction. IEEE Transactions on Image Processing. 2016;26(2):648–659. 20. Hotelling H. Relations between two sets of variates. In: Breakthroughs in statistics. Springer; 1992. p. 162–190. 21. Witten DM, Tibshirani RJ. Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical applications in genetics and molecular biology. 2009;8(1). 22. Haury AC, Mordelet F, Vera-Licona P, Vert JP. TIGRESS: trustful inference of gene regulation using stability selection. BMC systems biology. 2012;6(1):1–17. 23. Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory networks from expression data using tree-based methods. PloS one. 2010;5(9):e12776. 24. Shi WJ, Zhuang Y, Russell PH, Hobbs BD, Parker MM, Castaldi PJ, et al. Unsupervised discovery of phenotype-specific multi-omics networks. Bioinformatics. 2019;35(21):4336–4343. 25. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996;58(1):267–288. 26. Mastej E, Gillenwater L, Zhuang Y, Pratte KA, Bowler RP, Kechris K. Identifying Protein–metabolite Networks Associated with COPD Phenotypes. Metabolites. 2020;10(4):124. 27. Singh A, Shannon CP, Gautier B, Rohart F, Vacher M, Tebbutt SJ, et al. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics. 2019;35(17):3055–3062. 28. Luo Y, Tao D, Ramamohanarao K, Xu C, Wen Y. Tensor canonical correlation analysis for multi-view dimension reduction. IEEE transactions on Knowledge and Data Engineering. 2015;27(11):3111–3124. 29. Wong HS, Wang L, Chan R, Zeng T. Deep Tensor CCA for Multi-view Learning. arXiv preprint arXiv:200511914. 2020;. 30. Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab; 1999. 31. Vu T, Litkowski EM, Liu W, Pratte KA, Lange L, Bowler RP, et al. NetSHy: network summarization via a hybrid approach leveraging topological properties. Bioinformatics. 2023;39(1):btac818. January 18, 2024 27/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint 32. De Lathauwer L, De Moor B, Vandewalle J. On the best rank-1 and rank-(r 1, r 2,..., rn) approximation of higher-order tensors. SIAM journal on Matrix Analysis and Applications. 2000;21(4):1324–1342. 33. Kolda TG, Bader BW. Tensor decompositions and applications. SIAM review. 2009;51(3):455–500. 34. Dai Yh, Yuan Y. An efficient hybrid conjugate gradient method for unconstrained optimization. Annals of Operations Research. 2001;103:33–47. 35. Nash JC. Compact numerical methods for computers: linear algebra and function minimisation. CRC press; 1990. 36. Acar E, Papalexakis EE, Gürdeniz G, Rasmussen MA, Lawaetz AJ, Nilsson M, et al. Structure-revealing data fusion. BMC bioinformatics. 2014;15(1):1–17. 37. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology). 2005;67(2):301–320. 38. Papalexakis EE, Faloutsos C, Mitchell TM, Talukdar PP, Sidiropoulos ND, Murphy B. Turbo-smt: Accelerating coupled sparse matrix-tensor factorizations by 200x. In: Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM; 2014. p. 118–126. 39. Langville AN, Meyer CD. Google’s PageRank and beyond. In: Google’s PageRank and Beyond. Princeton university press; 2011. 40. Hu W, Lin D, Cao S, Liu J, Chen J, Calhoun VD, et al. Adaptive sparse multiple canonical correlation analysis with application to imaging (epi) genomics study of schizophrenia. IEEE Transactions on Biomedical Engineering. 2017;65(2):390–399. 41. Foldnes N, Olsson UH. A simple simulation technique for nonnormal data with prespecified skewness, kurtosis, and covariance matrix. Multivariate behavioral research. 2016;51(2-3):207–219. 42. Liu W, Vu T, Konigsberg IR, Pratte KA, Zhuang Y, Kechris KJ. SmCCNet 2.0: an Upgraded R package for Multi-omics Network Inference. bioRxiv. 2023; p. 2023–11. 43. Li Y, Umbach DM, Bingham A, Li QJ, Zhuang Y, Li L. Putative biomarkers for predicting tumor sample purity based on gene expression data. BMC genomics. 2019;20(1):1–12. 44. Zhou Y, Zhou B, Pache L, Chang M, Khodabakhshi AH, Tanaseichuk O, et al. Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nature communications. 2019;10(1):1523. 45. Ru Y, Kechris KJ, Tabakoff B, Hoffman P, Radcliffe RA, Bowler R, et al. The multiMiR R package and database: integration of microRNA–target interactions along with their disease and drug associations. Nucleic acids research. 2014;42(17):e133–e133. 46. Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, et al. Genetic epidemiology of COPD (COPDGene) study design. COPD: Journal of Chronic Obstructive Pulmonary Disease. 2011;7(1):32–43. January 18, 2024 28/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint 47. Gillenwater LA, Helmi S, Stene E, Pratte KA, Zhuang Y, Schuyler RP, et al. Multi-omics subtyping pipeline for chronic obstructive pulmonary disease. PloS one. 2021;16(8):e0255337. 48. Kamburov A, Cavill R, Ebbels TM, Herwig R, Keun HC. Integrated pathway-level analysis of transcriptomics and metabolomics data with IMPaLA. Bioinformatics. 2011;27(20):2917–2918. 49. Zhu Z, Ye W, Wu X, Lin S, Xu J, Li L, et al. Comprehensive analysis reveals a prognostic and therapeutic biomarker CD3D in the breast carcinoma microenvironment. Bioscience reports. 2021;41(1):BSR20202898. 50. Ding K, Li W, Zou Z, Zou X, Wang C. CCNB1 is a prognostic biomarker for ER+ breast cancer. Medical hypotheses. 2014;83(3):359–364. 51. Juríková M, Danihel L, Polák Š, Varga I. Ki67, PCNA, and MCM proteins: Markers of proliferation in the diagnosis of breast cancer. Acta histochemica. 2016;118(5):544–552. 52. Blaschko MB, Lampert CH. Correlational spectral clustering. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2008. p. 1–8. 53. Elmenawi KA, Anil V, Gosal H, Kaur H, Ngassa HC, Mohammed L. The Importance of Measuring Troponin in Chronic Obstructive Pulmonary Disease Exacerbations: A Systematic Review. Cureus. 2021;13(8). 54. Pons J, Sauleda J, Regueiro V, Santos C, López M, Ferrer J, et al. Expression of Toll-like receptor 2 is up-regulated in monocytes from patients with chronic obstructive pulmonary disease. Respiratory research. 2006;7(1):1–9. 55. Jasper AE, McIver WJ, Sapey E, Walton GM. Understanding the role of neutrophils in chronic inflammatory airway disease. F1000Research. 2019;8. 56. Moré JM, Voelker DR, Silveira LJ, Edwards MG, Chan ED, Bowler RP. Smoking reduces surfactant protein D and phospholipids in patients with and without chronic obstructive pulmonary disease. BMC pulmonary medicine. 2010;10(1):1–8. 57. Vietri L, Fui A, Bergantini L, d’Alessandro M, Cameli P, Sestini P, et al. Serum amyloid A: A potential biomarker of lung disorders. Respiratory Investigation. 2020;58(1):21–27. 58. Mercer P, Shute J, Bhowmik A, Donaldson G, Wedzicha J, Warner J. MMP-9, TIMP-1 and inflammatory cells in sputum from COPD patients during exacerbation. Respiratory research. 2005;6:1–9. 59. Aksu F, Capan N, Aksu K, Ofluoğlu R, Canbakan S, Yavuz B, et al. C-reactive protein levels are raised in stable Chronic obstructive pulmonary disease patients independent of smoking behavior and biomass exposure. Journal of thoracic disease. 2013;5(4):414. 60. Railwah C, Lora A, Zahid K, Goldenberg H, Campos M, Wyman A, et al. Cigarette smoke induction of S100A9 contributes to chronic obstructive pulmonary disease. American Journal of Physiology-Lung Cellular and Molecular Physiology. 2020;319(6):L1021–L1035. January 18, 2024 29/29 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-24T02:00:01.246996+00:00
License: CC-BY-NC-ND-4.0