Abstract
Multiple -omics (genomics, proteomics, etc.) profiles are commonly generated to gain
insight into a disease or physiological system. Constructing multi-omics networks with
respect to the trait(s) of interest provides an opportunity to understand relationships
between molecular features but integration is challenging due to multiple data sets with
high dimensionality. One approach is to use canonical correlation to integrate one or
two omics types and a single trait of interest. However, these types of methods may be
limited due to (1) not accounting for higher-order correlations existing among features,
(2) computational inefficiency when extending to more than two omics data when using
a penalty term-based sparsity method, and (3) lack of flexibility for focusing on specific
correlations (e.g., omics-to-phenotype correlation versus omics-to-omics correlations). In
this work, we have developed a novel multi-omics network analysis pipeline called
Sparse Generalized Tensor Canonical Correlation Analysis Network Inference
(SGTCCA-Net) that can effectively overcome these limitations. We also introduce an
implementation to improve the summarization of networks for downstream analyses.
Simulation and real-data experiments demonstrate the effectiveness of our novel method
for inferring omics networks and features of interest.
Author summary
Multi-omics network inference is crucial for identifying disease-specific molecular
interactions across various molecular profiles, which helps understand the biological
processes related to disease etiology. Traditional multi-omics integration methods focus
mainly on pairwise interactions by only considering two molecular profiles at a time.
This approach overlooks the complex, higher-order correlations often present in
multi-omics data, especially when analyzing more than two types of -omics data and
phenotypes. Higher-order correlation, by definition, refers to the simultaneous
relationships among more than two types of -omics data and phenotype, providing a
January 18, 2024 1/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
more complex and complete understanding of the interactions in biological systems.
Our research introduces Sparse Generalized Tensor Canonical Correlation Network
Analysis (SGTCCA-Net), a novel framework that effectively utilizes both higher-order
and lower-order correlations for multi-omics network inference. SGTCCA-Net is
adaptable for exploring diverse correlation structures within multi-omics data and is
able to construct complex multi-omics networks in a two-dimensional space. This
Method
offers a comprehensive view of molecular feature interactions with respect to
complex diseases. Our simulation studies and real data experiments validate
SGTCCA-Net as a potent tool for biomarker identification and uncovering biological
mechanisms associated with targeted diseases.
Introduction
1
0.1 Multi-Omics Data Integration 2
The integration of multiple datasets with the same set of subjects has been actively 3
explored in the field of machine learning, referred to as multi-view machine learning [1].4
This approach has a broad range of applications, including clustering subjects based on5
the consensus of different views [2] and performing image annotation [3]. Furthermore,6
multi-view machine learning has been used extensively in the biomedical domain [4,5].7
Recent advances in biomedical technologies have enabled the generation of 8
high-throughput data at the molecular level, such as the genome, transcriptome, 9
proteome, and metabolome. Collectively, these datasets are referred to as multi-omics10
data [6]. Traditionally, each omics dataset has been analyzed separately [7,8], which can11
Result
in the loss of important information about the connections between omics 12
datasets. Multi-omics integration is an alternative approach that identifies shared and13
complimentary information across different molecular profiles [9]. Integrating 14
multi-omics data can help uncover biological mechanisms and interactions at the 15
molecular level, and has been used for various purposes such as disease 16
subtyping [10,11], variable selection [12,13], network analysis [14,15] and prediction of17
biomarkers [6,16,17]. 18
0.2 Multi-View Dimension Reduction 19
Dimension reduction is one of the more common goals of high dimensional data analysis.20
For example, non-negative matrix factorization is implemented to cluster multiple data21
sets [18], and multi-view co-reduction is developed to preserve the locality within the22
view and the consistency between the views in the lower dimensional embedding of each23
view [19]. Other dimension reduction methods for multiple data sets are based on 24
Canonical Correlation Analysis (CCA) [20], which seeks to find the linear combination25
(canonical weight) that maximizes the correlation between two sets of data. Usually, 26
there is more than one solution, which may be referred to as a canonical weight matrix,27
and thus it is commonly used for dimension reduction by projecting the original data28
into a shared lower dimensional space between two views (or omics data) with canonical29
weight matrix. Given two data setsX1 and X2, the optimization function is: 30
w1, w2 = arg max
w1,w2
˜wT
1 X T
1 X2 ˜w2p
˜wT
1 X T
1 X1 ˜w1 ˜wT
2 X T
2 X2 ˜w2
(1)
However, this formulation only considers two sets of data and is not generalized to 31
multiple data. A multiple canonical correlation method was developed to maximize the32
January 18, 2024 2/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
sum of pairwise canonical correlations [21]. Suppose there arej = 1, . . . , Kviews, then 33
the problem can be formulated as follows: 34
max
w1,...,wk
KX
i<j
wT
i X T
i Xjwj s.t.wT
j X T
j Xjwj = 1 ∀j. (2)
A penalty term can be added to the formulation above to achieve sparsity, which is 35
called Sparse Multiple Canonical Correlation Analysis (SmCCA). 36
Network analysis for molecular profiles is often used with dimension reduction 37
Methods
like SmCCA to identify and visualize connections between features [14,15] for38
the purpose of inferring underlying biological interactions. In addition to canonical 39
correlation analysis, regression is another approach that is implemented for multi-omics40
network inference, particularly for gene regulatory networks [22,23]. However, most 41
Methods
do not incorporate an outcome or phenotype in the form of a quantitative trait.42
Canonical correlation analysis-based methods can accomplish this goal by expanding 43
CCA to incorporate a phenotype (Y). SmCCNet was proposed to partition omics 44
features into different network modules while considering the phenotype(s) of 45
interest [24] and using a scaled version of SmCCA with the following optimization: 46
(w1, w2, ..., wk) = arg max
˜w1, ˜w2,..., ˜wK
X
i<j
i,j=1,2,...,K
ai,j ˜wT
i X T
i Xj ˜wj +
KX
i=1
bi ˜wT
i X T
i Y
,
subject to ∥ ˜wj∥2 = 1, P j( ˜wj) ≤ cj, j = 1, 2, ..., K,
(3)
where ai,j and bi are scaling factors that place greater importance on pairwise 47
combinations of views. For example, one may want to increasebi to increase the 48
influence on the phenotype to determine the canonical weights, andP (·) is the lasso 49
penalty parameter for sparsity [25], but other types of penalties can also be used. The50
optimal penalty parameter is selected through k-fold cross-validation. After that, the 51
canonical weights can be extracted based on the objective function above to construct52
an adjacency matrix. Finally, to construct a network, hierarchical clustering can be 53
implemented afterward to extract multiple multi-omics network modules. This method54
has been applied in various contexts including identifying phenotype-specific 55
miRNA-mRNA and proteomics-metabolomics correlations and networks [26]. However, 56
the existing SmCCNet method can only be adapted to two molecular profiles plus one57
single phenotype. When extending it to 3 or more omics data types, it can become 58
computationally expensive due to the cross-validation step to find the optimal penalty59
parameter for each omics data set. 60
Another multi-omics analysis and network inference method Data Integration 61
Analysis for Biomarker discovery using Latent variable approaches for Omics studies 62
(DIABLO) [27] has a similar formulation to the sparse multiple canonical correlation 63
analysis problems (Eq 3). Although the formulations of SmCCNet and DIABLO are 64
similar, they have some differences: (1) In DIABLO, there is no lasso penalty, but the65
user can choose how many features to include as nonsparse for each data view instead;66
(2) DIABLO focuses more on the prediction of phenotype and biomarker discovery, 67
while SmCCNet focuses more on graph learning of the interaction between biomarkers68
with respect to phenotype; (3) the adjacency matrix of SmCCNet is aggregated through69
canonical weights obtained from subsampling the data multiple times for a more robust70
solution, while the adjacency matrix of DIABLO is the subset of correlation matrix with71
only molecular features selected by the canonical correlation analysis. 72
January 18, 2024 3/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
0.3 Tensor Canonical Correlation Analysis 73
As reviewed, the dimension reduction methods based on canonical correlation analysis74
only consider the summation of all pairwise correlation, including methods like 75
SmCCNet and DIABLO. However, for multi-omics data with more than two molecular76
profiles, the correlation structure can be higher-order rather than pairwise, where 77
higher-order correlation is defined as the simultaneous correlation among 3 or more 78
features. This type of higher-order correlation can be captured by pairwise correlation if79
and only if all pairwise correlations are strong. However, in multi-omics data, it may80
not be the case that higher-order correlations also have strong pairwise correlations. 81
A particular variant of the multiple canonical correlation analysis extends the 82
pairwise relationship to a higher-order relationship by maximizing the tensor canonical83
correlation [28], which captures higher-order correlations among multiple data sets and84
projects them into a shared lower dimensional embedding. A further extension to this85
tensor canonical correlation analysis (TCCA) is to combine deep learning with tensor86
canonical correlation analysis to maximize the higher-order correlation between views in87
the non-linear lower-dimensional space for dimensional reduction [29] to allow for 88
non-linearity. 89
Fig 1. Comparison between SmCCA, TCCA, and GTCCA Visualization of the
comparison between Sparse multiple Canonical Correlation Analysis (SmCCA), Tensor
Canonical Correlation Analysis (TCCA), and Generalized Tensor Canonical Correlation
Analysis (GTCCA). SmCCA only captures pairwise correlations (e.g., genes and proteins) and
TCCA only captures the highest order correlations (e.g., gene, protein and metabolites),.
GTCCA considers the combination of highest order correlations ((TCCA) and lower-order
correlations (SmCCA).January 18, 2024 4/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
0.4 Contributions 90
As reviewed, SmCCA captures pairwise correlation, while TCCA captures higher-order91
correlation. However, TCCA only captures the highest order correlation, but not all 92
types of lower-order correlation structures of interest. In addition, sparsity is not 93
considered in TCCA, which prevents the model from focusing only on the most relevant94
information in the data. Furthermore, TCCA has a high memory and computation cost,95
which is inefficient for high-dimensional omics data. Therefore, in this work, we present96
a new method for identifying multi-omics phenotype specifics network that is based on97
TCCA but with significant extensions by including correlations of higher and lower 98
order simultaneously (Figure 1) and by incorporating sparsity. This new method is 99
called Sparse Generalized Tensor Canonical Correlation Analysis for Network Analysis100
(SGTCCA-Net). In particular, we avoid repeatedly applying TCCA to each correlation101
structure, but instead use a simultaneous algorithm that accounts for all the correlation102
components. 103
In this novel pipeline, we introduce multiple major contributions: New definition of104
higher-order correlation compared to the original TCCA, which better measures 105
higher-order correlation without interpreting the directionality and avoids effect 106
cancellation issues; The development of Sparse Generalized Tensor Canonical 107
Correlation Analysis (SGTCCA) with consideration of higher-order correlation, flexible108
design of correlation structures, and biased subsampling algorithm that ensures sparsity109
and computational efficiency; Implementation of network pruning and summarization 110
algorithms to keep only the most important molecular features. Our pipeline first 111
performs feature selection with SGTCCA, which extracts molecular features that are 112
involved in one or more specified higher/lower-order correlation structures. Then an 113
adjacency matrix between selected features is constructed to identify the inter-molecule114
relationships and collapse all higher/lower-order correlations into the two-dimensional115
space. Afterward, a network pruning algorithm is implemented to the adjacency matrix116
with the PageRank algorithm [30] NetSHy summarization score [31] to reduce the size117
of the network and include only the most relevant molecular features, yielding the final118
multi-omics network. Details of the pipeline are described below 119
Materials and methods
120
0.5 Summary of Pipeline Workflow 121
The end-to-end pipeline for SGTCCA-Net (Figure 2 and Algorithm 1) inputs the 122
multi-omics data X1, X2, ..., Xk ∈ RN ×dj , j = 1, 2, ..., k, each with dimensiondj, and 123
outputs subnetwork adjacency matrixM(sub) ∈ R(Pk
j=1 d(sub)
j )×(Pk
j=1 d(sub)
j ), where 124
d(sub)
j ≤ dj for all j = 1, 2, ..., k is the number of features belong to thejth molecular 125
profile. There are two major contributions in this paper: (1) the development of Sparse126
Generalized Tensor Canonical Correlation Analysis that combines higher-order and 127
lower-order correlation of interests and guarantees high computational efficiency, and (2)128
a novel network pruning algorithm based on the network summarization score. 129
January 18, 2024 5/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
Fig 2. SGTCCA-Net workflow Workflow of SGTCCA-Net pipeline for multi-omics
network inference. It consists of three core steps: higher-order correlation extraction, affinity
matrix construction, and network trimming. Biased subsampling involves calculating the
covariance tensor/matrix density with respect to certain molecular profiles, which is also the
importance/probability vector when randomly sampling features from each molecular profile.
Algorithm 1: SGTCCA-Net End-to-End Pipeline Algorithm.
Input: Multi-omics and phenotype dataXj ∈ RN ×dj , j = 1, 2, ..., k, preferred
number of subsampless ;
(1) Run the SGTCCA algorithm and obtain canonical weight matrices
Hj ∈ Rdp×s, j = 1, 2, ..., k (Algorithm 2 and Algorithm 1 in S3 T ext);
(2) Construct global adjacency matricesM based on Hj, j = 1, 2, ..., k
(Algorithm 2 in S5 T ext);
(3) Prune each network module with PageRank algorithm and NetSHy
network summarization score, and obtain sub-networksM (sub)(Algorithm
2 in S5 T ext) ;
(4) Use Pearson’s correlation to filter out weaker edges from each subnetwork
(Algorithm 2 in S5 T ext);
Output:Multi-omics network module with respect to phenotype(s)
130
0.6 Generalized Tensor Canonical Correlation Analysis 131
0.6.1 New T ensor Canonical Correlation Analysis 132
The original TCCA algorithm has two major issues: (1) the directionality 133
(positive/negative) of higher-order correlation is not interpretable, especially when there134
are more than three datasets, (2) the correlation effect will be cancelled for odd number135
of datasets (see S1 Text for detail). Therefore, we addressed these limitations by 136
proposing a new tensor canonical correlation analysis formulation. Let 137
z1, z2, ..., zk ∈ Rn×1 be k n × 1 dimensional vectors that are centered and scaled, and 138
1 ∈ Rn×1 is the all-onen × 1 vector, then the higher-order correlation between these 139
January 18, 2024 6/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
vectors can be defined as: 140
ρ(z1, z2, ..., zk) = | 1
n(z1z2...zk)T 1|, (4)
where z1z2...zk is the element-wise multiplication betweenk vectors (assuming k is 141
even). The higher-order correlation calculation method is slightly different for the odd142
number of views to avoid effect cancellation (see S1 Text for detail). Eq 4 calculates the143
higher-order correlation between vectors, but if the higher-order correlation between 144
matrices (e.g. each matrix is an omics dataset) needs to be calculated, we can avoid145
processing multiple loops to calculate the higher-order correlation for all combinations of146
features through a different equation. Suppose that there arek data in total, denoted by 147
Zj ∈ RN ×dj , j = 1, 2, .., k, which are centered and scaled. Letzji ∈ R1×dj be the vector 148
for subject i in the jth view. For each view, there areN observations and dj features, 149
then the covariance tensorC1,2,...,k ∈ Rd1×d2×...×dk for k views can be denoted by: 150
C1,2,...,k = 1
N |
NX
i=1
z1i ◦ z2i ◦ ... ◦ zki|, (5)
where ◦ denotes the outer product. It can be shown that each entry ofC1,2,...,k, denoted
by C1,2,...,k(j1, j2, ..., jk), can also be calculated through Eq 4 (see S1 Text for more
details). If the canonical weights for each view arehj ∈ Rdj ×1, j = 1, 2, .., k, then our
new tensor canonical correlation analysis is formulated as follows:
arg max
hj
ρ =C1,2,...,k ×1 hT
1 ×2 hT
2 × ... ×k hT
k
s.t. hT
j hj = 1, j = 1, 2, ..., k, (6)
where the i-mode product ofC1,2,...,k and hT
i , denoted byC1,2,...,k ×i hi is defined as 151
follow: 152
(C1,2,...,k ×i hT
i )(j1, j2, ..., jp−1, 1, jp+1, ..., jk) = C1,2,...,k(j1, j2, ..., jk) · hT
i . (7)
It has been shown that the optimization problem above is equivalent to the following
form [32]:
min
ρ,h1,h2,...,hk
||C1,2,...,k − ˆC1,2,...,k||2
F
s.t. ˆC1,2,...,k = ρ · h1 ◦ h2 ◦ ... ◦ hk. (8)
In this optimization form, the canonical weightshj are solved using either a 153
gradient-based method or alternating least squares [33]. 154
0.6.2 Generalized T ensor Canonical Correlation Analysis Combining 155
Higher and Lower-Order Correlations 156
Algorithm 2: Generalized Tensor Canonical Correlation Analysis Algorithm.
Input: Multi-omics and phenotype dataXj ∈ RN ×dj , j = 1, 2, ..., K, preferred
number of network moduless ;
(1) Calculate covariance tensors/matricesC1, C2, ..., Cl of interest using Eq 5;
(2) Formulate multiple tensors/matrices factorization problem using Eq 11 ;
(3) Use the first-order gradient-based algorithm to find optimal canonical
weight matrices Hj ∈ Rdj ×s, j = 1, 2, ..., K;
Output: Canonical weight matricesHj ∈ Rdj ×s, j = 1, 2, ..., k
157
January 18, 2024 7/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
The section above illustrates TCCA assuming that only a single full covariance 158
tensor is used and is not capable of adding more covariance structures. For example,159
with the presence of transcriptomic, proteomics, metabolomics, and phenotype data, we160
may also be interested in the 3-way or pairwise correlation with respect to phenotype in161
addition to the full 4-way correlation structure. Therefore, to incorporate multiple 162
correlation structures of interest, we extended our new TCCA method and developed163
the Generalized Tensor Canonical Correlation Analysis (GTCCA). 164
Let Sm = {(m1, ..., mm) : mi ∈ {1, ..., k}, m1 ̸= m2 ̸= ... ̸= mm} be all possible
combinations of k choose m, let Sm(i) be the ith element in the set. For instance, with
k = 3 and m = 2, it would be the set of{S2(1) = (1, 2), S2(2) = (1, 3), S2(3) = (2, 3)},
then the GTCCA can be defined as:
arg max
h1,h2,...,hk
ak,1ρk(1)2 +
(
k
k−1)X
j=1
ak−1,jρk−1(j)2 + ...
+
(
k
3)X
j=1
a3,jρ3(j)2 +
(
k
2)X
j=1
a2,jρ2(j)2
s.t. hT
j hj = 1, j = 1, 2, ..., k, (9)
where ρm(j) = CSm(i) ×1 hm1 × ... ×m hmm, where (m1, m2, ..., mm) = Sm(i). 165
Compared to TCCA, this design allows flexibility with respect to a specific 166
experimental design of interest by allowing a portion ofai,j to be 0. For example, if the 167
investigator is not interested in a 3-way higher-order correlation between omics data 1,2,168
and 3, then the scaling factor associated withρ1,2,3 can be set to 0. In addition, the 169
scaling factor ai,j can be set to values other than 1 to prioritize certain correlation 170
structures. Eq 9 cannot be directly optimized with the existing gradient-based method,171
and thus we found the problem equivalency as follows (proof in S2 Text): 172
Theorem 1 Let CSm(j) be the covariance tensor of view with (m1, ..., mm) = Sm(j)
such that CSm(j) ∈ Rdm1 ×dm2 ×...×dmm , If the optimization goal is formulated as follows:
arg max
h1,h2,...,hk
ak,1ρk(1)2 +
(
k
k−1)X
j=1
ak−1,jρk−1(j)2 + ...
+
(
k
3)X
j=1
a3,jρ3(j)2 +
(
k
2)X
j=1
a2,jρ2(j)2
s.t. hT
j hj = 1, j = 1, 2, ..., k, (10)
where ρm(i) = CSm(i) ×1 hm1 × ... ×m hmm for all m = 1, 2, ..., k, then the optimization
January 18, 2024 8/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
problem above is equivalent to the following:
arg min
h1,h2,...,hk
ak,1||CSk(1) − ˆCSk(1)||2
F
+
(
k
k−1)X
j=1
ak−1,j||CSk−1(j) − ˆCSk−1(j)||2
F
+ ... +
(
k
3)X
j=1
a3,j||CS3(j) − ˆCS3(j)||2
F
+
(
k
2)X
j=1
a2,j||CS2(j) − ˆCS2(j)||2
F
s.t. hT
j hj = 1, j = 1, 2, ..., k, (11)
where ˆCSm(i) = ρm(i)hm1 ◦ hm2 ◦ ... ◦ hmm is the rank-1 approximation of CSm(i). 173
Since the original problem is shown to be equivalent to the multiple tensor factorization174
problem, it can be solved iteratively using any first-order gradient-based algorithm. We175
choose the non-linear conjugate gradient algorithm with the Dai/Yuan update and 176
restart [34,35]. 177
0.7 Multi-Omics Data Example 178
In this section, we demonstrate how the first-order gradient of Eq 11 can be calculated179
with a multi-omics data example. Suppose there are three types of molecular profiles:180
transcriptomics (tr), proteomics (pr), and metabolomics (me), and phenotype (ph) data.181
Define these data asXtr, Xpr, Xme and Yph. Using GTCCA to find the 182
phenotype-related correlation structure, the optimization problem is given by: 183
arg max
htr,hpr,hme,hph
ρ2
tr,pr,me,ph + ρ2
tr,pr,ph + ρ2
tr,me,ph + ρ2
pr,me,ph
+ ρ2
tr,ph + ρ2
pr,ph + ρ2
me,ph
s.t. hT
j hj = 1, i = g, pr, m, ph, (12)
where ρtr,pr,me,ph = Ctr,pr,me,ph ×1 htr ×2 hpr ×3 hme ×4 hph, and the other correlation
components are also in the same form. By Theorem (1), optimizing this objective
function is equivalent to:
arg min
htr,hpr,hme,hph
||Ctr,pr,me,ph − ˆCtr,pr,me,ph||2
F
+ ||Ctr,pr,ph − ˆCtr,pr,ph||2
F
+ ||Ctr,me,ph − ˆCtr,pr,ph||2
F
+ ||Cpr,me,ph − ˆCpr,me,ph||2
F
+ ||Ctr,ph − ˆCtr,ph||2
F + ||Cpr,ph − ˆCpr,ph||2
F
+ ||Cme,ph − ˆCme,ph||2
F
s.t. hT
j hj = 1, j = g, pr, m, ph, (13)
where ˆCtr,pr,me,ph = ρtr,pr,me,ph · htr ◦ hpr ◦ hem ◦ hph represents the rank-1
approximation of the covariance tensor, and other components are similar. Below is the
January 18, 2024 9/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
example gradient calculation for the transcriptomehtr,
∂f
∂htr
= [ ˆCtr,pr,me,ph(1) − Ctr,pr,me,ph(1)]
· (ρtr,pr,me,ph · htr ⊙ hpr ⊙ hme)
+ [ ˆCtr,pr,ph(1) − Ctr,pr,ph(1)](ρtr,pr,ph · hph ⊙ hpr)
+ [ ˆCtr,me,ph(1) − Ctr,me,ph(1)](ρtr,me,ph · hph ⊙ hme)
+ (htrρ1hT
ph − C1,y)hphρtr,ph + λ(htr − ¯hph) (14)
∂f
∂ρtr,pr,me,ph
= ( ˆCtr,pr,me,ph − Ctr,pr,me,ph)
×1 htr ×2 hpr ×3 hme ×4 hph, (15)
where ⊙ is the Khatri-Rao product, given two matricesA ∈ Rm1×n and B ∈ Rm1×n, 184
and let ⊗ denote the Kronecker product between two vectors, the Khatri-Rao product is185
given by: 186
A ⊙ B = [a1 ⊗ b1, a2 ⊗ b2, ..., an ⊗ bn], (16)
Ctr,pr,me,ph(i) is the mode-imatricization of tensorCtr,pr,me,ph, and "mode" stands 187
for the index of dimension in tensor data, and "mode-i" means theith dimension of the 188
tensor data. This is a way to matricize the tensor by mapping the elements of the 189
tensor into a matrix by arranging the mode-i fibers (think of it as the higher-order rows190
and columns) so that they become the columns of[Ctr,pr,me,ph](i). The gradient for 191
other hs and ρs can be calculated in a similar way. After taking the gradient, the next192
step is to concatenate all of the gradients into a long vector. All first-order 193
gradient-based methods can be used for optimization, and the nonlinear conjugate 194
gradient method is chosen to solve for the optimal canonical weights [36]. 195
0.8 Sparse Generalized Tensor Canonical Correlation Analysis 196
Common methods for ensuring sparsity in statistical models include using penalty-based197
techniques such as lasso [25] and elastic net [37]. However, applying these methods to198
GTCCA in practice can be computationally intensive, as they require tensor 199
computation on the original covariance tensors, which can potentially cause memory 200
issues. To address this problem, adapting the idea from Turbo-SMT [38], we use the201
biased subsampling method to guarantee sparsity (the details of the algorithm setup are202
shown in Algorithm 1 in S3 Text.). This method is both accurate and efficient, 203
reducing the tensor factorization problem by 1000 times. The core concept of this 204
Method
is to subsample features based on the covariance density value calculated for205
each feature. If a feature is more likely to be connected to other features or phenotypes,206
then it is more likely to be selected in the biased subsampling phase. Details of the 207
calculation of the covariance density are given in S4 Text. 208
In practice, multi-omics data are often high-dimensional, which poses challenges in209
terms of memory and computation when computing covariance tensors. Our preliminary210
findings indicate that in GTCCA optimization, even a covariance tensor of size 211
1000 × 1000 × 1000 requires more than 16 GB of RAM, leading to memory explosion.212
Although SGTCCA reduces the GTCCA memory requirement by feature subsampling,213
it still requires the calculation of covariance density vectors, which also involves the 214
computation of the full covariance tensor. Therefore, to further improve the algorithm215
efficiency, when calculating the covariance density vectors, we used the unbiased feature216
subsampling approach to approximate each covariance tensor. In each subsampling 217
iteration of our method, a small, randomly selected subset of features from each 218
January 18, 2024 10/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
molecular profile is used. For these subsets, we calculate the covariance density based219
on the provided correlation structures. This process is repeated across multiple 220
iterations. The final estimated covariance density vector is obtained by averaging the221
covariance densities from all subsamples. As a general rule for setting parameters, a 222
lower subsampling percentage (e.g.,10%) requires a greater number of iterations to 223
achieve reliable results, while a higher subsampling percentage (e.g.,70%) typically 224
requires fewer iterations. 225
To ensure the consistency of canonical weight vectors from each subsample, a 226
portion of features from each dataset are preselected and shared across different 227
subsamples. For instance, for each subsample, we can have8% of the features shared 228
across all subsamples and2% of the distinct features. A general recommendation is to229
set the total percentage of featured subsampled to be between10% and 20% to ensure 230
both computational efficiency. 231
0.9 Network Construction and Pruning 232
After obtaining the canonical weight through the SGTCCA, the next step is to 233
construct an adjacency matrix. The adjacency matrix construction process is the same234
as SmCCNet, and the detailed algorithm is illustrated in Algorithm 2 of S5 Text. The235
general idea is to take the outer product between the concatenated canonical weight and236
itself. However, even though the adjacency matrix is sparse, it may still contain 237
features/nodes that are less associated with other features/phenotypes. Therefore, we238
prune the global network with the PageRank algorithm [39] and the NetSHy network239
summarization score [31]. The original PageRank algorithm is widely used to rank web240
pages according to their importance. Its application to networks (adjacency matrix) is to241
count the number and strength of the edges for each node to determine the importance242
of each node. The NetSHy summarization score is the principal component score that243
takes into account the graph/network topology. We chose the NetSHy summarization244
score rather than the regular principal component analysis to prioritize the contribution245
of nodes with high network connectivity. The network pruning algorithm ensures both a246
high correlation of the summarization score with the phenotype and high network 247
connectivity. Step-by-step details of the algorithm can be found in S5 Text. 248
The algorithm above provides node-wise network pruning, but the pruned 249
sub-networks will be highly densely connected, and it requires edge pruning. Therefore,250
we calculate the between-node Pearson’s correlation and use it to filter out edges that251
connect to two weakly or non-correlated nodes, which improves the final network 252
visualization. 253
0.10 Computational Complexity 254
Suppose there arek omics data, each withd1, d2, ..., dk features, the common 255
subsampling proportion ispc and the distinct subsampling proportion ispd (see S3 Text 256
for details). Define p = pc + pd ∈ [0, 1], then the new tensor canonical correlation 257
analysis defined in this paper has the space complexity ofO(d1d2...dk), while SGTCCA 258
has the space complexity ofO(d1d2...dkpk). The computational cost per iteration for 259
new tensor canonical correlation analysis isO(d1d2...dks), while for SGTCCA it is 260
O(d1d2...dkpks). Since 0 < p < 1, SGTCCA has a more efficient space complexity and261
computational cost than the new TCCA. 262
January 18, 2024 11/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
Experiments 263
0.11 Simulation Study 264
F
eature IndexData
Type 1:20 21:40 41:60 61:80 81:100 101:160 161:220 221:280 281:340 341:1000
Omics
1 Laten
t 1* Laten
t 2* Laten
t 3* Noise Laten
t 5* Laten
t 8 Laten
t 9 Laten
t 10 Noise Noise
Omics
2 Laten
t 1* Laten
t 2* Noise Laten
t 4* Laten
t 6* Laten
t 8 Laten
t 9 Noise Laten
t 11 Noise
Omics
3 Laten
t 1* Noise Laten
t 3* Laten
t 4* Laten
t 7* Laten
t 8 Noise Laten
t 10 Laten
t 11 Noise
Phenot
ype Data Latent F actor Composition
Phenot
ype Laten
t 1*; Latent 2*; Latent 3*; Latent 4*; Latent 5*; Latent 6*; Latent 7*
F
eature IndexData
Type 1:20 21:40 41:60 61:80 81:100 101:160 161:220 221:280 281:340 341:1000
Omics
1 Laten
t 1* Laten
t 2* Laten
t 3* Noise Laten
t 5* Laten
t 8 Laten
t 9 Laten
t 10 Noise Noise
Omics
2 Laten
t 1* Laten
t 2* Noise Laten
t 4* Laten
t 6* Laten
t 8 Laten
t 9 Noise Laten
t 11 Noise
Omics
3 Laten
t 1* Noise Laten
t 3* Laten
t 4* Laten
t 7* Laten
t 8 Noise Laten
t 10 Laten
t 11 Noise
Phenot
ype Data Latent F actor Composition
Phenot
ype Laten
t 1*
F
eature IndexData
Type 1:20 21:40 41:60 61:80 81:100 101:160 161:220 221:280 281:340 341:1000
Omics
1 Laten
t 1* Laten
t 2* Laten
t 3* Noise Laten
t 5* Laten
t 8 Laten
t 9 Laten
t 10 Noise Noise
Omics
2 Laten
t 1* Laten
t 2* Noise Laten
t 4* Laten
t 6* Laten
t 8 Laten
t 9 Noise Laten
t 11 Noise
Omics
3 Laten
t 1* Noise Laten
t 3* Laten
t 4* Laten
t 7* Laten
t 8 Noise Laten
t 10 Laten
t 11 Noise
Phenot
ype Data Latent F actor Composition
Phenot
ype Laten
t 2*; Latent 3*; Latent 4*; Latent 5*; Latent 6*; Latent 7*
T able 1. Simulated multi-omics data correlation structure for cases 1-3. Red and "*" mean
that features that are simulated with this latent factor are considered signal features. In
addition to the existing latent factors and random noise, as shown in the table, additional
random noise will be added to all simulated molecular features and phenotype data. The first
table is for simulation case 1, where all types of phenotype-specific correlation structures are
simulated and considered signal; the second table is for simulation case 2, where only 4-way
phenotype-specific correlation structures are simulated and considered signal; the third table is
for simulation case 3, where all 3-way, pairwise phenotype-specific correlation structure is
simulated and considered signal.
We simulated multi-omics data using independent latent factors to represent various265
correlation structures [40] (Table 1). We aimed to compare the performance of 266
SGTCCA-Net with other multi-omics network inference pipelines. Since the comparison267
relies on the adjacency matrix, it is necessary that all methods have an available 268
adjacency matrix. We generated three datasets for molecular profiles, termed omics 1,269
omics 2, and omics 3, along with a quantitative phenotype. To mimic the true 270
multi-omics correlation structure in most scenarios (the strong connection between 271
omics but the weak connection between omics and phenotype), we imposed weaker noise272
on the omics data and stronger noise on the simulated phenotype data, and all the 273
random noise are generated from Gaussian distribution. This simulation study is 274
designed to evaluate the accuracy of different methods in identifying omics features 275
associated with both higher-order and lower-order phenotype-specific correlation 276
structures, and to assess the robustness of the model in handling noisy or highly skewed277
omics data. To this end, we introduced three latent factors simulation strategies: 278
multivariate normal latent factors, highly right-skewed latent factors (Fleishman power279
transformation [41]), and noisy latent factors (multivariate normal distribution with 280
noise). Under each simulation strategy, we proposed three distinct correlation structure281
setups: one with all possible phenotype-specific higher-order/lower-order correlations, 282
one with only phenotype-specific four-way correlations, and one with all 283
phenotype-specific correlation structures except for the 4-way correlations. These signal284
correlation structures are simulated by latent factors (red in Table 1). Furthermore, we285
also simulated nonphenotype-specific correlation structures to interfere with signal 286
feature identification and test the robustness of each method (black in Table 1). 287
The positive "signal" features, intended for model identification, are defined as 288
January 18, 2024 12/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
four-way correlations involving all omics and phenotype, multiple three-way correlations289
among different combinations of omics with phenotype, and pairwise correlations 290
between each omics and phenotype. Additionally, the nonphenotype-specific correlations291
are negative "random noise" features to challenge signal feature detection. These 292
include a 3-way correlation among all omics and pairwise correlations among the omics,293
and these correlation structures serve as noise features, which models should ideally 294
overlook. The performance is evaluated at the node level with Area Under the Receiver295
Operating Characteristic Curve (AUC) score. A node is predicted positive (signal) if its296
maximal connection to other nodes in the adjacency matrix passes a certain threshold,297
which is consistent with the SmCCNet evaluation [24]. The AUC score is then 298
calculated checking prediction results across a series of the threshold value. 299
We compare the performance of SGTCCA-Net to other multi-omics network analysis300
Methods
with 25 replications: (1) STCCA-Net with only the higher order 4-way 301
correlation structure;(2) Best SmCCNet AUC score from various combinations of 302
scaling factors and sparsity levels; (3) Best DIABLO AUC score with combinations of303
different levels of sparsity and scaling factors. The parameter setup for SmCCNet and304
DIABLO resulted in a total of 9 models for each method. To ensure a fair comparison305
between methods, we extracted only one set of canonical weights from each method.306
More details on parameter settings can be found in S6 Text. 307
0.12 Multi-Omics Data Analysis 308
In addition to the simulations, we compared the performance of SGTCCA and SmCCA309
on two data sets. To maintain a fair comparison, we kept all subsequent network 310
analysis steps constant, and therefore did not use DIABLO as the network inference for311
that method implements a different pipeline. Therefore, we opt for a modified version of312
SmCCNet (SmCCNet 2.0) [42], ensuring that the network analysis steps align with 313
those of SGTCCA-Net. The best scaling factors and penalty terms selection for 314
SmCCNet are based on 5-fold cross-validation. To be specific, we let the penalty term315
for each molecular profile vary between 0.25, 0.5, and 0.75, and the scaling factor 316
associated with each phenotype-specific pairwise correlation component specific to the317
phenotype varies between 1, 5, and 10, and we evaluate the scaled prediction error for318
each particular combination. In an example of two omics data, if(a1,2, b1, b2) are the 319
scaling factors whereb1 and b2 represent the phenotype-specific scaling factors anda1,2 320
stands for the scaling factor for between-omics correlation, we will scale the scaling 321
factor by δ to ensure a1,2 + b1 + b2 = 1, then the best scaling factors and their 322
associated penalty terms are determined by minimizing: 323
(a1,2, b1, b2)=arg max
˜a1,2,˜b1,˜c2
|trainCC − testCC |
|testCC | , (17)
where trainCC and testCC stand for total training canonical correlation and total 324
testing canonical correlation. All other steps in the pipeline, including network 325
construction and pruning, were kept unchanged. Specifically, we fix steps (2) through326
(4) of Algorithm 1 and vary the setup for step (1) between SGTCCA and SmCCNet.327
The study aimed to compare the performance of SGTCCA to SmCCA in extracting the328
network associated with the phenotype of interest. 329
0.12.1 TCGA Breast Cancer Network Analysis 330
The Cancer Genome Atlas Program (TCGA) breast invasive carcinoma project includes331
RNA sequencing data with normalized counts obtained through the Illumina HiSeq 332
platform, microRNA (miRNA) expression data of tumor samples obtained through the333
January 18, 2024 13/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
Illumina HiSeq platform at the miRgene-level, and log-ratio normalized reverse phase334
protein arrays (RPPA) expression data from tumor samples at the gene level. In this335
experiment, we use tumor purity as the phenotype, defined as the percentage of cancer336
cells in a sample of tumor tissue [43]. After matching subjects with all available 337
molecular profiles and phenotype data, we obtained a cohort of 105 subjects. Because of338
their relatively large number, we filtered genes based on the standard deviation to 339
eliminate genes that exhibit lower variability and only include the top 25% of the genes, 340
resulting in a total of 5039 genes, in addition to the 823 miRNAs, and 175 RPPAs. In341
the data preprocessing step, we regress out age, race, and whether patients received 342
radiation therapy for each of the molecular features to adjust for covariates. 343
Our SGTCCA-Net pipeline assumes the correlation structure of 344
gene-miRNA-RPPA-phenotype, gene-miRNA-phenotype, gene-RPPA-phenotype, 345
miRNA-RPPA-phenotype, and all pairwise molecular profiles with phenotype. Same as346
the simulation study, we set the percentage of common subsampled molecular features347
at 8% and the percentage of distinct molecular features to2% with a total of 10 348
iterations when bias subsampling is performed to guarantee sparsity. We use the 349
network pruning algorithm to prune the multi-omics network and set the 350
minimally/maximally final subnetwork size to 30/300. 351
To evaluate the efficacy of each method, we obtained the NetSHy summarization 352
score for every subnetwork, and correlate the scores with tumor purity. In addition, we353
calculated the correlation of individual molecular features with tumor purity, 354
contrasting the number of strongly and weakly correlated features between both 355
methods. Since one of the advantages of SGTCCA-Net is its ability to extract 356
higher-order correlation among molecular features, we rank all potential molecular 357
feature triplets and pairs in relation to tumor purity based on their respective 358
higher-order correlation values. This is done to investigate how different molecular 359
features simultaneously interact with respect to tumor purity. 360
An enrichment analysis was also performed on the aggregate set of genes and 361
proteins through Metascape (v. 3.5.20230501) [44], with the reference set of all genes362
and protein target genes fed into the SGTCCA-Net pipeline and recognized by 363
Cytoscape (4540). While the aforementioned approach focused solely on genes and 364
proteins, we also extended the enrichment analysis to miRNAs. Using MultiMiR [45],365
we first identified validated target genes of miRNAs in the subnetwork, then treated366
target genes that have a non-zero canonical weight in the global network as the enriched367
set. Enrichment analysis is also performed in Metascape with the same background set368
as described above. 369
0.12.2 COPDGene Network Analysis 370
Phase II COPDGene [46] includes transcriptomics (RNA-Seq), proteomics (SomaLogic371
1.3k), and metabolomics (Metabolon) data [47] for studying chronic obstructive 372
pulmonary disease (COPD). For demonstration, we use a smaller cohort to compare 373
different methods. We construct multi-omics networks with respect to forced expiratory374
volume at one second (FEV1). To ensure that we have roughly 1000 genes to run the375
model, we use standard deviation to filter out genes with less variability (sd < 0.435).376
After filtering subjects with available data for all molecular profiles and the phenotype,377
there were 461 subjects with available data for all molecular profiles, with 972 genes,378
1305 proteins, and 995 metabolites. To adjust for covariates, we regressed out effects379
from sex, age, and clinical center for each molecular feature. 380
Our SGTCCA-Net pipeline assumes the correlation structure of 381
gene-protein-metabolite-phenotype, gene-protein-phenotype, gene-metabolite-phenotype, 382
protein-metabolite-phenotype and all pairwise molecular profiles with phenotype. As in383
the simulation study and the TCGA breast cancer study, we set the percentage of 384
January 18, 2024 14/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
common subsampled molecular features at8% and the percentage of distinct molecular 385
features to 2% with a total of 10 iterations when biased subsampling was performed to386
guarantee sparsity. 387
To evaluate the efficacy of each method, we obtained the NetSHy summarization 388
scores for every subnetwork, correlating them with FEV1. In addition, we calculated 389
the correlation of individual molecular features with FEV1, contrasting the number of 390
strongly and weakly correlated features between both methods. Since one of the 391
advantages of SGTCCA-Net is its ability to extract higher-order correlation between 392
molecular features, we rank all potential molecular feature triplets and pairs in relation393
to FEV1 based on their respective higher-order correlation values. This is done to 394
investigate how different molecular features simultaneously interact with respect to 395
FEV1. 396
Enrichment analysis was also performed on the union set of genes and proteins in397
the subnetwork through Metascape (v. 3.5.20230501) [44], with the background set to398
be all genes and protein target genes fed into the SGTCCA-Net pipeline and recognized399
by Cytoscape (1750). While the aforementioned approach focused solely on genes and400
proteins, we also extended the analysis to metabolites by running metabolite enrichment401
analysis on IMPaLa [48] based on the HMDB names of the metabolites in the 402
subnetwork, with the background set to be all metabolites with available HMDB name403
(227). 404
FEV11 is a measurement in liters for each subject, which did not take into account405
body habitus and race. Therefore, we also ran SGTCCA-Net on the FEV1 percent 406
predicted (FEV11PP) to check if the top molecular features identified fromFEV11 are 407
also identified in the FEV11PP network, and if the correlation changes. Additionally, 408
we also ran an enrichment analysis in Metascape to check if the pathways overlapped409
between the two networks. 410
Results
411
0.13 Simulation 412
SGTCCA-Net significantly outperforms the other three methods (SmCCNet, 413
STCCA-Net, and DIABLO) in all simulation study settings and simulated correlation414
structure designs (Table 2), even when SmCCNet and DIABLO use the parameter setup415
that performs best in each iteration. In case 1, where all types of phenotype-specific 416
correlation structures are presented and in case 3, when the 4-way correlation structure417
is removed, the performance of SGTCCA-Net improves by around10% compared to the 418
best-performing results from SmCCNet and DIABLO, and the level of improvement 419
increases as the number of subjects increases. SGTCCA-Net achieves high signal feature420
identification accuracy even with only 100 subjects in the presence and absence of 421
different phenotype-specific correlation structures and provides nearly-perfect prediction422
when the number of subjects doubles, outperforming the other methods by around15% 423
to 20% in AUC. In case 2, when there is only one type of phenotype-specific correlation424
structure (4-way correlation), SGTCCA-Net, STCCA-Net, and best-performing 425
DIABLO perform equally well with perfect or near-perfect AUC scores. 426
When the underlying latent factors are highly right-skewed and omics data are noisy,427
the performance of all models are impacted, but SGTCCA-Net still outperforms the 428
other methods. This indicates that even when the signal is masked by noise to some429
extent, or normality is not guaranteed, SGTCCA-Net still performs well compared to430
other methods, and its performance improves as the sample size increases. 431
January 18, 2024 15/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
Normal
Latent F actor: Median AUC (Interquartile Range)
Metho
d n SGTCCA-Net STCCA-Net SmCCNet
(Best) DIABLO
(Best)
Case
1 n
= 100 0.885
(0.859, 0.906) 0.656
(0.604, 0.680) 0.706
(0.689, 0.736) 0.745
(0.718, 0.780)
n
= 200 0.978
(0.966, 0.986) 0.694
(0.614, 0.738) 0.742
(0.720, 0.818) 0.825
(0.754, 0.880)
Case
2 n
= 100 1.000
(1.000, 1.000) 0.995
(0.955, 1.000) 0.825
(0.779, 0.853) 0.931
(0.920, 0.954)
n
= 200 1.000
(1.000, 1.000) 1.000
(1.000, 1.000) 0.823
(0.811, 0.837) 0.939
(0.931, 1.000)
Case
3 n
= 100 0.890
(0.864, 0.942) 0.622
(0.583, 0.701) 0.710
(0.679, 0.740) 0.701
(0.670, 0.757)
n
= 200 0.977
(0.965, 0.983) 0.679
(0.635, 0.711) 0.766
(0.726, 0.816) 0.814
(0.786, 0.891)
Sk
ewed Latent F actor: Median AUC (Interquartile Range)
Metho
d n SGTCCA-Net STCCA-Net SmCCNet
(Best) DIABLO
(Best)
Case
1 n
= 100 0.845
(0.814, 0.884) 0.686
(0.603, 0.730) 0.705
(0.664, 0.722) 0.713
(0.675, 0.768)
n
= 200 0.935
(0.921, 0.954) 0.729
(0.675, 0.788) 0.756
(0.726, 0.804) 0.847
(0.813, 0.887)
Case
2 n
= 100 1.000
(1.000, 1.000) 1.000
(0.980, 1.000) 0.824
(0.780, 0.846) 0.939
(0.932, 1.000)
n
= 200 1.000
(1.000, 1.000) 1.000
(1.000, 1.000) 0.817
(0.770, 0.855) 0.949
(0.933, 1.000)
Case
3 n
= 100 0.855
(0.799, 0.891) 0.677
(0.614, 0.719) 0.714
(0.681, 0.748) 0.725
(0.651, 0.777)
n
= 200 0.936
(0.925, 0.949) 0.755
(0.678, 0.795) 0.712
(0.687, 0.735) 0.803
(0.738, 0.839)
Noisy
Omics Data: Median AUC (Interquartile Range)
Metho
d n SGTCCA-Net STCCA-Net SmCCNet
(Best) DIABLO
(Best)
Case
1 n
= 100 0.776
(0.761, 0.805) 0.578
(0.554, 0.595) 0.702
(0.677, 0.724) 0.703
(0.667, 0.780)
n
= 200 0.909
(0.892, 0.919) 0.603
(0.579, 0.638) 0.785
(0.748, 0.801) 0.813
(0.766, 0.920)
Case
2 n
= 100 0.998
(0.996, 1.000) 0.777
(0.666, 0.880) 0.839
(0.789, 0.853) 1.000
(0.918, 1.000)
n
= 200 1.000
(1.000, 1.000) 0.921
(0.822, 0.959) 0.832
(0.815, 0.846) 1.000
(0.936, 1.000)
Case
3 n
= 100 0.781
(0.746, 0.809) 0.556
(0.538, 0.614) 0.710
(0.678, 0.729) 0.701
(0.670, 0.757)
n
= 200 0.916
(0.888, 0.936) 0.583
(0.565, 0.609) 0.782
(0.745, 0.798) 0.799
(0.778, 0.876)
T able 2. Simulation results. Performance is evaluated through the AUC of the precision-recall
curve generated by applying different thresholds to the maximal connection of molecular
features to each other. For this simulation, 20 replications are the AUC median and
interquartile range in parenthesis of is reported. "Best" AUC for SmCCNet and DIABLO
denotes that in each replication, 9 SmCCNet/DIABLO models are run and only the highest
AUC score is recorded. The first table is the simulation study for setting 1, which uses latent
factors simulated from multivariate normal distribution; the second panel is the simulation
study for setting 2, where latent factors are simulated with a highly right-skewed distribution;
the third panel is the simulation study for setting 3, where latent factors are simulated with a
multivariate normal distribution (same as case 1), but strong random noise is enforced on
omics data. Each simulation study contains 3 cases: Case 1 means that the signal molecular
features are defined as all 4-way, 3-way, and pairwise phenotype-specific correlation structure;
case 2 removes the phenotype-specific 4-way correlation structure; case 3 removes the
phenotype-specific 3-way and pairwise correlation structure. In each case, the data is simulated
with 100 or 200 subjects.
0.14 TCGA Breast Cancer Network Analysis 432
After running a network analysis pipeline with SGTCCA and SmCCNet for the 433
phenotype of tumor purity, the final network size for both SGTCCA-Net and SmCCNet434
is 300. Specifically, there are 128 genes, 125 miRNAs, and 47 proteins in the SGTCCA435
network, while there are 128 genes, 59 miRNAs, and 113 proteins in the SmCCNet 436
network. The NetSHy network summarization score correlation with respect to tumor437
purity is 0.689 and 0.500 for SGTCCA-Net and SmCCNet respectively. In the 438
SGTCCA-Net network, there are 55 molecular features with weak correlation with 439
respect to tumor purity (FDR-adjusted > 0.05), while for SmCCNet there are 105 such440
molecular features. To check whether each method includes these nodes incorrectly or441
whether they contribute to the network, we ran the NetSHy network summarization 442
score on these weak nodes only, extracted the first 10 PCs, and calculated the maximum443
correlation between the PC score and tumor purity. For SGTCCA-Net, the maximum444
correlation is -0.232 (p = 0.017), while for SmCCNet, the maximum correlation is only445
-0.176 (p = 0.072). This implies that weak nodes within the SGTCCA-Net network are446
more significantly associated with tumor purity compared to those in the SmCCNet 447
Jan
uary 18, 2024 16/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
Type Name Correlation FDR
Gene (128)
TMC8 -0.712 <0.001
ARHGAP9 -0.752 <0.001
SPIB -0.638 <0.001
HCST -0.712 <0.001
IRF8 -0.748 <0.001
miRNA (125)
hsa-mir-142 -0.564 <0.001
hsa-mir-146a -0.560 <0.001
hsa-mir-150 -0.593 <0.001
hsa-mir-155 -0.534 <0.001
hsa-let-7i -0.572 <0.001
Protein (47)
PCNA -0.564 <0.001
CCNB1 -0.409 <0.001
NRG1 -0.466 <0.001
KAT2A 0.324 0.001
ACVRL1 0.297 0.003
T able 3. Top 5 molecular features from each molecular profile and their individual correlation
with respect to tumor purity for TCGA breast cancer data (with p-value).
network. Furthermore, a total of 139 molecular features are shared between the 448
SGTCCA-Net and SmCCNet methods, with each method having 161 unique molecular449
features. Of the 161 unique molecular features in each network, 125 and 66 are 450
significantly associated with tumor purity (FDR < 0.05) for SGTCCA-Net and 451
SmCCNet, respectively. 452
We calculate the PageRank score for the SGTCCA-Net subnetwork and present the453
top 5 nodes for each molecular profile in Table 3. We observed that all of the molecular454
features in the table are significantly correlated with tumor purity.ACVRL1 and 455
KAT2Aare the only two nodes with low correlation to tumor purity (FDR > 0.001). To456
check whether these nodes contribute to the network, we examine the correlation 457
between these two nodes and other molecular features of the subnetwork. We observed458
that ACVRL1 is highly correlated withPCNA (ρbetween−omics = 0.474), which is highly 459
correlated with tumor purity (ρomics−pheno = −0.564) and KAT2A 460
(ρbetween−omics = 0.476), which is highly correlated with tumor purity 461
(ρomics−pheno = −0.564). 462
The top 5 ontologies from the enrichment analysis for the SGTCCA-Net final 463
subnetwork (Figure 3) includes lymphocyte activation (49/186 genes enriched), positive464
regulation of immune response (45/204 genes enriched), adaptive immune response 465
(38/145 genes enriched), regulation of leukocyte activation (46/227 genes enriched), and466
Adaptive Immune System (37/178 genes enriched). Compared to enrichment analysis467
Result
from SmCCNet, we observe that the ontology results from both networks fall into468
similar categories. However, when only looking at the immune response-related 469
ontologies, a stronger enrichment analysis result is identified in SGTCCA-Net network470
(S1 Fig). Additionally, using MCODE, we observed 6 modules of the protein-protein 471
interaction network as shown in Fig 3b. For example, in the first network module (red),472
one of the proteins isCD3D, which has prognostic potential for breast cancer and 473
associated with lymphocyte infiltration and immune checkpoints [49]. 474
We ran MultiMiR on the miRNAs from each subnetwork. For the SGTCCA-Net 475
network, among 1288 global network genes with non-zero canonical weight, 179/697 of476
them are the validated/predicted target genes of miRNAs in the subnetwork; while for477
the SmCCNet network, among 1024 global network genes with non-zero canonical 478
January 18, 2024 17/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
0 5 10 15 20 25 30
-log10(P)
hsa04650: Natural killer cell mediated cytotoxicity
hsa01521: EGFR tyrosine kinase inhibitor resistance
GO:0006974: DNA damage response
GO:0002683: negative regulation of immune system process
hsa05169: Epstein-Barr virus infection
M141: PID PI3KCI PATHWAY
WP4255: Non-small cell lung cancer
R-HSA-5663202: Diseases of signal transduction by growth factor receptors and second messengers
hsa05340: Primary immunodeficiency
R-HSA-1280215: Cytokine Signaling in Immune system
GO:0045058: T cell selection
GO:0050854: regulation of antigen receptor-mediated signaling pathway
GO:0042113: B cell activation
M34: PID TCR PATHWAY
GO:0001819: positive regulation of cytokine production
R-HSA-1280218: Adaptive Immune System
GO:0002694: regulation of leukocyte activation
GO:0002250: adaptive immune response
GO:0050778: positive regulation of immune response
GO:0046649: lymphocyte activation
(a)
(b)
Fig 3. Enrichment analysis results for TCGA breast cancer data with respect to
tumor purity . (a) The top pathways that are associated with the final network. (b)
Protein-protein interaction (PPI) network for the multi-omics network from SGTCCA-Net with
respect to tumor purity colored by clusters. Clusters are generated based on the Molecular
Complex Detection (MCODE) algorithm.
weight, only 102/212 of them are the validated/predicted target genes of miRNAs in the479
subnetwork. This result suggests that more connections between genes and miRNAs are480
found in the SGTCCA-Net network compared to the SmCCNet network. The 481
enrichment analysis based on the 179 validated targets of SGTCCA-Net subnetworks (S2482
Fig) shows that the ontologies fall into similar categories compared to the gene/protein483
enrichment analysis, including lymphocyte differentiation and adaptive immune system.484
We examined the top combinations (4/3-way correlation) and found that the top 485
4-way correlation implies the top 3-way correlation in some cases (S2 Table). For 486
example, it was observed thathsa-mir-3133 and KDR dominate the 4-way relationship 487
between gene, miRNA, protein, and tumor purity, which also implies the 3-way 488
correlation between hsa-mir-3133, KDR, and tumor purity. Interestingly,BEND4, one 489
of the genes present in the top 4-way correlation, is shown to be the predicted target of490
hsa-mir-3133 with a predicted score of 0.728, andKDR is also the predicted target of 491
January 18, 2024 18/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
hsa-mir-3133 with a predicted score of 0.572. Through examining the 3-way correlation492
we can find that even though the top 3-way correlation could imply the 4-way 493
correlation, the signal is generally weaker. For instance, it was found thathsa-mir-142 494
dominates the gene-miRNA-tumor purity correlation,PCNA dominates both the 495
gene-protein-tumor purity correlation and the miRNA-protein-tumor purity correlation,496
and CCNB1 dominates the miRNA-protein-tumor purity correlation (details in S2 497
Table). However, it was observed that the highest 4-way correlation involved in 498
hsa-mir-142 is 2.405 (rank 94/631,680 across all 4-way correlation), inPCNA is 2.615 499
(rank 59/631,680 across all 4-way correlation), inCCNB1 is 2.093 (rank 191/631,680 500
across all 4-way correlation). A particular example is the 3-way correlation that involves501
CCNB1 and hsa-mir-150, which is observed in the top 10 3-way correlation between 502
miRNAs, proteins and tumor purity, but the highest 4-way correlation involves these503
two molecular features is only 1.914 (rank 398/631,680). This suggests that some 3-way504
correlation combinations may be ignored by the 4-way correlation, and indicates the 505
importance of incorporating the 3-way correlation into the model. 506
Lastly, the PageRank score for each molecular feature in the final subnetwork was507
calculated, and the top 10 molecular features from each molecular profile were selected508
for network visualization in Figure 4. It was noticed that higher connectivity occurs in509
proteins PCNA and CCNB1 (both appear in Table 3 and S2 Table), and both of them510
are observed in the top 3-way miRNA-protein-tumor purity correlation and found as511
prognostic biomarkers for breast cancer [50,51]. 512
Fig 4. SGTCCA-Net network with top 10 molecular features from each molecular
profile. Multi-omics network module for TCGA breast cancer data with respect to tumor
purity. Nodes are genes (red), miRNAs (blue) and RPPAs (green). The edge color denotes
positive correlation (red) or negative correlation (blue) between molecular features with the
width denoting the strength of the connection. Edges are filtered based on Pearson correlation
with a threshold of 0.2.
January 18, 2024 19/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
Type Name Correlation FDR
Gene (68)
HK3 -0.177 0.004
MGAM2 -0.123 0.019
S100A9 -0.138 0.013
BCL6 -0.121 0.029
IRAK3 -0.134 0.013
Protein (43)
Matrix metalloproteinase-9 -0.149 0.012
calgranulin B -0.122 0.029
Complement component C9 -0.163 0.007
Kallistatin 0.158 0.009
Neutrophil gelatinase-associated lipocalin -0.095 0.060
Metabolite (59)
1-oleoylglycerol (18:1) 0.114 0.028
1-stearoyl-2-oleoyl-GPI (18:0/18:1)* 0.094 0.062
1-stearoyl-2-linoleoyl-GPI (18:0/18:2) 0.143 0.013
1-linoleoyl-GPA (18:2)* 0.103 0.042
retinol (Vitamin A) 0.093 0.064
T able 4. Top 5 molecular features from each molecular profile for SGTCCA-Net based on the
PageRank and their individual correlation with respect to FEV1 for COPDGene data (with
p-value).
0.15 COPDGene Network Analysis 513
After running a network analysis pipeline with SGTCCA-Net and SmCCNet for the 514
FEV1 phenotype, the final network size for SGTCCA-Net is 170, while SmCCNet has a515
final network size of 117. The NetSHy summarization score for SGTCCA-Net is 516
correlated with FEV1 with a correlation of 0.211, while for SmCCNet it is only 0.162.517
In the SGTCCA-Net network, there are 58 molecular features with weak correlation 518
with respect to FEV1 (FDR > 0.05), while for SmCCNet, there are 82 such molecular519
features. To check whether each method includes these nodes incorrectly or whether 520
they contribute to the network, we extracted the NetSHy summarization score on these521
weak nodes only, extracted the first 3 PCs, and calculated the maximum correlation522
between PC score and FEV1. For SGTCCA-Net, the maximum correlation is -0.159 523
(p < 10−3), while for SmCCNet, the maximum correlation is only 0.090 (p = 0.053). 524
This implies that the weak nodes included in SGTCCA-Net are shown to be more 525
related to the phenotype than the weak nodes included in SmCCNet. Furthermore, a526
total of 29 molecular features are shared between the SGTCCA-Net and SmCCNet 527
methods, with each method having 141 and 88 distinct molecular features respectively.528
Of the 141 unique molecular features in the SGTCCA-Net network, 89 of them are 529
significantly associated withFEV1 (FDR < 0.05), while of 88 unique molecular features530
in the SmCCNet network, only 14 of them are significantly associated with FEV1 (FDR531
< 0.05). In particular,Troponin T (ρ = −0.278, p < 10−3) and C-reactive protein 532
(ρ = −0.190, p < 10−3) have a strong correlation with respect to FEV1 and are not 533
identified by SmCCNet, and studies have shown that elevatedTroponin T levels during 534
exacerbation are associated with death in patients with COPD [52,53], and more 535
validation will need to be conducted to evaluate whether it is related to FEV1 at the 536
stable state. 537
We calculate the PageRank score for the SGTCCA-Net subnetwork and present the538
top 5 nodes for each molecular profile in Table 4. We observed that most of the 539
molecular features in the table are significantly correlated with FEV1 (FDR < 0.05),540
with exceptions such asNeutrophil gelatinase-associated lipocalin and 541
1-stearoyl-2-oleoyl-GPI (18:0/18:1)*. To check whether these nodes contribute to the 542
network, we examined the correlation between these two nodes and other molecular 543
January 18, 2024 20/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
features of the subnetwork and observed thatNeutrophil gelatinase-associated lipocalin 544
is highly correlated withN-acetylneuraminate (ρbetween−omics = 0.382), which is highly 545
correlated to FEV1 (ρomics−pheno = −0.134) and 1-stearoyl-2-oleoyl-GPI (18:0/18:1)* 546
is highly correlated with s1-(1-enyl-stearoyl)-GPE (P-18:0)* (ρbetween−omics = 0.509), 547
which is highly correlated to FEV1 (ρomics−pheno = 0.151). 548
Pathways such as regulation of TLR by endogenous ligand (7/13 genes enriched) and549
neutrophil degranulation (23/102 genes enriched) are strongly associated with the 550
severity of COPD from the enrichment analysis (Figure 5a) . Toll-like receptor (TLR) 2551
is elevated in monocytes from individuals with COPD [54], and increased degranulation552
in COPD is found to increase on the surface of unstimulated neutrophils in patients553
with COPD and can cause additional airway damage in patients with COPD [55]. In554
particular, when checking the correlation between the network and cell type, we found555
that NetSHy PC1 is highly correlated with the percentage of neutrophil cells 556
(ρ = −0.767), which may explain the significance of the neutrophil degranulation 557
pathway. However, we did not regress out cell type composition in advance as they can558
reflect disease state. 559
We ran the enrichment analysis on the metabolites of the SGTCCA-Net subnetwork560
and found that most of the top pathways are associated with 561
phospholipid/glyerophospholipid and bile acid (S4 Table). Compared to non-smokers, 562
current smokers have decreased phospholipid levels regardless of COPD status [56]. 563
Figure 5b shows that there are three groups of protein-protein interaction based on564
MCODE, and the largest group is centered on the group with SAA1 and MMP9 (red)565
with the enrichment score of 1.5. SAA1 is related to the acute exacerbation of chronic566
obstructive pulmonary disease [57], and higher levels of MMP-9 correspond to a higher567
influx of neutrophils and lymphocytes, signaling an exacerbation of COPD in which a568
higher burden of MMP-9 is observed in the airways [58]. 569
We examined the top combinations (4/3-way correlation) and found that the top 570
4-way correlation implies the top 3-way correlation in some cases (S3 Table). For 571
instance, it was observed thatMTCO1P12, Tyrosine-protein kinase Yes, and 572
C-glycosyltryptophan dominate the 4-way relationship between gene, protein, metabolite,573
and FEV1, which also implies the 3-way correlation betweenTyrosine-protein kinase 574
Yes (the protein product ofYES1), C-glycosyltryptophan and FEV1. Through 575
examining the 3-way correlation we can find that even though the top 3-way correlation576
could imply the 4-way correlation but the signal is generally weaker. For instance, it577
was found thatMatrix metalloproteinase-9 dominates the gene-protein-fev1 correlation, 578
1-stearoyl-2-oleoyl-GPI (18:0/18:1)* dominates the gene-metabolite-fev1 correlation, 579
and C-glycosyltryptophan and iminodiacetate (IDA) dominate the 580
protein-metabolite-fev1 correlation (details in S3 Table). However, it was observed that581
the highest 4-way correlation involved inMatrix metalloproteinase-9 is 0.431 (rank 582
233/172,516 across all 4-way correlation), iniminodiacetate (IDA) is 0.580 (rank 583
30/172,516 across all 4-way correlation). A particular example is the 3-way correlation584
that involvesMatrix metalloproteinase-9 and other genes, we found that all top 10 585
3-way correlation between gene, protein, and FEV1 involve this protein, but only 2 of 586
the genes are shown in the top 1000 4-way correlation along withMatrix 587
metalloproteinase-9. This suggests that some 3-way correlation combinations may be 588
ignored by the 4-way correlation, and indicates the importance of incorporating the 589
3-way correlation into the model. 590
Furthermore, the PageRank score for each molecular feature in the final subnetwork591
was calculated, and the top 10 molecular features from each molecular profile were 592
selected for visualization of the network in Fig. 6. In particular, the C-reactive protein593
has been shown to be associated with poor lung function [59]; the expression level of594
S100A9 is elevated in patients with COPD, which triggers neutrophil degranulation and595
January 18, 2024 21/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
0 1 2 3 4 5
-log10(P)
GO:0030574: collagen catabolic process
hsa00500: Starch and sucrose metabolism
GO:0006957: complement activation, alternative pathway
R-HSA-6803157: Antimicrobial peptides
GO:1901570: fatty acid derivative biosynthetic process
GO:0002523: leukocyte migration involved in inflammatory response
GO:0045087: innate immune response
GO:0006953: acute-phase response
R-HSA-1222556: ROS and RNS production in phagocytes
GO:0001818: negative regulation of cytokine production
R-HSA-6798695: Neutrophil degranulation
R-HSA-5686938: Regulation of TLR by endogenous ligand
(a)
(b)
Fig 5. Enrichment analysis results for COPDGene data with respect to FEV1. (a)
The top pathways that are associated with the final network. (b) Protein-protein interaction
(PPI) network for the multi-omics network from SGTCCA-Net with respect to FEV1 colored
by clusters. Clusters are generated based on the Molecular Complex Detection (MCODE)
algorithm.
releases inflammatory and proteolytic enzymes [60]. 596
We also repeated the analysis on FEV1 percent predicted (FEV11PP), which takes 597
into account body habitus and race. After applying the SGTCCA-Net toFEV11PP, we 598
identified a final subnetwork with 43 features. In particular, 16 of these features exhibit599
a strong correlation withFEV11PP (with |ρ| > 0.15), and are all present in theFEV11 600
final subnetwork. This significant overlap between the FEV11 and FEV11PP 601
subnetworks suggests a consistent molecular pattern that influences both metrics. 602
Furthermore, Metascape enrichment analysis highlighted similar patterns in both 603
FEV11 and FEV11PP networks, with shared pathways including acute-phase response,604
inflammatory response, and neutrophil degranulation (S3 Fig). Interestingly, these 605
molecular features generally show a higher correlation withFEV11PP than withFEV11 606
(detailed in S5 Table). This observation demonstrates that the associations between 607
top-correlated molecular features and pulmonary function remain robust, unaffected by608
variations in body habitus and race. 609
January 18, 2024 22/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
Fig 6. SGTCCA-Net network with top 10 molecular features from each molecular
profile. Multi-omics network module for COPDGene data with respect to FEV 1. The red
nodes stand for genes, the purple nodes stand for proteins, and the black nodes stand for
metabolites. The width and color depth of the edge stands for the strength of the connection
between two molecular features and the type of color stands for whether two nodes are
positively correlated (red) or negatively correlated (blue). Edges are filtered based on Pearson
correlation with a threshold of 0.2.
Discussion
610
In this work, we introduce a new multi-omics network analysis pipeline, Sparse 611
Generalized Tensor Canonical Correlation Analysis Network Inference (SGTCCA-Net),612
designed for biomarker identification through both higher-order and lower-order 613
correlations, thereby creating molecular interaction networks pertinent to specific 614
phenotypes. We propose a new approach, Sparse Generalized Tensor Canonical 615
Correlation Analysis (SGTCCA), which extends Tensor Canonical Correlation Analysis616
to accommodate multiple correlation structures at once, ensuring sparsity via the biased617
subsampling technique. Additionally, we bring forth a novel network analysis and 618
network pruning algorithm to derive multi-omics subnetwork modules associated with619
the phenotype of interest. 620
Contrary to existing CCA-based multi-omics network inference approaches like 621
SmCCNet and DIABLO, which only examine pairwise correlations, SGTCCA-Net 622
encapsulates both higher and lower-order correlations. Our simulation analyses 623
demonstrate that SGTCCA-Net is adept at pinpointing features with diverse 624
phenotype-specific correlation structures, thereby surpassing the other two CCA-based625
Methods
across varied simulation settings. 626
Our real data analysis highlights the ability of SGTCCA-Net to identify molecular627
features for diseases such as cancer and chronic obstructive pulmonary disease (COPD).628
Interestingly, the biomarkers identified are not limited to those with strong phenotype629
January 18, 2024 23/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
correlations; we also identified molecular features with weak phenotype correlations that630
are highly connected to other nodes and contribute to the network. Furthermore, we631
validated our findings using various techniques, including enrichment analysis of 632
biological pathways and miRNA-target relationships. These validations confirm the 633
enhanced performance of SGTCCA-Net over SmCCNet and the ability to find relevant634
pathways. Furthermore, we also demonstrate the top higher-order correlation associated635
with the final subnetwork. This study showcases the potential for the integration of 636
multi-omics data to improve our understanding of complex diseases, such as COPD and637
breast cancer, and highlights the utility of SGTCCA-Net as a powerful tool for 638
analyzing such data. 639
Beyond its primary use in multi-omics network inference, SGTCCA has broader 640
applications, and can be useful for multi-view dimensional reduction tasks. It can be641
used to project omics data into a shared lower-dimensional space, suitable for both 642
supervised tasks, such as disease classification, and unsupervised tasks, such as 643
subtyping. 644
Supporting information 645
S1 Fig. Enrichment analysis ontology results based on the joint set of 646
gene/protein for TCGA breast cancer data and COPDGene data for 647
SmCCNet subnetwork. The top pathways that are associated with the final network648
of SmCCNet based on Metascape. a: Top ontology pathways for TCGA breast cancer649
data with respect to tumor purity; b: Top ontology pathways for COPDGene data with650
respect to FEV1. 651
S2 Fig. Results of the enrichment analysis ontology based on the validated 652
miRNA target genes for the TCGA breast cancer data for the 653
SGTCCA-Net subnetwork. Top ontology pathways based on validated miRNA 654
target genes for the breast cancer data from TCGA with respect to tumor purity. 655
S3 Fig. Enrichment Analysis Results of SGTCCA-Net for FEV1 Percent 656
Predicted: This figure focused on pathways enriched in the joint gene/protein set 657
derived from the COPDGene data, specifically within the SGTCCA-Net subnetwork 658
associated withFEV1 Percent Predicted. The identified pathways provide insights into659
the molecular mechanisms linked to pulmonary function as measured byFEV1 Percent 660
Predicted. 661
S1 T ext. Higher-order Covariance T ensor for both Odd and Even Number 662
of Views 663
S2 T ext. Proof of Theorem 1 664
S3 T ext. Sparse Generalized T ensor Canonical Correlation Analysis 665
(SGTCCA) Algorithm 666
S4 T ext. Biased Subsampling Covariance Density 667
S5 T ext. Network Analysis and Network Pruning Algorithm 668
S6 T ext. Additional Detail of Simulation Study Setup 669
January 18, 2024 24/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
S1 T able. Simulation Setup T able. Summary table of all simulation scenarios. 670
Starting with 3 omics data simulation settings: (1) normal latent factors, (2) highly 671
right-skewed latent factors, and (3) noisy omics data. Each of these settings has 3 cases:672
case 1 with all phenotype-specific correlation structures; case 2 with only 4-way 673
phenotype-specific correlation structure; and (3) case 3 with 3-way and pairwise 674
phenotype-specific correlation structure. All these cases will be evaluated based on 675
sample sizes of 100 and 200.Σ represents a diagonal matrix with diagonal values to be676
1. 677
S2 T able. T op Network Higher-order Correlation for TCGA Breast Cancer 678
Network. Top 5 network higher-order correlation of breast cancer data of different 679
correlation structures. 680
S3 T able. T op network higher-order correlation for the COPDGene 681
network. Top 5 network higher-order correlation of COPDGene data of different 682
correlation structures. 683
S4 T able. T op Metabolite Enrichment Pathways. Top 10 pathways of 684
metabolite enrichment analysis result from IMPaLa for SGTCCA-Net subnetwork. 685
S5 T able. Overlap Molecular F eatures between FEV1 and FEV1 Percent 686
Predicted Network Molecular features overlap between FEV1 and FEV1 percent 687
predicted network generated from SGTCCA-Net (where absolute correlation with 688
phenotype >0.15). 689
Implementation 690
The model, simulations, and real data experiment were implemented using R version691
4.2.2. All code and scripts used in this study are publicly available via GitHub at 692
https://github.com/liux4283/SparseGTCCANet. 693
A vailability 694
The TCGA breast cancer data used in the real data experiment section is available at:695
http://linkedomics.org/data_download/TCGA-BRCA/. Clinical data and 696
SOMAScan data are available through COPDGene 697
(https://www.ncbi.nlm.nih.gov/gap/, ID: phs000179.v6.p2). RNA-Seq data are 698
available through dbGaP (https://www.ncbi.nlm.nih.gov/gap/, ID: phs000765.v3.p2). 699
Metabolon data is available at Metabolomics Workbench 700
(https://www.metabolomicsworkbench.org/ ID: PR000907). The source code for 701
SGTCCA-Net is available at https://github.com/liux4283/SparseGTCCANet. 702
Acknowledgments 703
We thank all members of the NetCO team for all the insightful feedback on this novel704
pipeline. This work is also supported in part by funds from the National Heart, Lung,705
and Blood Institute, National Institutes of Health (R01 HL152735 and TransOmics for706
Precision Medicines Fellowship). 707
This work was also supported by NHLBI grants U01 HL089897 and U01 HL089856708
and by NIH contract 75N92023D00011. The COPDGene study (NCT00608764) has also709
January 18, 2024 25/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
been supported by the COPD Foundation through contributions made to an Industry710
Advisory Committee that has included AstraZeneca, Bayer Pharmaceuticals, 711
Boehringer-Ingelheim, Genentech, GlaxoSmithKline, Novartis, Pfizer, and Sunovion. 712
References
1. Xu C, Tao D, Xu C. A survey on multi-view learning. arXiv preprint
arXiv:13045634. 2013;.
2. Bickel S, Scheffer T. Multi-view clustering. In: ICDM. vol. 4; 2004. p. 19–26.
3. Kalayeh MM, Idrees H, Shah M. NMF-KNN: Image annotation using weighted
multi-view non-negative matrix factorization. In: Proceedings of the IEEE
conference on computer vision and pattern recognition; 2014. p. 184–191.
4. Rappoport N, Shamir R. Multi-omic and multi-view clustering algorithms:
review and cancer benchmark. Nucleic acids research. 2018;46(20):10546–10562.
5. Nicora G, Vitali F, Dagliati A, Geifman N, Bellazzi R. Integrated multi-omics
analyses in oncology: a review of machine learning methods and tools. Frontiers
in oncology. 2020;10:1030.
6. Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-omics data
integration, interpretation, and its application. Bioinformatics and biology
insights. 2020;14:1177932219899051.
7.
Gilchrist A, Au CE, Hiding J, Bell AW, Fernandez-Rodriguez J, Lesimple S, et al.
Quantitative proteomics analysis of the secretory pathway. Cell.
2006;127(6):1265–1281.
8. Monteiro M, Carvalho M, Bastos M, Guedes de Pinho P. Metabolomics analysis
for biomarker discovery: advances and challenges. Current medicinal chemistry.
2013;20(2):257–271.
9. Picard M, Scott-Boyer MP, Bodein A, Périn O, Droit A. Integration strategies of
multi-omics data for machine learning analysis. Computational and Structural
Biotechnology Journal. 2021;19:3735–3746.
10. Menyhárt O, Győrffy B. Multi-omics approaches in cancer research with
applications in tumor subtyping, prognosis, and diagnosis. Computational and
structural biotechnology journal. 2021;19:949–960.
11. Duan R, Gao L, Gao Y, Hu Y, Xu H, Huang M, et al. Evaluation and
comparison of multi-omics data integration methods for cancer subtyping. PLoS
computational biology. 2021;17(8):e1009224.
12. Lin E, Lane HY. Machine learning and systems genomics approaches for
multi-omics data. Biomarker research. 2017;5(1):1–6.
13. Pierre-Jean M, Deleuze JF, Le Floch E, Mauger F. Clustering and variable
selection evaluation of 13 unsupervised methods for multi-omics data integration.
Briefings in bioinformatics. 2020;21(6):2011–2030.
14. Jiang D, Armour CR, Hu C, Mei M, Tian C, Sharpton TJ, et al. Microbiome
multi-omics network analysis: statistical considerations, limitations, and
opportunities. Frontiers in genetics. 2019;10:995.
January 18, 2024 26/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
15. Zhou G, Li S, Xia J. Network-based approaches for multi-omics integration.
Computational Methods and Data Analysis for Metabolomics. 2020; p. 469–487.
16. Wu C, Zhou F, Ren J, Li X, Jiang Y, Ma S. A selective review of multi-level
omics data integration using variable selection. High-throughput. 2019;8(1):4.
17. Yan J, Risacher SL, Shen L, Saykin AJ. Network approaches to systems biology
analysis of complex disease: integrative methods for multi-omics data. Briefings
in bioinformatics. 2018;19(6):1370–1381.
18. Liu J, Wang C, Gao J, Han J. Multi-view clustering via joint nonnegative matrix
factorization. In: Proceedings of the 2013 SIAM International Conference on
Data Mining. SIAM; 2013. p. 252–260.
19. Zhang C, Fu H, Hu Q, Zhu P, Cao X. Flexible multi-view dimensionality
co-reduction. IEEE Transactions on Image Processing. 2016;26(2):648–659.
20. Hotelling H. Relations between two sets of variates. In: Breakthroughs in
statistics. Springer; 1992. p. 162–190.
21. Witten DM, Tibshirani RJ. Extensions of sparse canonical correlation analysis
with applications to genomic data. Statistical applications in genetics and
molecular biology. 2009;8(1).
22. Haury AC, Mordelet F, Vera-Licona P, Vert JP. TIGRESS: trustful inference of
gene regulation using stability selection. BMC systems biology. 2012;6(1):1–17.
23.
Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory networks
from expression data using tree-based methods. PloS one. 2010;5(9):e12776.
24. Shi WJ, Zhuang Y, Russell PH, Hobbs BD, Parker MM, Castaldi PJ, et al.
Unsupervised discovery of phenotype-specific multi-omics networks.
Bioinformatics. 2019;35(21):4336–4343.
25. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society: Series B (Methodological). 1996;58(1):267–288.
26. Mastej E, Gillenwater L, Zhuang Y, Pratte KA, Bowler RP, Kechris K.
Identifying Protein–metabolite Networks Associated with COPD Phenotypes.
Metabolites. 2020;10(4):124.
27. Singh A, Shannon CP, Gautier B, Rohart F, Vacher M, Tebbutt SJ, et al.
DIABLO: an integrative approach for identifying key molecular drivers from
multi-omics assays. Bioinformatics. 2019;35(17):3055–3062.
28. Luo Y, Tao D, Ramamohanarao K, Xu C, Wen Y. Tensor canonical correlation
analysis for multi-view dimension reduction. IEEE transactions on Knowledge
and Data Engineering. 2015;27(11):3111–3124.
29.
Wong HS, Wang L, Chan R, Zeng T. Deep Tensor CCA for Multi-view Learning.
arXiv preprint arXiv:200511914. 2020;.
30. Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking:
Bringing order to the web. Stanford InfoLab; 1999.
31. Vu T, Litkowski EM, Liu W, Pratte KA, Lange L, Bowler RP, et al. NetSHy:
network summarization via a hybrid approach leveraging topological properties.
Bioinformatics. 2023;39(1):btac818.
January 18, 2024 27/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
32. De Lathauwer L, De Moor B, Vandewalle J. On the best rank-1 and rank-(r 1, r
2,..., rn) approximation of higher-order tensors. SIAM journal on Matrix Analysis
and Applications. 2000;21(4):1324–1342.
33. Kolda TG, Bader BW. Tensor decompositions and applications. SIAM review.
2009;51(3):455–500.
34. Dai Yh, Yuan Y. An efficient hybrid conjugate gradient method for
unconstrained optimization. Annals of Operations Research. 2001;103:33–47.
35. Nash JC. Compact numerical methods for computers: linear algebra and function
minimisation. CRC press; 1990.
36. Acar E, Papalexakis EE, Gürdeniz G, Rasmussen MA, Lawaetz AJ, Nilsson M,
et al. Structure-revealing data fusion. BMC bioinformatics. 2014;15(1):1–17.
37. Zou H, Hastie T. Regularization and variable selection via the elastic net.
Journal of the royal statistical society: series B (statistical methodology).
2005;67(2):301–320.
38. Papalexakis EE, Faloutsos C, Mitchell TM, Talukdar PP, Sidiropoulos ND,
Murphy B. Turbo-smt: Accelerating coupled sparse matrix-tensor factorizations
by 200x. In: Proceedings of the 2014 SIAM International Conference on Data
Mining. SIAM; 2014. p. 118–126.
39. Langville AN, Meyer CD. Google’s PageRank and beyond. In: Google’s
PageRank and Beyond. Princeton university press; 2011.
40. Hu W, Lin D, Cao S, Liu J, Chen J, Calhoun VD, et al. Adaptive sparse multiple
canonical correlation analysis with application to imaging (epi) genomics study of
schizophrenia. IEEE Transactions on Biomedical Engineering.
2017;65(2):390–399.
41. Foldnes N, Olsson UH. A simple simulation technique for nonnormal data with
prespecified skewness, kurtosis, and covariance matrix. Multivariate behavioral
research. 2016;51(2-3):207–219.
42. Liu W, Vu T, Konigsberg IR, Pratte KA, Zhuang Y, Kechris KJ. SmCCNet 2.0:
an Upgraded R package for Multi-omics Network Inference. bioRxiv. 2023; p.
2023–11.
43. Li Y, Umbach DM, Bingham A, Li QJ, Zhuang Y, Li L. Putative biomarkers for
predicting tumor sample purity based on gene expression data. BMC genomics.
2019;20(1):1–12.
44. Zhou Y, Zhou B, Pache L, Chang M, Khodabakhshi AH, Tanaseichuk O, et al.
Metascape provides a biologist-oriented resource for the analysis of systems-level
datasets. Nature communications. 2019;10(1):1523.
45. Ru Y, Kechris KJ, Tabakoff B, Hoffman P, Radcliffe RA, Bowler R, et al. The
multiMiR R package and database: integration of microRNA–target interactions
along with their disease and drug associations. Nucleic acids research.
2014;42(17):e133–e133.
46. Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, et al.
Genetic epidemiology of COPD (COPDGene) study design. COPD: Journal of
Chronic Obstructive Pulmonary Disease. 2011;7(1):32–43.
January 18, 2024 28/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
47. Gillenwater LA, Helmi S, Stene E, Pratte KA, Zhuang Y, Schuyler RP, et al.
Multi-omics subtyping pipeline for chronic obstructive pulmonary disease. PloS
one. 2021;16(8):e0255337.
48. Kamburov A, Cavill R, Ebbels TM, Herwig R, Keun HC. Integrated
pathway-level analysis of transcriptomics and metabolomics data with IMPaLA.
Bioinformatics. 2011;27(20):2917–2918.
49. Zhu Z, Ye W, Wu X, Lin S, Xu J, Li L, et al. Comprehensive analysis reveals a
prognostic and therapeutic biomarker CD3D in the breast carcinoma
microenvironment. Bioscience reports. 2021;41(1):BSR20202898.
50. Ding K, Li W, Zou Z, Zou X, Wang C. CCNB1 is a prognostic biomarker for
ER+ breast cancer. Medical hypotheses. 2014;83(3):359–364.
51. Juríková M, Danihel L, Polák Š, Varga I. Ki67, PCNA, and MCM proteins:
Markers of proliferation in the diagnosis of breast cancer. Acta histochemica.
2016;118(5):544–552.
52. Blaschko MB, Lampert CH. Correlational spectral clustering. In: 2008 IEEE
Conference on Computer Vision and Pattern Recognition. IEEE; 2008. p. 1–8.
53. Elmenawi KA, Anil V, Gosal H, Kaur H, Ngassa HC, Mohammed L. The
Importance of Measuring Troponin in Chronic Obstructive Pulmonary Disease
Exacerbations: A Systematic Review. Cureus. 2021;13(8).
54.
Pons J, Sauleda J, Regueiro V, Santos C, López M, Ferrer J, et al. Expression of
Toll-like receptor 2 is up-regulated in monocytes from patients with chronic
obstructive pulmonary disease. Respiratory research. 2006;7(1):1–9.
55. Jasper AE, McIver WJ, Sapey E, Walton GM. Understanding the role of
neutrophils in chronic inflammatory airway disease. F1000Research. 2019;8.
56. Moré JM, Voelker DR, Silveira LJ, Edwards MG, Chan ED, Bowler RP. Smoking
reduces surfactant protein D and phospholipids in patients with and without
chronic obstructive pulmonary disease. BMC pulmonary medicine.
2010;10(1):1–8.
57.
Vietri L, Fui A, Bergantini L, d’Alessandro M, Cameli P, Sestini P, et al. Serum
amyloid A: A potential biomarker of lung disorders. Respiratory Investigation.
2020;58(1):21–27.
58. Mercer P, Shute J, Bhowmik A, Donaldson G, Wedzicha J, Warner J. MMP-9,
TIMP-1 and inflammatory cells in sputum from COPD patients during
exacerbation. Respiratory research. 2005;6:1–9.
59. Aksu F, Capan N, Aksu K, Ofluoğlu R, Canbakan S, Yavuz B, et al. C-reactive
protein levels are raised in stable Chronic obstructive pulmonary disease patients
independent of smoking behavior and biomass exposure. Journal of thoracic
disease. 2013;5(4):414.
60. Railwah C, Lora A, Zahid K, Goldenberg H, Campos M, Wyman A, et al.
Cigarette smoke induction of S100A9 contributes to chronic obstructive
pulmonary disease. American Journal of Physiology-Lung Cellular and Molecular
Physiology. 2020;319(6):L1021–L1035.
January 18, 2024 29/29
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 25, 2024. ; https://doi.org/10.1101/2024.01.22.576667doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.