Keywords
microbiome community-level association studies; integrative analysis; distance-25
based association analysis; similarity matrix regression26
2
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
1 Introduction27
Microbiome research has rapidly expanded in recent years, revealing complex relationships28
between microbial communities and human health. However, individual studies that explore the29
relationship between the microbiome and disease frequently report inconsistent results. For exam-30
ple, studies on the gut microbiome and HIV infections have shown conflicting results. While most31
studies found a decreased microbial alpha diversity in HIV-infected individuals compared to unin-32
fected individuals [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], some studies detected no significant difference33
[12, 13, 14, 15] and even an increase of microbial diversity [16]. Research on structural differences34
in microbial communities (e.g., beta diversity) between HIV-positive and HIV-negative individuals35
has also yielded mixed results. Some studies report significant differences in the microbiome beta36
diversity of HIV-infected versus uninfected individuals [2, 14], while others suggest that these dif-37
ferences may be attributed more to variations in sexual behavior, especially MSM (men who have38
sex with men) status, rather than HIV infection itself [3]. Similar inconsistencies in microbial di-39
versity and community structure have also been observed in other diseases, highlighting a broader40
challenge in microbiome research. These inconsistencies can arise from small sample sizes, pop-41
ulation heterogeneity, varied study designs, and differences in analytical techniques, which may42
lead to bias and limit the generalizability of findings.43
To address these issues and enhance reproducibility, integrative analysis of multiple micro-44
biome datasets is essential to aggregate information across multiple studies. In this paper, we45
focus on the association test between the overall microbiome community profile, measured by46
beta-diversities, and outcomes of interest while incorporating multiple datasets. For such tasks47
within a single study, there are two popular approaches: distance-based analysis [17, 18] and the48
microbiome regression kernel association test (MiRKAT) [19]. The distance-based approach in-49
volves computing a beta-diversity distance between individuals’ microbiome profiles and testing50
for association with the outcome of interest via permutation. MiRKAT utilizes a semi-parametric51
kernel machine regression in which a kernel similarity matrix, derived from the beta-diversity ma-52
trix, is assessed for association with the outcome. Here, the kernel similarity matrix is transformed53
3
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
from of a beta-diversity matrix. Both methods require an assessment of similarity or dissimilar-54
ity(distance) between all pairs of samples.55
When extending to multiple studies, neither distance-based methods nor MiRKAT are directly56
applicable due to the lack of a complete pairwise distance matrix across all subjects and studies.57
In cases where data are obtained from public resources without access to original sequencing files,58
differences in preprocessing steps across studies make it impossible to construct consistent beta-59
diversity matrices—distinct OTU (operational taxonomic unit) assignments and phylogenetic trees60
are common. Even when re-processing data across studies is feasible, differential biases intro-61
duced through various sequencing protocols can invalidate beta diversity measurements between62
samples from different studies. Thus, in most cases, only within-study distances are available,63
while between-study distances are missing.64
Herein, we propose a novel statistical method, SMRmix, to test the association between micro-65
biome communities and outcomes of interests by aggregating information across multiple studies.66
Our method is built upon similarity matrix regression (SMR) and only requires within-study dis-67
tances. Instead of modeling the direct relationship between traits and microbiome similarity, SMR68
regresses pairwise trait similarity on pairwise microbiome similarity for association testing. Trait69
similarity is calculated as the outer product of residuals from a null regression model, and micro-70
biome similarity is calculated from the within-study beta diversity matrix. Associations are then71
detected using a score test. We extend SMR by incorporating mixed effects, vectorizing similar-72
ity matrices, and deriving the corresponding score test statistic through the likelihood. The new73
method, SMRmix, enables us to account for study-level random effects and avoid reliance on inac-74
cessible between-study distances. Our method also enables development of an “optimal” omnibus75
test, utilizing multiple distances within a single study to enhance power.76
Through extensive simulations, we demonstrate that the proposed SMRmix has well-controlled77
type I error rates, and higher or comparable power than potential competitors, such as CSKAT [20]78
and GLMM-MiRKAT [21], both of which require complete distance matrices across all studies,79
which are often inaccessible. We apply SMRmix to two datasets: (1) the HIV-reanalysis consor-80
4
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
tium dataset, comprising 17 studies on gut dysbiosis in HIV , where we observed strong associations81
between gut microbiome composition and both HIV infection and MSM (men who have sex with82
men) status, and (2) a dataset of 11 studies on gut microbiome in colorectal cancer (CRC), con-83
firming significant dysbiosis in affected individuals compared to healthy controls. These analyses84
illustrate the capability of SMRmix to detect consistent patterns of microbiome composition com-85
position across different conditions.86
2 Materials and methods87
SMRmix evaluates the association between a binary or continuous outcome and microbiome88
communities by integrating data from multiple microbiome studies. It extends similarity matrix89
regression to incorporate multiple studies, utilizing only intra-study kernel matrices, thereby elim-90
inating the need for between-study similarity assessments. In what follows, we first present the91
microbiome regression-based kernel association test for a single study and its link to Similarity92
Matrix Regression (SMR), followed by its extension to incorporate multiple studies.93
2.1 Microbiome regression-based kernel association test for a single study94
and its connection with similarity matrix regression95
When analyzing data from a single study, one popular approach is the microbiome regression-96
based kernel association test (MiRKAT) [19]. In this method, data are summarized as independent97
observations of the form (Yi, Xi, Mi) for i = 1, · · ·, n. Here, Yi represents the outcome of interest,98
which can be either continuous or binary. The covariates we want to adjust for, such as age, gender,99
and MSM status, are included in Xi = (1, xi1, · · ·, xip)T , where p is the number of covariates. The100
abundances of all taxa/OTUs in the microbial community are represented by Mi = (mi1, · · ·, miq),101
where q is the total number of taxa/OTUs.102
MiRKAT compares pairwise similarity in the outcome variable to pairwise similarity in the103
microbiome profiles, with a high correspondence suggesting an association. It employs a kernel104
5
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
machine regression framework [22, 23, 24] to relate covariates and microbiome profiles to out-105
comes. Specifically, for a continuous outcome, we assume a linear kernel machine regression106
model,107
Yi = Xiβ + f (Mi) +εi , (1)108
and for a binary outcome, we use the logistic kernel machine regression model,109
logit(P(Yi = 1)) = Xiβ + f (Mi) , (2)110
where εi is the random error with mean 0 and a common variance, f (·) captures the association111
between microbial community and the outcome, and additional covariates Xi are adjusted for112
through a linear function. Under the null hypothesis of no microbiome effect, f (M ) = 0.113
In MiRKAT, f (Mi) is assumed to belong to a reproducing kernel Hilbert space generated by114
a positive definite kernel function K(·, ·), such that f (Mi) = ∑n
i′=1 αi′K(Mi, Mi′). The (i, i′)-th115
element of the kernel matrix K, defined as K(Mi, Mi′), measures the pairwise similarity between116
the microbiome profiles of the i-th and the i′-th samples. To utilize the phylogenetic/taxonomic in-117
formation contained in the microbiome sequences, the kernel similarity matrix can be constructed118
by exploiting its correspondence with the well-defined beta-diversities, which measure dissimi-119
larities between subjects through: K = − 1
2 (I − 11′
n )D2(I − 11′
n ), where D represents pairwise120
dissimilarities/distances between all subjects, calculated using beta-diversity metrics such as the121
Bray-Curtis or UniFrac distance. A variance component score test is then calculated through the122
connection between the kernel machine regression model and linear mixed models [19]: Q =123
1
2φ (Y − ˆµ0)′K(Y − ˆµ0), where ˆµ0 are the fitted values of Y under the null, and φ is the disper-124
sion parameter.125
An alternative approach to investigate the association between the outcome variable and a set of126
“-omics” markers is similarity matrix regression (SMR) [25]. In microbiome association studies,127
6
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
the SMR model can be written as128
Zi j = b × Ki j + ei j, i < j, (3)129
where Zi j, the trait similarity between subject i and j, is defined as the outer product of trait resid-130
uals, Zi j = (Yi − ˆµ0
i )(Yj − ˆµ0
j ), and ˆµ0
i is the fitted values of Yi under the null hypothesis. Ki j is the131
i, j-th element of microbial kernel similarity matrix. Intuitively, if the outcome is associated with132
microbial profiles, subjects with higher outcome similarity should also have higher microbiome133
similarity.134
Under the null, assuming that all Yi’s are independent, the score statistic for testing in SMR is135
Ub = ∑i< j Ki j(v0
i )−1(v0
j)−1Zi j = ∑i< j Ki j(v0
i )−1(v0
j)−1(Yi − ˆµ0
i )(Yj − ˆµ0
j ), where v0
i = Var(Yi|Xi, Mi)136
under the null. This statistic can be written in quadratic form as:Ub = 1
2 (Y − ˆµ0)′V −1K0V −1(Y −137
ˆµ0), where K0 is a matrix with diagonal elements equal to 0 and (i, j)-th off-diagonal elements138
equal to Ki j. Under the null, Ub approximately follows a weighted chi-squared distribution, which139
can be used to assess the statistical significance.140
Formulating SMR in vector form, we have141
Zvec = b × Kvec + evec, (4)142
where Zvec and Kvec are stacked vectors of Zi j, Ki j with all available pairwise distances (i, j).143
There is a close connection between MiRKAT and SMR [26]. Specifically, the square of the em-144
pirical sample correlation between outcome similarity and microbiome similarity in SMR, denoted145
by R2, is proportional to the score statistic in MiRKAT. Note that theR2 statistic represents the pro-146
portion of variation explained in SMR.147
Although SMR is statistically equivalent to MiRKAT, the vector form of SMR allows for in-148
tegration of multiple studies when between-study similarity matrices are not available. This is149
because we can exclude pairs with missing distances from the similarity vectors Zvec and Kvec.150
7
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
2.2 SMRmix for integrating multiple microbiome datasets151
Suppose we have a total of K microbiome studies. The data for the i-th subject in the k-th152
study is denoted as (Yik, Xik, Mik), where i = 1, · · ·, nk, k = 1, · · ·, K. Here, nk represents the153
number of subjects in k−th study, and N = ∑K
k=1 nk denotes the total number of subjects across154
K studies. We assume all studies are independent and do not share any subjects, and that all155
samples are independent within each study. A common violation of the second assumption occurs156
when multiple samples are longitudinal measurements from the same participant or are clustered157
within families in a single study. Thus, we assume that samples are clustered by studies, without158
additional clustering structures, although our method can be extended to allow for within-study159
correlations.160
When microbiome data from different studies are processed through distinct protocols, subjects161
from these studies are not directly comparable, preventing calculation of between-study beta diver-162
sities. As a result, only intra-study microbial dissimilarities are available for association testing.163
Suppose the similarity matrix for all subjects across all studies (assuming it exists) is represented164
as follows:165
Kfull =
K1 K12 · · · K1K
K21 K2 · · · K2K
..
. .
.
. ... .
.
.
KK1 KK2 · · · KK
166
Here, the diagonal blocks Kk represent the kernel similarity matrices for samples within the same167
study k, and the off-diagonals represent the microbiome similarity for pairs of samples from dif-168
ferent studies. For existing kernel-based/distance-based methods, all elements in Kfull must be169
calculated using the same distance; otherwise, the absolute values of pairwise kernels using differ-170
ent metrics are not on the same scale and thus not comparable.171
In practice, we often only have access to the diagonal blocks, represented as follows:172
8
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
K =
K1
K2
...
KK
173
In this matrix, only diagonal blocks of K are obtained, while the cells outside the diagonal blocks174
are missing. Having only the elements in diagonal blocks gives us the flexibility to choose different175
kernels for each study. For example, K1 may be a UniFrac kernel while K2 is a Bray-Curtis176
kernel. While it is advantageous to interpret the results when all kernels involved are of the same177
type, using different kernels for each study remains valid. This flexibility is particularly helpful178
when kernels of the same type are not available for some studies. For example, we may be able179
to construct a UniFrac kernel for study 1, but not for study 2 if the necessary phylogenetic tree180
information is unavailable, such as when using publicly available data. In this paper, we will181
focus on the situations where all individual Kk are of the same type for simplicity. The vectorized182
SMR model enables us to test the association based on these blocks without the need of the off183
diagonal-blocks, by defining the similarity regression model only for existing Ki j values.184
To motivate the SMR model when integrating multiple microbiome datasets, we first evalu-185
ate its counterpart model within the kernel machine regression framework, assuming the overall186
microbiome kernel similarity matrix Kfull is available. For continuous outcomes, the correlated187
sequence kernel association test (CSKAT) [20], an extension of MiRKAT, can be used to test the188
association between microbiome profiles and outcomes when data from the same study are clus-189
tered. CSKAT assumes a linear mixed-effects model with study-specific random effects to link the190
outcome and microbial composition:191
Yik = X T
ikβ + f (Mik) +hk + εik , (5)192
where hk represents the study-specific random effect, which captures within-study correlation, and193
hk ∼ N(0, σ 2). The subject-level random effect is εik ∼ N(0, δ 2
k ).194
9
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
For dichotomous outcomes, another extension of MiRKAT, GLMM-MiRKAT [21], incorpo-195
rates a study-specific random effect:196
logit(P(Yik = 1)) = X T
ikβ + f (Mik) +hk . (6)197
Putting samples from all studies together in vector format, we have h ∼ N(0, σ 2Σh), Σh =198
diag{1n11T
n1, · · ·,1nK1T
nK }, ϵ ∼ N(0,Σϵ), Σϵ = diag{δ 2
1 In1, · · ·, δ 2
S InK }.199
If a full kernel matrix across all studies is available, the kernel machine regression model in200
CSKAT and GLMM-MiRKAT is equivalent to the (generalized) linear mixed model with f ∼201
N(0, τKfull). Hypothesis testing for H0 : τ = 0 can be conducted as described in the literature.202
However, because only the diagonal blocks Kk of the overall kernel matrix are available, CSKAT203
and GLMM-MiRKAT cannot be applied. However, we can still calculate the trait similarity Zk204
and the variance-covariance matrix Vk = Var(Zk|Xk, Mk) (please refer to section A of the Sup-205
plementary Materials).206
When only K is available, we propose the following mixed effect similarity matrix regression207
model (SMRmix):208
Zi j,k = a + bKi j,k + uk + fi + ei j,k, i < j, (7)209
where, for subjects i, j in study k, the outcome similarity Zi j,k is linked to microbiome similarity210
Ki j,k. Here, ei j,k is a random error that is allowed to differ across studies, uk is a study-specific211
random effect, fi is a random effect that captures the additional correlation between two pairs of212
trait similarities if the two pairs share a common sample. For example, fi captures the additional213
correlation between Zi j,k and Zi j′,k compared to Zi j,k and Zi′ j′,k, with i ̸= i′ and j ̸= j′. Compared214
to the original SMR model, which is equivalent to a kernel machine regression model, our model215
is based on a mixed effect kernel machine regression model, thus the name SMRmix. All random216
effects have mean 0 and constant variance. The mathematical relationship between the random217
effects in SMRmix and the random effects in CSKAT and GLMM-MiRKAT is specified in Section218
B of Supplementary Materials.219
10
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
To test whether there is a microbiome effect, we test H0 : b = 0. We propose the following220
score statistic:221
U =
K
∑
k=1
(Kvec
k )T (Vk)−1(Zvec
k − E(Zvec
k |Xk, Mk)), (8)222
where Kvec
k = ( K(1,2),k, · · ·, K(1,nk),k, K(2,3),k, · · ·, K(2,nk),k, · · ·, K(nk−1,nk),k)T is a vector obtained223
by stacking all the available within-study pairwise kernels of the k-th study. Similarly, Zvec
k =224
(Z(1,2),k, · · ·, Z(1,nk),k, Z(2,3),k, · · ·, Z(2,nk),k, · · ·, Z(nk−1,nk),k)T , Vk = Var(Zvec
k ) represents the corre-225
sponding outcome similarities and variances calculated under the null.226
This test aggregates the score statistics from different studies, requiring only microbial similar-227
ities within studies, whereas CSKAT requires microbial pairwise similarities of all samples.228
2.3 Perturbation approach for obtaining p-value229
In the original SMR approach, the test statistic can be expressed in a quadratic form and230
approximated by a weighted chi-square distribution. However, for SMRmix, due to the com-231
plexity of the variance matrix V vec
k and the vectorization involved, the asymptotic distribution232
of U cannot be easily approximated. Instead, we propose a perturbation approach for obtain-233
ing the null distribution of U and thereby calculating the p-value. We note that, each element in234
Zk = (Yk − µ0
k)(Yk − µ0
k)T is obtained from the outer product ofYk − µ0
k, which, after some trans-235
formations, asymptotically follows a multivariate normal distribution with mean 0 and a covariance236
matrix that can be estimated under the null hypothesis. The idea behind the perturbation approach237
is to generate a large number of ˜Zvec
k values that asymptotically have the same distribution asZvec
k .238
By doing so, we generate a large number ofU ∗ that have the same distribution asU under the null.239
Therefore, we can compare the observed test statistic with U ∗ to calculate the p-value.240
Perturbation procedures, sometimes referred to as resampling or Monte Carlo approaches, have241
been used by many researchers in various contexts [27, 28, 29, 30]. Our approach is related to these242
methods, but differs substantially in that we need to account for the additional correlations that arise243
from the clustered nature of our data. Specifically, let the estimated covariance matrix of Yk − µ0
k244
11
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
from the mixed-effect model be Σk. For continuous outcomes, Σk is a symmetric matrix with245
diagonal elements equal to ˆσ 2 + ˆδk
2
and off-diagonal elements equal to ˆσ 2. For binary outcomes,246
i−th diagonal elements is ˆµ0
ik(1− ˆµ0
ik), and element(i, j), i ̸= j, in Σk is ˆµ0
ik(1− ˆµ0
ik) ˆµ0
jk(1− ˆµ0
ik) ˆσ 2.247
Let Pk = Σk − Xk(X T
k Σ−1Xk)−1X T
k , then we get P
1
2
k = Σ
1
2
k − Xk(X T
k Σ−1
k Xk)−1X T
k Σ
− 1
2
k .248
It can be shown that, under the null hypothesis,249
P
− 1
2
k (Yk − µ0
k) ∼ MV N(0, IN), (9)250
where IN is the identity matrix of size N.251
To generate the null distribution of the test statistic, we first generate random samples ˜Rk252
from a k-dimensional normal distribution with mean 0 and identity covariance matrix (i.e., the253
distribution of P
− 1
2
k (Yk − µ0
k)). Since P
1
2
k ˜Rk has the same distribution as Yk − µ0
k, we can replace254
Zvec
k in equation (8) with a vectorized version of the outer product of P
1
2
k ˜Rk, denoted as ˜Zvec
k .255
This process is repeated multiple times to construct the null distribution of the test statistic. The256
resulting distribution is then used to conduct hypothesis testing and calculate p-values.257
2.4 Omnibus test258
Previous research has shown that the power of statistical methods depends on whether the259
selected kernel effectively captures the true association signal [19]. For example, the weighted260
UniFrac and Bray-Curtis kernels utilize relative abundance information to construct pairwise sim-261
ilarities, giving higher weight to common taxa. Consequently, these kernels are more powerful262
when the true association involves common taxa. Conversely, the unweighted UniFrac and Jaccard263
kernels only use information on whether a taxon is present in a sample, giving higher weight to264
rare taxa, and are therefore more powerful when the true association involves rare taxa. In practice,265
the optimal kernel choice is unknown prior to analysis.266
To address this uncertainty, we propose an omnibus test that extends SMRmix to simulta-267
neously consider multiple kernels for each study. The omnibus-SMRmix is expected to be robust268
12
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
across various association patterns, losing little power compared to the best individual kernel while269
achieving substantially greater power than a poorly chosen kernel.270
Let Kl, l = (1, · · ·, L) represent the diagonal block kernels, where the kernel for each study is271
of the same type, such as weighted UniFrac. Suppose there are a total of L such groups of kernels.272
For each Kl, we conduct SMRmix and apply the perturbation approach as previously described273
to obtain a p-value. These p-values are then combined using the harmonic mean combination274
(HMP) [31]. In our simulations, the HMP combination demonstrated superior type I error control275
compared to other combination methods, such as the Cauchy combination [32], Simes combination276
[33] and the Tippett combination [34]. Note that even though we have L sets of SMRmix models,277
we only need to generate a single set of B multivariate normal distributions for perturbation across278
all sets of kernels. This drastically reduces computational time.279
2.5 Simulation Studies280
To assess the performance of SMRmix, we conducted extensive simulations focusing on both281
type I error and statistical power, comparing SMRmix against CSKAT and GLMM-MiRKAT. For282
power evaluation, we additionally compared SMRmix to MiRKAT-meta, which employs the HMP283
combination method to integrate p-values from individual MiRKAT analyses into a single p-value.284
We performed two sets of simulations.285
2.5.1 Simulation A: Microbiome data processed via the same pipeline with a common phy-286
logenetic tree287
In Simulation A, we generated microbiome data using a Dirichlet-Multinomial distribution,288
with hyperparameters estimated from real data. This simulation framework has been used in previ-289
ous studies [19, 20, 35]. To account for differences in data collection across studies, we introduced290
a multiplicative and differential bias to capture inter-study heterogeneity. This approach ensured291
that the microbiome data from all studies shared the same tree, allowing the computation of a com-292
mon kernel matrix that methods like CSKAT or GLMM-MiRKAT can be applied. However, with293
13
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
differential bias across studies, the kernel components across different studies were not directly294
comparable, potentially impacting the validity of results from CSKAT or GLMM-MiRKAT. This295
setup resembles scenarios where raw sequencing data are available and can be processed through296
a unified bioinformatic pipeline.297
Panel A in Figure 1 illustrates the general workflow of this simulation setup. Specifically, we298
first generated a large OTU table via a Dirichlet-Multinomial distribution with hyperparameters299
estimated from a real throat microbiome dataset [36], which contained 60 samples with 856 OTUs.300
These generated microbiome data were treated as the “true” microbiome abundance, from which301
we simulated the outcome variable. The association effect was set to either the same or different302
across studies depending on the different simulation scenarios. The total number of counts for each303
sample was set to 106, representing the total number of 16s rRNA molecules in the community. We304
denote this “true” OTU table as M ∗, which is unobserved. From this community, we conducted305
sub-sampling to generate the observed microbiome OTU table (denoted as M), which was used306
for association testing.307
After simulating the true OTU table, we partitioned the 856 taxa into 20 clusters based on308
cophenetic distances calculated from the phylogenetic tree. Using the simulated microbial com-309
munity, for a continuous trait, we generated the outcomes via the following model:310
Yik = 0.5 · (1 + Xik,1 + Xik,2) + fk · scale( ∑
l∈A
M∗
ik,l) +hk + εik , (10)311
where, i = 1, · · ·, nk, k = 1, · · ·, K. The variables X·,1 and X·,2 are covariates, withX·,1 ∼ Bernoulli( 0.5)312
and X·,2 ∼ N(0,1). Here, M∗
ik,l is the unobserved “true” count of l-th taxon for individual i in k-th313
study, l ∈ A indicates whether taxon l belongs to a specific taxon cluster A . The function scale(·)314
standardizes vector ∑l∈A {M∗
ik,l}i to have zero mean and unit variance. Since the total number of315
counts is set at 10 6 for all samples, this is analogous to using the true relative abundances. The316
parameter fk represents the effect size for k-th study. Between-study variation and within-study317
variation are described by hk ∼ N(0, σ 2), εik ∼ N(0, δ 2
k ), respectively. We set σ 2 = 1 and the318
14
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
Figure 1: Workflow for simulation studies. Panel A: simulation scenario A, where microbiome data
are processed from the same pipeline with a common tree structure. Panel B: simulation scenario
B, where microbiome data are processed from different pipelines with different tree structures.
same δ 2
k = 1 for all k, or different combinations of δ 2
k using δ 2
k = 1, for k = 1,2,3, δ 2
k = 3, for319
k = 4,5,6,7, δ 2
k = 5, for k = 8,9,10 to evaluate type I error. We set nk = 50, K = 10 so that we320
have a total of 10 studies with each study having 50 study subjects.321
For binary outcome, we generated outcomes by322
logit[P(Yik = 1)] = 0.5 · (1 + Xik,1 + Xik,2) + fk · scale( ∑
l∈A
M∗
ik,l) +hk, (11)323
in which X·,1, X·,2, M∗
ik,l and hk are defined the same way as before.324
For type I error evaluation, we simulated data with fk = 0 for all k, indicating no taxon was325
15
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
associated with the outcome in any study. For power evaluation, we considered two scenarios:326
Scenario A1 the microbiome effect is the same across all studies.327
Scenario A2 the microbiome effects differs across studies.328
For scenario A1, we set fk = c for all k and A was the 16th cluster, which was a common cluster329
representing 19.6% of the total true OTU reads. The constant c was varied to assess the power. In330
scenario A2 , we set fk = c for k = 1, · · ·,5 and fk = −c for k = 6, · · ·,10. Other clusters had no331
association with the outcome Y .332
This outcome generation process linked the response variable to the “true” microbiome com-333
munity, which is unobserved in real settings. We then generated the observed OTU table (M ) by334
sub-sampling from the “true” OTU table, simulating the sequencing process as a random sampling335
(with or without bias) from the 16S rRNA molecules (a total of 10 6 per sample). Two methods336
of random sub-sampling were considered. The first method is unbiased sampling. In this method,337
each molecule had an equal probability of being sampled, representing an unbiased sequencing338
process. We fixed the library size for each sample at 1,000 reads and generated observed OTU339
counts from a multinomial distribution with probabilities M∗
ik,l × 10−6 and total read counts of340
1,000. In this case, the observed OTU counts were a faithful representation of the underlying341
community with only random sampling errors. The second method is biased sampling, where we342
introduced bias into the sequencing process. Some taxa were observed with higher or lower relative343
abundances compared to the true community. First, we generated the observed OTU table as in the344
unbiased sampling method. Then, for the first five studies, we inflated the read counts of the most345
abundant taxa by 1.1-fold. These “common” taxa were defined as the top 10% with the highest rel-346
ative abundances in the true community. Consequently, the common taxa were over-represented,347
while other taxa had lower relative abundances, introducing bias into the observed data. Due to348
this bias, the beta-diversities between samples from different studies were not reliable.349
All hypothesis testings and model evaluations were conducted using these “observed” datasets,350
as they reflect real-world data scenarios. We applied SMRmix (including both single-kernel and351
omnibus versions), CKSAT (for continuous outcomes), and GLMM-MiRKAT (for binary out-352
16
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
comes). MiRKAT-meta was also included in power evaluation. Each evaluation involved 10,000353
replicates for type I error assessment and 5,000 replicates for power evaluation. We tested the null354
hypothesis of no association between the microbiome profiles and outcomes. For the tests with a355
single kernel, we used three kernels: weighted UniFrac( Kw), unweighted UniFrac(Ku) and Bray-356
Curtis (Kbc). The UniFrac kernels utilize phylogenetic information while the Bray-Curtis does357
not. For SMRmix-Omnibus test, we synthesized test results from five kernelsm including weighted358
UniFrac, unweighted UniFrac, Bray-Curtis, and generalized UniFrac with weight parameter of 0359
and 0.5.360
2.5.2 Simulation B: Microbiome data processed through different pipelines with different361
tree structures362
In simulations (B), we reprocessed the raw sequencing (FASTQ) files from a real microbiome363
study through different bioinformatic pipelines, resulting in multiple OTU tables with distinct tree364
structures. For each dataset processed through a distinct bioinformatic pipelines, we estimated365
hyperparameters of the Dirichlet-Multinomial distribution and subsequently simulated additional366
microbiome data with its corresponding tree structure. This set of simulations reflects scenar-367
ios where the collected microbiome datasets are pre-processed through different bioinformatic368
pipelines, but access to the raw sequencing data is unavailable, a common situation when working369
with public datasets. In such scenarios, generating a single kernel matrix is not feasible, and ap-370
proaches that require a common kernel matrix, such as CSKAT and GLMM-MiRKAT, cannot be371
applied.372
Panel B in Figure 1 summarizes the general workflow of this simulation. The raw sequencing373
files were from a real gut microbiome study [10], which contains stool microbiome data from374
82 HIV-positive and 40 HIV-negative individuals. The samples underwent PCR amplification of375
the V4 region and were sequenced on the Illumina MiSeq platform. We downloaded the paired-376
end sequencing data in FASTQ format from Synapse (ID: syn22273201), and reprocess it through377
different pipelines to generate a total of nine dataset, each with distinct phylogenetic tree structures378
17
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
and OTU definitions.379
Among the nine datasets, six of them were processed through the QIIME2 pipeline [37]. Within380
QIIME2, paired-end sequencing data were joined and filtered, and two denoising techniques-381
DADA2 [38] and Deblur [39]-were used along with three phylogeny construction approaches-382
FastTree [40], RAxML [41] and IQ-TREE [42]. This combination resulted in six (2 × 3) distinct383
datasets. The remaining three datasets were generated using the original QIIME pipeline [43], em-384
ploying three different OTU-picking strategies: closed-reference OTU picking using SILV A (Re-385
lease 132) [44, 45], closed-reference OTU picking using Greengenes (Version 13 8) [46], and De386
novo OTU picking. Due to differences in the phylogenetic trees and taxa definitions across these387
processing steps, a common kernel matrix was not available in this simulation. This represents388
a scenario in which multiple public datasets are integrated without access to the raw sequencing389
data.390
After generating the nine datasets, we partitioned the taxa in each dataset into 20 clusters. We391
then simulated new count tables based on these datasets using Dirichlet-Multinomial distribution392
and the hyperparameters estimated from the nine count tables. The outcomes were generated using393
equation (10) for continuous traits and equation (11) for binary traits. Similar to scenario A, we394
evaluate type I error and power using σ 2 = 1 and either the same δ 2
k = 1 for all k, or different395
combinations of δ 2
k where δ 2
k = 1 for k = 1,2,3, δ 2
k = 3 for k = 4,5,6,7, δ 2
k = 5 for k = 8,9,10. To396
assess power, we assumed that the microbiome effect was consistent across the nine tables, because397
they originated from the same microbiome community. We specified fk = c for k ∈ {1, · · ·,9} and398
selected the 16th cluster in each dataset as A .399
Unlike Simulation A, we did not introduce a process to simulate bias in this scenario, as there400
was no “true” community available for comparison. We applied the proposed SMRmix and meta-401
analysis procedures to the simulated data, noting that CSKAT and GLMM-MiRKAT could not be402
used due to the lack of a common kernel matrix.403
18
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
3 Results404
3.1 Simulation results405
Table 1 summarizes the type I error rates for continuous and binary outcomes across various406
scenarios in simulation setups A and B. SMRmix controls type I error rates effectively across all407
scenarios. In simulation setup A, CSKAT and GLMM-MiRKAT also showed valid type I error408
rates, regardless of sequencing bias. For binary outcomes, SMRmix was slightly conservative.409
Table 1: Empirical type I error rates comparing CSKAT, GLMM-MiRKAT, and SMRmix at differ-
ent α cutoffs (0.05 and 0.01) in simulations. For continuous outcome with the sameδ 2
k , δ 2
k = 1, ∀k;
for continuous outcome with different δ 2
k , δ 2
k = 1 for k = 1,2,3, δ 2
k = 3 for k = 4,5,6,7, δ 2
k = 5
for k = 8,9,10. For binary scenarios, σ 2 = 1.
α = 0.05 α = 0.01
Methods
Kw Ku Kbc Omnib
us Kw Ku Kbc Omnibus
Simulation
A, Continuous Y , No Bias
Same δ 2
k
CSKAT 0.0528 0.0488 0.0502 / 0.0106 0.0081 0.0087 /
SMRmix 0.0508 0.0522 0.0509 0.0494 0.0110 0.0107 0.0110 0.0111
Different δ 2
k
CSKAT 0.0475 0.0481 0.0482 / 0.0100 0.0098 0.0092 /
SMRmix 0.0459 0.0525 0.0492 0.0481 0.0101 0.0094 0.0098 0.0105
Simulation
A, Continuous Y , With Bias
Same δ 2
k
CSKAT 0.0494 0.0477 0.0498 / 0.0082 0.0094 0.0087 /
SMRmix 0.0526 0.0512 0.0504 0.0522 0.0106 0.0099 0.0113 0.0111
Different δ 2
k
CSKAT 0.0552 0.0493 0.0528 / 0.0112 0.0094 0.0098 /
SMRmix 0.0513 0.0506 0.0514 0.0513 0.0104 0.0105 0.0107 0.0112
Simulation
A, Binary Y , No Bias
GLMM-MiRKAT 0.0513 0.0522 0.0500 / 0.0097 0.0105 0.0096 /
SMRmix 0.0403 0.0444 0.0443 0.0377 0.0083 0.0077 0.0071 0.0057
Simulation
A, Binary Y , With Bias
GLMM-MiRKAT 0.0474 0.0511 0.0485 / 0.0075 0.0107 0.0091 /
SMRmix 0.0403 0.0387 0.0397 0.0388 0.0047 0.0056 0.0047 0.0051
Simulation
B, Continuous Y
SMRmix 0.0469 0.0583 0.0504 0.0493 0.0083 0.0106 0.0101 0.0107
Simulation B, Binary Y
SMRmix 0.0471 0.0438 0.0393 0.0452 0.0084 0.0064 0.0060 0.0072
*Note: CSKAT is not applicable for studies with binary outcomes, and only SMRmix is applicable for
Simulation B.
Figure 2 presents the power results for simulation setups A and B without sequencing bias. In410
simulation A1 with both continuous and binary outcomes, (left panels in Figure 2) when all stud-411
ies share the same microbiome effect and there is no sequencing bias, CSKAT/GLMM-MiRKAT412
achieves higher power than SMRmix when comparing methods using the same type of kernel413
19
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
(e.g., Bray-Curtis kernel). This is expected, as in this scenario, microbiome data across studies are414
comparable and CSKAT/GLMM-MiRKAT leverage the similarities between samples from differ-415
ent studies, offering more efficient use of the data. In contrast, the meta-analysis approach that416
combines p-values from multiple MiRKAT models exhibits substantial power loss.417
When comparing different kernels within the same method, both CSKAT and SMRmix demon-418
strate that the weighted UniFrac is the most powerful, followed by Bray-Curtis, and the unweighted419
UniFrac is the least powerful. This is because the microbiome effect in this scenario targets a420
phylogenetically-informed cluster of common taxa. A similar trend is observed even when se-421
quencing bias is introduced (left panels in Figure S1 in Supplementary Materials), likely due to the422
limited magnitude of bias introduced in this simulation.423
In scenarios A2, where studies have distinct microbiome effects (middle panels in Figure 2)424
and no sequencing bias, CSKAT/GLMM-MiRKAT, regardless of the kernel choice, show minimal425
power. This is because CSKAT/GLMM-MiRKAT implicitly assume that the microbiome-outcome426
relationship is consistent across studies via the same kernel function, which is violated in this sce-427
nario. In contrast, SMRmix demonstrates higher power. While the power of meta-analysis is428
substantially lower than SMRmix, it still performs better than CSKAT/GLMM-MiRKAT in this429
scenario. When comparing different kernels, the results are similar to those observed in Scenario430
A1. Notably, SMRmix maintains its power effectively between Simulations A1 and A2, demon-431
strating its robustness in scenarios with differing effect sizes across studies. A similar conclusion432
is drawn when sequencing bias is introduced (right panels of Figure S1 in the Supplementary Ma-433
terials).434
In simulation setup B, SMRmix exhibits much higher power than the meta-analysis procedure435
(right panels in Figure 2). CSKAT and GLMM-MiRKAT are not applicable in this simulation.436
The power results for continuous outcomes with varying variances of study-specific random437
effects (δ2
k ) are displayed in Figure S2 of the Supplementary Materials. The conclusions of perfor-438
mance under scenarios A1 and A2 remain consistent with previous findings. Notably, in simulation439
scenario A2, the meta analysis suffers a substantial power loss, while SMRmix stands out as the440
20
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
Figure 2: Statistical power for simulations A and B when there is no bias in sequencing sampling.
For continuous scenarios (upper three panels), δ 2
k = 1, ∀k, and σ 2 = 1.
only method capable of effectively detecting associations.441
3.2 Application to HIV studies442
We utilized SMRmix to analyze data from the HIV Microbiome Re-analysis Consortium [47],443
which re-analyzed α−diversity using raw 16S rRNA sequencing data from 17 studies. These444
studies aimed to characterize the gut dysbiosis in HIV+ individuals. The microbiome amplicon445
sequencing data and metadata of these studies are available on Synapse (ID: syn18406854) and446
NCBI. The 17 studies differ in their study design, institution of participants recruitment, study447
population, type of samples and various factors in the sequencing protocol including the 16s rRNA448
variable regions that were sequenced and the sequencing platform. The consortium reprocessed449
the raw sequencing data using standardized protocols and conducted taxonomic assignments in450
species level through Resphera Insight.451
Our goal was to evaluate the association between microbiome composition and HIV status,452
21
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
MSM status by integrating all these studies and draw a consistent conclusion. Out of the 17453
datasets, 14 studies(763 samples and 666 taxa) have subject-level information on MSM status,454
age, and HIV status, which we used for further analysis. Because a common phylogenetic tree is455
not available from Synapse webpage, we created a phylogentic tree using the “tax tree” function456
in R package “LTN” based on the taxonmy relationship [48]. The Principal Coordinates Anal-457
ysis (PCoA) plots, which visualize the distance matrix between study subjects, are presented in458
Supplementary Figure S3.459
We conducted the association tests utilizing single-kernel SMRmix based on kernels including
weighted UniFrac(K w), unweighted UniFrac(K u), Jaccard(K jac), Bray-Curtis(K bc) and omnibus
SMRmix using the above kernels. Suggested by previous studies [3], we adjusted HIV status
when the outcome variable is MSM, and adjusted MSM when the outcome variable is HIV status.
The statistical model was established as
logit(P(HIVik = 1)) = β1ageik + β2MSMik + f (Mik) +hk,
logit(P(MSMik = 1)) = β1ageik + β2HIVik + f (Mik) +hk.
We also conducted an analysis in which each individual sample was analyzed separately, and HMP460
combination was used to obtain the final p-values, similar to the approach used in the simulation461
studies. The p-values are reported in Table 2. MiRKAT-meta yielded a p-value of 0.0623 for the462
outcome of HIV , while SMRmix reported a p-value< 0.0001, in line with our simulation results.463
Table 2: P-values of using SMRmix to test associations between microbiome composition and
outcome variables HIV Status and MSM adjusting for age.
Outcome SMRmix MiRKAT-meta
Kw Ku K jac Kbc Omnibus Omnibus
HIV data HIV < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 0.0623
MSM < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001
CRC data CRC < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001
For the outcome HIV status, SMRmix with weighted UniFrac, UniFrac, Jaccard, and Bray-464
22
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
Curtis kernels all discovered significant signals (p-values< 0.0001). Synthesizing the results based465
on above kernels, omnibus SMRmix found HIV status and microbiome composition associated (p-466
value < 0.0001). For the outcome MSM, similar conclusions were obtained that SMRmix with all467
kernels show significant connection between MSM and microbiome composition.468
3.3 Application to colorectal cancer data469
We applied the proposed method to eleven metagenomic datasets containing species-level OTU470
data for fecal microbiomes of individuals, consisting of 354 colorectal cancer patients and 393471
healthy controls. These datasets were obtained from the curatedMetagenomicData package in472
Bioconductor, which provides standardized microbial profiling using MetaPhlAn [49]. These stud-473
ies exhibit great heterogeneity in sequencing platforms (Illumina HiSeq, Illumina NextSeq), DNA474
extraction kits (Gnome, MoBio, Qiagen), and other technical factors. The sample sizes varied from475
9 to 279 across the studies (Supplementary Table S1). PCoA plots are provided in Supplementary476
Figure S4. Samples from the same study tend to cluster together, with noticeable shifts across477
clusters, suggesting study-specific variation.478
The objective was to evaluate differences in microbiome composition between colorectal can-479
cer patients and healthy controls. Patient demographics are summarized in Supplementary Table480
S1. To examine variations in patient demographics within studies, we fit a generalized linear mixed481
model (GLMM), using disease status as the outcome and age, gender, and BMI as control variables.482
Significant variations were found in patient demographics within studies, including age (GLMM483
p-value < 0.0001), gender (GLMM p-value = 0.0087), and BMI (GLMM p-value = 0.0297). To484
account for these variations, we adjusted for these factors when applying SMRmix to integrate485
information from these studies. Using SMRmix, we found a significant difference in microbiome486
composition between colorectal cancer patients and healthy controls based on weighted UniFrac,487
unweighted UniFrac, Jaccard, and Bray-Curtis kernels (See Table 2), and MiRKAT-meta yielded488
similar results. These findings are consistent with previous research, which also suggests a distinct489
microbiome profile associated with colorectal cancer [50].490
23
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
4 Discussion491
In this paper, we present a novel approach, SMRmix for testing the association between mi-492
crobiome composition and outcomes of interest, by combining the concepts of similarity matrix493
regression and kernel machine regression. This method is particularly beneficial in the context of494
integrative analysis of microbiome data from multiple studies, cohorts or protocols, where inter-495
study pairwise distances are either absent or unreliable. SMRmix effectively addresses the limi-496
tations of existing kernel regression tests by leveraging only within-study pairwise distances for497
testing. In simulations, SMRmix showed well-controlled Type I error rates. We compared SMR-498
mix to two other methods, CSKAT and GLMM-MiRKAT (which require the usually non-available499
between-study pairwise distance), for microbiome compositional analysis. Our results showed500
that if distances of all pairs of samples are available and when the microbiome effect is homoge-501
neous, SMRmix has relatively lower power. However, SMRmix exhibits superior performance in502
scenarios where the effects of taxa are a mixture of positive and negative effects, and when differ-503
ent effect sizes result in a significant reduction in power for CSKAT and GLMM-MiRKAT. This504
highlights the robustness and effectiveness of SMRmix, particularly when inter-study pairwise dis-505
tances may not be reliable or available. We also compared SMRmix with a meta-analysis procedure506
(which, like SMRmix, doesn’t require the between-study pairwise distance), and showed higher507
power of SMRmix across all simulation setups. In summary, our proposed SMRmix method rep-508
resents a novel and effective approach for microbiome compositional analysis, with the potential509
to contribute to ongoing efforts to better understand the complex interactions between microbial510
communities and host health.511
Kernel selection is a critical challenge for any kernel-based approach, as choosing an unsuitable512
kernel can severely reduce the statistical power despite controlling the type I error. In microbiome513
studies, each kernel function assumes a distinct microbiome effect in the community, and improper514
choice of kernel can compromise the analysis [51]. To address this issue, SMRmix provides an515
omnibus test that integrates multiple kernel matrices to enhance the robustness of the analysis.516
SMRmix also allows for the use of different kernels for each study, such as Bray-Curtis distances517
24
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
in one study and UniFrac distances in another, thereby offering greater flexibility depending on the518
data availability. Our simulations demonstrate that single-kernel tests can be sensitive to kernel519
selection. In contrast, our omnibus SMRmix approach synthesizes results from multiple kernel-520
based tests by combining p-values and only incurs a small loss of power relative to using the best521
kernel. By contrast, the omnibus test improves the power substantially compared to a poor kernel522
choice. Given that the best kernel is often unknown in practice, the omnibus test is a practical and523
useful alternative.524
In applying SMRmix to investigate the relationship between human gut microbiome com-525
position and HIV infection using data from 14 studies, we aimed to address conflicting find-526
ings in the literature. Although many studies have reported gut microbiome dysbiosis associ-527
ated with HIV infection and highlighted the potential role of “leaky gut” in HIV pathogenesis528
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], others have suggested that MSM status may confound the HIV-529
microbiome relationship [3]. Our analysis adjusted for age and MSM status to assess the associ-530
ation between microbiome composition and HIV status, and similarly adjusted for age and HIV531
infection to examine the association between microbiome composition and MSM status. By inte-532
grating information from multiple studies on this topic, we found significant associations between533
microbiome composition and HIV (adjusting for age and MSM) and between microbiome and534
MSM (adjusting for age and HIV). These findings provide compelling evidence of HIV-related535
dysbiosis that is independent of MSM status, challenging the notion that MSM is only a confound-536
ing factor in the HIV-microbiome relationship.537
Supplementary Information538
Supplementary Files539
Supplemental files can be obtained through online version of the paper at Microbiome.540
25
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
Authors’ contributions541
MH contributed to the development of the method, performed simulation studies and compar-542
isons, and wrote the manuscript. NZ conceived the study, primarily developed the method, and543
wrote the manuscript. All authors read and approved the final manuscript.544
Funding545
This work was funded by by NIH grants R21AI154236, R01GM147162 and U24OD023382.546
Availability of data and materials547
The implementation of SMRmix in R codes are available in the GitHub repository https:548
//github.com/mengyu-he/SMRmix.549
Ethics approval and consent to participate550
Not applicable.551
Consent for publication552
Not applicable.553
Competing interests554
The authors declare that they have no competing interests.555
Acknowledgements556
Not applicable.557
26
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
Author affiliations558
Department of Biostatistics and Bioinformatics, Emory University, Atlanta, 30322, GA, USA559
(Mengyu He)560
Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA (Ni Zhao)561
References562
[1] Dubourg, G. et al. Gut microbiota associated with HIV infection is significantly enriched in563
bacteria tolerant to oxygen. BMJ Open Gastroenterology 3, e000080 (2016). URL https:564
//bmjopengastro.bmj.com/lookup/doi/10.1136/bmjgast-2016-000080.565
[2] Mutlu, E. A. et al. A Compositional Look at the Human Gastrointestinal Microbiome and566
Immune Activation Parameters in HIV Infected Subjects. PLoS Pathogens 10, e1003829567
(2014). URL https://dx.plos.org/10.1371/journal.ppat.1003829.568
[3] Noguera-Julian, M. et al. Gut Microbiota Linked to Sexual Preference and HIV Infec-569
tion. EBioMedicine 5, 135–146 (2016). URL https://linkinghub.elsevier.com/570
retrieve/pii/S2352396416300287.571
[4] Nowak, P. et al. Gut microbiota diversity predicts immune status in HIV-572
1 infection. AIDS 29, 2409–2418 (2015). URL https://journals.lww.com/573
00002030-201511280-00004.574
[5] Pinto-Cardoso, S. et al. Fecal Bacterial Communities in treated HIV infected individuals on575
two antiretroviral regimens. Scientific Reports 7, 43741 (2017). URL http://www.nature.576
com/articles/srep43741.577
[6] Villanueva-Mill ´an, M. J., P ´erez-Matute, P., Recio-Fern ´andez, E., Lezana Rosales, J. M.578
& Oteo, J. A. Differential effects of antiretrovirals on microbial translocation and579
gut microbiota composition of HIV-infected patients: Villanueva-Mill ´an MJ et al.580
27
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
Journal of the International AIDS Society 20, 21526 (2017). URL http://doi.wiley.581
com/10.7448/IAS.20.1.21526.582
[7] Yu, G., Fadrosh, D., Ma, B., Ravel, J. & Goedert, J. J. Anal microbiota profiles in HIV-583
positive and HIV-negative MSM. AIDS 28, 753–760 (2014). URL https://journals.584
lww.com/00002030-201403130-00013.585
[8] Sun, Y .et al. Fecal bacterial microbiome diversity in chronic HIV-infected patients in China.586
Emerging Microbes & Infections 5, 1–7 (2016). URL https://www.tandfonline.com/587
doi/full/10.1038/emi.2016.25.588
[9] V ´azquez-Castellanos, J. F. et al. Altered metabolism of gut microbiota contributes to chronic589
immune activation in HIV-infected individuals. Mucosal Immunology 8, 760–772 (2015).590
URL http://www.nature.com/articles/mi2014107.591
[10] Monaco, C. et al. Altered Virome and Bacterial Microbiome in Human Immunode-592
ficiency Virus-Associated Acquired Immunodeficiency Syndrome. Cell Host & Microbe593
19, 311–322 (2016). URL https://linkinghub.elsevier.com/retrieve/pii/594
S193131281630052X.595
[11] Vesterbacka, J. et al. Richer gut microbiota with distinct metabolic profile in HIV in-596
fected Elite Controllers. Scientific Reports 7, 6269 (2017). URL http://www.nature.597
com/articles/s41598-017-06675-1 .598
[12] V olpe, G. E.et al. Associations of Cocaine Use and HIV Infection With the Intestinal Micro-599
biota, Microbial Translocation, and Inflammation. Journal of Studies on Alcohol and Drugs600
75, 347–357 (2014). URL http://www.jsad.com/doi/10.15288/jsad.2014.75.347.601
[13] Ling, Z. et al. Alterations in the Fecal Microbiota of Patients with HIV-1 Infection: An602
Observational Study in A Chinese Population. Scientific Reports 6, 30673 (2016). URL603
http://www.nature.com/articles/srep30673.604
28
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
[14] Dinh, D. M. et al. Intestinal Microbiota, Microbial Translocation, and Systemic Inflammation605
in Chronic HIV Infection. Journal of Infectious Diseases 211, 19–27 (2015). URL https:606
//academic.oup.com/jid/article-lookup/doi/10.1093/infdis/jiu409.607
[15] Nowak, R. G. et al. Rectal microbiota among HIV-uninfected, untreated HIV, and treated608
HIV-infected in Nigeria. AIDS 31, 857–862 (2017). URL https://journals.lww.com/609
00002030-201703270-00014.610
[16] Lozupone, C. et al. Alterations in the Gut Microbiota Associated with HIV-1 Infection.611
Cell Host & Microbe 14, 329–339 (2013). URL https://linkinghub.elsevier.com/612
retrieve/pii/S1931312813002886.613
[17] McArdle, B. H. & Anderson, M. J. Fitting multivariate models to community data: a com-614
ment on distance-based redundancy analysis. Ecology 82, 290–297 (2001).615
[18] MetaHIT Consortium (additional members) et al. Enterotypes of the human gut microbiome.616
Nature 473, 174–180 (2011). URL http://www.nature.com/articles/nature09944.617
[19] Zhao, N. et al. Testing in microbiome-profiling studies with MiRKAT, the microbiome618
regression-based kernel association test. The American Journal of Human Genetics 96, 797–619
807 (2015). URL https://doi.org/10.1016/j.ajhg.2015.04.003.620
[20] Zhan, X. et al. A small-sample kernel association test for correlated data with application to621
microbiome association studies. Genetic Epidemiology 42, 772–782 (2018). URL https:622
//onlinelibrary.wiley.com/doi/abs/10.1002/gepi.22160.623
[21] Koh, H., Li, Y ., Zhan, X., Chen, J. & Zhao, N. A Distance-Based Kernel Association624
Test Based on the Generalized Linear Mixed Model for Correlated Microbiome Studies.625
Frontiers in Genetics 10, 458 (2019). URL https://www.frontiersin.org/article/626
10.3389/fgene.2019.00458/full.627
29
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
[22] Kwee, L. C., Liu, D., Lin, X., Ghosh, D. & Epstein, M. P. A Powerful and Flexible Mul-628
tilocus Association Test for Quantitative Traits. The American Journal of Human Genetics629
82, 386–397 (2008). URL https://linkinghub.elsevier.com/retrieve/pii/630
S0002929708000888.631
[23] Wu, M. C. et al. Powerful SNP-Set Analysis for Case-Control Genome-wide Association632
Studies. The American Journal of Human Genetics 86, 929–942 (2010). URL https://633
linkinghub.elsevier.com/retrieve/pii/S000292971000248X.634
[24] Wu, M. et al. Rare-Variant Association Testing for Sequencing Data with the Sequence635
Kernel Association Test. The American Journal of Human Genetics 89, 82–93 (2011). URL636
https://linkinghub.elsevier.com/retrieve/pii/S0002929711002229.637
[25] Tzeng, J.-Y ., Zhang, D., Chang, S.-M., Thomas, D. C. & Davidian, M. Gene-Trait Similarity638
Regression for Multimarker-Based Association Analysis. Biometrics 65, 822–832 (2009).639
URL http://doi.wiley.com/10.1111/j.1541-0420.2008.01176.x.640
[26] Zhan, X. Relationship Between MiRKAT and Coefficient of Determination in Similarity641
Matrix Regression. Processes 7, 79 (2019). URL http://www.mdpi.com/2227-9717/7/642
2/79.643
[27] Lin, D. Y . An efficient Monte Carlo approach to assessing statistical significance in ge-644
nomic studies. Bioinformatics 21, 781–787 (2005). URL https://academic.oup.com/645
bioinformatics/article-lookup/doi/10.1093/bioinformatics/bti053.646
[28] Conneely, K. N. & Boehnke, M. So Many Correlated Tests, So Little Time! Rapid Adjust-647
ment of P Values for Multiple Correlated Tests. The American Journal of Human Genetics648
81, 1158–1168 (2007). URL https://linkinghub.elsevier.com/retrieve/pii/649
S0002929707637665.650
[29] Chapman, J. & Whittaker, J. Analysis of multiple SNPs in a candidate gene or re-651
30
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
gion. Genetic Epidemiology 32, 560–566 (2008). URL http://doi.wiley.com/10.1002/652
gepi.20330.653
[30] Pan, W., Han, F. & Shen, X. Test Selection with Application to Detecting Disease Association654
with Multiple SNPs. Human Heredity 69, 120–130 (2010). URL https://www.karger.655
com/Article/FullText/264449.656
[31] Wilson, D. J. The harmonic mean p -value for combining dependent tests.657
Proceedings of the National Academy of Sciences 116, 1195–1200 (2019). URL http:658
//www.pnas.org/lookup/doi/10.1073/pnas.1814092116.659
[32] Liu, Y . & Xie, J. Cauchy combination test: A powerful test with analytic p-value calculation660
under arbitrary dependency structures. Journal of the American Statistical Association 115661
(2020).662
[33] Simes, R. J. An improved bonferroni procedure for multiple tests of significance. Biometrika663
73 (1986).664
[34] Tippett, L. H. C. The methods of statistics V ol. 2d ed., re (Williams665
Norgate ltd, London, 1931).666
[35] Chen, J. & Li, H. Kernel methods for regression analysis of microbiome compositional data,667
V ol. 55 (2013).668
[36] Charlson, E. S. et al. Disordered Microbial Communities in the Upper Respiratory Tract of669
Cigarette Smokers. PLoS ONE 5, e15216 (2010). URL https://dx.plos.org/10.1371/670
journal.pone.0015216.671
[37] Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science672
using QIIME 2. Nature Biotechnology 37, 852–857 (2019). URL http://www.nature.673
com/articles/s41587-019-0209-9.674
31
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
[38] Callahan, B. J. et al. Dada2: High-resolution sample inference from illumina amplicon data.675
Nature Methods 13 (2016).676
[39] Amir, A. et al. Deblur rapidly resolves single-nucleotide community sequence patterns.677
mSystems 2 (2017). URL http://dx.doi.org/10.1128/mSystems.00191-16.678
[40] Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree: Computing Large Minimum Evolution679
Trees with Profiles instead of a Distance Matrix.Molecular Biology and Evolution 26, 1641–680
1650 (2009). URL https://academic.oup.com/mbe/article-lookup/doi/10.1093/681
molbev/msp077.682
[41] Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses683
with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690 (2006).684
URL https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/685
bioinformatics/btl446.686
[42] Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: A Fast687
and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies.688
Molecular Biology and Evolution 32, 268–274 (2015). URLhttps://academic.oup.com/689
mbe/article-lookup/doi/10.1093/molbev/msu300.690
[43] Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing data.691
Nature Methods 7, 335–336 (2010). URL http://www.nature.com/articles/nmeth.f.692
303.693
[44] Quast, C. et al. The SILV A ribosomal RNA gene database project: im-694
proved data processing and web-based tools. Nucleic Acids Research 41, D590–695
D596 (2012). URL http://academic.oup.com/nar/article/41/D1/D590/1069277/696
The-SILVA-ribosomal-RNA-gene-database-project .697
[45] Yilmaz, P. et al. The SILV A and “all-species living tree project (LTP)” taxonomic frame-698
32
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint
works. Nucleic Acids Research 42, D643–D648 (2013). URLhttps://doi.org/10.1093/699
nar/gkt1209.700
[46] McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological701
and evolutionary analyses of bacteria and archaea. The ISME Journal 6, 610–618 (2012).702
URL http://www.nature.com/articles/ismej2011139.703
[47] Tuddenham, S. A. et al. The impact of human immunodeficiency virus infection on gut704
microbiota α-diversity: An individual-level meta-analysis. Clinical Infectious Diseases 70705
(2020).706
[48] Wang, Z., Mao, J. & Ma, L. Microbiome compositional analysis with logistic-tree normal707
models. arXiv preprint arXiv:2106.15051 (2021).708
[49] Pasolli, E. et al. Accessible, curated metagenomic data through ExperimentHub.709
Nat. Methods 14, 1023–1024 (2017).710
[50] Ahn, J. et al. Human gut microbiome and risk for colorectal cancer.711
JNCI: Journal of the National Cancer Institute 105, 1907–1911 (2013). URL712
http://dx.doi.org/10.1093/jnci/djt300.713
[51] Plantinga, A. et al. Mirkat-s: a community-level test of association between the microbiota714
and survival times. Microbiome 5, 17 (2017).715
33
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint