A Mixed Effect Similarity Matrix Regression Model (SMRmix) for Integrating Multiple Microbiome Datasets at Community Level

doi:10.1101/2024.03.10.584315

A Mixed Effect Similarity Matrix Regression Model (SMRmix) for Integrating Multiple Microbiome Datasets at Community Level

2024 · doi:10.1101/2024.03.10.584315

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 72,486 characters · extracted from oa-pdf · 6 sections · click to expand

Background

Recent studies have highlighted the importance of human microbiota in our2 health and diseases. However, in many areas of research, individual microbiome studies often3 offer inconsistent results due to the limited sample sizes and the heterogeneity in study pop-4 ulations and experimental procedures. This inconsistency underscores the necessity for inte-5 grative analysis of multiple microbiome datasets. Despite the critical need, statistical methods6 that incorporate multiple microbiome datasets and account for the study heterogeneity are not7 available in the literature.8

Methods

In this paper, we develop a mixed effect similarity matrix regression (SMRmix)9 approach for identifying community level microbiome shifts between outcomes. SMRmix10 has a close connection with the microbiome kernel association test, one of the most popular11 approaches for such a task but is only applicable when we have a single study. SMRmix12 enables researchers to consolidate findings from diverse microbiome studies.13

Results

Via extensive simulations, we show that SMRmix has well-controlled type I error14 and higher power than some potential competitors. We applied the SMRmix to two real-world15 datasets. The first, from the HIV-reanalysis consortium, integrated data from 17 studies on16 gut dysbiosis in HIV . Our analysis confirmed consistent associations between the gut micro-17 biome and HIV infection as well as MSM (men who have sex with men) status, demonstrating18 greater power than competing methods. The second dataset involved 11 studies on the gut19 microbiome in colorectal cancer; analysis with SMRmix confirmed significant dysbiosis in20 affected individuals compared to healthy controls.21

Conclusion

The development of SMRmix enables the integration of multiple studies and22 effectively managing study heterogeneity, and provides a powerful tool for uncovering consis-23 tent associations between diseases and community-level microbiome data.24

Keywords

microbiome community-level association studies; integrative analysis; distance-25 based association analysis; similarity matrix regression26 2 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint 1 Introduction27 Microbiome research has rapidly expanded in recent years, revealing complex relationships28 between microbial communities and human health. However, individual studies that explore the29 relationship between the microbiome and disease frequently report inconsistent results. For exam-30 ple, studies on the gut microbiome and HIV infections have shown conflicting results. While most31 studies found a decreased microbial alpha diversity in HIV-infected individuals compared to unin-32 fected individuals [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], some studies detected no significant difference33 [12, 13, 14, 15] and even an increase of microbial diversity [16]. Research on structural differences34 in microbial communities (e.g., beta diversity) between HIV-positive and HIV-negative individuals35 has also yielded mixed results. Some studies report significant differences in the microbiome beta36 diversity of HIV-infected versus uninfected individuals [2, 14], while others suggest that these dif-37 ferences may be attributed more to variations in sexual behavior, especially MSM (men who have38 sex with men) status, rather than HIV infection itself [3]. Similar inconsistencies in microbial di-39 versity and community structure have also been observed in other diseases, highlighting a broader40 challenge in microbiome research. These inconsistencies can arise from small sample sizes, pop-41 ulation heterogeneity, varied study designs, and differences in analytical techniques, which may42 lead to bias and limit the generalizability of findings.43 To address these issues and enhance reproducibility, integrative analysis of multiple micro-44 biome datasets is essential to aggregate information across multiple studies. In this paper, we45 focus on the association test between the overall microbiome community profile, measured by46 beta-diversities, and outcomes of interest while incorporating multiple datasets. For such tasks47 within a single study, there are two popular approaches: distance-based analysis [17, 18] and the48 microbiome regression kernel association test (MiRKAT) [19]. The distance-based approach in-49 volves computing a beta-diversity distance between individuals’ microbiome profiles and testing50 for association with the outcome of interest via permutation. MiRKAT utilizes a semi-parametric51 kernel machine regression in which a kernel similarity matrix, derived from the beta-diversity ma-52 trix, is assessed for association with the outcome. Here, the kernel similarity matrix is transformed53 3 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint from of a beta-diversity matrix. Both methods require an assessment of similarity or dissimilar-54 ity(distance) between all pairs of samples.55 When extending to multiple studies, neither distance-based methods nor MiRKAT are directly56 applicable due to the lack of a complete pairwise distance matrix across all subjects and studies.57 In cases where data are obtained from public resources without access to original sequencing files,58 differences in preprocessing steps across studies make it impossible to construct consistent beta-59 diversity matrices—distinct OTU (operational taxonomic unit) assignments and phylogenetic trees60 are common. Even when re-processing data across studies is feasible, differential biases intro-61 duced through various sequencing protocols can invalidate beta diversity measurements between62 samples from different studies. Thus, in most cases, only within-study distances are available,63 while between-study distances are missing.64 Herein, we propose a novel statistical method, SMRmix, to test the association between micro-65 biome communities and outcomes of interests by aggregating information across multiple studies.66 Our method is built upon similarity matrix regression (SMR) and only requires within-study dis-67 tances. Instead of modeling the direct relationship between traits and microbiome similarity, SMR68 regresses pairwise trait similarity on pairwise microbiome similarity for association testing. Trait69 similarity is calculated as the outer product of residuals from a null regression model, and micro-70 biome similarity is calculated from the within-study beta diversity matrix. Associations are then71 detected using a score test. We extend SMR by incorporating mixed effects, vectorizing similar-72 ity matrices, and deriving the corresponding score test statistic through the likelihood. The new73 method, SMRmix, enables us to account for study-level random effects and avoid reliance on inac-74 cessible between-study distances. Our method also enables development of an “optimal” omnibus75 test, utilizing multiple distances within a single study to enhance power.76 Through extensive simulations, we demonstrate that the proposed SMRmix has well-controlled77 type I error rates, and higher or comparable power than potential competitors, such as CSKAT [20]78 and GLMM-MiRKAT [21], both of which require complete distance matrices across all studies,79 which are often inaccessible. We apply SMRmix to two datasets: (1) the HIV-reanalysis consor-80 4 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint tium dataset, comprising 17 studies on gut dysbiosis in HIV , where we observed strong associations81 between gut microbiome composition and both HIV infection and MSM (men who have sex with82 men) status, and (2) a dataset of 11 studies on gut microbiome in colorectal cancer (CRC), con-83 firming significant dysbiosis in affected individuals compared to healthy controls. These analyses84 illustrate the capability of SMRmix to detect consistent patterns of microbiome composition com-85 position across different conditions.86 2 Materials and methods87 SMRmix evaluates the association between a binary or continuous outcome and microbiome88 communities by integrating data from multiple microbiome studies. It extends similarity matrix89 regression to incorporate multiple studies, utilizing only intra-study kernel matrices, thereby elim-90 inating the need for between-study similarity assessments. In what follows, we first present the91 microbiome regression-based kernel association test for a single study and its link to Similarity92 Matrix Regression (SMR), followed by its extension to incorporate multiple studies.93 2.1 Microbiome regression-based kernel association test for a single study94 and its connection with similarity matrix regression95 When analyzing data from a single study, one popular approach is the microbiome regression-96 based kernel association test (MiRKAT) [19]. In this method, data are summarized as independent97 observations of the form (Yi, Xi, Mi) for i = 1, · · ·, n. Here, Yi represents the outcome of interest,98 which can be either continuous or binary. The covariates we want to adjust for, such as age, gender,99 and MSM status, are included in Xi = (1, xi1, · · ·, xip)T , where p is the number of covariates. The100 abundances of all taxa/OTUs in the microbial community are represented by Mi = (mi1, · · ·, miq),101 where q is the total number of taxa/OTUs.102 MiRKAT compares pairwise similarity in the outcome variable to pairwise similarity in the103 microbiome profiles, with a high correspondence suggesting an association. It employs a kernel104 5 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint machine regression framework [22, 23, 24] to relate covariates and microbiome profiles to out-105 comes. Specifically, for a continuous outcome, we assume a linear kernel machine regression106 model,107 Yi = Xiβ + f (Mi) +εi , (1)108 and for a binary outcome, we use the logistic kernel machine regression model,109 logit(P(Yi = 1)) = Xiβ + f (Mi) , (2)110 where εi is the random error with mean 0 and a common variance, f (·) captures the association111 between microbial community and the outcome, and additional covariates Xi are adjusted for112 through a linear function. Under the null hypothesis of no microbiome effect, f (M ) = 0.113 In MiRKAT, f (Mi) is assumed to belong to a reproducing kernel Hilbert space generated by114 a positive definite kernel function K(·, ·), such that f (Mi) = ∑n i′=1 αi′K(Mi, Mi′). The (i, i′)-th115 element of the kernel matrix K, defined as K(Mi, Mi′), measures the pairwise similarity between116 the microbiome profiles of the i-th and the i′-th samples. To utilize the phylogenetic/taxonomic in-117 formation contained in the microbiome sequences, the kernel similarity matrix can be constructed118 by exploiting its correspondence with the well-defined beta-diversities, which measure dissimi-119 larities between subjects through: K = − 1 2 (I − 11′ n )D2(I − 11′ n ), where D represents pairwise120 dissimilarities/distances between all subjects, calculated using beta-diversity metrics such as the121 Bray-Curtis or UniFrac distance. A variance component score test is then calculated through the122 connection between the kernel machine regression model and linear mixed models [19]: Q =123 1 2φ (Y − ˆµ0)′K(Y − ˆµ0), where ˆµ0 are the fitted values of Y under the null, and φ is the disper-124 sion parameter.125 An alternative approach to investigate the association between the outcome variable and a set of126 “-omics” markers is similarity matrix regression (SMR) [25]. In microbiome association studies,127 6 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint the SMR model can be written as128 Zi j = b × Ki j + ei j, i < j, (3)129 where Zi j, the trait similarity between subject i and j, is defined as the outer product of trait resid-130 uals, Zi j = (Yi − ˆµ0 i )(Yj − ˆµ0 j ), and ˆµ0 i is the fitted values of Yi under the null hypothesis. Ki j is the131 i, j-th element of microbial kernel similarity matrix. Intuitively, if the outcome is associated with132 microbial profiles, subjects with higher outcome similarity should also have higher microbiome133 similarity.134 Under the null, assuming that all Yi’s are independent, the score statistic for testing in SMR is135 Ub = ∑i< j Ki j(v0 i )−1(v0 j)−1Zi j = ∑i< j Ki j(v0 i )−1(v0 j)−1(Yi − ˆµ0 i )(Yj − ˆµ0 j ), where v0 i = Var(Yi|Xi, Mi)136 under the null. This statistic can be written in quadratic form as:Ub = 1 2 (Y − ˆµ0)′V −1K0V −1(Y −137 ˆµ0), where K0 is a matrix with diagonal elements equal to 0 and (i, j)-th off-diagonal elements138 equal to Ki j. Under the null, Ub approximately follows a weighted chi-squared distribution, which139 can be used to assess the statistical significance.140 Formulating SMR in vector form, we have141 Zvec = b × Kvec + evec, (4)142 where Zvec and Kvec are stacked vectors of Zi j, Ki j with all available pairwise distances (i, j).143 There is a close connection between MiRKAT and SMR [26]. Specifically, the square of the em-144 pirical sample correlation between outcome similarity and microbiome similarity in SMR, denoted145 by R2, is proportional to the score statistic in MiRKAT. Note that theR2 statistic represents the pro-146 portion of variation explained in SMR.147 Although SMR is statistically equivalent to MiRKAT, the vector form of SMR allows for in-148 tegration of multiple studies when between-study similarity matrices are not available. This is149 because we can exclude pairs with missing distances from the similarity vectors Zvec and Kvec.150 7 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint 2.2 SMRmix for integrating multiple microbiome datasets151 Suppose we have a total of K microbiome studies. The data for the i-th subject in the k-th152 study is denoted as (Yik, Xik, Mik), where i = 1, · · ·, nk, k = 1, · · ·, K. Here, nk represents the153 number of subjects in k−th study, and N = ∑K k=1 nk denotes the total number of subjects across154 K studies. We assume all studies are independent and do not share any subjects, and that all155 samples are independent within each study. A common violation of the second assumption occurs156 when multiple samples are longitudinal measurements from the same participant or are clustered157 within families in a single study. Thus, we assume that samples are clustered by studies, without158 additional clustering structures, although our method can be extended to allow for within-study159 correlations.160 When microbiome data from different studies are processed through distinct protocols, subjects161 from these studies are not directly comparable, preventing calculation of between-study beta diver-162 sities. As a result, only intra-study microbial dissimilarities are available for association testing.163 Suppose the similarity matrix for all subjects across all studies (assuming it exists) is represented164 as follows:165 Kfull =      K1 K12 · · · K1K K21 K2 · · · K2K .. . . . . ... . . . KK1 KK2 · · · KK        166 Here, the diagonal blocks Kk represent the kernel similarity matrices for samples within the same167 study k, and the off-diagonals represent the microbiome similarity for pairs of samples from dif-168 ferent studies. For existing kernel-based/distance-based methods, all elements in Kfull must be169 calculated using the same distance; otherwise, the absolute values of pairwise kernels using differ-170 ent metrics are not on the same scale and thus not comparable.171 In practice, we often only have access to the diagonal blocks, represented as follows:172 8 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint K =        K1 K2 ... KK        173 In this matrix, only diagonal blocks of K are obtained, while the cells outside the diagonal blocks174 are missing. Having only the elements in diagonal blocks gives us the flexibility to choose different175 kernels for each study. For example, K1 may be a UniFrac kernel while K2 is a Bray-Curtis176 kernel. While it is advantageous to interpret the results when all kernels involved are of the same177 type, using different kernels for each study remains valid. This flexibility is particularly helpful178 when kernels of the same type are not available for some studies. For example, we may be able179 to construct a UniFrac kernel for study 1, but not for study 2 if the necessary phylogenetic tree180 information is unavailable, such as when using publicly available data. In this paper, we will181 focus on the situations where all individual Kk are of the same type for simplicity. The vectorized182 SMR model enables us to test the association based on these blocks without the need of the off183 diagonal-blocks, by defining the similarity regression model only for existing Ki j values.184 To motivate the SMR model when integrating multiple microbiome datasets, we first evalu-185 ate its counterpart model within the kernel machine regression framework, assuming the overall186 microbiome kernel similarity matrix Kfull is available. For continuous outcomes, the correlated187 sequence kernel association test (CSKAT) [20], an extension of MiRKAT, can be used to test the188 association between microbiome profiles and outcomes when data from the same study are clus-189 tered. CSKAT assumes a linear mixed-effects model with study-specific random effects to link the190 outcome and microbial composition:191 Yik = X T ikβ + f (Mik) +hk + εik , (5)192 where hk represents the study-specific random effect, which captures within-study correlation, and193 hk ∼ N(0, σ 2). The subject-level random effect is εik ∼ N(0, δ 2 k ).194 9 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint For dichotomous outcomes, another extension of MiRKAT, GLMM-MiRKAT [21], incorpo-195 rates a study-specific random effect:196 logit(P(Yik = 1)) = X T ikβ + f (Mik) +hk . (6)197 Putting samples from all studies together in vector format, we have h ∼ N(0, σ 2Σh), Σh =198 diag{1n11T n1, · · ·,1nK1T nK }, ϵ ∼ N(0,Σϵ), Σϵ = diag{δ 2 1 In1, · · ·, δ 2 S InK }.199 If a full kernel matrix across all studies is available, the kernel machine regression model in200 CSKAT and GLMM-MiRKAT is equivalent to the (generalized) linear mixed model with f ∼201 N(0, τKfull). Hypothesis testing for H0 : τ = 0 can be conducted as described in the literature.202 However, because only the diagonal blocks Kk of the overall kernel matrix are available, CSKAT203 and GLMM-MiRKAT cannot be applied. However, we can still calculate the trait similarity Zk204 and the variance-covariance matrix Vk = Var(Zk|Xk, Mk) (please refer to section A of the Sup-205 plementary Materials).206 When only K is available, we propose the following mixed effect similarity matrix regression207 model (SMRmix):208 Zi j,k = a + bKi j,k + uk + fi + ei j,k, i < j, (7)209 where, for subjects i, j in study k, the outcome similarity Zi j,k is linked to microbiome similarity210 Ki j,k. Here, ei j,k is a random error that is allowed to differ across studies, uk is a study-specific211 random effect, fi is a random effect that captures the additional correlation between two pairs of212 trait similarities if the two pairs share a common sample. For example, fi captures the additional213 correlation between Zi j,k and Zi j′,k compared to Zi j,k and Zi′ j′,k, with i ̸= i′ and j ̸= j′. Compared214 to the original SMR model, which is equivalent to a kernel machine regression model, our model215 is based on a mixed effect kernel machine regression model, thus the name SMRmix. All random216 effects have mean 0 and constant variance. The mathematical relationship between the random217 effects in SMRmix and the random effects in CSKAT and GLMM-MiRKAT is specified in Section218 B of Supplementary Materials.219 10 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint To test whether there is a microbiome effect, we test H0 : b = 0. We propose the following220 score statistic:221 U = K ∑ k=1 (Kvec k )T (Vk)−1(Zvec k − E(Zvec k |Xk, Mk)), (8)222 where Kvec k = ( K(1,2),k, · · ·, K(1,nk),k, K(2,3),k, · · ·, K(2,nk),k, · · ·, K(nk−1,nk),k)T is a vector obtained223 by stacking all the available within-study pairwise kernels of the k-th study. Similarly, Zvec k =224 (Z(1,2),k, · · ·, Z(1,nk),k, Z(2,3),k, · · ·, Z(2,nk),k, · · ·, Z(nk−1,nk),k)T , Vk = Var(Zvec k ) represents the corre-225 sponding outcome similarities and variances calculated under the null.226 This test aggregates the score statistics from different studies, requiring only microbial similar-227 ities within studies, whereas CSKAT requires microbial pairwise similarities of all samples.228 2.3 Perturbation approach for obtaining p-value229 In the original SMR approach, the test statistic can be expressed in a quadratic form and230 approximated by a weighted chi-square distribution. However, for SMRmix, due to the com-231 plexity of the variance matrix V vec k and the vectorization involved, the asymptotic distribution232 of U cannot be easily approximated. Instead, we propose a perturbation approach for obtain-233 ing the null distribution of U and thereby calculating the p-value. We note that, each element in234 Zk = (Yk − µ0 k)(Yk − µ0 k)T is obtained from the outer product ofYk − µ0 k, which, after some trans-235 formations, asymptotically follows a multivariate normal distribution with mean 0 and a covariance236 matrix that can be estimated under the null hypothesis. The idea behind the perturbation approach237 is to generate a large number of ˜Zvec k values that asymptotically have the same distribution asZvec k .238 By doing so, we generate a large number ofU ∗ that have the same distribution asU under the null.239 Therefore, we can compare the observed test statistic with U ∗ to calculate the p-value.240 Perturbation procedures, sometimes referred to as resampling or Monte Carlo approaches, have241 been used by many researchers in various contexts [27, 28, 29, 30]. Our approach is related to these242 methods, but differs substantially in that we need to account for the additional correlations that arise243 from the clustered nature of our data. Specifically, let the estimated covariance matrix of Yk − µ0 k244 11 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint from the mixed-effect model be Σk. For continuous outcomes, Σk is a symmetric matrix with245 diagonal elements equal to ˆσ 2 + ˆδk 2 and off-diagonal elements equal to ˆσ 2. For binary outcomes,246 i−th diagonal elements is ˆµ0 ik(1− ˆµ0 ik), and element(i, j), i ̸= j, in Σk is ˆµ0 ik(1− ˆµ0 ik) ˆµ0 jk(1− ˆµ0 ik) ˆσ 2.247 Let Pk = Σk − Xk(X T k Σ−1Xk)−1X T k , then we get P 1 2 k = Σ 1 2 k − Xk(X T k Σ−1 k Xk)−1X T k Σ − 1 2 k .248 It can be shown that, under the null hypothesis,249 P − 1 2 k (Yk − µ0 k) ∼ MV N(0, IN), (9)250 where IN is the identity matrix of size N.251 To generate the null distribution of the test statistic, we first generate random samples ˜Rk252 from a k-dimensional normal distribution with mean 0 and identity covariance matrix (i.e., the253 distribution of P − 1 2 k (Yk − µ0 k)). Since P 1 2 k ˜Rk has the same distribution as Yk − µ0 k, we can replace254 Zvec k in equation (8) with a vectorized version of the outer product of P 1 2 k ˜Rk, denoted as ˜Zvec k .255 This process is repeated multiple times to construct the null distribution of the test statistic. The256 resulting distribution is then used to conduct hypothesis testing and calculate p-values.257 2.4 Omnibus test258 Previous research has shown that the power of statistical methods depends on whether the259 selected kernel effectively captures the true association signal [19]. For example, the weighted260 UniFrac and Bray-Curtis kernels utilize relative abundance information to construct pairwise sim-261 ilarities, giving higher weight to common taxa. Consequently, these kernels are more powerful262 when the true association involves common taxa. Conversely, the unweighted UniFrac and Jaccard263 kernels only use information on whether a taxon is present in a sample, giving higher weight to264 rare taxa, and are therefore more powerful when the true association involves rare taxa. In practice,265 the optimal kernel choice is unknown prior to analysis.266 To address this uncertainty, we propose an omnibus test that extends SMRmix to simulta-267 neously consider multiple kernels for each study. The omnibus-SMRmix is expected to be robust268 12 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint across various association patterns, losing little power compared to the best individual kernel while269 achieving substantially greater power than a poorly chosen kernel.270 Let Kl, l = (1, · · ·, L) represent the diagonal block kernels, where the kernel for each study is271 of the same type, such as weighted UniFrac. Suppose there are a total of L such groups of kernels.272 For each Kl, we conduct SMRmix and apply the perturbation approach as previously described273 to obtain a p-value. These p-values are then combined using the harmonic mean combination274 (HMP) [31]. In our simulations, the HMP combination demonstrated superior type I error control275 compared to other combination methods, such as the Cauchy combination [32], Simes combination276 [33] and the Tippett combination [34]. Note that even though we have L sets of SMRmix models,277 we only need to generate a single set of B multivariate normal distributions for perturbation across278 all sets of kernels. This drastically reduces computational time.279 2.5 Simulation Studies280 To assess the performance of SMRmix, we conducted extensive simulations focusing on both281 type I error and statistical power, comparing SMRmix against CSKAT and GLMM-MiRKAT. For282 power evaluation, we additionally compared SMRmix to MiRKAT-meta, which employs the HMP283 combination method to integrate p-values from individual MiRKAT analyses into a single p-value.284 We performed two sets of simulations.285 2.5.1 Simulation A: Microbiome data processed via the same pipeline with a common phy-286 logenetic tree287 In Simulation A, we generated microbiome data using a Dirichlet-Multinomial distribution,288 with hyperparameters estimated from real data. This simulation framework has been used in previ-289 ous studies [19, 20, 35]. To account for differences in data collection across studies, we introduced290 a multiplicative and differential bias to capture inter-study heterogeneity. This approach ensured291 that the microbiome data from all studies shared the same tree, allowing the computation of a com-292 mon kernel matrix that methods like CSKAT or GLMM-MiRKAT can be applied. However, with293 13 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint differential bias across studies, the kernel components across different studies were not directly294 comparable, potentially impacting the validity of results from CSKAT or GLMM-MiRKAT. This295 setup resembles scenarios where raw sequencing data are available and can be processed through296 a unified bioinformatic pipeline.297 Panel A in Figure 1 illustrates the general workflow of this simulation setup. Specifically, we298 first generated a large OTU table via a Dirichlet-Multinomial distribution with hyperparameters299 estimated from a real throat microbiome dataset [36], which contained 60 samples with 856 OTUs.300 These generated microbiome data were treated as the “true” microbiome abundance, from which301 we simulated the outcome variable. The association effect was set to either the same or different302 across studies depending on the different simulation scenarios. The total number of counts for each303 sample was set to 106, representing the total number of 16s rRNA molecules in the community. We304 denote this “true” OTU table as M ∗, which is unobserved. From this community, we conducted305 sub-sampling to generate the observed microbiome OTU table (denoted as M), which was used306 for association testing.307 After simulating the true OTU table, we partitioned the 856 taxa into 20 clusters based on308 cophenetic distances calculated from the phylogenetic tree. Using the simulated microbial com-309 munity, for a continuous trait, we generated the outcomes via the following model:310 Yik = 0.5 · (1 + Xik,1 + Xik,2) + fk · scale( ∑ l∈A M∗ ik,l) +hk + εik , (10)311 where, i = 1, · · ·, nk, k = 1, · · ·, K. The variables X·,1 and X·,2 are covariates, withX·,1 ∼ Bernoulli( 0.5)312 and X·,2 ∼ N(0,1). Here, M∗ ik,l is the unobserved “true” count of l-th taxon for individual i in k-th313 study, l ∈ A indicates whether taxon l belongs to a specific taxon cluster A . The function scale(·)314 standardizes vector ∑l∈A {M∗ ik,l}i to have zero mean and unit variance. Since the total number of315 counts is set at 10 6 for all samples, this is analogous to using the true relative abundances. The316 parameter fk represents the effect size for k-th study. Between-study variation and within-study317 variation are described by hk ∼ N(0, σ 2), εik ∼ N(0, δ 2 k ), respectively. We set σ 2 = 1 and the318 14 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint Figure 1: Workflow for simulation studies. Panel A: simulation scenario A, where microbiome data are processed from the same pipeline with a common tree structure. Panel B: simulation scenario B, where microbiome data are processed from different pipelines with different tree structures. same δ 2 k = 1 for all k, or different combinations of δ 2 k using δ 2 k = 1, for k = 1,2,3, δ 2 k = 3, for319 k = 4,5,6,7, δ 2 k = 5, for k = 8,9,10 to evaluate type I error. We set nk = 50, K = 10 so that we320 have a total of 10 studies with each study having 50 study subjects.321 For binary outcome, we generated outcomes by322 logit[P(Yik = 1)] = 0.5 · (1 + Xik,1 + Xik,2) + fk · scale( ∑ l∈A M∗ ik,l) +hk, (11)323 in which X·,1, X·,2, M∗ ik,l and hk are defined the same way as before.324 For type I error evaluation, we simulated data with fk = 0 for all k, indicating no taxon was325 15 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint associated with the outcome in any study. For power evaluation, we considered two scenarios:326 Scenario A1 the microbiome effect is the same across all studies.327 Scenario A2 the microbiome effects differs across studies.328 For scenario A1, we set fk = c for all k and A was the 16th cluster, which was a common cluster329 representing 19.6% of the total true OTU reads. The constant c was varied to assess the power. In330 scenario A2 , we set fk = c for k = 1, · · ·,5 and fk = −c for k = 6, · · ·,10. Other clusters had no331 association with the outcome Y .332 This outcome generation process linked the response variable to the “true” microbiome com-333 munity, which is unobserved in real settings. We then generated the observed OTU table (M ) by334 sub-sampling from the “true” OTU table, simulating the sequencing process as a random sampling335 (with or without bias) from the 16S rRNA molecules (a total of 10 6 per sample). Two methods336 of random sub-sampling were considered. The first method is unbiased sampling. In this method,337 each molecule had an equal probability of being sampled, representing an unbiased sequencing338 process. We fixed the library size for each sample at 1,000 reads and generated observed OTU339 counts from a multinomial distribution with probabilities M∗ ik,l × 10−6 and total read counts of340 1,000. In this case, the observed OTU counts were a faithful representation of the underlying341 community with only random sampling errors. The second method is biased sampling, where we342 introduced bias into the sequencing process. Some taxa were observed with higher or lower relative343 abundances compared to the true community. First, we generated the observed OTU table as in the344 unbiased sampling method. Then, for the first five studies, we inflated the read counts of the most345 abundant taxa by 1.1-fold. These “common” taxa were defined as the top 10% with the highest rel-346 ative abundances in the true community. Consequently, the common taxa were over-represented,347 while other taxa had lower relative abundances, introducing bias into the observed data. Due to348 this bias, the beta-diversities between samples from different studies were not reliable.349 All hypothesis testings and model evaluations were conducted using these “observed” datasets,350 as they reflect real-world data scenarios. We applied SMRmix (including both single-kernel and351 omnibus versions), CKSAT (for continuous outcomes), and GLMM-MiRKAT (for binary out-352 16 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint comes). MiRKAT-meta was also included in power evaluation. Each evaluation involved 10,000353 replicates for type I error assessment and 5,000 replicates for power evaluation. We tested the null354 hypothesis of no association between the microbiome profiles and outcomes. For the tests with a355 single kernel, we used three kernels: weighted UniFrac( Kw), unweighted UniFrac(Ku) and Bray-356 Curtis (Kbc). The UniFrac kernels utilize phylogenetic information while the Bray-Curtis does357 not. For SMRmix-Omnibus test, we synthesized test results from five kernelsm including weighted358 UniFrac, unweighted UniFrac, Bray-Curtis, and generalized UniFrac with weight parameter of 0359 and 0.5.360 2.5.2 Simulation B: Microbiome data processed through different pipelines with different361 tree structures362 In simulations (B), we reprocessed the raw sequencing (FASTQ) files from a real microbiome363 study through different bioinformatic pipelines, resulting in multiple OTU tables with distinct tree364 structures. For each dataset processed through a distinct bioinformatic pipelines, we estimated365 hyperparameters of the Dirichlet-Multinomial distribution and subsequently simulated additional366 microbiome data with its corresponding tree structure. This set of simulations reflects scenar-367 ios where the collected microbiome datasets are pre-processed through different bioinformatic368 pipelines, but access to the raw sequencing data is unavailable, a common situation when working369 with public datasets. In such scenarios, generating a single kernel matrix is not feasible, and ap-370 proaches that require a common kernel matrix, such as CSKAT and GLMM-MiRKAT, cannot be371 applied.372 Panel B in Figure 1 summarizes the general workflow of this simulation. The raw sequencing373 files were from a real gut microbiome study [10], which contains stool microbiome data from374 82 HIV-positive and 40 HIV-negative individuals. The samples underwent PCR amplification of375 the V4 region and were sequenced on the Illumina MiSeq platform. We downloaded the paired-376 end sequencing data in FASTQ format from Synapse (ID: syn22273201), and reprocess it through377 different pipelines to generate a total of nine dataset, each with distinct phylogenetic tree structures378 17 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint and OTU definitions.379 Among the nine datasets, six of them were processed through the QIIME2 pipeline [37]. Within380 QIIME2, paired-end sequencing data were joined and filtered, and two denoising techniques-381 DADA2 [38] and Deblur [39]-were used along with three phylogeny construction approaches-382 FastTree [40], RAxML [41] and IQ-TREE [42]. This combination resulted in six (2 × 3) distinct383 datasets. The remaining three datasets were generated using the original QIIME pipeline [43], em-384 ploying three different OTU-picking strategies: closed-reference OTU picking using SILV A (Re-385 lease 132) [44, 45], closed-reference OTU picking using Greengenes (Version 13 8) [46], and De386 novo OTU picking. Due to differences in the phylogenetic trees and taxa definitions across these387 processing steps, a common kernel matrix was not available in this simulation. This represents388 a scenario in which multiple public datasets are integrated without access to the raw sequencing389 data.390 After generating the nine datasets, we partitioned the taxa in each dataset into 20 clusters. We391 then simulated new count tables based on these datasets using Dirichlet-Multinomial distribution392 and the hyperparameters estimated from the nine count tables. The outcomes were generated using393 equation (10) for continuous traits and equation (11) for binary traits. Similar to scenario A, we394 evaluate type I error and power using σ 2 = 1 and either the same δ 2 k = 1 for all k, or different395 combinations of δ 2 k where δ 2 k = 1 for k = 1,2,3, δ 2 k = 3 for k = 4,5,6,7, δ 2 k = 5 for k = 8,9,10. To396 assess power, we assumed that the microbiome effect was consistent across the nine tables, because397 they originated from the same microbiome community. We specified fk = c for k ∈ {1, · · ·,9} and398 selected the 16th cluster in each dataset as A .399 Unlike Simulation A, we did not introduce a process to simulate bias in this scenario, as there400 was no “true” community available for comparison. We applied the proposed SMRmix and meta-401 analysis procedures to the simulated data, noting that CSKAT and GLMM-MiRKAT could not be402 used due to the lack of a common kernel matrix.403 18 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint 3 Results404 3.1 Simulation results405 Table 1 summarizes the type I error rates for continuous and binary outcomes across various406 scenarios in simulation setups A and B. SMRmix controls type I error rates effectively across all407 scenarios. In simulation setup A, CSKAT and GLMM-MiRKAT also showed valid type I error408 rates, regardless of sequencing bias. For binary outcomes, SMRmix was slightly conservative.409 Table 1: Empirical type I error rates comparing CSKAT, GLMM-MiRKAT, and SMRmix at differ- ent α cutoffs (0.05 and 0.01) in simulations. For continuous outcome with the sameδ 2 k , δ 2 k = 1, ∀k; for continuous outcome with different δ 2 k , δ 2 k = 1 for k = 1,2,3, δ 2 k = 3 for k = 4,5,6,7, δ 2 k = 5 for k = 8,9,10. For binary scenarios, σ 2 = 1. α = 0.05 α = 0.01

Methods

Kw Ku Kbc Omnib us Kw Ku Kbc Omnibus Simulation A, Continuous Y , No Bias Same δ 2 k CSKAT 0.0528 0.0488 0.0502 / 0.0106 0.0081 0.0087 / SMRmix 0.0508 0.0522 0.0509 0.0494 0.0110 0.0107 0.0110 0.0111 Different δ 2 k CSKAT 0.0475 0.0481 0.0482 / 0.0100 0.0098 0.0092 / SMRmix 0.0459 0.0525 0.0492 0.0481 0.0101 0.0094 0.0098 0.0105 Simulation A, Continuous Y , With Bias Same δ 2 k CSKAT 0.0494 0.0477 0.0498 / 0.0082 0.0094 0.0087 / SMRmix 0.0526 0.0512 0.0504 0.0522 0.0106 0.0099 0.0113 0.0111 Different δ 2 k CSKAT 0.0552 0.0493 0.0528 / 0.0112 0.0094 0.0098 / SMRmix 0.0513 0.0506 0.0514 0.0513 0.0104 0.0105 0.0107 0.0112 Simulation A, Binary Y , No Bias GLMM-MiRKAT 0.0513 0.0522 0.0500 / 0.0097 0.0105 0.0096 / SMRmix 0.0403 0.0444 0.0443 0.0377 0.0083 0.0077 0.0071 0.0057 Simulation A, Binary Y , With Bias GLMM-MiRKAT 0.0474 0.0511 0.0485 / 0.0075 0.0107 0.0091 / SMRmix 0.0403 0.0387 0.0397 0.0388 0.0047 0.0056 0.0047 0.0051 Simulation B, Continuous Y SMRmix 0.0469 0.0583 0.0504 0.0493 0.0083 0.0106 0.0101 0.0107 Simulation B, Binary Y SMRmix 0.0471 0.0438 0.0393 0.0452 0.0084 0.0064 0.0060 0.0072 *Note: CSKAT is not applicable for studies with binary outcomes, and only SMRmix is applicable for Simulation B. Figure 2 presents the power results for simulation setups A and B without sequencing bias. In410 simulation A1 with both continuous and binary outcomes, (left panels in Figure 2) when all stud-411 ies share the same microbiome effect and there is no sequencing bias, CSKAT/GLMM-MiRKAT412 achieves higher power than SMRmix when comparing methods using the same type of kernel413 19 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint (e.g., Bray-Curtis kernel). This is expected, as in this scenario, microbiome data across studies are414 comparable and CSKAT/GLMM-MiRKAT leverage the similarities between samples from differ-415 ent studies, offering more efficient use of the data. In contrast, the meta-analysis approach that416 combines p-values from multiple MiRKAT models exhibits substantial power loss.417 When comparing different kernels within the same method, both CSKAT and SMRmix demon-418 strate that the weighted UniFrac is the most powerful, followed by Bray-Curtis, and the unweighted419 UniFrac is the least powerful. This is because the microbiome effect in this scenario targets a420 phylogenetically-informed cluster of common taxa. A similar trend is observed even when se-421 quencing bias is introduced (left panels in Figure S1 in Supplementary Materials), likely due to the422 limited magnitude of bias introduced in this simulation.423 In scenarios A2, where studies have distinct microbiome effects (middle panels in Figure 2)424 and no sequencing bias, CSKAT/GLMM-MiRKAT, regardless of the kernel choice, show minimal425 power. This is because CSKAT/GLMM-MiRKAT implicitly assume that the microbiome-outcome426 relationship is consistent across studies via the same kernel function, which is violated in this sce-427 nario. In contrast, SMRmix demonstrates higher power. While the power of meta-analysis is428 substantially lower than SMRmix, it still performs better than CSKAT/GLMM-MiRKAT in this429 scenario. When comparing different kernels, the results are similar to those observed in Scenario430 A1. Notably, SMRmix maintains its power effectively between Simulations A1 and A2, demon-431 strating its robustness in scenarios with differing effect sizes across studies. A similar conclusion432 is drawn when sequencing bias is introduced (right panels of Figure S1 in the Supplementary Ma-433 terials).434 In simulation setup B, SMRmix exhibits much higher power than the meta-analysis procedure435 (right panels in Figure 2). CSKAT and GLMM-MiRKAT are not applicable in this simulation.436 The power results for continuous outcomes with varying variances of study-specific random437 effects (δ2 k ) are displayed in Figure S2 of the Supplementary Materials. The conclusions of perfor-438 mance under scenarios A1 and A2 remain consistent with previous findings. Notably, in simulation439 scenario A2, the meta analysis suffers a substantial power loss, while SMRmix stands out as the440 20 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint Figure 2: Statistical power for simulations A and B when there is no bias in sequencing sampling. For continuous scenarios (upper three panels), δ 2 k = 1, ∀k, and σ 2 = 1. only method capable of effectively detecting associations.441 3.2 Application to HIV studies442 We utilized SMRmix to analyze data from the HIV Microbiome Re-analysis Consortium [47],443 which re-analyzed α−diversity using raw 16S rRNA sequencing data from 17 studies. These444 studies aimed to characterize the gut dysbiosis in HIV+ individuals. The microbiome amplicon445 sequencing data and metadata of these studies are available on Synapse (ID: syn18406854) and446 NCBI. The 17 studies differ in their study design, institution of participants recruitment, study447 population, type of samples and various factors in the sequencing protocol including the 16s rRNA448 variable regions that were sequenced and the sequencing platform. The consortium reprocessed449 the raw sequencing data using standardized protocols and conducted taxonomic assignments in450 species level through Resphera Insight.451 Our goal was to evaluate the association between microbiome composition and HIV status,452 21 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint MSM status by integrating all these studies and draw a consistent conclusion. Out of the 17453 datasets, 14 studies(763 samples and 666 taxa) have subject-level information on MSM status,454 age, and HIV status, which we used for further analysis. Because a common phylogenetic tree is455 not available from Synapse webpage, we created a phylogentic tree using the “tax tree” function456 in R package “LTN” based on the taxonmy relationship [48]. The Principal Coordinates Anal-457 ysis (PCoA) plots, which visualize the distance matrix between study subjects, are presented in458 Supplementary Figure S3.459 We conducted the association tests utilizing single-kernel SMRmix based on kernels including weighted UniFrac(K w), unweighted UniFrac(K u), Jaccard(K jac), Bray-Curtis(K bc) and omnibus SMRmix using the above kernels. Suggested by previous studies [3], we adjusted HIV status when the outcome variable is MSM, and adjusted MSM when the outcome variable is HIV status. The statistical model was established as logit(P(HIVik = 1)) = β1ageik + β2MSMik + f (Mik) +hk, logit(P(MSMik = 1)) = β1ageik + β2HIVik + f (Mik) +hk. We also conducted an analysis in which each individual sample was analyzed separately, and HMP460 combination was used to obtain the final p-values, similar to the approach used in the simulation461 studies. The p-values are reported in Table 2. MiRKAT-meta yielded a p-value of 0.0623 for the462 outcome of HIV , while SMRmix reported a p-value< 0.0001, in line with our simulation results.463 Table 2: P-values of using SMRmix to test associations between microbiome composition and outcome variables HIV Status and MSM adjusting for age. Outcome SMRmix MiRKAT-meta Kw Ku K jac Kbc Omnibus Omnibus HIV data HIV < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 0.0623 MSM < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 CRC data CRC < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 For the outcome HIV status, SMRmix with weighted UniFrac, UniFrac, Jaccard, and Bray-464 22 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint Curtis kernels all discovered significant signals (p-values< 0.0001). Synthesizing the results based465 on above kernels, omnibus SMRmix found HIV status and microbiome composition associated (p-466 value < 0.0001). For the outcome MSM, similar conclusions were obtained that SMRmix with all467 kernels show significant connection between MSM and microbiome composition.468 3.3 Application to colorectal cancer data469 We applied the proposed method to eleven metagenomic datasets containing species-level OTU470 data for fecal microbiomes of individuals, consisting of 354 colorectal cancer patients and 393471 healthy controls. These datasets were obtained from the curatedMetagenomicData package in472 Bioconductor, which provides standardized microbial profiling using MetaPhlAn [49]. These stud-473 ies exhibit great heterogeneity in sequencing platforms (Illumina HiSeq, Illumina NextSeq), DNA474 extraction kits (Gnome, MoBio, Qiagen), and other technical factors. The sample sizes varied from475 9 to 279 across the studies (Supplementary Table S1). PCoA plots are provided in Supplementary476 Figure S4. Samples from the same study tend to cluster together, with noticeable shifts across477 clusters, suggesting study-specific variation.478 The objective was to evaluate differences in microbiome composition between colorectal can-479 cer patients and healthy controls. Patient demographics are summarized in Supplementary Table480 S1. To examine variations in patient demographics within studies, we fit a generalized linear mixed481 model (GLMM), using disease status as the outcome and age, gender, and BMI as control variables.482 Significant variations were found in patient demographics within studies, including age (GLMM483 p-value < 0.0001), gender (GLMM p-value = 0.0087), and BMI (GLMM p-value = 0.0297). To484 account for these variations, we adjusted for these factors when applying SMRmix to integrate485 information from these studies. Using SMRmix, we found a significant difference in microbiome486 composition between colorectal cancer patients and healthy controls based on weighted UniFrac,487 unweighted UniFrac, Jaccard, and Bray-Curtis kernels (See Table 2), and MiRKAT-meta yielded488 similar results. These findings are consistent with previous research, which also suggests a distinct489 microbiome profile associated with colorectal cancer [50].490 23 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint 4 Discussion491 In this paper, we present a novel approach, SMRmix for testing the association between mi-492 crobiome composition and outcomes of interest, by combining the concepts of similarity matrix493 regression and kernel machine regression. This method is particularly beneficial in the context of494 integrative analysis of microbiome data from multiple studies, cohorts or protocols, where inter-495 study pairwise distances are either absent or unreliable. SMRmix effectively addresses the limi-496 tations of existing kernel regression tests by leveraging only within-study pairwise distances for497 testing. In simulations, SMRmix showed well-controlled Type I error rates. We compared SMR-498 mix to two other methods, CSKAT and GLMM-MiRKAT (which require the usually non-available499 between-study pairwise distance), for microbiome compositional analysis. Our results showed500 that if distances of all pairs of samples are available and when the microbiome effect is homoge-501 neous, SMRmix has relatively lower power. However, SMRmix exhibits superior performance in502 scenarios where the effects of taxa are a mixture of positive and negative effects, and when differ-503 ent effect sizes result in a significant reduction in power for CSKAT and GLMM-MiRKAT. This504 highlights the robustness and effectiveness of SMRmix, particularly when inter-study pairwise dis-505 tances may not be reliable or available. We also compared SMRmix with a meta-analysis procedure506 (which, like SMRmix, doesn’t require the between-study pairwise distance), and showed higher507 power of SMRmix across all simulation setups. In summary, our proposed SMRmix method rep-508 resents a novel and effective approach for microbiome compositional analysis, with the potential509 to contribute to ongoing efforts to better understand the complex interactions between microbial510 communities and host health.511 Kernel selection is a critical challenge for any kernel-based approach, as choosing an unsuitable512 kernel can severely reduce the statistical power despite controlling the type I error. In microbiome513 studies, each kernel function assumes a distinct microbiome effect in the community, and improper514 choice of kernel can compromise the analysis [51]. To address this issue, SMRmix provides an515 omnibus test that integrates multiple kernel matrices to enhance the robustness of the analysis.516 SMRmix also allows for the use of different kernels for each study, such as Bray-Curtis distances517 24 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint in one study and UniFrac distances in another, thereby offering greater flexibility depending on the518 data availability. Our simulations demonstrate that single-kernel tests can be sensitive to kernel519 selection. In contrast, our omnibus SMRmix approach synthesizes results from multiple kernel-520 based tests by combining p-values and only incurs a small loss of power relative to using the best521 kernel. By contrast, the omnibus test improves the power substantially compared to a poor kernel522 choice. Given that the best kernel is often unknown in practice, the omnibus test is a practical and523 useful alternative.524 In applying SMRmix to investigate the relationship between human gut microbiome com-525 position and HIV infection using data from 14 studies, we aimed to address conflicting find-526 ings in the literature. Although many studies have reported gut microbiome dysbiosis associ-527 ated with HIV infection and highlighted the potential role of “leaky gut” in HIV pathogenesis528 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], others have suggested that MSM status may confound the HIV-529 microbiome relationship [3]. Our analysis adjusted for age and MSM status to assess the associ-530 ation between microbiome composition and HIV status, and similarly adjusted for age and HIV531 infection to examine the association between microbiome composition and MSM status. By inte-532 grating information from multiple studies on this topic, we found significant associations between533 microbiome composition and HIV (adjusting for age and MSM) and between microbiome and534 MSM (adjusting for age and HIV). These findings provide compelling evidence of HIV-related535 dysbiosis that is independent of MSM status, challenging the notion that MSM is only a confound-536 ing factor in the HIV-microbiome relationship.537 Supplementary Information538 Supplementary Files539 Supplemental files can be obtained through online version of the paper at Microbiome.540 25 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint Authors’ contributions541 MH contributed to the development of the method, performed simulation studies and compar-542 isons, and wrote the manuscript. NZ conceived the study, primarily developed the method, and543 wrote the manuscript. All authors read and approved the final manuscript.544 Funding545 This work was funded by by NIH grants R21AI154236, R01GM147162 and U24OD023382.546 Availability of data and materials547 The implementation of SMRmix in R codes are available in the GitHub repository https:548 //github.com/mengyu-he/SMRmix.549 Ethics approval and consent to participate550 Not applicable.551 Consent for publication552 Not applicable.553 Competing interests554 The authors declare that they have no competing interests.555 Acknowledgements556 Not applicable.557 26 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint Author affiliations558 Department of Biostatistics and Bioinformatics, Emory University, Atlanta, 30322, GA, USA559 (Mengyu He)560 Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA (Ni Zhao)561 References562 [1] Dubourg, G. et al. Gut microbiota associated with HIV infection is significantly enriched in563 bacteria tolerant to oxygen. BMJ Open Gastroenterology 3, e000080 (2016). URL https:564 //bmjopengastro.bmj.com/lookup/doi/10.1136/bmjgast-2016-000080.565 [2] Mutlu, E. A. et al. A Compositional Look at the Human Gastrointestinal Microbiome and566 Immune Activation Parameters in HIV Infected Subjects. PLoS Pathogens 10, e1003829567 (2014). URL https://dx.plos.org/10.1371/journal.ppat.1003829.568 [3] Noguera-Julian, M. et al. Gut Microbiota Linked to Sexual Preference and HIV Infec-569 tion. EBioMedicine 5, 135–146 (2016). URL https://linkinghub.elsevier.com/570 retrieve/pii/S2352396416300287.571 [4] Nowak, P. et al. Gut microbiota diversity predicts immune status in HIV-572 1 infection. AIDS 29, 2409–2418 (2015). URL https://journals.lww.com/573 00002030-201511280-00004.574 [5] Pinto-Cardoso, S. et al. Fecal Bacterial Communities in treated HIV infected individuals on575 two antiretroviral regimens. Scientific Reports 7, 43741 (2017). URL http://www.nature.576 com/articles/srep43741.577 [6] Villanueva-Mill ´an, M. J., P ´erez-Matute, P., Recio-Fern ´andez, E., Lezana Rosales, J. M.578 & Oteo, J. A. Differential effects of antiretrovirals on microbial translocation and579 gut microbiota composition of HIV-infected patients: Villanueva-Mill ´an MJ et al.580 27 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint Journal of the International AIDS Society 20, 21526 (2017). URL http://doi.wiley.581 com/10.7448/IAS.20.1.21526.582 [7] Yu, G., Fadrosh, D., Ma, B., Ravel, J. & Goedert, J. J. Anal microbiota profiles in HIV-583 positive and HIV-negative MSM. AIDS 28, 753–760 (2014). URL https://journals.584 lww.com/00002030-201403130-00013.585 [8] Sun, Y .et al. Fecal bacterial microbiome diversity in chronic HIV-infected patients in China.586 Emerging Microbes & Infections 5, 1–7 (2016). URL https://www.tandfonline.com/587 doi/full/10.1038/emi.2016.25.588 [9] V ´azquez-Castellanos, J. F. et al. Altered metabolism of gut microbiota contributes to chronic589 immune activation in HIV-infected individuals. Mucosal Immunology 8, 760–772 (2015).590 URL http://www.nature.com/articles/mi2014107.591 [10] Monaco, C. et al. Altered Virome and Bacterial Microbiome in Human Immunode-592 ficiency Virus-Associated Acquired Immunodeficiency Syndrome. Cell Host & Microbe593 19, 311–322 (2016). URL https://linkinghub.elsevier.com/retrieve/pii/594 S193131281630052X.595 [11] Vesterbacka, J. et al. Richer gut microbiota with distinct metabolic profile in HIV in-596 fected Elite Controllers. Scientific Reports 7, 6269 (2017). URL http://www.nature.597 com/articles/s41598-017-06675-1 .598 [12] V olpe, G. E.et al. Associations of Cocaine Use and HIV Infection With the Intestinal Micro-599 biota, Microbial Translocation, and Inflammation. Journal of Studies on Alcohol and Drugs600 75, 347–357 (2014). URL http://www.jsad.com/doi/10.15288/jsad.2014.75.347.601 [13] Ling, Z. et al. Alterations in the Fecal Microbiota of Patients with HIV-1 Infection: An602 Observational Study in A Chinese Population. Scientific Reports 6, 30673 (2016). URL603 http://www.nature.com/articles/srep30673.604 28 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint [14] Dinh, D. M. et al. Intestinal Microbiota, Microbial Translocation, and Systemic Inflammation605 in Chronic HIV Infection. Journal of Infectious Diseases 211, 19–27 (2015). URL https:606 //academic.oup.com/jid/article-lookup/doi/10.1093/infdis/jiu409.607 [15] Nowak, R. G. et al. Rectal microbiota among HIV-uninfected, untreated HIV, and treated608 HIV-infected in Nigeria. AIDS 31, 857–862 (2017). URL https://journals.lww.com/609 00002030-201703270-00014.610 [16] Lozupone, C. et al. Alterations in the Gut Microbiota Associated with HIV-1 Infection.611 Cell Host & Microbe 14, 329–339 (2013). URL https://linkinghub.elsevier.com/612 retrieve/pii/S1931312813002886.613 [17] McArdle, B. H. & Anderson, M. J. Fitting multivariate models to community data: a com-614 ment on distance-based redundancy analysis. Ecology 82, 290–297 (2001).615 [18] MetaHIT Consortium (additional members) et al. Enterotypes of the human gut microbiome.616 Nature 473, 174–180 (2011). URL http://www.nature.com/articles/nature09944.617 [19] Zhao, N. et al. Testing in microbiome-profiling studies with MiRKAT, the microbiome618 regression-based kernel association test. The American Journal of Human Genetics 96, 797–619 807 (2015). URL https://doi.org/10.1016/j.ajhg.2015.04.003.620 [20] Zhan, X. et al. A small-sample kernel association test for correlated data with application to621 microbiome association studies. Genetic Epidemiology 42, 772–782 (2018). URL https:622 //onlinelibrary.wiley.com/doi/abs/10.1002/gepi.22160.623 [21] Koh, H., Li, Y ., Zhan, X., Chen, J. & Zhao, N. A Distance-Based Kernel Association624 Test Based on the Generalized Linear Mixed Model for Correlated Microbiome Studies.625 Frontiers in Genetics 10, 458 (2019). URL https://www.frontiersin.org/article/626 10.3389/fgene.2019.00458/full.627 29 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint [22] Kwee, L. C., Liu, D., Lin, X., Ghosh, D. & Epstein, M. P. A Powerful and Flexible Mul-628 tilocus Association Test for Quantitative Traits. The American Journal of Human Genetics629 82, 386–397 (2008). URL https://linkinghub.elsevier.com/retrieve/pii/630 S0002929708000888.631 [23] Wu, M. C. et al. Powerful SNP-Set Analysis for Case-Control Genome-wide Association632 Studies. The American Journal of Human Genetics 86, 929–942 (2010). URL https://633 linkinghub.elsevier.com/retrieve/pii/S000292971000248X.634 [24] Wu, M. et al. Rare-Variant Association Testing for Sequencing Data with the Sequence635 Kernel Association Test. The American Journal of Human Genetics 89, 82–93 (2011). URL636 https://linkinghub.elsevier.com/retrieve/pii/S0002929711002229.637 [25] Tzeng, J.-Y ., Zhang, D., Chang, S.-M., Thomas, D. C. & Davidian, M. Gene-Trait Similarity638 Regression for Multimarker-Based Association Analysis. Biometrics 65, 822–832 (2009).639 URL http://doi.wiley.com/10.1111/j.1541-0420.2008.01176.x.640 [26] Zhan, X. Relationship Between MiRKAT and Coefficient of Determination in Similarity641 Matrix Regression. Processes 7, 79 (2019). URL http://www.mdpi.com/2227-9717/7/642 2/79.643 [27] Lin, D. Y . An efficient Monte Carlo approach to assessing statistical significance in ge-644 nomic studies. Bioinformatics 21, 781–787 (2005). URL https://academic.oup.com/645 bioinformatics/article-lookup/doi/10.1093/bioinformatics/bti053.646 [28] Conneely, K. N. & Boehnke, M. So Many Correlated Tests, So Little Time! Rapid Adjust-647 ment of P Values for Multiple Correlated Tests. The American Journal of Human Genetics648 81, 1158–1168 (2007). URL https://linkinghub.elsevier.com/retrieve/pii/649 S0002929707637665.650 [29] Chapman, J. & Whittaker, J. Analysis of multiple SNPs in a candidate gene or re-651 30 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint gion. Genetic Epidemiology 32, 560–566 (2008). URL http://doi.wiley.com/10.1002/652 gepi.20330.653 [30] Pan, W., Han, F. & Shen, X. Test Selection with Application to Detecting Disease Association654 with Multiple SNPs. Human Heredity 69, 120–130 (2010). URL https://www.karger.655 com/Article/FullText/264449.656 [31] Wilson, D. J. The harmonic mean p -value for combining dependent tests.657 Proceedings of the National Academy of Sciences 116, 1195–1200 (2019). URL http:658 //www.pnas.org/lookup/doi/10.1073/pnas.1814092116.659 [32] Liu, Y . & Xie, J. Cauchy combination test: A powerful test with analytic p-value calculation660 under arbitrary dependency structures. Journal of the American Statistical Association 115661 (2020).662 [33] Simes, R. J. An improved bonferroni procedure for multiple tests of significance. Biometrika663 73 (1986).664 [34] Tippett, L. H. C. The methods of statistics V ol. 2d ed., re (Williams665 Norgate ltd, London, 1931).666 [35] Chen, J. & Li, H. Kernel methods for regression analysis of microbiome compositional data,667 V ol. 55 (2013).668 [36] Charlson, E. S. et al. Disordered Microbial Communities in the Upper Respiratory Tract of669 Cigarette Smokers. PLoS ONE 5, e15216 (2010). URL https://dx.plos.org/10.1371/670 journal.pone.0015216.671 [37] Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science672 using QIIME 2. Nature Biotechnology 37, 852–857 (2019). URL http://www.nature.673 com/articles/s41587-019-0209-9.674 31 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint [38] Callahan, B. J. et al. Dada2: High-resolution sample inference from illumina amplicon data.675 Nature Methods 13 (2016).676 [39] Amir, A. et al. Deblur rapidly resolves single-nucleotide community sequence patterns.677 mSystems 2 (2017). URL http://dx.doi.org/10.1128/mSystems.00191-16.678 [40] Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree: Computing Large Minimum Evolution679 Trees with Profiles instead of a Distance Matrix.Molecular Biology and Evolution 26, 1641–680 1650 (2009). URL https://academic.oup.com/mbe/article-lookup/doi/10.1093/681 molbev/msp077.682 [41] Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses683 with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690 (2006).684 URL https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/685 bioinformatics/btl446.686 [42] Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: A Fast687 and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies.688 Molecular Biology and Evolution 32, 268–274 (2015). URLhttps://academic.oup.com/689 mbe/article-lookup/doi/10.1093/molbev/msu300.690 [43] Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing data.691 Nature Methods 7, 335–336 (2010). URL http://www.nature.com/articles/nmeth.f.692 303.693 [44] Quast, C. et al. The SILV A ribosomal RNA gene database project: im-694 proved data processing and web-based tools. Nucleic Acids Research 41, D590–695 D596 (2012). URL http://academic.oup.com/nar/article/41/D1/D590/1069277/696 The-SILVA-ribosomal-RNA-gene-database-project .697 [45] Yilmaz, P. et al. The SILV A and “all-species living tree project (LTP)” taxonomic frame-698 32 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint works. Nucleic Acids Research 42, D643–D648 (2013). URLhttps://doi.org/10.1093/699 nar/gkt1209.700 [46] McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks for ecological701 and evolutionary analyses of bacteria and archaea. The ISME Journal 6, 610–618 (2012).702 URL http://www.nature.com/articles/ismej2011139.703 [47] Tuddenham, S. A. et al. The impact of human immunodeficiency virus infection on gut704 microbiota α-diversity: An individual-level meta-analysis. Clinical Infectious Diseases 70705 (2020).706 [48] Wang, Z., Mao, J. & Ma, L. Microbiome compositional analysis with logistic-tree normal707 models. arXiv preprint arXiv:2106.15051 (2021).708 [49] Pasolli, E. et al. Accessible, curated metagenomic data through ExperimentHub.709 Nat. Methods 14, 1023–1024 (2017).710 [50] Ahn, J. et al. Human gut microbiome and risk for colorectal cancer.711 JNCI: Journal of the National Cancer Institute 105, 1907–1911 (2013). URL712 http://dx.doi.org/10.1093/jnci/djt300.713 [51] Plantinga, A. et al. Mirkat-s: a community-level test of association between the microbiota714 and survival times. Microbiome 5, 17 (2017).715 33 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted November 27, 2024. ; https://doi.org/10.1101/2024.03.10.584315doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-06-15T06:18:04.506796+00:00