A conceptual framework for revealing minor bacterial signals in microbiome data through guided data transformation

doi:10.1101/2025.05.31.656121

A conceptual framework for revealing minor bacterial signals in microbiome data through guided data transformation

2025 · doi:10.1101/2025.05.31.656121

preprint OA: closed CC-BY-NC-4.0

📄 Open PDF Full text JSON View at publisher

Full text 74,067 characters · extracted from oa-pdf · 6 sections · click to expand

Methods

section and the Figure 5. For each condition, we assessed (i) the clustering performance by evaluating the association between transformed data clusterings and host health, and (ii) the predictive performance of machine learning models trained to predict host health status from microbiome features. Before applying the proposed methodology, we have to check that the simulated data have a similar structure as the one observed in the reference dataset. In particular, the simulated data should present three distinct enterotypes, each driven by species predominantly from the genus Prevotella or Bacteroides. As anticipated based on our simulation design, samples assigned to the enterotype dominated by Prevotella species were those most frequently associated with Inﬂammatory Bowel Disease (IBD). It is conﬁrmed by a statistical test where the null hypothesis ( 𝑃 0 :) ”the composition of the enterotype is independent of the simulated host health status” is rejected with a 𝑀-value < 0.001 (see Fig. 6). 8 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint 2.7 Data transformations facilitate the formation of clusters associated with host health by focusing on minor bacterial signals For each subset of experimental conditions and type of data, we assessed clustering performance. A confusion matrix is constructed between the host health status and the clusters. Then, 𝑁2 test is applied to this matrix to evaluate the degree of association. The null hypothesis ( 𝑃 0) states independence between clustering and host health. Here, we show that the clustering performance derived from transformed data remains e!ective in de- tecting host health, whatever the experimental scenario, especially when the inﬂuence of dominant bacterial signals on 𝑗sim is high (Figure 2.A). Furthermore, our results demonstrate that the clustering applied to transformed data is more inﬂuenced by minor bacterial signals than the clustering applied to data of rel- ative abundance (Figure 2.B). Data transformations appear essential to capture minor bacterial signals in microbiome data. Consequently, the unsupervised analysis derived from various transformed data takes into account the presence of minor bacterial signals, leading to a more accurate insight into the simulated host health ( 𝑗sim). We validate that in a situation where the dominant signal has a strong e!ect on host health, the unsupervised analysis of transformed data remains e”cient by focusing its analysis on minor bacterial signals. However, this guided data transformation does not lead to a higher clustering performance than other data transformations. 2.8 The guided data transformation improves predictive performance in an 𝐿 ↑ 𝑀 problem by selecting minor bacterial signals The scenario and the methods used in this section are the same as the previous section. For each subset of experimental conditions, a Random Forest (RF) model was trained on a training set of varying sizes (𝐿train ↓{ 75, 150, 300}) and evaluated on a ﬁxed test set ( 𝐿test = 100). Predictive performance was assessed using the area under the receiver operating characteristic curve (AUC-ROC). Additionally, we examined the distribution of the 20 most important bacterial species contributing to the prediction of the host health variable. Our results demonstrate that training RF models on guided transformed data and presence/absence data improves predictive accuracy in settings where the curse of dimensionality is more severe (𝐿train = 75, Figure 2.C). Notably, only RF models trained on guided transformed data consistently outperform those trained on relative abundance data when dominant bacterial signals exert a strong inﬂuence on host health ( 𝐿train = 75, Figure 2.C). These ﬁndings support the hypothesis that reducing the total information improves predictive performance in high-dimensional small-sample datasets. In addition, models trained on guided transformed data are primarily inﬂuenced by minor bacterial signals (Figure 2.D). RF algorithms trained on logarithmically transformed data show predictive performance comparable to the ones trained in other transformed data sets, but do not outperform those using relative abundance data (Figure 2.C). In addition, these models are predominantly based on dominant bacterial signals, even though host health was simulated based on minor bacterial signals (Figure 2.D). Consequently, the interpretability of RF algorithms trained on log-transformed data appears limited. Importantly, the beneﬁt of data transformation diminishes as training sample size increases (Figure 2.C). 9 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint AUC−ROC 0.5 0.6 0.7 0.8 0.9 1.0 Low Medium High c,d' c,d' d Mean (log scale) of the signal driving the clustering −6 −4 −2 0 Low Medium High nsim = 300 nsim = 150 nsim = 75 Clustering performance Predictive performance Species driving the prediction(A) (B) (C)Species driving the clustering (D) LEGENDS p value (log scale) −30 −20 −10 −5 0 Low Medium High Mean (log scale) of the signal driving the clustering −4 −3 −2 −1 0 Low Medium High p value (log scale) −30 −20 −10 −5 0 Low Medium High Mean (log scale) of the signal driving the clustering −4 −3 −2 −1 0 Low Medium High Rel. Ab. Abs./Pres. CLR Unifrac Guided Transf. AUC−ROC 0.5 0.6 0.7 0.8 0.9 1.0 Low Medium High b,c',d' c',d' d Mean (log scale) of the signal driving the clustering −6 −4 −2 0 Low Medium High AUC−ROC 0.5 0.6 0.7 0.8 0.9 1.0 Low Medium High d Mean (log scale) of the signal driving the clustering −6 −4 −2 0 Low Medium High Figure 2 : Guided transformation of microbiome data enhances clustering performance and predic- tion accuracy in high-dimensional, low-sample-size settings ( 𝐿 ↑ 𝑀). Three simulation scenarios are considered, where the inﬂuence of dominant bacterial signals on the simulated host health outcome ( 𝑗sim) is low, medium, or high. These scenarios are evaluated across di!erent sample sizes: 𝐿sim ↓{ 75, 150, 300}. (A) Clustering was performed for each scenario. Performance was evaluated using a 𝑁2 test on the con- fusion matrix between cluster assignments and 𝑗sim. Distributions of log-transformed p-values across 50 simulated datasets are shown. The dashed line indicates the log (0.05) threshold, below which the clustering is statistically relevant with host health. ( B) The bacterial species driving the clustering were identiﬁed using Random Forest models. The distribution of the mean abundances (log-transformed) of the top 20 most important species is shown across simulations. Statistical comparisons are omitted, as the patterns are visually distinct. ( C) Predictive performance was assessed using the AUC-ROC on test sets. Letters denote statistically signiﬁcant di!erences ( 𝑀< 0.05) between the relative abundance data and: centered log-ratio (CLR) transformation ( 𝑘), absence/presence transformation ( 𝑙), and guided transformation ( 𝑖). A prime symbol ( ⇐) indicates additional signiﬁcant di!erences between CLR and other transformations. ( D) The distributions of the top 20 most important species for predicting host health are shown. As in (B), statistical marks are omitted due to visually discernible trends. The guided transformation has been applied to the Absence/Presence matrix. 10 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint In larger datasets, RF algorithms trained on relative abundance data increasingly leverage low-abundance species, indicating that minor bacterial signals are essential for accurate prediction of host health. In summary, our results show that data transformations (particularly guided transformation) improve predictive performance. These transformations achieve this by focusing the analysis on biologically relevant minor bacterial signals. Building on these ﬁndings, we now evaluate the guided transformation on real datasets. The ﬁrst application addresses a high-dimensional ( 𝐿 ↑ 𝑀) predictive problem, while the second illustrates how guided transformation can reveal enterotypes as potential confounding factors. 2.9 The decrease of the total amount of information enhances predictive performance of fat oxidation in humans In this ﬁrst application of our guided data transformation, we aim to develop a predictive model to classify fat oxidation (FO) rates in humans, to determine whether individuals exceed or fall below the threshold of 0.4 g/min, using gut microbiota composition as a predictor (25). The 0.4 g/min threshold for FO was selected because it has previously been identiﬁed as an e!ective cut-o! to discriminate between populations with poor versus normal metabolic ﬂexibility ( 26, 27). A detailed description of the dataset is provided in the

Materials

section ( 28). An initial clustering analysis based on relative abundance data revealed two distinct enterotypes: one associated with the genus Bacteroides and a second with species belonging to the Prevotella genus, repre- sented respectively by pink and yellow circles in Figure 3.A. A 𝑁2 test, which states that FO is independent of the clustering, yielded a p-value of approximately 0.006 (Figure 3.B). It indicates a signiﬁcant relationship between dominant microbial signals (i.e., enterotype-deﬁning species) and fat oxidation. We conﬁrm that the most inﬂuential species in the clustering are those with the highest abundance and variance (Figure 3.C), including Prevotella Copri and Bacteroides Uniformis. To investigate this further, we compared the predictive performance of Random Forest (RF) models trained on relative abundance data with those trained on various transformed datasets, including guided transformed data. The guided transformation was applied to each type of data as deﬁned in Equation (1). RF models were trained using 80% of the samples, and predictive performance was evaluated on the remaining 20%. This procedure was repeated twenty times to estimate the distribution of AUC-ROC values. RF models trained on both relative abundance and presence/absence data yielded mean AUC-ROC values of approximately 0.65 and 0.66, respectively. Applying the guided transformation to either of these data signiﬁcantly improved the AUC-ROC by an average of 0.05 points ( 𝑀< 0.05, Figure 3.D). These

Results

support our hypothesis that reducing the overall information can enhance the predictive accuracy of supervised algorithms, especially in small-sample, high-dimensional contexts. Furthermore, we assessed the interpretability of the model by computing the coe”cient of determination (𝑂 2) between species abundance and their importance in RF models. We found that models trained on guided transformed data relied more on minor bacterial signals compared to other data (Figure 3.E). These ﬁndings suggest that incorporating minor bacterial species into the models could lead to improved predictive performance. In conclusion, these ﬁndings provide evidence that, in 𝐿 ↑ 𝑀 scenarios, dominant bacterial signals can 11 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint (A) (B) (C) (D) (E) LEGENDS Type of data Original data Guided transformed Figure 3:( A) Principal Coordinates Analysis (PCoA) with 80% conﬁdence ellipses for each cluster (col- ored circles). ( B) Distribution of host health status across clusters. The p-value of the 𝑁2 test assessing the association between clustering and host health is reported. A bar plot showing the microbiome composition at the genus level for each cluster is also presented. ( C) Abundance-variance relationship at the species level, illustrating the inﬂuence of each species on the clustering. Circle size reﬂects the species’ contribu- tion to the clustering. The 10 most inﬂuential species (i.e., the largest circles) are highlighted along with their corresponding mean decrease in Gini index. ( D) Predictive performance of Random Forest models across di!erent transformation strategies, with and without guided transformation. ( E) Strength of the linear relationship (expressed as 𝑂 2) between species abundance and feature importance in each Random Forest model. The symbol 𝑉 indicates statistically signiﬁcant di!erences between the original data and its guided transformation. The guided transformation has been applied to all types of data. 12 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint obscure more informative minor bacterial signals. Guided transformation e!ectively mitigates this issue, leading to both improved predictive performance and interpretability in predicting human fat oxidation based on gut microbiota composition. 2.10 The dominant bacterial signals appear to act as a confounding factor in predictive models of Ulcerative Colitis In this second application of our guided data transformation, we aim to develop a predictive model to classify host health, speciﬁcally whether they have ulcerative colitis (UC) or not, using gut microbiota compositions as predictors. The dataset comprises 12 public datasets. All gut microbiomes were curated in a previous study (29), resulting in a uniﬁed dataset of 1328 samples including 925 bacterial species. We replicated the ﬁndings of Wu et al. (2024) ( 29), conﬁrming the presence of three distinct enterotypes within the dataset (Figure 4.A-B). The Bacteroides enterotype (ET-B), depicted in red, comprises a com- parable number of healthy and UC samples. In contrast, the Clostridium enterotype (ET-C), represented in yellow, consists almost exclusively of UC samples. The third enterotype is characterized by a diverse assemblage of genera, predominantly Blautia and Faecalibacterium, and is primarily associated with healthy individuals. Our results conﬁrm that clustering based on the Bray–Curtis dissimilarity metric is primarily driven by species with high abundance and variance (Figure 4.C). We next compared the predictive performance of Random Forest (RF) models trained on relative abun- dance data and guided-transformed data derived from it. Each RF model was trained on 80% of the dataset, and predictions were generated on the remaining 20%. This process was repeated 20 times to assess perfor- mance stability by the distribution of AUC-ROC scores. RF models trained on relative abundance data achieved a mean AUC-ROC of approximately 0.98. In particular, models trained on guided transformed data achieved the same performance (Figures 4.D-E), indicating that removing dominant bacterial signals did not a!ect predictive power. However, a comparison of the importance proﬁles of the features revealed that guided transformation alters the interpretability of the model. Indeed, when trained on guided transformed data, RF models are more dependent on minor bacterial signals (Figure 4.F). This observation is supported by lower values of 𝑂 2 between species abundance and feature importance, indicating a shift away from high-abundance species (Figure 4.F). The approach implemented here is closely related to out-of-sample deconfounding, providing empirical evidence that enterotype-related information may not be essential for predicting host health and can act as a confounding factor ( 30). The guided transformation instead allows the model to focus on minor bacterial signals that may carry greater relevance to the phenotype of interest. Interestingly, while Wu et al. reported that Ruminococcus gnavus is more abundant in subjects with UC associated with ET-B and ET-C, our analysis suggests the presence of two distinct microbial networks that are not directly related to the pathology of UC, but can play protective roles (Figure 4G). The ﬁrst group, located upstream of the UC node, includes Ruminococcus callidus, a species more abundant in healthy individuals. Although phylogenetically close to R. gnavus. R. Callidus exhibits a negative correlation in abundance with R. gnavus, suggesting competitive exclusion and a possible protective e!ect. Notably, R. Callidus has been suggested as a key bacterium that may protect against inﬂammatory bowel disease ( 31). In addition, as a 13 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint Networks of the 20 most important features derived from RF Blautia.faecis Colidextribacter.massiliensis Coprococcus.catus Oscillibacter.valericigenes Blautia.obeum Lawsonibacter.asaccharolyticus Papillibacter.cinnamivorans Aminipila.butyrica Anaerobutyricum.hallii Ruminococcus.callidus Gemmiger.formicilis Anaerobacterium.chartisolvens Adlercreutzia.equolifaciens Roseburia.faecis Coprococcus.comes UC Eubacterium.xylanophilum Dorea.longicatena log p value = −393 (***) −0.4 −0.2 0.0 0.2 −0.4 −0.2 0.0 0.2 0.4 PCo1 (2%) PCo2 (2%) Healthy UC log p value = −4 (*) Enterotypes 294 / 278 2 / 260 411 / 83 Healthy / UC Gut microbiota composition 0.0 0.2 0.4 0.6 0.8 1.0 Alistipes Bacteroides Bifidobacterium Blautia Clostridium Faecalibacterium Faecalicatena Fusobacterium Phocaeicola Pseudomonas Roseburia Ruminococcus Streptococcus Others Species influences on enterotype determination Low Medium High −16 −14 −12 −10 −8 −6 −4 −25 −20 −15 −10 −5 Species means (log scale) Species variance (log scale) (A) (B) (D) (E) (F) (G) (C) Figure 4: (A) Principal Coordinates Analysis (PCoA) with 80% conﬁdence ellipses (CEs) for each cluster, represented by colored circles. (B) Bar plot of gut microbiota composition at the genus level, stratiﬁed by cluster. Only the 13 most abundant genera are shown. The yellow cluster is associated with the Clostridium enterotype, while the red cluster corresponds to the Bacteroides enterotype. (C) Abundance-variance rela- tionship at the species level, illustrating the inﬂuence of each species on the clustering. Circle size reﬂects the species’ contribution to the clustering. (D) Predictive performance (with 95% conﬁdence intervals) is compared between relative abundance and guided-transformed datasets.(E) Distribution of predictive perfor- mance across 20 cross-validation runs. (F) Distribution of 𝑂 2 values across 20 cross-validations, indicating the strength of the relationship between species importance in the RF model and their abundances. (G) Co- occurrence network between the 20 most important bacterial species (as determined by the RF model trained on guided-transformed data) and host health status. Co-occurrences were evaluated using the Chi-square test ( 𝑁2); edge width is proportional to signiﬁcance (lower p-values), with blue edges indicating positive associations and red edges indicating negative associations. The guided transformation has been applied to the relative abundance data. 14 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint ﬁber-degrading species, it belongs to the same group of beneﬁcial microbes (including Faecalibacterium prausnitzii) that are reduced in children with pre-diabetes ( 32). The second group, located downstream of the UC node, includes species consistently selected by RF models trained on relative abundance data. All samples expressing this group of species belong to healthy individuals, with no UC cases observed, further supporting a potential protective role (data not shown). In summary, our results reinforce the notion that dominant bacterial signals can obscure minor but biologically relevant species. The application of guided data transformation not only preserves predictive performance but also enhances interpretability, o!ering novel insights into the complex interplay between microbial composition and host health. 3 METHODS 3.1 Study Populations The present study used three distinct datasets of the gut microbiome to ensure a comprehensive analysis. The ﬁrst dataset served as the reference dataset that we use for the demonstration and simulation sections (from 2.1 to 2.8). The remaining two data sets were used in the application sections (2.9 and 2.10) to illustrate the impact, e!ectiveness, and robustness of the proposed method. 3.1.1 Inﬂammatory Bowel Disease (IBD) [reference dataset] The reference data set used to observe the problems is registered in Bio-project n°PRJEB1220. (18) It consists of 396 samples, each characterized by the expression levels of 606 microbial species and an associated binary host health variable, 0 corresponding to a healthy sample (HLT) and 1 corresponding to a sample presenting an irritable bowel disease (IBD). 3.1.2 Fat oxidation (FO) and insulin sensitivity [dataset n °2] The second dataset is a small cohort (n = 50 and p = 248) where the binary variable of host health is the maximal oxidation of fat (FO) ( 25). It refers to the ability to oxidize fat during submaximal exercise in a fasted state, measured in g/min. The 0.4 g/min threshold for FO was selected because it has previously been identiﬁed as an e!ective cut-o! to discriminate between populations with poor versus normal metabolic ﬂexibility (26, 27). In this case, 0 corresponds to a healthy sample presenting an FO higher than 0.4, and 1 corresponds to a sample presenting an FO lower than 0.4. 3.1.3 Ulcerative colitis (UC) [dataset n °3] The third dataset comprises 12 publicly available datasets. All gut microbiomes have been curated in a previous study, provided in their Github: https://github.com/WXG920713/Gut-microbes.( 29) They obtained gut microbiome metagenomic data from three sources: GMrepo, the European Bioinformatics Institute (EMBL-EBI), and Google Scholar, using speciﬁc search criteria related to ulcerative colitis (UC) and 16S rRNA gene sequencing. They generate a data set including 1328 samples that express 925 species. In 15 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint this dataset, each sample is labeled with a binary host health variable, where 0 indicates a healthy individual (HLT) and 1 corresponds to a case of ulcerative colitis (UC). 3.2 Data transformation 3.2.1 Relative abundance data. The non-transformed gut microbiome data corresponds to the compositional data. Compositional data are restricted within a simplex (𝑚𝑊 ), where data contain 𝑕 parts of nonnegative numbers whose sum is 1. In this case, for each observation, the variables are limited to the interval [0, 1] (33). 𝑚𝑊 = {𝑄 = [𝑛1,𝑛 2,..., 𝑛𝑊 ]| 𝑛𝐿 ↗ 0, (𝑒= 1, 2,..., 𝑕 ); 𝑊/√︄ummationdi√︄√︁lay. 𝐿=1 𝑛𝐿 = 1} 3.2.2 Absence/presence data. The gut microbiome data is transformed into the absence/presence matrix, where 0 refers to the absence of a species in the sample, while 1 refers to the presence of the bacteria. 3.2.3 Log transformed data. The log-ratio transformations eliminate the non-negativity constraint of compositional data and establish a one-to-one mapping onto real space, allowing researchers to use standard multivariate methods ( 34). In this paper, we focus speciﬁcally on thecentered log-ratio (CLR) transformation due to its relevance in subsequent analyses. The CLR transformation uses the logarithm of the ratio of each component over the geometric mean of all components ( 33). CLR(x) = ( ln 𝑛1 𝑜(x) ,..., ln 𝑛𝑊 𝑜(x) ) where 𝑜(𝑛) is the geometric mean of the vector 𝑛, deﬁned as: 𝑜(x) = ( 𝑊/√︁√︂oductdi√︄√︁lay. 𝐿=1 𝑛𝐿 ) 1 𝑂 This transformation preserves the relative structure of the data while enabling robust statistical analysis. 3.3 Clustering The clustering involves the following steps: • The pairwise (dis)similarity or distance matrix is computed, denoted 𝑕 . • Hierarchical Agglomerative Clustering (HAC) is applied on 𝑕 . • The clustering is then obtained by applying the cutree function in R on the dendrogram derived from the HAC. 16 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint • The optimal number of classes is determined by selecting the number that maximizes the silhouette coe”cient. 3.3.1 Assessment of clustering performance. To assess clustering performance, a confusion matrix is constructed between the host health status and the clusters. A 𝑁2 test is then applied to this matrix to evaluate the degree of association. The null hypothesis (𝑃 0) states independence between clustering and host health. The resulting p-values are log-transformed for interpretability. More negative log-transformed p-values indicate stronger statistical dependence between clusters and host health, thus reﬂecting improved clustering performance. 3.3.2 Assessment of the species driving the clustering. To assess the contribution of minor versus dominant bacterial signals to the clustering, a Random Forest (RF) model is trained to predict cluster assignments. The mean abundances of the top 20 most important species, as determined by the RF model, are log-transformed. Lower values indicate that the clustering is predominantly driven by low-abundance bacterial species. Additionally, the relationship between species abundance and their importance in the RF model is quantiﬁed by calculating the coe”cient of determination ( 𝑂 2). A high 𝑂 2 value suggests that species with high abundance and variance exert a dominant inﬂuence on the clustering outcome, whereas a low 𝑂 2 indicates a greater role for minor bacterial signals. 3.4 Predictive classiﬁcation 3.4.1 Assessment of predictive performance The datasets under study are randomly partitioned into training and test sets. Machine learning models are ﬁtted using the training data, and their predictive performance is evaluated on the corresponding test sets. Performance is assessed using the Area Under the Receiver Operating Characteristic Curve (AUC- ROC), averaged over 20 cross-validation iterations. Higher AUC-ROC values indicate superior classiﬁcation performance in distinguishing host health status. 3.4.2 Assessment of the species driving the classiﬁcation To evaluate the contribution of minor bacterial signals to the prediction, the mean abundances of the 20 most important species (identiﬁed via feature importance from the Random Forest (RF) model) are log- transformed. Lower log-transformed mean values indicate that the model’s predictions are primarily based on low-abundance (i.e., minor) bacterial species. In addition, the relationship between species abundance and their predictive importance is quantiﬁed by computing the coe”cient of determination ( 𝑂 2). A high 𝑂 2 value suggests that species with high abundance and variance predominantly inﬂuence the model’s predictions, whereas a lower 𝑂 2 indicates a greater contribution from minor signals. 17 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint 3.5 Numerical experiments and simulations The dataset of the Inﬂammatory Bowel Disease is used to simulate the experimental data. The simulated host health (𝑗sim) is computed from both dominant and minor bacterial signals. We propose three scenarios in which the dependence of dominant bacterial signals on 𝑗sim is categorized as low, medium, or high. Those situations are tested in di!erent dimensions where the number of samples varies ( 𝐿𝑉𝐿𝑄 ↓{ 75, 150, 300}). Fifty datasets have been simulated. 3.5.1 Simulation of the microbiome data We propose a new method to simulate microbiome data with statistical characteristics similar to those of a

Reference

dataset (especially the (dis)similarity between species, which can be related to their interaction). Our method, described below, is inspired by the MIDAsim algorithm (35). MIDASim is a non-parametric approach that generally outperforms model-based simulation algorithms. However, in MIDASim, the size of the simulated sample is constrained to be the same as the size of the reference dataset. In our approach, we relax this constraint. Our novel algorithm has been implemented in the function Simulator in the Python module named BiomeSampler available at: https://github.com/pierrehouedry/BiomeSampler. Let us denote by 𝑄 = (𝑄𝐿𝑀)↓ R𝑁↔ 𝑂 our given data set of microbiome data and 𝑄 𝑉𝐿𝑄 the simulated dataset. To simulate new data, we start by generating a binary sample that represents the absence-presence of the species in our dataset. This is done by ﬁrst deﬁning the absence-presence matrix P (𝑄 ) = (𝑉𝐿𝑀)↓ R𝑁↔ 𝑂, where each entry 𝑉𝐿𝑀 is given by: 𝑉𝐿𝑀 = { 1 if 𝑄𝐿𝑀 ω 0, 0 otherwise. Once we have this binary matrix, we estimate its correlation matrix ˆε ↓ R𝑂↔ 𝑂. Given the numerical errors that may arise in the calculation, we need to ensure that ε is positive semi-deﬁnite to make it suitable for sampling from a multivariate normal distribution. To achieve this, we use Higham’s algorithm, which ﬁnds the closest positive semi-deﬁnite matrix (in the Frobenius norm sense) H ( ˆε ) . Next, we simulate the absence-presence data 𝑝𝑏 by sampling from the multivariate normal distribution: 𝑝𝑏 = (𝑞𝐿)1↘𝐿↘𝑁⇐ with 𝑞𝐿 ⇒N ( 0, H ( ˆε )) , where 𝐿⇐ is the desired number of samples. To maintain a proportion of zeros similar to the original data, we deﬁne the simulated absence-presence matrix ¯𝑝 = ( ¯𝑉𝐿𝑀)↓ R𝑁⇐↔ 𝑂, where: ¯𝑉𝐿𝑀 = { 1 if 𝑝𝑏 𝐿𝑀 > ϑ ⇑1 (𝑟 𝑀), 0 otherwise where ϑ denotes the cumulative distribution function of the standard normal distribution N( 0, 1), and the 18 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint X Y Xsim Ysim Model fitted on minor bacterial species Predict !" 1) Clustering (using Bray-Curtis metrics) Xsim CLUSTERING PERFORMANCE !"! Ysim SIMULATION Various transformation PREDICTIVE PERFORMANCE Xsim Models fit on various transformation Train Test Ytrain Predict Ypred Ytest (A) (B) (C) 2) Model fitted on dominant signals P2 P1 Figure 5: Design of numerical experiments. (A) The reference dataset ( X) is used to simulate an experi- mental dataset ( 𝑄sim). A predictive model is trained on minor bacterial species to estimate host health ( 𝑗 ) in the reference dataset. The probability of belonging to the IBD group ( 𝑏2) is then computed based on this model. Clustering is performed on 𝑄sim to infer enterotypes, denoted as ˆ𝑍 . We consider three scenarios reﬂecting low, medium, and high dependence between ˆ𝑍 and 𝑗sim, e!ectively introducing varying levels of noise into the host health signal. The noise is introduced as follows: (i) Fit a Random Forest on the dominant bacterial signals and then (ii) predict the probability of belonging to the Prevotella enterotype (denotes 𝑏1). (iii) Finally, a conditional probability is computed from 𝑏1 and 𝑏2 as it is described in the materials and methods. Noise was introduced through the following procedure: (i) a Random Forest classiﬁer was trained on the dominant bacterial signals; (ii) this model was then used to estimate the probability of belonging to the Prevotella enterotype, denoted as 𝑏1 ; and (iii) a conditional probability was subsequently computed from 𝑏1 and 𝑏2, as detailed in the Materials and Methods section. These scenarios are evaluated across di!erent sample sizes ( 𝐿sim ↓{ 75, 150, 300}), with 50 datasets simulated for each condition. ( B) Clustering performance is assessed under each experimental condition. Alternative clusterings ( ˆ𝑍 ⇐) are obtained by applying HAC on various data transformations. A 𝑁2 test is applied to the contingency table between ˆ𝑍 ⇐ and 𝑗sim to quantify their association. This evaluation step is illustrated by a prominent red arrow in the ﬁgure. (C) Predictive performance is evaluated for each experimental condition. The dataset 𝑄sim is split into training and testing subsets. Machine learning models are trained on the training data and evaluated on the test data. Predictive accuracy is assessed using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), based on the comparison between predicted and true host health outcomes (𝑗pred vs. 𝑗test). This evaluation step is also represented by a prominent red arrow in the ﬁgure. 19 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint estimated proportion of zeros for each species is calculated as follows: (𝑟 𝑀)1↘ 𝑀↘ 𝑂 = /√︁a√︂enleftt√︁/√︁a√︂enleftex  1 𝐿 /√︄ummationdi√︄√︁lay. 𝐿,𝑋𝑁𝑃=0 1 1↘ 𝑀↘ 𝑂 . Once the absence-presence matrix is simulated, we move on to simulate the abundance values for the non-zero entries. The aim is to simulate according to the empirical law. For each feature 1 ↘ 𝑇 ↘ 𝑀, let 𝑄 + 𝑀 be the vector containing only the strictly positive values of (𝑄𝐿𝑀)1↘𝐿↘𝑁. To perform density estimation, we use a bandwidth, which is chosen adaptively for each feature using Silverman’s rule of thumb: 𝑠 𝑀 = 0.9 min ( 𝑡 𝑋 + 𝑃, IQR(𝑄 + 𝑀) ) ( #𝑄 + 𝑀 ) 1 5 , where 𝑡 𝑋 + 𝑃 and IQR (𝑄 + 𝑀) represent the standard deviation and interquartile range of 𝑄 + 𝑀, respectively. We simulate the dataset ¯𝑄 = ( ¯𝑄𝐿𝑀)↓ R𝑁⇐↔ 𝑂 by setting: ¯𝑉𝐿𝑀 = { 𝑢 + 𝑠 𝑀𝑞 if ¯𝑉𝐿𝑀 ω 0, 0 otherwise. Here, 𝑞 ⇒U ( [ 0, 1]) , and 𝑢 is uniformly chosen from the elements of 𝑄 + 𝑀. We summarize the method in the Supplementary materials in the Algorithm 2. The comparison of our method with other existing

Methods

(35, 36, 37) is presented in Figure S3. Moreover, an example of a simulated dataset is presented in Figure 6. 3.5.2 Simulation of the host health The simulated host health outcome, denoted 𝑗sim, was modeled as a function of both dominant and minor bacterial signals. To investigate the inﬂuence of dominant bacterial signals, we constructed three scenarios in which these signals exerted low, medium, and high inﬂuence on 𝑗sim. In these scenarios, the contribution of minor bacterial signals decreased proportionally. However, they remained relevant to the simulated phenotype. Speciﬁcally,𝑗sim was generated by combining two probabilities: (1) the probability that a sample exhibits a gut microbiota composition characterized by a high ratio of species from the Prevotella genus relative to the Bacteroides genus ( 𝑏1), and (2) the probability that a sample presents with Irritable Bowel Disease (IBD) ( 𝑏2). Each probability is estimated using a machine learning model trained on a reference dataset. Speciﬁcally, 𝑏1 is derived from a model ﬁtted to highly abundant species to predict the ﬁrst level of clustering (i.e., enterotype), while 𝑏2 is obtained from a separate model trained on low-abundance species to predict host health status. Both outputs are expressed as probabilities. The ﬁgure 5 and the algorithm 3 give details of the simulation of 𝑗sim. In this context, the simulated host health was then deﬁned as: 𝑗sim ⇓ 𝑏1𝑟 1 + 𝑏2𝑟 2 > 0.5, 20 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint where 𝑟 1 + 𝑟 2 = 1 and 𝑟 1 ↓{ 0, 0.08, 0.15}. A higher value of 𝑟 1 indicates a greater contribution of the dominant signal ( 𝑏1) to 𝑗sim. For example, when 𝑟 1 = 0.15, IBD samples are more likely to be associated with clusters characterized by a high abundance of Prevotella species. We conﬁrm these simulation plans in Figure 6. PCo1 PCo2 Simulated host health HLT IBD PCoA − Rel. abundance simulated data −0.1 0.0 0.1 0.2 0.3 0.4 0.5 −0.3 −0.2 −0.1 0.0 0.1 0.2 Enterotypes B−ET Mix−ET P−ET 45 / 24 167 / 94 27 / 43 HLT / IBD p−value = 0.0003 Gut microbiota composition 0 20 40 60 80 100 Importance of each species on clustering Species' abundances (log scale) Species' variability (log scale) −20 −15 −10 −5 0 5 −30 −20 −10 0 (A) (B) (C) Genus of microbiome composition Bacteroides.s Prevotella.s Alistipes.s Parabacteroides.s Eubacterium.s Anaerostipes.s Butyrivibrio.s Coprococcus.s Roseburia.s Faecalibacterium.s Ruminococcus.s Klebsiella.s Others Figure 6 : Example of data simulated from the reference dataset with 𝐿𝑉𝐿𝑄 = 300 and 𝑟 1 = 0.15 (A) Principal Coordinates Analysis (PCoA) with 80% conﬁdence ellipses (CEs) for each cluster, represented by colored circles. (B) Bar plot of gut microbiota composition at the genus level, stratiﬁed by cluster. Only the 13 most abundant genera are shown. (C) Abundance-variance relationship at the species level, illustrating the inﬂuence of each species on the clustering. Circle size reﬂects the species’ contribution to the clustering. We conﬁrm that the simulation produced a conﬁguration characterized by three distinct enterotypes, each driven by species predominantly from either the Prevotella or Bacteroides genus. As anticipated based on our simulation design, samples assigned to the enterotype dominated by Prevotella species were those most frequently associated with Inﬂammatory Bowel Disease (IBD). This association yielded a statistically signiﬁcant result (p-value < 0.001), rejecting the null hypothesis ( 𝑃 0) that enterotype composition is independent of the simulated host health status. 21 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint

References

and Notes 1. K. Hou, et al. , Microbiota in health and diseases. Signal Transduction and Targeted Therapy 7, 1–28 (2022). 2. V . Odintsova, A. T yakht, D. Alexeev, Guidelines to Statistical Analysis of Microbial Composition Data Inferred from Metagenomic Sequencing. Current Issues in Molecular Biology 24, 17–36 (2017). 3. D. C. Fonseca, et al. , Evaluation of gut microbiota predictive potential associated with phenotypic characteristics to identify multifactorial diseases. Gut Microbes 16, 2297815 (2024). 4. C. B. Peterson, S. Saha, K.-A. Do, Analysis of Microbiome Data. Annual Review of Statistics and Its Application 11, 483–504 (2024). 5. M. I. Keller, et al., Reﬁned Enterotyping Reveals Dysbiosis in Global Fecal Metagenomes (2024). 6. Q. Xie, et al. , T wo-year follow-up of gut microbiota alterations in patients after COVID-19: from the perspective of gut enterotype. Microbiology Spectrum 13, e02774–24 (2025). 7. X. Zhu, et al., A speciﬁc enterotype derived from gut microbiome of older individuals enables favorable responses to immune checkpoint blockade therapy. Cell Host & Microbe 32, 489–505.e5 (2024). 8. M. Arumugam, et al., Enterotypes of the human gut microbiome. Nature 473, 174–180 (2011). 9. P . I. Costea, et al. , Enterotypes in the landscape of gut microbial community composition. Nature microbiology 3, 8–16 (2018). 10. M. Cheng, K. Ning, Stereotypes About Enterotype: The Old and New Ideas. Genomics, Proteomics & Bioinformatics 17, 4–12 (2019). 11. D. Knights, et al., Rethinking “Enterotypes”. Cell Host & Microbe 16, 433–437 (2014). 12. I. Bulygin, et al. , Absence of enterotypes in the human gut microbiomes reanalyzed with non-linear dimensionality reduction methods. PeerJ 11, e15838 (2023). 13. D. I. Warton, S. T. Wright, Y . Wang, Distance-based multivariate analyses confound location and dispersion e!ects. Methods in Ecology and Evolution 3, 89–101 (2012). 14. F. Wang, et al. , Detecting Microbial Dysbiosis Associated with Pediatric Crohn Disease Despite the High Variability of the Gut Microbiota. Cell Reports 14, 945–955 (2016). 15. D. I. Warton, F. K. C. Hui, The central role of mean-variance relationships in the analysis of multivariate abundance data: a response to Roberts (2017). Methods in Ecology and Evolution 8, 1408–1414 (2017). 16. G. Papoutsoglou, et al. , Machine learning approaches in microbiome research: challenges and best practices. Frontiers in Microbiology 14 (2023). 22 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint 17. B. D. Topc ¸uo˘glu, N. A. Lesniak, M. T. Ru”n, J. Wiens, P . D. Schloss, A Framework for Ef- fective Application of Machine Learning to Microbiome-Based Classiﬁcation Problems. mBio 11, 10.1128/mbio.00434–20 (2020). 18. H. B. Nielsen, et al., Identiﬁcation and assembly of genomes and genetic elements in complex metage- nomic samples without using reference genomes. Nature Biotechnology 32, 822–828 (2014). 19. C. Lozupone, M. E. Lladser, D. Knights, J. Stombaugh, R. Knight, UniFrac: an e!ective distance metric for microbial community comparison. The ISME Journal 5, 169–172 (2011). 20. M. Kucera, B. A. Malmgren, Logratio transformation of compositional data: a resolution of the constant sum constraint. Marine Micropaleontology 34, 117–120 (1998). 21. T. P . Quinn,et al., A ﬁeld guide for the compositional analysis of any-omics data. GigaScience 8, giz107 (2019). 22. S. Aryal, A. Alimadadi, I. Manandhar, B. Joe, X. Cheng, Machine learning strategy for gut microbiome- based diagnostic screening of cardiovascular disease. Hypertension (Dallas, Tex. : 1979) 76, 1555–1562 (2020). 23. D. Fern´andez-Edreira, J. Li˜nares-Blanco, C. Fernandez-Lozano, Machine Learning analysis of the human infant gut microbiome identiﬁes inﬂuential species in type 1 diabetes. Expert Systems with Applications 185, 115648 (2021). 24. I. Manandhar, et al. , Gut microbiome-based supervised machine learning for clinical diagnosis of inﬂammatory bowel diseases. American Journal of Physiology-Gastrointestinal and Liver Physiology 320, G328–G337 (2021). 25. D. Martin, et al. , Atypical gut microbial ecosystem from athletes with very high exercise capacity improves insulin sensitivity and muscle glycogen store in mice. Cell Reports 44 (2025). 26. E. Maunder, D. J. Plews, A. E. Kilding, Contextualising Maximal Fat Oxidation During Exercise: Determinants and Normative Values. Frontiers in Physiology 9, 599 (2018). 27. I. San-Mill´an, G. A. Brooks, Assessment of Metabolic Flexibility by Means of Measuring Blood Lactate, Fat, and Carbohydrate Oxidation Responses to Exercise in Professional Endurance Athletes and Less-Fit Individuals. Sports Medicine (Auckland, N.Z.) 48, 467–479 (2018). 28. Materials and methods. 29. X. Wu, T. Zhang, T. Zhang, S. Park, The impact of gut microbiome enterotypes on ulcerative colitis: identifying key bacterial species and revealing species co-occurrence networks using machine learning. Gut Microbes 16, 2292254 (2024). 30. D. Chyzhyk, G. Varoquaux, M. Milham, B. Thirion, How to remove or control confounds in predictive models, with applications to brain biomarkers. GigaScience 11, giac014 (2022). 23 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint 31. S. Kang, et al. , Dysbiosis of fecal microbiota in Crohn’s disease patients as revealed by a custom phylogenetic microarray. Inﬂammatory Bowel Diseases 16, 2034–2042 (2010). 32. M. Wijdeveld,et al., Identifying Gut Microbiota associated with Gastrointestinal Symptoms upon Roux- en-Y Gastric Bypass. Obesity Surgery 33, 1635–1645 (2023). 33. M. C. Jones, The Statistical Analysis of Compositional Data. Royal Statistical Society. Journal. Series A: General 150, 396 (1987). 34. V . Pawlowsky-Glahn, J. J. Egozcue, R. Tolosana-Delgado, Modelling and Analysis of Compositional Data (John Wiley & Sons, Ltd) (2015). 35. M. He, N. Zhao, G. A. Satten, MIDASim: a fast and simple simulator for realistic microbiome data. Microbiome 12, 135 (2024). 36. Z. D. Kurtz, et al. , Sparse and Compositionally Robust Inference of Microbial Ecological Networks. PLOS Computational Biology 11, e1004226 (2015). 37. J. Chiquet, S. Robin, M. Mariadassou, Variational Inference for sparse network reconstruction from count data, in Proceedings of the 36th International Conference on Machine Learning (PMLR) (2019), pp. 1162–1171. Acknowledgments The authors thank the anonymous reviewers for their valuable suggestions. Funding: This work is supported in part by funds from the University of Rennes and the French National Research Agency within the framework of the PIA France 2030 program for EUR DIGISPORT (ANR-18- EURE-0022) projects. Author contributions: DM and VM developed the conceptual framework, DM conceived the experi- ment(s), DM and PH conducted the experiment(s), PH conceived the simulation microbiome algorithm, DM, PH, and VM analyzed the results. DM, PH, FD and VM wrote and reviewed the manuscript. Competing interests: There are no competing interests to declare. Data and materials availability: The R scripts and the Python packages used for the simulation and the numerical experiments are available at: https://github.com/pierrehouedry/BiomeSampler. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request. 24 .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-NC-4.0