Methods
section and the Figure 5.
For each condition, we assessed (i) the clustering performance by evaluating the association between
transformed data clusterings and host health, and (ii) the predictive performance of machine learning models
trained to predict host health status from microbiome features.
Before applying the proposed methodology, we have to check that the simulated data have a similar
structure as the one observed in the reference dataset. In particular, the simulated data should present three
distinct enterotypes, each driven by species predominantly from the genus Prevotella or Bacteroides. As
anticipated based on our simulation design, samples assigned to the enterotype dominated by Prevotella
species were those most frequently associated with Inflammatory Bowel Disease (IBD). It is confirmed by
a statistical test where the null hypothesis ( 𝑃 0 :) ”the composition of the enterotype is independent of the
simulated host health status” is rejected with a 𝑀-value < 0.001 (see Fig. 6).
8
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
2.7 Data transformations facilitate the formation of clusters associated with host health by
focusing on minor bacterial signals
For each subset of experimental conditions and type of data, we assessed clustering performance. A confusion
matrix is constructed between the host health status and the clusters. Then, 𝑁2 test is applied to this matrix
to evaluate the degree of association. The null hypothesis ( 𝑃 0) states independence between clustering and
host health.
Here, we show that the clustering performance derived from transformed data remains e!ective in de-
tecting host health, whatever the experimental scenario, especially when the influence of dominant bacterial
signals on 𝑗sim is high (Figure 2.A). Furthermore, our results demonstrate that the clustering applied to
transformed data is more influenced by minor bacterial signals than the clustering applied to data of rel-
ative abundance (Figure 2.B). Data transformations appear essential to capture minor bacterial signals in
microbiome data.
Consequently, the unsupervised analysis derived from various transformed data takes into account the
presence of minor bacterial signals, leading to a more accurate insight into the simulated host health ( 𝑗sim).
We validate that in a situation where the dominant signal has a strong e!ect on host health, the unsupervised
analysis of transformed data remains e”cient by focusing its analysis on minor bacterial signals. However, this
guided data transformation does not lead to a higher clustering performance than other data transformations.
2.8 The guided data transformation improves predictive performance in an 𝐿 ↑ 𝑀 problem
by selecting minor bacterial signals
The scenario and the methods used in this section are the same as the previous section. For each subset
of experimental conditions, a Random Forest (RF) model was trained on a training set of varying sizes
(𝐿train ↓{ 75, 150, 300}) and evaluated on a fixed test set ( 𝐿test = 100). Predictive performance was assessed
using the area under the receiver operating characteristic curve (AUC-ROC). Additionally, we examined
the distribution of the 20 most important bacterial species contributing to the prediction of the host health
variable.
Our results demonstrate that training RF models on guided transformed data and presence/absence data
improves predictive accuracy in settings where the curse of dimensionality is more severe (𝐿train = 75, Figure
2.C). Notably, only RF models trained on guided transformed data consistently outperform those trained on
relative abundance data when dominant bacterial signals exert a strong influence on host health ( 𝐿train = 75,
Figure 2.C). These findings support the hypothesis that reducing the total information improves predictive
performance in high-dimensional small-sample datasets. In addition, models trained on guided transformed
data are primarily influenced by minor bacterial signals (Figure 2.D).
RF algorithms trained on logarithmically transformed data show predictive performance comparable to
the ones trained in other transformed data sets, but do not outperform those using relative abundance data
(Figure 2.C). In addition, these models are predominantly based on dominant bacterial signals, even though
host health was simulated based on minor bacterial signals (Figure 2.D). Consequently, the interpretability
of RF algorithms trained on log-transformed data appears limited.
Importantly, the benefit of data transformation diminishes as training sample size increases (Figure 2.C).
9
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
AUC−ROC
0.5 0.6 0.7 0.8 0.9 1.0
Low Medium High
c,d' c,d' d
Mean (log scale) of the signal
driving the clustering
−6 −4 −2 0
Low Medium High
nsim = 300 nsim = 150 nsim = 75
Clustering performance Predictive performance Species driving the prediction(A) (B) (C)Species driving the clustering (D)
LEGENDS
p value (log scale)
−30 −20 −10 −5 0
Low Medium High
Mean (log scale) of the signal
driving the clustering
−4 −3 −2 −1 0
Low Medium High
p value (log scale)
−30 −20 −10 −5 0
Low Medium High
Mean (log scale) of the signal
driving the clustering
−4 −3 −2 −1 0
Low Medium High
Rel. Ab. Abs./Pres. CLR Unifrac Guided Transf.
AUC−ROC
0.5 0.6 0.7 0.8 0.9 1.0
Low Medium High
b,c',d' c',d' d
Mean (log scale) of the signal
driving the clustering
−6 −4 −2 0
Low Medium High
AUC−ROC
0.5 0.6 0.7 0.8 0.9 1.0
Low Medium High
d
Mean (log scale) of the signal
driving the clustering
−6 −4 −2 0
Low Medium High
Figure 2 : Guided transformation of microbiome data enhances clustering performance and predic-
tion accuracy in high-dimensional, low-sample-size settings ( 𝐿 ↑ 𝑀). Three simulation scenarios are
considered, where the influence of dominant bacterial signals on the simulated host health outcome ( 𝑗sim)
is low, medium, or high. These scenarios are evaluated across di!erent sample sizes: 𝐿sim ↓{ 75, 150, 300}.
(A) Clustering was performed for each scenario. Performance was evaluated using a 𝑁2 test on the con-
fusion matrix between cluster assignments and 𝑗sim. Distributions of log-transformed p-values across 50
simulated datasets are shown. The dashed line indicates the log (0.05) threshold, below which the clustering
is statistically relevant with host health. ( B) The bacterial species driving the clustering were identified
using Random Forest models. The distribution of the mean abundances (log-transformed) of the top 20
most important species is shown across simulations. Statistical comparisons are omitted, as the patterns are
visually distinct. ( C) Predictive performance was assessed using the AUC-ROC on test sets. Letters denote
statistically significant di!erences ( 𝑀< 0.05) between the relative abundance data and: centered log-ratio
(CLR) transformation ( 𝑘), absence/presence transformation ( 𝑙), and guided transformation ( 𝑖). A prime
symbol ( ⇐) indicates additional significant di!erences between CLR and other transformations. ( D) The
distributions of the top 20 most important species for predicting host health are shown. As in (B), statistical
marks are omitted due to visually discernible trends. The guided transformation has been applied to the
Absence/Presence matrix.
10
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
In larger datasets, RF algorithms trained on relative abundance data increasingly leverage low-abundance
species, indicating that minor bacterial signals are essential for accurate prediction of host health.
In summary, our results show that data transformations (particularly guided transformation) improve
predictive performance. These transformations achieve this by focusing the analysis on biologically relevant
minor bacterial signals. Building on these findings, we now evaluate the guided transformation on real
datasets. The first application addresses a high-dimensional ( 𝐿 ↑ 𝑀) predictive problem, while the second
illustrates how guided transformation can reveal enterotypes as potential confounding factors.
2.9 The decrease of the total amount of information enhances predictive performance of fat
oxidation in humans
In this first application of our guided data transformation, we aim to develop a predictive model to classify
fat oxidation (FO) rates in humans, to determine whether individuals exceed or fall below the threshold of
0.4 g/min, using gut microbiota composition as a predictor (25). The 0.4 g/min threshold for FO was selected
because it has previously been identified as an e!ective cut-o! to discriminate between populations with
poor versus normal metabolic flexibility ( 26, 27). A detailed description of the dataset is provided in the
Materials
section ( 28).
An initial clustering analysis based on relative abundance data revealed two distinct enterotypes: one
associated with the genus Bacteroides and a second with species belonging to the Prevotella genus, repre-
sented respectively by pink and yellow circles in Figure 3.A. A 𝑁2 test, which states that FO is independent
of the clustering, yielded a p-value of approximately 0.006 (Figure 3.B). It indicates a significant relationship
between dominant microbial signals (i.e., enterotype-defining species) and fat oxidation. We confirm that
the most influential species in the clustering are those with the highest abundance and variance (Figure 3.C),
including Prevotella Copri and Bacteroides Uniformis.
To investigate this further, we compared the predictive performance of Random Forest (RF) models
trained on relative abundance data with those trained on various transformed datasets, including guided
transformed data. The guided transformation was applied to each type of data as defined in Equation (1). RF
models were trained using 80% of the samples, and predictive performance was evaluated on the remaining
20%. This procedure was repeated twenty times to estimate the distribution of AUC-ROC values.
RF models trained on both relative abundance and presence/absence data yielded mean AUC-ROC
values of approximately 0.65 and 0.66, respectively. Applying the guided transformation to either of these
data significantly improved the AUC-ROC by an average of 0.05 points ( 𝑀< 0.05, Figure 3.D). These
Results
support our hypothesis that reducing the overall information can enhance the predictive accuracy of
supervised algorithms, especially in small-sample, high-dimensional contexts.
Furthermore, we assessed the interpretability of the model by computing the coe”cient of determination
(𝑂 2) between species abundance and their importance in RF models. We found that models trained on
guided transformed data relied more on minor bacterial signals compared to other data (Figure 3.E). These
findings suggest that incorporating minor bacterial species into the models could lead to improved predictive
performance.
In conclusion, these findings provide evidence that, in 𝐿 ↑ 𝑀 scenarios, dominant bacterial signals can
11
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
(A) (B) (C)
(D) (E) LEGENDS
Type of data
Original data
Guided transformed
Figure 3:( A) Principal Coordinates Analysis (PCoA) with 80% confidence ellipses for each cluster (col-
ored circles). ( B) Distribution of host health status across clusters. The p-value of the 𝑁2 test assessing the
association between clustering and host health is reported. A bar plot showing the microbiome composition
at the genus level for each cluster is also presented. ( C) Abundance-variance relationship at the species
level, illustrating the influence of each species on the clustering. Circle size reflects the species’ contribu-
tion to the clustering. The 10 most influential species (i.e., the largest circles) are highlighted along with
their corresponding mean decrease in Gini index. ( D) Predictive performance of Random Forest models
across di!erent transformation strategies, with and without guided transformation. ( E) Strength of the linear
relationship (expressed as 𝑂 2) between species abundance and feature importance in each Random Forest
model. The symbol 𝑉 indicates statistically significant di!erences between the original data and its guided
transformation. The guided transformation has been applied to all types of data.
12
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
obscure more informative minor bacterial signals. Guided transformation e!ectively mitigates this issue,
leading to both improved predictive performance and interpretability in predicting human fat oxidation based
on gut microbiota composition.
2.10 The dominant bacterial signals appear to act as a confounding factor in predictive
models of Ulcerative Colitis
In this second application of our guided data transformation, we aim to develop a predictive model to classify
host health, specifically whether they have ulcerative colitis (UC) or not, using gut microbiota compositions
as predictors. The dataset comprises 12 public datasets. All gut microbiomes were curated in a previous
study (29), resulting in a unified dataset of 1328 samples including 925 bacterial species.
We replicated the findings of Wu et al. (2024) ( 29), confirming the presence of three distinct enterotypes
within the dataset (Figure 4.A-B). The Bacteroides enterotype (ET-B), depicted in red, comprises a com-
parable number of healthy and UC samples. In contrast, the Clostridium enterotype (ET-C), represented
in yellow, consists almost exclusively of UC samples. The third enterotype is characterized by a diverse
assemblage of genera, predominantly Blautia and Faecalibacterium, and is primarily associated with healthy
individuals. Our results confirm that clustering based on the Bray–Curtis dissimilarity metric is primarily
driven by species with high abundance and variance (Figure 4.C).
We next compared the predictive performance of Random Forest (RF) models trained on relative abun-
dance data and guided-transformed data derived from it. Each RF model was trained on 80% of the dataset,
and predictions were generated on the remaining 20%. This process was repeated 20 times to assess perfor-
mance stability by the distribution of AUC-ROC scores.
RF models trained on relative abundance data achieved a mean AUC-ROC of approximately 0.98. In
particular, models trained on guided transformed data achieved the same performance (Figures 4.D-E),
indicating that removing dominant bacterial signals did not a!ect predictive power. However, a comparison
of the importance profiles of the features revealed that guided transformation alters the interpretability of the
model. Indeed, when trained on guided transformed data, RF models are more dependent on minor bacterial
signals (Figure 4.F). This observation is supported by lower values of 𝑂 2 between species abundance and
feature importance, indicating a shift away from high-abundance species (Figure 4.F).
The approach implemented here is closely related to out-of-sample deconfounding, providing empirical
evidence that enterotype-related information may not be essential for predicting host health and can act as
a confounding factor ( 30). The guided transformation instead allows the model to focus on minor bacterial
signals that may carry greater relevance to the phenotype of interest.
Interestingly, while Wu et al. reported that Ruminococcus gnavus is more abundant in subjects with UC
associated with ET-B and ET-C, our analysis suggests the presence of two distinct microbial networks that are
not directly related to the pathology of UC, but can play protective roles (Figure 4G). The first group, located
upstream of the UC node, includes Ruminococcus callidus, a species more abundant in healthy individuals.
Although phylogenetically close to R. gnavus. R. Callidus exhibits a negative correlation in abundance with
R. gnavus, suggesting competitive exclusion and a possible protective e!ect. Notably, R. Callidus has been
suggested as a key bacterium that may protect against inflammatory bowel disease ( 31). In addition, as a
13
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
Networks of the 20 most important features derived from RF
Blautia.faecis
Colidextribacter.massiliensis
Coprococcus.catus
Oscillibacter.valericigenes
Blautia.obeum
Lawsonibacter.asaccharolyticus
Papillibacter.cinnamivorans
Aminipila.butyrica
Anaerobutyricum.hallii
Ruminococcus.callidus
Gemmiger.formicilis
Anaerobacterium.chartisolvens
Adlercreutzia.equolifaciens
Roseburia.faecis
Coprococcus.comes
UC
Eubacterium.xylanophilum
Dorea.longicatena
log p value =
−393 (***)
−0.4 −0.2 0.0 0.2
−0.4 −0.2 0.0 0.2 0.4
PCo1 (2%)
PCo2 (2%)
Healthy
UC
log p value =
−4 (*)
Enterotypes
294 / 278
2 / 260
411 / 83
Healthy / UC
Gut microbiota composition
0.0 0.2 0.4 0.6 0.8 1.0
Alistipes
Bacteroides
Bifidobacterium
Blautia
Clostridium
Faecalibacterium
Faecalicatena
Fusobacterium
Phocaeicola
Pseudomonas
Roseburia
Ruminococcus
Streptococcus
Others
Species influences on
enterotype determination
Low Medium High
−16 −14 −12 −10 −8 −6 −4
−25 −20 −15 −10 −5
Species means (log scale)
Species variance (log scale)
(A) (B)
(D) (E) (F) (G)
(C)
Figure 4: (A) Principal Coordinates Analysis (PCoA) with 80% confidence ellipses (CEs) for each cluster,
represented by colored circles. (B) Bar plot of gut microbiota composition at the genus level, stratified by
cluster. Only the 13 most abundant genera are shown. The yellow cluster is associated with the Clostridium
enterotype, while the red cluster corresponds to the Bacteroides enterotype. (C) Abundance-variance rela-
tionship at the species level, illustrating the influence of each species on the clustering. Circle size reflects
the species’ contribution to the clustering. (D) Predictive performance (with 95% confidence intervals) is
compared between relative abundance and guided-transformed datasets.(E) Distribution of predictive perfor-
mance across 20 cross-validation runs. (F) Distribution of 𝑂 2 values across 20 cross-validations, indicating
the strength of the relationship between species importance in the RF model and their abundances. (G) Co-
occurrence network between the 20 most important bacterial species (as determined by the RF model trained
on guided-transformed data) and host health status. Co-occurrences were evaluated using the Chi-square
test ( 𝑁2); edge width is proportional to significance (lower p-values), with blue edges indicating positive
associations and red edges indicating negative associations. The guided transformation has been applied to
the relative abundance data.
14
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
fiber-degrading species, it belongs to the same group of beneficial microbes (including Faecalibacterium
prausnitzii) that are reduced in children with pre-diabetes ( 32).
The second group, located downstream of the UC node, includes species consistently selected by RF
models trained on relative abundance data. All samples expressing this group of species belong to healthy
individuals, with no UC cases observed, further supporting a potential protective role (data not shown).
In summary, our results reinforce the notion that dominant bacterial signals can obscure minor but
biologically relevant species. The application of guided data transformation not only preserves predictive
performance but also enhances interpretability, o!ering novel insights into the complex interplay between
microbial composition and host health.
3 METHODS
3.1 Study Populations
The present study used three distinct datasets of the gut microbiome to ensure a comprehensive analysis. The
first dataset served as the reference dataset that we use for the demonstration and simulation sections (from
2.1 to 2.8). The remaining two data sets were used in the application sections (2.9 and 2.10) to illustrate the
impact, e!ectiveness, and robustness of the proposed method.
3.1.1 Inflammatory Bowel Disease (IBD) [reference dataset]
The reference data set used to observe the problems is registered in Bio-project n°PRJEB1220. (18) It consists
of 396 samples, each characterized by the expression levels of 606 microbial species and an associated binary
host health variable, 0 corresponding to a healthy sample (HLT) and 1 corresponding to a sample presenting
an irritable bowel disease (IBD).
3.1.2 Fat oxidation (FO) and insulin sensitivity [dataset n °2]
The second dataset is a small cohort (n = 50 and p = 248) where the binary variable of host health is the
maximal oxidation of fat (FO) ( 25). It refers to the ability to oxidize fat during submaximal exercise in a
fasted state, measured in g/min. The 0.4 g/min threshold for FO was selected because it has previously been
identified as an e!ective cut-o! to discriminate between populations with poor versus normal metabolic
flexibility (26, 27). In this case, 0 corresponds to a healthy sample presenting an FO higher than 0.4, and 1
corresponds to a sample presenting an FO lower than 0.4.
3.1.3 Ulcerative colitis (UC) [dataset n °3]
The third dataset comprises 12 publicly available datasets. All gut microbiomes have been curated in a
previous study, provided in their Github: https://github.com/WXG920713/Gut-microbes.( 29) They
obtained gut microbiome metagenomic data from three sources: GMrepo, the European Bioinformatics
Institute (EMBL-EBI), and Google Scholar, using specific search criteria related to ulcerative colitis (UC)
and 16S rRNA gene sequencing. They generate a data set including 1328 samples that express 925 species. In
15
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
this dataset, each sample is labeled with a binary host health variable, where 0 indicates a healthy individual
(HLT) and 1 corresponds to a case of ulcerative colitis (UC).
3.2 Data transformation
3.2.1 Relative abundance data.
The non-transformed gut microbiome data corresponds to the compositional data. Compositional data are
restricted within a simplex (𝑚𝑊 ), where data contain parts of nonnegative numbers whose sum is 1. In this
case, for each observation, the variables are limited to the interval [0, 1] (33).
𝑚𝑊 = {𝑄 = [𝑛1,𝑛 2,..., 𝑛𝑊 ]| 𝑛𝐿 ↗ 0, (𝑒= 1, 2,..., );
𝑊/√︄ummationdi√︄√︁lay.
𝐿=1
𝑛𝐿 = 1}
3.2.2 Absence/presence data.
The gut microbiome data is transformed into the absence/presence matrix, where 0 refers to the absence of
a species in the sample, while 1 refers to the presence of the bacteria.
3.2.3 Log transformed data.
The log-ratio transformations eliminate the non-negativity constraint of compositional data and establish a
one-to-one mapping onto real space, allowing researchers to use standard multivariate methods ( 34). In this
paper, we focus specifically on thecentered log-ratio (CLR) transformation due to its relevance in subsequent
analyses. The CLR transformation uses the logarithm of the ratio of each component over the geometric
mean of all components ( 33).
CLR(x) =
(
ln 𝑛1
𝑜(x) ,..., ln 𝑛𝑊
𝑜(x)
)
where 𝑜(𝑛) is the geometric mean of the vector 𝑛, defined as:
𝑜(x) =
( 𝑊/√︁√︂oductdi√︄√︁lay.
𝐿=1
𝑛𝐿
) 1
𝑂
This transformation preserves the relative structure of the data while enabling robust statistical analysis.
3.3 Clustering
The clustering involves the following steps:
• The pairwise (dis)similarity or distance matrix is computed, denoted .
• Hierarchical Agglomerative Clustering (HAC) is applied on .
• The clustering is then obtained by applying the cutree function in R on the dendrogram derived from
the HAC.
16
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
• The optimal number of classes is determined by selecting the number that maximizes the silhouette
coe”cient.
3.3.1 Assessment of clustering performance.
To assess clustering performance, a confusion matrix is constructed between the host health status and the
clusters. A 𝑁2 test is then applied to this matrix to evaluate the degree of association. The null hypothesis
(𝑃 0) states independence between clustering and host health. The resulting p-values are log-transformed
for interpretability. More negative log-transformed p-values indicate stronger statistical dependence between
clusters and host health, thus reflecting improved clustering performance.
3.3.2 Assessment of the species driving the clustering.
To assess the contribution of minor versus dominant bacterial signals to the clustering, a Random Forest
(RF) model is trained to predict cluster assignments. The mean abundances of the top 20 most important
species, as determined by the RF model, are log-transformed. Lower values indicate that the clustering is
predominantly driven by low-abundance bacterial species.
Additionally, the relationship between species abundance and their importance in the RF model is
quantified by calculating the coe”cient of determination ( 𝑂 2). A high 𝑂 2 value suggests that species with
high abundance and variance exert a dominant influence on the clustering outcome, whereas a low 𝑂 2
indicates a greater role for minor bacterial signals.
3.4 Predictive classification
3.4.1 Assessment of predictive performance
The datasets under study are randomly partitioned into training and test sets. Machine learning models
are fitted using the training data, and their predictive performance is evaluated on the corresponding test
sets. Performance is assessed using the Area Under the Receiver Operating Characteristic Curve (AUC-
ROC), averaged over 20 cross-validation iterations. Higher AUC-ROC values indicate superior classification
performance in distinguishing host health status.
3.4.2 Assessment of the species driving the classification
To evaluate the contribution of minor bacterial signals to the prediction, the mean abundances of the 20
most important species (identified via feature importance from the Random Forest (RF) model) are log-
transformed. Lower log-transformed mean values indicate that the model’s predictions are primarily based
on low-abundance (i.e., minor) bacterial species.
In addition, the relationship between species abundance and their predictive importance is quantified by
computing the coe”cient of determination ( 𝑂 2). A high 𝑂 2 value suggests that species with high abundance
and variance predominantly influence the model’s predictions, whereas a lower 𝑂 2 indicates a greater
contribution from minor signals.
17
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
3.5 Numerical experiments and simulations
The dataset of the Inflammatory Bowel Disease is used to simulate the experimental data. The simulated
host health (𝑗sim) is computed from both dominant and minor bacterial signals. We propose three scenarios
in which the dependence of dominant bacterial signals on 𝑗sim is categorized as low, medium, or high. Those
situations are tested in di!erent dimensions where the number of samples varies ( 𝐿𝑉𝐿𝑄 ↓{ 75, 150, 300}).
Fifty datasets have been simulated.
3.5.1 Simulation of the microbiome data
We propose a new method to simulate microbiome data with statistical characteristics similar to those of a
Reference
dataset (especially the (dis)similarity between species, which can be related to their interaction).
Our method, described below, is inspired by the MIDAsim algorithm (35). MIDASim is a non-parametric
approach that generally outperforms model-based simulation algorithms. However, in MIDASim, the size of
the simulated sample is constrained to be the same as the size of the reference dataset. In our approach, we
relax this constraint. Our novel algorithm has been implemented in the function Simulator in the Python
module named BiomeSampler available at: https://github.com/pierrehouedry/BiomeSampler.
Let us denote by 𝑄 = (𝑄𝐿𝑀)↓ R𝑁↔ 𝑂 our given data set of microbiome data and 𝑄 𝑉𝐿𝑄 the simulated
dataset. To simulate new data, we start by generating a binary sample that represents the absence-presence of
the species in our dataset. This is done by first defining the absence-presence matrix P (𝑄 ) = (𝑉𝐿𝑀)↓ R𝑁↔ 𝑂,
where each entry 𝑉𝐿𝑀 is given by:
𝑉𝐿𝑀 =
{
1 if 𝑄𝐿𝑀 ω 0,
0 otherwise.
Once we have this binary matrix, we estimate its correlation matrix ˆε ↓ R𝑂↔ 𝑂. Given the numerical
errors that may arise in the calculation, we need to ensure that ε is positive semi-definite to make it suitable
for sampling from a multivariate normal distribution. To achieve this, we use Higham’s algorithm, which
finds the closest positive semi-definite matrix (in the Frobenius norm sense) H
(
ˆε
)
.
Next, we simulate the absence-presence data 𝑝𝑏 by sampling from the multivariate normal distribution:
𝑝𝑏 = (𝑞𝐿)1↘𝐿↘𝑁⇐ with 𝑞𝐿 ⇒N
(
0, H
(
ˆε
))
,
where 𝐿⇐ is the desired number of samples.
To maintain a proportion of zeros similar to the original data, we define the simulated absence-presence
matrix ¯𝑝 = ( ¯𝑉𝐿𝑀)↓ R𝑁⇐↔ 𝑂, where:
¯𝑉𝐿𝑀 =
{
1 if 𝑝𝑏 𝐿𝑀 > ϑ ⇑1 (𝑟 𝑀),
0 otherwise
where ϑ denotes the cumulative distribution function of the standard normal distribution N( 0, 1), and the
18
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
X
Y
Xsim
Ysim
Model fitted on minor
bacterial species Predict
!"
1) Clustering (using
Bray-Curtis metrics)
Xsim
CLUSTERING
PERFORMANCE
!"! Ysim
SIMULATION
Various transformation
PREDICTIVE
PERFORMANCE
Xsim
Models fit on
various
transformation
Train
Test
Ytrain
Predict Ypred Ytest
(A)
(B)
(C)
2) Model fitted on
dominant signals
P2 P1
Figure 5: Design of numerical experiments. (A) The reference dataset ( X) is used to simulate an experi-
mental dataset ( 𝑄sim). A predictive model is trained on minor bacterial species to estimate host health ( 𝑗 )
in the reference dataset. The probability of belonging to the IBD group ( 𝑏2) is then computed based on
this model. Clustering is performed on 𝑄sim to infer enterotypes, denoted as ˆ𝑍 . We consider three scenarios
reflecting low, medium, and high dependence between ˆ𝑍 and 𝑗sim, e!ectively introducing varying levels of
noise into the host health signal. The noise is introduced as follows: (i) Fit a Random Forest on the dominant
bacterial signals and then (ii) predict the probability of belonging to the Prevotella enterotype (denotes 𝑏1).
(iii) Finally, a conditional probability is computed from 𝑏1 and 𝑏2 as it is described in the materials and
methods. Noise was introduced through the following procedure: (i) a Random Forest classifier was trained
on the dominant bacterial signals; (ii) this model was then used to estimate the probability of belonging
to the Prevotella enterotype, denoted as 𝑏1 ; and (iii) a conditional probability was subsequently computed
from 𝑏1 and 𝑏2, as detailed in the Materials and Methods section. These scenarios are evaluated across
di!erent sample sizes ( 𝐿sim ↓{ 75, 150, 300}), with 50 datasets simulated for each condition. ( B) Clustering
performance is assessed under each experimental condition. Alternative clusterings ( ˆ𝑍 ⇐) are obtained by
applying HAC on various data transformations. A 𝑁2 test is applied to the contingency table between ˆ𝑍 ⇐
and 𝑗sim to quantify their association. This evaluation step is illustrated by a prominent red arrow in the
figure. (C) Predictive performance is evaluated for each experimental condition. The dataset 𝑄sim is split into
training and testing subsets. Machine learning models are trained on the training data and evaluated on the
test data. Predictive accuracy is assessed using the Area Under the Receiver Operating Characteristic Curve
(AUC-ROC), based on the comparison between predicted and true host health outcomes (𝑗pred vs. 𝑗test). This
evaluation step is also represented by a prominent red arrow in the figure.
19
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
estimated proportion of zeros for each species is calculated as follows:
(𝑟 𝑀)1↘ 𝑀↘ 𝑂 = /√︁a√︂enleftt√︁/√︁a√︂enleftex
1
𝐿
/√︄ummationdi√︄√︁lay.
𝐿,𝑋𝑁𝑃=0
1
1↘ 𝑀↘ 𝑂
.
Once the absence-presence matrix is simulated, we move on to simulate the abundance values for the
non-zero entries. The aim is to simulate according to the empirical law. For each feature 1 ↘ 𝑇 ↘ 𝑀, let 𝑄 +
𝑀
be the vector containing only the strictly positive values of (𝑄𝐿𝑀)1↘𝐿↘𝑁. To perform density estimation, we
use a bandwidth, which is chosen adaptively for each feature using Silverman’s rule of thumb:
𝑠 𝑀 =
0.9 min
(
𝑡 𝑋 +
𝑃, IQR(𝑄 +
𝑀)
)
(
#𝑄 +
𝑀
) 1
5
,
where 𝑡 𝑋 +
𝑃 and IQR (𝑄 +
𝑀) represent the standard deviation and interquartile range of 𝑄 +
𝑀, respectively.
We simulate the dataset ¯𝑄 = ( ¯𝑄𝐿𝑀)↓ R𝑁⇐↔ 𝑂 by setting:
¯𝑉𝐿𝑀 =
{
𝑢 + 𝑠 𝑀𝑞 if ¯𝑉𝐿𝑀 ω 0,
0 otherwise.
Here, 𝑞 ⇒U ( [ 0, 1]) , and 𝑢 is uniformly chosen from the elements of 𝑄 +
𝑀. We summarize the method
in the Supplementary materials in the Algorithm 2. The comparison of our method with other existing
Methods
(35, 36, 37) is presented in Figure S3. Moreover, an example of a simulated dataset is presented in
Figure 6.
3.5.2 Simulation of the host health
The simulated host health outcome, denoted 𝑗sim, was modeled as a function of both dominant and minor
bacterial signals. To investigate the influence of dominant bacterial signals, we constructed three scenarios
in which these signals exerted low, medium, and high influence on 𝑗sim. In these scenarios, the contribution
of minor bacterial signals decreased proportionally. However, they remained relevant to the simulated
phenotype.
Specifically,𝑗sim was generated by combining two probabilities: (1) the probability that a sample exhibits
a gut microbiota composition characterized by a high ratio of species from the Prevotella genus relative
to the Bacteroides genus ( 𝑏1), and (2) the probability that a sample presents with Irritable Bowel Disease
(IBD) ( 𝑏2). Each probability is estimated using a machine learning model trained on a reference dataset.
Specifically, 𝑏1 is derived from a model fitted to highly abundant species to predict the first level of clustering
(i.e., enterotype), while 𝑏2 is obtained from a separate model trained on low-abundance species to predict
host health status. Both outputs are expressed as probabilities. The figure 5 and the algorithm 3 give details
of the simulation of 𝑗sim. In this context, the simulated host health was then defined as:
𝑗sim ⇓ 𝑏1𝑟 1 + 𝑏2𝑟 2 > 0.5,
20
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
where 𝑟 1 + 𝑟 2 = 1 and 𝑟 1 ↓{ 0, 0.08, 0.15}. A higher value of 𝑟 1 indicates a greater contribution of the
dominant signal ( 𝑏1) to 𝑗sim. For example, when 𝑟 1 = 0.15, IBD samples are more likely to be associated
with clusters characterized by a high abundance of Prevotella species. We confirm these simulation plans in
Figure 6.
PCo1
PCo2
Simulated host health
HLT
IBD
PCoA − Rel. abundance simulated data
−0.1 0.0 0.1 0.2 0.3 0.4 0.5
−0.3 −0.2 −0.1 0.0 0.1 0.2
Enterotypes
B−ET
Mix−ET
P−ET
45 / 24
167 / 94
27 / 43
HLT / IBD
p−value = 0.0003
Gut microbiota composition
0 20 40 60 80 100
Importance of each species on clustering
Species' abundances (log scale)
Species' variability (log scale)
−20 −15 −10 −5 0 5
−30 −20 −10 0
(A) (B) (C)
Genus of microbiome composition
Bacteroides.s
Prevotella.s
Alistipes.s
Parabacteroides.s
Eubacterium.s
Anaerostipes.s
Butyrivibrio.s
Coprococcus.s
Roseburia.s
Faecalibacterium.s
Ruminococcus.s
Klebsiella.s
Others
Figure 6 : Example of data simulated from the reference dataset with 𝐿𝑉𝐿𝑄 = 300 and 𝑟 1 = 0.15 (A)
Principal Coordinates Analysis (PCoA) with 80% confidence ellipses (CEs) for each cluster, represented
by colored circles. (B) Bar plot of gut microbiota composition at the genus level, stratified by cluster.
Only the 13 most abundant genera are shown. (C) Abundance-variance relationship at the species level,
illustrating the influence of each species on the clustering. Circle size reflects the species’ contribution
to the clustering. We confirm that the simulation produced a configuration characterized by three distinct
enterotypes, each driven by species predominantly from either the Prevotella or Bacteroides genus. As
anticipated based on our simulation design, samples assigned to the enterotype dominated by Prevotella
species were those most frequently associated with Inflammatory Bowel Disease (IBD). This association
yielded a statistically significant result (p-value < 0.001), rejecting the null hypothesis ( 𝑃 0) that enterotype
composition is independent of the simulated host health status.
21
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
References
and Notes
1. K. Hou, et al. , Microbiota in health and diseases. Signal Transduction and Targeted Therapy 7, 1–28
(2022).
2. V . Odintsova, A. T yakht, D. Alexeev, Guidelines to Statistical Analysis of Microbial Composition Data
Inferred from Metagenomic Sequencing. Current Issues in Molecular Biology 24, 17–36 (2017).
3. D. C. Fonseca, et al. , Evaluation of gut microbiota predictive potential associated with phenotypic
characteristics to identify multifactorial diseases. Gut Microbes 16, 2297815 (2024).
4. C. B. Peterson, S. Saha, K.-A. Do, Analysis of Microbiome Data. Annual Review of Statistics and Its
Application 11, 483–504 (2024).
5. M. I. Keller, et al., Refined Enterotyping Reveals Dysbiosis in Global Fecal Metagenomes (2024).
6. Q. Xie, et al. , T wo-year follow-up of gut microbiota alterations in patients after COVID-19: from the
perspective of gut enterotype. Microbiology Spectrum 13, e02774–24 (2025).
7. X. Zhu, et al., A specific enterotype derived from gut microbiome of older individuals enables favorable
responses to immune checkpoint blockade therapy. Cell Host & Microbe 32, 489–505.e5 (2024).
8. M. Arumugam, et al., Enterotypes of the human gut microbiome. Nature 473, 174–180 (2011).
9. P . I. Costea, et al. , Enterotypes in the landscape of gut microbial community composition. Nature
microbiology 3, 8–16 (2018).
10. M. Cheng, K. Ning, Stereotypes About Enterotype: The Old and New Ideas. Genomics, Proteomics &
Bioinformatics 17, 4–12 (2019).
11. D. Knights, et al., Rethinking “Enterotypes”. Cell Host & Microbe 16, 433–437 (2014).
12. I. Bulygin, et al. , Absence of enterotypes in the human gut microbiomes reanalyzed with non-linear
dimensionality reduction methods. PeerJ 11, e15838 (2023).
13. D. I. Warton, S. T. Wright, Y . Wang, Distance-based multivariate analyses confound location and
dispersion e!ects. Methods in Ecology and Evolution 3, 89–101 (2012).
14. F. Wang, et al. , Detecting Microbial Dysbiosis Associated with Pediatric Crohn Disease Despite the
High Variability of the Gut Microbiota. Cell Reports 14, 945–955 (2016).
15. D. I. Warton, F. K. C. Hui, The central role of mean-variance relationships in the analysis of multivariate
abundance data: a response to Roberts (2017). Methods in Ecology and Evolution 8, 1408–1414 (2017).
16. G. Papoutsoglou, et al. , Machine learning approaches in microbiome research: challenges and best
practices. Frontiers in Microbiology 14 (2023).
22
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
17. B. D. Topc ¸uo˘glu, N. A. Lesniak, M. T. Ru”n, J. Wiens, P . D. Schloss, A Framework for Ef-
fective Application of Machine Learning to Microbiome-Based Classification Problems. mBio 11,
10.1128/mbio.00434–20 (2020).
18. H. B. Nielsen, et al., Identification and assembly of genomes and genetic elements in complex metage-
nomic samples without using reference genomes. Nature Biotechnology 32, 822–828 (2014).
19. C. Lozupone, M. E. Lladser, D. Knights, J. Stombaugh, R. Knight, UniFrac: an e!ective distance metric
for microbial community comparison. The ISME Journal 5, 169–172 (2011).
20. M. Kucera, B. A. Malmgren, Logratio transformation of compositional data: a resolution of the constant
sum constraint. Marine Micropaleontology 34, 117–120 (1998).
21. T. P . Quinn,et al., A field guide for the compositional analysis of any-omics data. GigaScience 8, giz107
(2019).
22. S. Aryal, A. Alimadadi, I. Manandhar, B. Joe, X. Cheng, Machine learning strategy for gut microbiome-
based diagnostic screening of cardiovascular disease. Hypertension (Dallas, Tex. : 1979) 76, 1555–1562
(2020).
23. D. Fern´andez-Edreira, J. Li˜nares-Blanco, C. Fernandez-Lozano, Machine Learning analysis of the human
infant gut microbiome identifies influential species in type 1 diabetes. Expert Systems with Applications
185, 115648 (2021).
24. I. Manandhar, et al. , Gut microbiome-based supervised machine learning for clinical diagnosis of
inflammatory bowel diseases. American Journal of Physiology-Gastrointestinal and Liver Physiology
320, G328–G337 (2021).
25. D. Martin, et al. , Atypical gut microbial ecosystem from athletes with very high exercise capacity
improves insulin sensitivity and muscle glycogen store in mice. Cell Reports 44 (2025).
26. E. Maunder, D. J. Plews, A. E. Kilding, Contextualising Maximal Fat Oxidation During Exercise:
Determinants and Normative Values. Frontiers in Physiology 9, 599 (2018).
27. I. San-Mill´an, G. A. Brooks, Assessment of Metabolic Flexibility by Means of Measuring Blood Lactate,
Fat, and Carbohydrate Oxidation Responses to Exercise in Professional Endurance Athletes and Less-Fit
Individuals. Sports Medicine (Auckland, N.Z.) 48, 467–479 (2018).
28. Materials and methods.
29. X. Wu, T. Zhang, T. Zhang, S. Park, The impact of gut microbiome enterotypes on ulcerative colitis:
identifying key bacterial species and revealing species co-occurrence networks using machine learning.
Gut Microbes 16, 2292254 (2024).
30. D. Chyzhyk, G. Varoquaux, M. Milham, B. Thirion, How to remove or control confounds in predictive
models, with applications to brain biomarkers. GigaScience 11, giac014 (2022).
23
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
31. S. Kang, et al. , Dysbiosis of fecal microbiota in Crohn’s disease patients as revealed by a custom
phylogenetic microarray. Inflammatory Bowel Diseases 16, 2034–2042 (2010).
32. M. Wijdeveld,et al., Identifying Gut Microbiota associated with Gastrointestinal Symptoms upon Roux-
en-Y Gastric Bypass. Obesity Surgery 33, 1635–1645 (2023).
33. M. C. Jones, The Statistical Analysis of Compositional Data. Royal Statistical Society. Journal. Series
A: General 150, 396 (1987).
34. V . Pawlowsky-Glahn, J. J. Egozcue, R. Tolosana-Delgado, Modelling and Analysis of Compositional
Data (John Wiley & Sons, Ltd) (2015).
35. M. He, N. Zhao, G. A. Satten, MIDASim: a fast and simple simulator for realistic microbiome data.
Microbiome 12, 135 (2024).
36. Z. D. Kurtz, et al. , Sparse and Compositionally Robust Inference of Microbial Ecological Networks.
PLOS Computational Biology 11, e1004226 (2015).
37. J. Chiquet, S. Robin, M. Mariadassou, Variational Inference for sparse network reconstruction from
count data, in Proceedings of the 36th International Conference on Machine Learning (PMLR) (2019),
pp. 1162–1171.
Acknowledgments
The authors thank the anonymous reviewers for their valuable suggestions.
Funding: This work is supported in part by funds from the University of Rennes and the French National
Research Agency within the framework of the PIA France 2030 program for EUR DIGISPORT (ANR-18-
EURE-0022) projects.
Author contributions: DM and VM developed the conceptual framework, DM conceived the experi-
ment(s), DM and PH conducted the experiment(s), PH conceived the simulation microbiome algorithm,
DM, PH, and VM analyzed the results. DM, PH, FD and VM wrote and reviewed the manuscript.
Competing interests: There are no competing interests to declare.
Data and materials availability: The R scripts and the Python packages used for the simulation and
the numerical experiments are available at: https://github.com/pierrehouedry/BiomeSampler. Any
additional information required to reanalyze the data reported in this paper is available from the lead contact
upon request.
24
.CC-BY-NC 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted May 31, 2025. ; https://doi.org/10.1101/2025.05.31.656121doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.