Abstract
Microbial communities significantly impact medicine, biotechnology, and agriculture.
Advanced sequencing technologies have generated extensive microbiome data,
enabling the discovery of substantial evolutionary and ecological patterns. However,
traditional supervised learning methods struggle to capture universal patterns in
microbial community data, largely due to the large data heterogeneity and profound
batch effects among samples, rendering it difficult to classify samples as well as
detect biomarkers from millions of samples, not to say the intricate but important
dynamic patterns from a variety of contextualized sceneries. In this study, we
introduce the Microbial General Model (MGM), the first microbiome community
foundation model pre-trained on a dataset of 263,302 microbiome samples using
language modeling techniques. MGM demonstrated significant improvements in
microbial community classification compared to traditional machine learning methods.
Additionally, MGM has enabled contextualized classification, effectively overcomes
cross-regional limitations, showing enhanced performance on intercontinental datasets
through transfer learning. Furthermore, fine-tuning MGM on a longitudinal infant
dataset revealed distinct keystone genera during development, with Bacteroides and
Bifidobacterium exhibiting higher attention weights in vaginal deliveries, and
Haemophilus in cesarean deliveries. Finally, through in silico modeling, the model
also uncovered novel microbial dynamic patterns in a Crohn’s disease cohort
following antibiotic treatment. In conclusion, by leveraging self-attention and
autoregressive pre-training, MGM serves as a versatile model for various downstream
microbiome tasks and holds significant potential for achieving contextualized aims.
Key points
/circle6 The Microbial General Model (MGM) is a foundation model with millions of
parameters pre-trained on sub-million microbial community data.
/circle6 MGM outperforms traditional methods in various microbiome classification and
prediction tasks, such as microbial community classification.
/circle6 MGM effectively captures the spatial and temporal dynamics of microbial
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
communities.
/circle6 MGM could detect the effects of perturbation on microbial community through in
silico experiments.
Introduction
Microbial communities, ubiquitous across diverse environments play crucial roles in
shaping ecological niches have significant implications for health [1, 2], synthetic
biology [3, 4], and environmental science [5, 6]. The advent of sequencing
technologies has enabled researchers to amass vast microbiome datasets, significantly
expanding our understanding of these complex systems [7]. Currently, hundreds of
thousands of microbiome samples and their sequencing data have been accumulated
and deposited in public databases [8]. For example, as of 2023, EBI MGnify, a
leading platform for microbiome data analysis and archiving, has cataloged 343,695
distinct samples from 4,601 studies across various biomes, including environmental,
engineered, and host-associated microbiomes [8]. While these extensive collections of
microbial community samples represent a valuable resource, they also present
challenges in integrating large-scale microbiome data and extracting complex,
multifaceted patterns within microbial communities, which is essential for advancing
our understanding of their subtle evolutionary and ecological dynamics [9]. However,
data heterogeneity, including insufficient data standards and a lack of interoperability
across datasets, as well as profound batch effects among studies, limits the ability of
traditional meta-analytical methods in capturing shared insights across studies
[10-12].
A promising approach for overcoming these limitations is the use of foundation
models. These models are pre-trained on large-scale datasets, enabling them to
generate a broad range of outputs. Recently, several studies have focused on
developing foundation model to improve the understanding of microbial sequence
data [13-15]. These methods derive foundation models by training on vast, diverse
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
datasets, capturing broad patterns that represent generalized biological relationships.
For example, gLM, a genomic language model pre-trained on millions of
metagenomic scaffolds, demonstrated impressive zero-shot performance in
downstream tasks [15].
As the foundation model learned the shared knowledge embedded in large-scale
dataset, it can be transferred to specific downstream tasks with the specific
characteristics and nuances by transfer learning [16, 17]. This approach is often more
effective than training models from scratch, as it leverages previously acquired
knowledge, leading to improved performance and reduced training time [18-20]. By
adjusting the learned representations to fit the context of the new dataset, transfer
learning has shown potential for addressing the complexities of microbiome data
integration and analysis [21-23].
However, these methods face inherent limitations due to their reliance on supervised
learning strategies, which can introduce label bias during pre-training. Consequently,
different downstream tasks often necessitate the adoption of distinct pre-trained
models. Inappropriate pre-trained model selection can lead to either negligible
performance gains or even performance degradation in the contextualized model [22].
Moreover, inaccurately annotated microbiome samples, such as those labeled “Mixed
biome” in MGnify, can distort the training process, leading to misinterpretations and
ultimately hindering model performance.
To address these limitations, self-supervised learning offers a compelling alternative.
Unlike supervised methods, self-supervised learning does not rely on labeled data and
instead allows models to uncover underlying patterns in large datasets [24]. This
approach circumvents the need for high quality labeled datasets and has shown
significant promise in natural language processing (NLP), where large language
models (LLMs) have evolved from simple pattern recognition to tackling more
sophisticated tasks such as reasoning and content generation [25]. A key driver of
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
these capabilities is the self-attention mechanism [26], a core component of LLMs,
which enables models to focus on relevant parts of high-dimensional input data,
capturing contextual relationships within text. By fine-tuning these models on specific
tasks, the knowledge gained during extensive pre-training on large corpora can be
effectively transferred to new applications. This approach has led to significant
performance improvements across a wide range of NLP tasks [27].
Recent advancements in the application of LLMs have shown promises in the analysis
of biological tabular data [28-31]. Building on their success in processing textual data,
these models have been adapted to handle the structured, high-dimensional nature of
biological datasets. For example, scBERT employs a transformer-based architecture to
analyze gene expression data, enhancing tasks such as cell type classification and
gene expression imputation [28]. Geneformer leverages pre-trained LLMs to identify
gene network regulatory elements within biological data [29]. Similarly, scGPT,
inspired by generative pre-training, provides insights into cellular heterogeneity and
simulating biological states under various conditions [30]. Lastly, scFoundation
integrates foundation LLM techniques to create a versatile framework for diverse
biological tasks, including cell type annotation, trajectory inference, and differential
expression analysis [31]. These models underscore the potential of LLMs to
revolutionize the analysis of biological tabular data, offering more accurate and
comprehensive insights into complex biological systems. Building on these
advancements in genomics analysis, the application of LLMs to the study of microbial
communities offers a promising new frontier for understanding complex microbial
ecosystems.
In this study, we introduce the Microbial General Model (MGM), a context-aware,
attention-based foundation model specifically designed for microbiome analysis.
MGM employs multi-layer transformer blocks and is pre-trained on nearly one
million microbiome samples from diverse biomes to capture generalizable microbial
composition patterns. Through transfer learning, MGM replaces its language
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
modeling head with task-specific heads, enabling fine-tuning on limited data for
various applications. In benchmark evaluations, MGM outperformed traditional
machine learning approaches across multiple tasks. For example, in a cross-regional
disease diagnosis task, MGM generalized effectively across diverse cohorts,
overcoming intercontinental variations in microbial community structures. In a
longitudinal infant cohort, MGM distinguished between delivery modes, identifying
developmental distinctions and keystone species. In tumor microbiome analyses,
MGM uncovered potential therapeutic targets through in silico perturbation
experiments. Additionally, in a Crohn’s disease cohort, MGM revealed microbial
dynamics influenced by antibiotic treatment, identifying consistent and novel
microbial signals over time.
Results
Microcorpus-260K and MGM architecture
MGM is a foundation model pre-trained on large-scale microbiome community
samples from various biomes. To facilitate this pre-training, we assembled
Microcorpus-260K, a comprehensive dataset containing microbiome samples from
the MGnify database up to June 2023. After filtering out low-quality or incomplete
data, 263,302 samples were retained for pre-training (Methods). From these samples,
we generated a vocabulary comprising 9,665 distinct genera. The genera were
normalized and ranked based on their relative abundance in each sample, and then
transformed into discrete input representations. Given the fixed input length required
by the transformer model, we selected 512 as the input length. This choice was made
to ensure that 99.99% of the samples were adequately covered without truncation (Fig.
1a). Following the preparation of the dataset, we developed MGM using a multi-layer
transformer architecture (Fig. 1c ), designed to effectively capture the patterns and
structures present within the large-scale microbial community data.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Figure 1. MGM architecture and transfer learning strategy. a. Construction
process of Microcorpus-260K. b. MGM pre-trained using causal language modeling
approach. c. Model details in multi-layer transformer blocks. d. Adaptation of MGM
to different downstream tasks using transfer learning methods. e. Examples of
downstream tasks: I. Batch integration based on contextualized sample embedding. II.
Keystone species discovery based on contextualized attention weights. III. Accurate
predictions based on transfer learning. IV . In silico perturbation analysis.
Language modeling enables generalizable patterns capture
We pre-trained MGM using a causal language modeling approach on the
Microcorpus-260K dataset. During this process, the model progressively learned to
predict the next genera based on existing microbial composition within the sample
(Fig. 1b ). By leveraging the self-attention mechanism, MGM captures
high-dimensional global representations of microbial community information (Fig.
1c). For downstream tasks, MGM employs transfer learning strategy, where the
language modeling head is replaced by task-specific heads (e.g., a sequence
classification head) and the model is fine-tuned on limited data ( Fig. 1d ). This
capacity makes it a versatile tool for a range of microbiome analyses, including batch
integration, keystone species discovery and microbial community classification.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Additionally, MGM can predict the effects of perturbation by modifying the rank of
certain microorganisms (Fig. 1e).
To optimize model performance, we conducted a grid search on hyperparameters and
selected a structure comprising 8 layers and 8 attention heads ( Fig. 2a,
Supplementary Fig. 1a ). To evaluate the effectiveness of pre-training process, we
randomly selected 1,000 samples from the validation set. For each sample, the
pre-trained MGM model predicted microbial compositions (referred to as “sentences”)
based on a proportion of genera (referred to as “tokens”). We then compared the
cosine similarity between the embeddings of the predicted and original compositions.
Remarkably, even when only 20% of the original genera were provided, the cosine
similarity between the predicted and original embeddings exceeded 0.9 ( Fig. 2b ).
These results highlighted MGM’s strong ability to capture and generalize microbiome
patterns, positioning it as a powerful tool for various downstream applications.
We further explored whether the pre-training process captured taxonomic differences
between genera, despite the absence of explicit phylogenetic information in our
encoding strategy. We extracted embeddings for the 9,665 genera from the word
embedding layer and found that genera from Bacteria and Eukaryota formed two
diffuse clusters in the embedding space (Fig. 2c) . Additionally, we identified an
outlier cluster comprising 157 genera, 43.3% of which belonged to Arthropoda (68 of
157, Fig. 2d). Given that Arthropoda is not typically considered a core component in
microbiome studies, this finding suggested that MGM is capable of detecting
taxonomic distinctions, even without explicit phylogenetic encoding.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Figure 2. Hyperparameter search and pre-training evaluation. a. Grid search
Result
of layers and heads of MGM. b. Similarity between generated sentences and the
original sentences. c. UMAP visualization of the 9,665 genus embeddings extracted
from MGM’s word embedding layer. d. Phylum-level distribution of the 157 genera
identified as outliers in the UMAP plot. Only the top 3 phylum have the most genera
are labeled.
Microbial community classification and batch integration
We evaluated MGM by a comprehensive benchmark using a microbial community
classification task on our Microcorpus-260K dataset. This task is a critical aspect of
microbiome analysis, with applications including microbial source tracking [32] and
noninvasive diagnostics [33]. We fine-tuned MGM for microbial community
classification with cross-entropy on biome name and lineage from MGnify, followed
by a 5-fold cross-validation on each biome layer. We compared these results with both
traditional source tracking methods, such as FEAST [34], and other machine learning
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
techniques, including K-Nearest Neighbor (KNN), Logistic Regression (LR), Random
Forest (RF). To assess the value of self-supervised pre-training, we also evaluated an
un-pre-trained MGM model. Our results demonstrated that the fine-tuned MGM
significantly enhanced the ability to distinguish the source of samples ( Fig. 2a ,
average ROC-AUC = 0.99), outperforming traditional methods ( Fig. 2a, FEAST,
average ROC-AUC = 0.68; KNN, average ROC-AUC = 0.94; LR, average
ROC-AUC = 0.95; RF, average ROC-AUC = 0.97) and the un-pre-trained MGM (Fig.
3a, average ROC-AUC = 0.97). Notably, the EM-based FEAST method exhibited
inefficiently, leading us to evaluate it only at the first layer. To further tested model
generalization, we applied MGM on 43,528 new samples introduced to MGnify after
March 2023 without additional fine-tuning. While the RF model performed slightly
better in the shallower layers (layer 1 and 2), MGM excelled in the deeper, more
complex layers (layer 3, 4, and 5) ( Fig. 3b ). Furthermore, both fine-tuned and
un-pre-trained MGM embeddings yielded superior microbial community
classification compared to community abundance profiles based on clustering
performance (Fig. 3c). These findings underscored the value of the general patterns
captured during self-supervised pre-training, which provided a foundational
understanding of microbial community structures and enhanced downstream
classification tasks.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Figure 3. MGM Enhanced Source Tracking of Microbial Communities. a.
Boxplot evaluation of the six methods using 5-fold cross-validation on each layer of
biome lineage, with blue representing ROC-AUC, orange representing F1, and dashed
lines indicating the average of all experiments. b. ROC-AUC performance of each
Method
on newly introduced data in MGnify. We excluded FEAST for its poor
computational inefficiency and performance at the shallow layer . c. UMAP
dimensionality reduction of for 3,000 random samples from microbial relative
abundance, pre-trained MGM embeddings and fine-tuned MGM embeddings colored
by the biome of layer 1. KNN: K-nearest neighbors. LR: Logistic regression. RF:
Random Forest. TL: transfer learning. SS: Silhouette score.
Overcoming cross-regional limitation
Cross-regional diagnosis poses a significant challenge due to the variability in
microbiome compositions across geographic regions [35]. Factors like diet,
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
environment, and genetic background create distinct microbial profiles, complicating
the development of diagnostic models that perform consistently across diverse
populations. Traditional methods often struggle with overfitting to specific training
data, which hinders their ability to generalize, especially when datasets are small or
lack diversity. This variability highlights the need for a foundation model that can
capture broad microbial patterns while being flexible enough to accommodate
regional differences. MGM is pre-trained on large-scaled dataset, enabling it to
recognize generalizable microbial features that are less susceptible to regional biases.
When applied to cross-regional diagnosis, MGM can be fine-tuned with
region-specific data, enhancing its ability to adapt to local microbial variations
without losing the robustness of its general foundation.
We evaluated the robustness of MGM against regional limitations in clinical diagnosis.
In such tasks, region-specific and disease-specific microbes often play a crucial role.
Three models were evaluated for disease diagnosis performance across two
intercontinental gut microbiome cohorts: an un-pre-trained model, a pre-trained model,
and a Cross-country model fine-tuned with data from the target country.
For the inflammatory bowel disease (IBD) cohort [35] consisting of samples from
Ireland and Canada, we observed that Lachnoclostridium ranks lower in the Crohn's
disease group compared to the healthy group. Similarly, it also ranks lower in samples
from Canada compared to those from Ireland (Supplementary Fig. 2 ). The
pre-trained MGM model outperformed the un-pre-trained model in both
cross-regional and local region diagnostics. After partial fine-tuning with data from
the target region, the cross-regional diagnostic performance reached satisfactory
levels (Ireland predicting Canada, ROC-AUC: 0.844; Canada predicting Ireland,
ROC-AUC: 0.829, Fig. 4a, 4b and 4c). For the irritable bowel syndrome (IBS) cohort
[36] consisting of samples from Australia, the United Kingdom, and the United States,
the model's initial performance was poor in Australia due to small sample sizes (n=21,
ROC-AUC: 0.500). However, after training on a larger dataset from the United
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Kingdom (n=171) or United States (n=498) and fine-tuning with data from the target
countries, the model's diagnostic performance in Australia and the United Kingdom
improved significantly (Australia from United Kingdom ROC-AUC: 0.833; United
Kingdom from United State ROC-AUC: 0.800, Fig. 4d, 4e and 4f). Our results
aligned with previous studies based on fully connected neural network [37], suggested
building a foundation diagnostic model on a large, diverse dataset and then
fine-tuning it with smaller regional datasets proves to be an effective strategy for
mitigating regional effects.
Figure 4. Result of disease diagnosis across intercontinental regions. a. Heatmap
of ROC-AUC of un-pre-trained model on IBD cohort. b. Heatmap of ROC-AUC of
pre-trained model on IBD cohort. c. Heatmap of ROC-AUC of cross-country model
on IBD cohort. d. Heatmap of ROC-AUC of un-pre-trained model on IBS cohort. e.
Heatmap of ROC-AUC of pre-trained model on IBS cohort. f. Heatmap of ROC-AUC
of cross-country model on IBS cohort. Row represents countries for training the
model. Columns represents countries for evaluating the model.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Infant development monitoring and keystone genus
discovery
Next, we fine-tuned our model on a longitudinal infant dataset from Roswall et al.
[38], to distinguish the developmental stages of infants from different delivery modes
(Fig. 5a ). During infancy, the gut microbiome undergoes dynamic changes and
typically stabilizes in childhood. Meanwhile, the mode of delivery significantly
impacts the microbiome in neonates, which can be linked to health outcomes later in
life [39]. However, relatively few methods exist to uncover the dynamics progression
of gut microbiota from infancy toward a stable, adult-like microbiota.
Our model could accurately predict the developmental stage and delivery mode of
these infants, outperforming both MGM trained from scratch and traditional methods.
In comparison, EXPERT, a supervised pre-training method based on neural network
for microbial community classification [22], performed poorly in this task, indicating
the superiority of self-supervised pre-training methods and attention mechanisms
(EXPERT, average ROC-AUC = 0.54; EXPERT with transfer learning, average
ROC-AUC = 0.54; MGM, average ROC-AUC = 0.90; MGM with transfer learning,
average ROC-AUC = 0.91, Fig. 5a, Supplementary Fig. 3a). Besides, we visualized
sample embeddings to capture the dynamic changes in microbial communities at
different stages ( Supplementary Fig. 4 ). The high similarity between samples of
cesarean delivery and vaginal delivery samples at the same development stage suggest
that infant gut microbes are stage-specific. As the infant ages, the sample embeddings
become increasingly similar to their mother’s, indicating that infant gut microbial
communities evolve toward those of adults.
We then examined the attention weights from 64 attention across developmental
stages. Several genera, including Dorea, Faecalibacterium and Ruminococcus (KS
test, P = 0.13, 0.19 and 0.22), demonstrated similar attention trends across delivery
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
modes, consistent with their known associations with infant age [40]. However, most
genera demonstrated distinct attention patterns. Notably, Bacteroides (KS test, P =
5.87E-5), a keystone taxon of the human gut microbiota [41] and Bifidobacterium (KS
test, P = 1.53E-12), a common probiotic [42], received higher attention weights in
vaginal deliveries during early stages. In contrast, Haemophilus (KS test, P =
4.88E-90), a well-known human pathogen, had consistently higher attention weights
in cesarean deliveries throughout the entire developmental process (Fig. 5b).
We further employed a leave-one-genus-out deletion approach to identify genera
whose removal would have a deleterious effect in this context. In this analysis, one
genus was removed from the rank value encoding at a time, and the impact on the
embeddings of the remaining genera was quantified by similarity. The deleterious
pattern of two delivery modes is consistent with the results of the attention weights
analysis that probiotic has higher deleterious effects on infants delivered vaginally.
Keystone taxa such as Bacteroides, Roseburia, and Faecalibacterium exhibited the
higher deleterious effects ( Fig. 5c and Supplementary Fig. 3b ). Based on these
findings, we hypothesize that genera with high attention weights and deleterious
effects possess strong keystone attributes.
We quantified the keystone attributes of genera using DKI framework [43], a
deep-learning model designed to assess community-specific keystoneness by
conducting thought experiment on species removal. We found that genera with high
attention weight had a large overlap with genera with high keystoneness, and these
genera were present in a large proportion of samples. Interestingly, infants delivered
vaginally exhibited higher overall keystoneness compared to those delivered by
cesarean section, with the mother's keystoneness was intermediate between the two
delivery modes. (Fig. 5d and Supplementary Fig. 3c). In summary, several bacterial
taxa identified as keystone in gut microbiome exhibited high attention weights in our
model. This demonstrated that our attention-based model can effectively capture
keystones taxa and the dynamic developmental trajectory of microbial community,
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
paving the way for comprehensive analysis of diverse microbiome from both spatial
and temporal perspectives.
Figure 5. MGM Enhanced age prediction and keystone genus discovery. a.
Boxplot evaluation of MGM model using 5-fold cross-validation, with blue
representing ROC-AUC, orange representing F1, and dashed lines indicating the
average of all experiments. b. Line plot showing the attention weights of the top 20
genera across layers and heads. Asterisks representing the significance of
Kolmogorov–Smirnov (KS) test: ***: P < 0.005, **: P < 0.01, *: P < 0.05, ns: not
significant. c. Top 5 genera with deleterious effects at birth from cesarean deliveries
and birth from vaginal deliveries. d. Keystoneness of genera appearing in at least 30%
of samples in each group (pink line plot). Overlap of the top 20 highest attention
weight genera and the top 20 highest keystoneness genera (green bar plot). B: birth,
4M: 4 months, 12M: 12 months, 3Y: 3 years, 5Y: 5 years, C: cesarean, V: vaginal, M:
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
mother, TL: transfer learning.
Potential cancer treatment target identification
We next sought to explore the potential applications of MGM model in distinguishing
different types of cancers and identifying tumor-specific biomarkers. Recent studies
have increasingly highlighted the presence of microbial signals within tumor tissues,
suggesting that the tumor microbiome could be a valuable target for cancer research
and treatment [44-46]. However, the research on biomarkers across different tumor
tissues may not yet be sufficiently comprehensive [47]. Additionally, the ability to
accurately diagnose and identify multiple types of tumor tissues remains limited in
precision. This underscores the need for more robust models that can effectively
differentiate cancers and identify specific microbial biomarkers, which could be
crucial for developing targeted therapies.
To this end, we fine-tuned our model using five types of gastrointestinal tumors
obtained from The Cancer Microbiome Atlas (TCMA) database [48]. Firstly, our
model with a macro-ROC value reaching 0.97 ( Fig. 6a). Specifically, the ROC
values for COAD, ESCA, HNSC, READ, and STAD were 0.99, 0.97, 0.98, 0.98, and
0.97, respectively, indicating a robust performance across different cancer types.
Secondly, to further explore the potential therapeutic targets, we employed the
'leave-one-genus-out' approach to assess the impact of each genus's absence on the
sample embeddings. We extracted the top 50 biomarkers ranked by MGM-calculated
attention scores, and the heatmap (Fig. 6b) illustrated their distribution across the five
cancer tissues. Our findings suggested that some genera exhibited significant
abundance differences in particular types of cancer. For instance, Escherichia and
Enterobacter had a significant detrimental impact on COAD and READ samples.
These genera have been previously associated with gastrointestinal tumors [48-50]. In
COAD samples, the abundance of Escherichia was observed to be 7.25 times higher
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
than in non-COAD samples. Conversely, genera like Acinetobacter may act as key
biomarkers distinguishing COAD and READ tissues. Streptobacillus, another
significant genus, displayed an elevated presence in HNSC, with a relative abundance
increase of 5.26 times compared to other types, indicating its importance in
distinguishing HNSC. Boxplots ( Fig. 6c-d ) showed the cosine similarity of the
embedding vectors before and after the removal of specific genus. Notably, when
Acinetobacter was removed, the cosine similarity in the READ samples significantly
decreased, underscoring the critical role this genus plays in shaping the microbial
composition of these samples. These findings not only validated the effectiveness of
MGM model in identifying cancer-associated microbial biomarkers, but also offered
new possibilities for targeted cancer diagnosis and treatment.
Collectively, our model demonstrated high diagnostic accuracy and robust
performance across multiple cancer types, underscoring the crucial role of the tumor
microbiome in cancer development. These advancements revealed MGM’s potential
for more precise and targeted cancer therapies, highlighting the potential of
integrating tumor microbiome analysis into clinical practice.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Figure 6. MGM Enhanced Cancer Diagnosis and Biomarker Identification. a.
ROC performance of MGM in distinguishing between five types of cancer. The x-axis
represents the False Positive Rate (FPR), while the y-axis indicates the True Positive
Rate (TPR). The macro-average ROC value reaches 0.97, with individual ROC values
as follows: COAD (0.99), ESCA (0.97), HNSC (0.98), READ (0.98), and STAD
(0.97). b. Distribution of the top 50 species identified by attention across the five
types of cancer. The x-axis represents different cancer types (COAD, ESCA, HNSC,
READ, STAD), and the y-axis lists the microbial genera. The color intensity in each
cell corresponds to the abundance level, with the scale ranging from -2 to 12, where
positive values indicate a higher relative abundance. c-d. Cosine similarity of
embedding vectors before and after the removal of a specific genus in COAD and
READ. The x-axis shows the microbial genera removed, and the y-axis represents the
cosine similarity score, ranging from 0 to 1.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
In silico perturbation analysis and validation on Crohn’s
disease
Microbial perturbation studies are crucial for understanding the impact of treatments,
such as antibiotics, on microbiome composition in disease contexts. To verify MGM's
capability to capture microbial perturbations, we fine-tuned the model on a dataset
containing intestinal mucosa microbiome samples from Crohn's disease (CD) patients
before and after antibiotic treatment [51].
In our in silico perturbation analysis, we identified microbes whose enrichment or
reduction in CD patients' intestines could shift sample embeddings towards those of
healthy controls. To simulate microbial enrichment, we repositioned a genus token to
the next position after the ‘bos’ token, while microbial reduction was simulated by
repositioning the genus token to the position before the ‘eos’ token. Our results
aligned with the original study, particularly in CD patients who had not received
antibiotic treatment ( Fig. 7a-d). In both Terminal ileum and Rectum samples, the in
silico reduction of Enterobacteriaceae, Pasteurellaceae, V eillonellaceae,
Fusobacteriaceae, and Neisseriaceae resulted in a higher similarity to healthy
controls compared to in silico enrichment, mirroring findings from the original study.
However, Gemellaceae, another family reported as increased in the original study,
showed no difference between in silico enrichment and reduction.
We further evaluated the top six families showing the largest differences in our in
silico analysis against the six families reported as increased in the original study. We
identified several novel families that were more prevalent in CD patients, including
Alcaligenaceae and [Odoribacteraceae], whose in silico reduction significantly
increased similarity to healthy controls ( Fig. 7e-h). Previous studies have shown that
Alcaligenaceae is enriched in mesenteric adipose tissue in CD patients [52], and
[Odoribacteraceae] is more abundant in CD patients with the Type II Paneth cell
phenotype [53]. Moreover, the in silico reduction of Enterococcaceae and
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Helicobacteraceae in Terminal ileum samples from CD patients treated with
antibiotics exhibited a significantly greater impact, suggesting that these families may
show resistance to antibiotic treatment (Fig. 7g).
These findings highlighted MGM's effectiveness in detecting both known and novel
microbial perturbations. By capturing microbial dynamics that align with
experimental data and identifying new perturbations, MGM proved its utility for
microbiome research, particularly for exploring therapeutic impacts in disease
contexts.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Figure 7. In silico enrichment and reduction analysis on CD dataset. a. In silico
analysis of six families reported as increased in the Terminal ileum of CD patients. b.
In silico analysis of six families reported as increased in the Rectum of CD patients. c.
In silico analysis of six families reported as increased in the Terminal ileum of CD
patients treated with antibiotics. d. In silico analysis of six families reported as
increased in the Rectum of CD patients treated with antibiotics. e. In silico analysis of
novel families identified as increased in the Terminal ileum of CD patients. f. In silico
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
analysis of novel families identified as increased in the Rectum of CD patients. g. In
silico analysis of novel families identified as increased in the Terminal ileum of CD
patients treated with antibiotics. h. In silico analysis of novel families identified as
increased in the Rectum of CD patients treated with antibiotics. ‘bos’ indicates
enrichment simulation by shifting a genus token to the next position after the ‘bos’
token, while ‘eos’ indicates reduction simulation by shifting a genus token to the
position before the ‘eos’ token. The Y-axis represents the cosine similarity of the
sample embedding to that of healthy controls. Venn diagrams in the center show the
overlap between families reported as increased in the original study and the top six
families showing the largest differences in our in silico analysis, as well as novel
families identified through this analysis.
Discussion
In this study, we proposed MGM, the first foundation model designed for microbial
community analysis, leveraging pre-trained transformers on a diverse corpus of
microbiome data. By employing large-scale self-supervised pre-training, MGM
develops a foundational understanding of microbial interactions within communities,
free from task-specific biases. This general representation captures broad patterns and
relationships across varied microbiome datasets, establishing MGM as a versatile tool
in microbiome research.
Benchmark evaluations underscore MGM's superior performance in microbial
community classification tasks. In cross-validation on the Microcorpus-260K dataset,
the fine-tuned MGM achieved an average ROC-AUC of 0.99, significantly
outperforming traditional methods, including source tracking techniques and machine
learning models. Its application to 43,528 additional samples from MGnify revealed
exceptional performance in deeper, more complex analyses. Furthermore, MGM
embeddings enabled seamless integration of microbiome data across different sources
and batches, highlighting its utility in distinguishing microbial samples for tasks such
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
as microbial source tracking.
To tailor these general insights to specific microbiome-related tasks, MGM employs a
contextualization approach. By fine-tuning the foundation model on task-specific
datasets, MGM adapts its learned representations to align with the unique
characteristics and nuances of the target task. This process enhances the model’s
performance on various microbiome-related tasks by leveraging the robust, broad
knowledge acquired during pre-training. Through systematically analyzing the
performance of MGM, we gain valuable insights into how pre-trained language model
can be applied to microbiome data analysis, showing MGM’s potential for enhancing
performance on various microbiome-related tasks.
Beyond classification, MGM captures the spatial and temporal dynamics of microbial
communities. In cross-regional intestinal disease datasets, MGM overcame regional
limitations, achieving accurate diagnoses across intercontinental regions. When
applied to a longitudinal infant dataset, the model effectively traced the development
and maturation of the infant gut microbiome. Attention-weight analyses across
developmental stages and delivery modes identified key genera, such as Bacteroides
and Bifidobacterium, which were more prominent in vaginal deliveries, while
Haemophilus showed higher weights in cesarean deliveries.
In silico perturbation experiments further highlighted MGM's clinical potential.
Fine-tuning on the TCMA database revealed genera with significant deleterious
effects across various tumor types, suggesting its utility in identifying microbial
targets for cancer therapy. When applied to a Crohn’s disease dataset involving
antibiotic treatment, MGM detected microbial perturbations in the intestinal mucosa
consistent with the original study, along with novel findings, such as shifts involving
Alcaligenaceae and Odoribacteraceae , later corroborated by independent research.
These findings illustrate MGM’s sensitivity to microbial community changes and its
potential for therapeutic applications.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
While MGM model shows promise, several limitations need to be addressed. The
primary limitation lies in its rank value encoding method. While this method
effectively mitigates the impact of extreme values and converts tabular data into
sequential data, it fails to preserve the original relative abundance information. This
shortcoming complicates the reconstruction of samples into their original abundance
tables, limiting the model’s generative capabilities. Future work should focus on
refining the encoding process to better retain relative abundance information. Another
area for improvement involves the expansion of the model's training dataset. While
MGM has demonstrated strong performance, its generalizability could be further
enhanced by incorporating a broader range of microbiome samples from different
biomes and populations. Given the model's ease of fine-tuning, updating it with
additional datasets is a practical way to boost its adaptability across various
microbiome-related tasks. Additionally, while the model shows great promise in
identifying treatment targets and keystone genera, incorporating wet-lab experimental
validation would strengthen the robustness and comprehensiveness of our findings.
In conclusion, MGM represents a profound advancement in microbiome research,
offering a robust and adaptable tool for analyzing microbial communities. As a
foundation model, MGM exceled in large-scale microbial classification, leveraging
vast pre-trained datasets to capture fundamental patterns that span diverse microbial
ecosystems. In its contextualized form, MGM could be fine-tuned for specific
downstream tasks, such as identifying clinically relevant microbial perturbations and
uncovering nuanced microbial interactions. This dual capacity to model both general
and task-specific patterns underscored its broad applicability in microbiome science,
including therapeutic interventions and diagnostic innovations. As a powerful
foundation model, MGM paves the way for future innovations in microbiome analysis,
contributing to a deeper understanding of microbial ecosystems and their roles in
human health.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Methods
Data Preprocessing
We assembled a comprehensive dataset, Microcorpus-260K, which includes all
samples from MGnify up to June 2023. Initial processing involved retaining
genus-level relative abundances and filtering out genera with relative abundances less
than 0.01%. Samples were further filtered to retain only those with at least 10 genera
with non-negligible abundance, resulting in a final dataset of 263,302 samples and a
vocabulary of 9,665 genera.
For each sample, we standardized the relative abundances of each genus using their
mean and standard deviation across all samples. These standardized values were then
rank encoded to prepare the data for model input. The means and standard deviations
calculated during this step were saved for future standardization of downstream data.
Model Architecture
We constructed MGM model using eight layers of transformer blocks, with each
block consisting of a self-attention layer and a feed forward neural network layer.
Given the fixed input length requirement of transformer models, we set the input
length to 512 tokens, covering 99.99% of the samples. This length ensured that most
samples could be processed without truncation, preserving the integrity of the data.
Additional key hyperparameters were as follows: activation function, Gaussian Error
Linear Unit (GELU); attention heads per layer, eight; feed forward size, 1024. The
modeling framework was implemented in PyTorch, leveraging the Huggingface
Transformers library for model configuration and training [54].
The self-attention mechanism employed in each transformer layer follows the scaled
dot-product attention formula:
Attention /g4666 /g1843 ,/g1837 ,/g1848 /g4667 /g3404s o f t m a x /g4678 /g1843 /g1837 /g3021
/g3493 /g1856 /g3038
/g4679/g1848
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Where /g1843 (queries), /g1837 (keys), and /g1848 (values) are the linear projections of the
input, and /g1856 /g3038 is the dimensionality of the keys. This formulation allows the model
to capture contextual relationships between different tokens in the sequence,
enabling more effective representation learning for downstream tasks.
Pre-training Procedure
Pre-training was conducted using the causal language modeling approach with
self-attention mechanism to capture co-occurrence pattens among tokens. For each
sample, we appended a ‘bos’ token at the beginning and an ‘eos’ token at the end to
denote the start and end of a sequence, respectively. Different from transformer
encoder models like BERT, which randomly masked some tokens and predicted them
in the output, our autoregressive model was trained to predict the next possible token
in the sequence based on known input tokens, facilitating the learning of contextual
relationships among genera within a sample. Specifically, the order of predicted
tokens also implied genera relative abundance distribution in the sample.
Training was executed using Huggingface's Trainer API. Key hyperparameters
included: learning rate, 1e-3; batch size, 50; warmup steps, 1000; weight decay, 0.001;
validation split, 10% of the data. Validation loss is calculated per 500 training steps,
Early stopping based on validation loss with 3 patience.
Model interpretability
We conducted an interpretability analysis by leveraging the attention weights
extracted from the multi-head, multi-layer transformer. These attention weights were
modified by replacing /g1848 with /g1848 /g2868, where /g1848 /g2868 represents one-hot indicators for each
position index. To consolidate the attention information across the model, we
integrated the attention matrices by calculating an element-wise average across all
layers and attention heads. To identify the genera with the highest attention weights in
a microbial community, we summed the attention weights across each column,
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
obtaining the total attention weight from a single genus to all other genera in the
community.
Sample Representation
Each sample is analogous to a ‘sentence’ composed of genera, and its representation is
obtained by aggregating the learned genus-level representations. In this study, we
opted element-wise mean pooling to get the sample representation from our
pre-trained model. For fine-tuned model, as the last token (‘eos’ in this study) was
used to do the sequence classification, we used its embeddings as the sample
representation.
Downstream Fine-tuning
For downstream tasks, the pre-trained MGM model is fine-tuned by replacing the
language modeling head with a task-specific head. All downstream tasks in this study
focused on microbial community classification. Fine-tuning employed a sequence
classification head, which utilized the final token (‘eos’ in this study) for classification.
Fine-tuning was executed using Huggingface's Trainer API. Key hyperparameters
included, learning rate, 1e-3, batch size, 50, warmup steps, 1000, weight decay, 0.001,
validation split, 10% of the data. For microbial classification task on
MicroCorpus-260K, For the microbial classification task on the MicroCorpus-260K
dataset, validation loss was calculated every 500 training steps, while for other
downstream tasks, it was calculated per training epoch. Early stopping based on
validation loss with 3 patience.
The evaluation of both the microbial classification task and the infant age prediction
task was performed using a 5-fold cross-validation strategy. Training was conducted
on 80% of the samples, with performance tested on the 20% held-out samples, and
this process was repeated across five folds. For the cross-regional disease diagnosis
task, 50% of the samples from each region were split as the test set using stratified
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
sampling, while the remaining 50% were used either to train the diagnostic model or
fine-tune a model trained on another region. Notably, the fine-tuning applications
were trained on classification objectives distinct from the causal language modeling
objective, making the inclusion of task-specific data in the pre-training corpus
irrelevant to classification predictions.
Comparison methods
FEAST: FEAST is an expectation-maximization-based method that estimates the
proportion of the sink community contributed by various source environments [34].
For benchmarking, we employed the R package implementation of FEAST.
EXPERT: EXPERT is an ontology-aware neural network method that leverages
transfer learning for microbial community classification [22]. We benchmarked
EXPERT using its Python package.
DKI: DKI is a deep learning-based approach designed to identify keystone species in
microbial communities [43]. We utilized the scripts provided in DKI’s GitHub
repository to validate the keystone microbes identified by MGM.
Other Machine Learning Methods: Additional machine learning models used for
benchmarking, including K-Nearest Neighbor, Logistic Regression, and Random
Forest, were implemented using the scikit-learn library [55].
Code Availability
The code for MGM model is available at
https://github.com/HUST-NingKang-Lab/MGM.
Acknowledgments
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
This work was partially supported by the National Key R&D Program of China
(Grant No. 2023YFA1800900 and 2018YFC0910502), the National Natural Science
Foundation of China (Grant Nos. 32071465, 31871334, 81827901). Numerical
computations were performed on the Hefei Advanced Computing Center.
Author contributions
HZ and KN conceived of and proposed the idea, designed and developed the
framework. HZ and YZ performed the experiments and analyzed the data. HZ, YZ
and ZK visualized the data. HZ, YZ, ZK, LS and KN contributed to editing and
proof-reading the manuscript. All authors read and approved the final manuscript.
Competing interest
The authors declare that they have no competing interests.
Ethics approval and consent to participate
Not applicable.
References
1 . Fl i n t, H. J. , et al . , The r o le o f the gut microb iota in nutriti on an d health. N a t R e v
Ga s troe nter o l H e pa to l, 2012 . 9 (10) : p. 57 7-89 .
2. Trem aro l i, V. a nd F. Backhed, F unctio na l in te rac tions betw een the g ut microb i o ta and
hos t me tabolism. N a tu re , 201 2 . 489 ( 7415): p. 242 -9.
3. Mc C ar ty, N .S. and R . Le des ma-A maro , Synthe tic B iology T ools to Eng i nee r M ic r obia l
Comm unities fo r B iotec hno logy. Trends Biote ch no l, 20 19 . 37(2 ): p . 181 -197.
4 . K e , J . , B . W a n g , a n d Y . Y o s h i k u n i , Microb i om e E ngineer in g : Synthe tic Biology o f
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Plant-Ass o ciated M icr ob io mes in Sustainable Ag ric ultu re. Trend s Bio tec h nol, 2 021 .
39(3) : p. 244- 261.
5. Baker, B .J . and J .F . Banf ie l d, Mic robial co m mu n ities i n acid m ine drai nag e. F E M S
Mic rob io l E col, 2 003. 44(2) : p. 139- 52.
6. Anderse n , R . , S.J . Chap ma n , and R .R .E . Ar tz, M ic r o bia l c o m mun i ties i n na tura l and
disturbed pe a t land s : A re vie w . So il B io logy and B i o che mis t ry , 2013 . 57 : p. 979- 9 94 .
7. Han, D . , et a l ., O rganelle 16S rRN A a m plicon sequencing enables p ro filing o f activ e
gut m icrobiota i n mu rin e mode l. App l Mic rob i ol B i ot ech no l, 2022 . 106 ( 1 7) : p.
5715 - 5 728 .
8. Richar d son , L. , e t a l. , M Gn i fy: the microb io me sequence da ta analys i s re sou rc e in
2023 . N uc l eic Ac ids R e s , 202 3 . 51 (D 1): p . D753 -D759 .
9. David, M. M ., e t a l ., R ev ea ling Ge ne ra l Patte rns o f M ic robiom es T hat Tran sc end
Systems : Po te ntia l an d Cha llenges o f D eep Tran s fe r Lea rn in g . mSystem s, 2022 . 7 (1 ):
p. e0105821.
10. Kyrpid es, N .C. , E .A. E loe-F adrosh, and N . N . I vano v a, Microb iome Data Sc i en ce:
Unders tan ding Ou r M ic ro bia l P l ane t. Tre n ds Microb io l , 2016 . 24(6 ): p . 42 5 -427 .
11. Duvallet , C ., et a l . , Me ta-analy s is of gu t microb io me studies identif ie s d i sease-sp ec ific
and s ha red respons es . Na t Com mun, 20 17. 8 ( 1): p. 1784.
12. Wir bel , J ., et al. , Mic rob i ome me ta -ana ly s is an d cross -diseas e co mpariso n enab l ed by
the S IAMCA T machine learn ing to o lbox. Genome Bio l , 2021 . 22(1 ): p . 93 .
13. Hoarf ros t, A. , e t a l ., Deep le a rning o f a ba cte ria l a nd a rchaeal un iv ersal languag e o f
life ena b les trans fe r lea rni ng a nd illu minates m ic rob ia l da rk matter . N at Com mun,
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
2022 . 13 (1 ): p . 2 6 06.
1 4 . L i n , Z. , et a l. , Ev o lutio na ry -sc a le pred iction of a tom ic- l e vel p rotein struc tu re w ith a
lan guage mode l. Science, 2 023. 379 (663 7): p . 1123 -113 0 .
1 5 . Hwa n g , Y., e t al ., Gen o mic lang u age m o d el predic ts p rote in co -regul a tion and fun c tion.
Nat C om mun, 2 024 . 15 ( 1) : p . 288 0.
16. Yosin s ki, J ., e t a l., How trans ferable are feature s in dee p neural ne tw o rk s? , in NIPS .
2014 .
17. Tan, C. , e t a l . , A S urv ey on Deep T ran s fer Learn in g , in Art i f ic ial N eu ral Ne tw or ks and
Mac h ine L ear ning - - ICAN N 2 01 8 . 2018 . p. 27 0- -279 .
18. He, K . , e t a l . , D eep Residua l Lea r ning for Im age Reco gn itio n , i n 2 016 IEEE
Conferen c e on Co mpu t er V isio n and Pa ttern Re c ogniti on (C VPR) . 2 016. p. 770 -77 8.
19. Gu r u r a nga n, S. , e t al. Don’t Stop Pretra inin g: Adapt Language M od els to D oma ins
and Ta s ks . 2020 . On li ne: Ass o ciat i on fo r Compu t at iona l L inguis t i cs .
2 0 . He , K., e t al . Mas ke d Au toenco de rs A re Scalable Visi o n Learn e rs . in 202 2 IEE E /CVF
Conferen c e on Co mpu t er V isio n and Pa ttern Re c ogniti on (C VPR) . 2 022.
21. Ito , M. , Y . Glas e r, an d P . Sadows ki, Evo lu tion - I n for med N eura l N e t w orks f or
Mic rob io me D ata Ana l ys i s , in 2021 IEEE Inter natio na l C o nfe rence o n Bioin for matics
and Biome dicine (B IB M) . 2021. p. 338 6 -3 391.
22. Chong, H ., et al ., EXP ERT: tr an sfer lea rnin g-en able d c on tex t - a w a re m ic ro bia l
commun it y c l ass ification . B rie f Bioin for m , 2022 .
23. Gu o , S . , e t a l . , A neu ral ne twork -ba se d framewo rk to unders tand the type 2
dia be tes- re l ated a ltera tion of the human gut m icr ob io me . I meta , 20 22 . 1 (2) : p. e20 .
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
24. Erhan, D. , et a l . , Wh y Doe s Un su pervise d Pre -tra ining H elp Deep Learn in g? J . M ach .
Learn. R es., 2010. 11 : p . 62 5 –660.
25. Radford , A. , e t al . L an guage Models a re Unsupervised Multi ta sk L earne rs . 20 19.
26. Vaswani , A. , e t a l . , At ten tion i s a ll y ou n eed, in P r oc eed i ngs of the 31st In tern a tional
Conferen c e on Neu ral In fo rma ti on P rocessing S y s tems . 2 0 1 7, Cu r ra n As s o c i at e s In c . :
Long B e ach , Cal iforn ia , USA. p . 6 000 –60 10 .
27. Devli n, J ., et a l . B ERT : Pre-training o f D ee p B i d irec ti ona l Tran s fo rmers fo r Langu age
Unders tan ding . i n P roc eed in gs o f the 2 019 C onfe rence o f the North American
Chapter o f the As s oc ia tion fo r C o mputatio na l Lingu is tic s : Hu man L angua ge
Tech nolo gi e s, Volume 1 (Long and Sho rt Pap ers) . 20 19 .
28. Yang, F ., et a l. , s cBERT as a l arge -scale pretr ained d e ep la nguage mode l for cell type
anno ta tio n o f sing le - cell RN A-seq d a ta . Nature Machine Inte llig en ce , 2 022. 4 (10) : p .
852-86 6 .
29. Theod or is , C .V. , e t al . , T ransfer lea r ning e nab les p redictions in n etw ork biology .
Nature , 202 3. 618(796 5 ): p. 616- 6 24 .
30. Cui, H . , e t al. , s cGPT : tow ard b u ild i ng a founda tion m odel fo r sing l e-ce ll multi-o mics
usi ng gen e rative AI. N at Method s , 2024 .
31. Hao, M. , e t a l., L arge -sc a le f ound at i on mod e l o n s ingle-c ell tran s cr ip to mic s . N a t
Met h ods , 20 2 4.
32. Simp s on, J . M., J. W. Santo Do mingo, an d D.J . R e a sone r, M i c rob ial s ou rce tracking :
sta t e o f the sc ie nce . Env i ron Sc i Te chnol, 2002. 36(24 ): p. 5 279 -88 .
33. Shang, J. , e t a l . , Gut Mic rob i o me Ana ly s is C an B e Used as a Non in vas i ve Diagn ostic
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Tool a nd P lays an Essentia l Ro le in the Onse t of Me mbran o us Nephr opathy. A d v S c i
(We inh ), 2 022 . 9 (28) : p. e2201581.
34. Shenha v, L ., e t al. , FEAST : f a st ex pecta ti on - maxim iz a t ion fo r m i c rob ial s ou rce
trac k in g. Nat Method s , 2019 . 16 (7 ): p . 62 7 - 6 32.
35. Clooney, A. G ., et al., R a nk ing microbio m e v ariance in in fl am ma to r y bowe l d is eas e: a
large long i tud inal in ter c on tinenta l s tud y. Gut , 20 21 . 70(3) : p . 499 -510.
36. Mc D onald, D. , et al., Amer ica n Gut: an O p en Pla t for m fo r C i t izen S cienc e M icrob io me
Researc h . mSy s te ms , 2018. 3 (3 ).
37. Wang, N. , M . C h e ng, and K. Ning , Overcom i ng reg ion a l limitation s : transfer l e arning
for cr os s -reg ion a l m icrobial- b ased d i ag nos is o f d i sea s es. Gut, 2023 . 72(10) : p.
2004 - 2 006 .
38. Roswall, J. , e t a l. , D e v elopmen ta l tra je c tor y of the hea lthy hu man gut m ic rob iota
during the f i r st 5 years o f life . Ce ll H ost & Microb e , 2021. 29(5) : p. 7 65 -+ .
39. Back hed, F ., e t a l., D y na mic s and S tab iliza tion of the Hu ma n G ut Mi c rob iome d uring
the F irst Yea r o f Li fe. Ce ll Host M icrobe, 2015 . 17 (5 ): p . 6 9 0-703 .
40. Hughes , R .L. , e t a l. , In fan t gu t mic rob iota c ha rac ter i s tic s ge ne ra lly do no t mo dif y
effe c ts o f li p id -bas ed nutrient s uppl e men ta tion on g row th or in fl a m ma ti on : se c ond ary
anal ys i s o f a r andom iz e d cont rolle d t r ial in M al aw i . Sc i Rep , 20 20 . 10(1) : p . 14861 .
41. Shin, J. H ., et al ., Bactero ides and re late d s pec ie s : The key s tone tax a of the hu man
gut micr obi o ta . Ana erob e , 2024. 85: p . 1 0 2819.
42. Hudault , S. , et a l. , Rela t ions h ip b e tw e e n in te st in a l c olo ni z at io n of B i f idobacte r ium
bifi du m in infa nts and the pre sen c e of ex o genous and endog enou s grow th -pro m oting
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
fac tors in the ir stools. P e d iat r Re s, 19 94 . 35(6) : p . 696- 700.
43. Wang, X. W ., et a l . , Iden tify ing keystone s pecies in m i c robial co mmuniti es u s in g d eep
lea r ning . Na t Eco l E v o l , 2 0 24. 8 (1 ) : p. 2 2- 31.
4 4 . M c A l l i s t e r , F . , e t a l . , The Tu mor Mic ro bi o me i n Pan creatic C an c er: Bac teri a and
Beyon d. C a nce r Ce l l, 2019 . 36 ( 6): p. 577 -57 9.
45. Nejma n , D., et al., Th e hu man tumo r m icrobio me is co m pose d of tum or ty p e-s pecific
intrac e ll ula r bacter ia. Sc i ence , 2020 . 368( 6494): p . 9 73-980 .
46. Mat s on, V ., e t al ., The c o m me nsa l m ic ro b io me i s assoc i ated w ith a n ti-PD -1 effica c y in
me ta s t ati c mel a n oma p a ti e nt s . Sc ience, 20 18. 359 (6 3 71) : p . 104 - 1 08.
47. Hanahan, D ., Hallmark s o f C an ce r: New Dimensio n s. Cance r Dis c ove ry , 2022 . 12 (1 ):
p. 31- 4 6 .
48. Dohlm a n, A . B ., et a l. , The cancer m icrobiom e a t las: a p an- canc e r co mp a ra tiv e
anal ys i s to d isti ngu is h tissu e- r e s ident mic rob i o ta from c onta minan ts . Ce ll Host
Mic robe , 20 21 . 29(2) : p . 281 -298 e5.
49. Belibasakis , G .N. , e t al ., V iru lenc e an d Patho gen icity P roper ties of Aggre ga tiba c ter
actino my ce temco mitans . Pa th ogen s, 201 9. 8 (4) .
5 0 . Ni e mi n e n, M.T., et al ., T repone ma dentic ola chymo try ps in-like p rote in as e may
con tribu te to or odig e stive c a rc inog enes is t h rough im mu no modula tion . B r J Canc e r,
2018 . 118 (3 ): p . 428-434 .
5 1 . Ge v er s, D., et a l. , The treatment- naiv e microbiom e in new -onset C roh n's d isea s e. C e l l
Host M i cr obe, 2 014 . 15 ( 3) : p . 382 -392.
52. He, Z., et a l., M i cro biota in me s en teric adipos e tiss ue fro m Cr o hn's dise as e prom ote
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
colit is in mice. M i cr o bi ome , 2 0 2 1. 9 (1 ): p . 228.
5 3 . L i u , T. C., et al. , P a ne th cell d efe cts in C ro hn' s d is e ase patie n ts pro mo te dysbios is. J CI
Insight , 20 16. 1 (8) : p . e86907 .
54. Wol f, T. , Hugg i ngfa c e's tr ansfor mer s: S ta te -o f- the-a rt natu ra l la ng uage p rocessing.
arXiv prepr in t a rXi v :19 10.03771, 20 19 .
55. Pedregos a , F. , et a l . , S c ik it- lear n: Ma ch in e Lea rnin g in Py thon . J. Mac h. Lear n. Res . ,
2011 . 12 (nu ll) : p. 2825 –2 830 .
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted January 1, 2025. ; https://doi.org/10.1101/2024.12.30.630825doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.