BioGeoFormer: A deep learning approach to classify unknown genes associated with critical biogeochemical cycles

preprint OA: closed
Full text JSON View at publisher
Full text 3,194 characters · extracted from oa-doi-fallback · click to expand
Abstract Remote functional annotation continues to impede progress in microbial ecology, as alignment-based approaches still leave over one-third of microbial sequences functionally unresolved. In contrast, pre-trained natural-language–processing approaches have shown strong potential for inferring functions from diverse biological sequences, and here we introduce a protein language modeling approach allowing us to classify sequences into 37 defined key pathway categories involved in 4 major biogeochemical cycles (methane, sulfur, nitrogen and phosphorus cycles). To do so, we fine-tuned ESM2-8m using databases curated for biogeochemical cycling pathways. Our resultant BioGeochemical cycling transFormer (BioGeoFormer or BGF) was high-performing on validation and test sets, producing embeddings that exhibit an ability to infer protein function at a metabolic pathway level. BGF was applied to a dataset of metagenome-assembled genomes (MAGs) constructed from methane-fueled, deep-sea, “cold seep” environments to demonstrate its utility in contrast to current informatics approaches. We employed multiple gene assignments to identify gene function within these MAGs. A total of 1.05M genes were assigned biogeochemical functions, with BGF alone suggesting putative ecosystem roles for 0.49M (46%) of these at a confidence of 85% or greater; these genes were classified as unknown by the other approaches. Across the pathways of interest, BGF identified 6 times as many genes, on average, as Hidden Markov models (HMMs) as well as alignment-based approaches across the various pathways. BGF provides a novel tool that is capable of informing process-based hypotheses in diverse systems, highlighting cryptic proteins most notably linked to methane, nitrogen, and phosphorus cycling while uncovering the mysteries within microbial dark matter. Author summary When investigating the function of microbes in the environment, scientists are often left with vast amounts of genes or proteins where no knowledge about their function is available. This represents a huge amount of information left to be discovered in many fields of biology. One recent approach that has shown significant potential in further understanding unknown proteins are protein language models, which are deep-learning methods leveraging large datasets to understand the ‘language’ of proteins. We aimed to apply protein language modeling to further understand the function of proteins as they relate to large-scale environmental transformation of nutrients and carbon. Specifically, we designed our approach to further understand the metabolism of microbes that affect methane, nitrogen, phosphorus, and sulfur, all elements that are highly impactful to the planet’s function and health. We used our new approach on a deep-sea microbiology dataset, and showed the method’s utility in further understanding the function of proteins and their impact on the environment. Overall, we found our method is an important new tool in the toolset of environmental scientists working to better understand the function of microbes and their proteins. Competing Interest Statement The authors have declared no competing interest.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00