Resolving Genome-to-Phenotype Links in Bacteria: Machine-Learned Inference from Downsampled k-mer Representations

doi:10.64898/2026.02.18.705352

Resolving Genome-to-Phenotype Links in Bacteria: Machine-Learned Inference from Downsampled k-mer Representations

2026 · doi:10.64898/2026.02.18.705352

preprint OA: closed

Full text JSON View at publisher

Full text 1,830 characters · extracted from oa-doi-fallback · click to expand

Abstract Standard approaches to bacterial phenotyping often treat the entire genome as the fundamental unit of information, resulting in high-dimensional inputs that may contain significant redundancy. Consequently, current bacterial phenotyping techniques typically rely on the assumption that entire sequences are required for accurate predictions. While downsampling based on min-hashing or prefix filtering has been used for clustering, its utility as a direct input for predictive machine learning remains underexplored. Here, we show that a novel prefix-based downsampling algorithm can reduce the size of genomes while maintaining relatively high predictive accuracy on phenotype prediction tasks. By combining a prefix reduction strategy with the specificity of short k-mers, we developed a method to downsample entire genomes into k-mer frequency matrices and k-mer-on-a-string representations. We found that ensemble models, such as Random Forest and Gradient Boosting, trained on k-mer frequency matrices from downsampled genome representations outperformed more complex deep learning architectures with the same downsampled representation, particularly on datasets with limited data or highly similar genomes. We were able demonstrate explainability by tracing back the k-mers with the most impact on the models to genes coding for the specific phenotype. Our results demonstrate that downsampling genomic data can yield models with good predictive power thus establishing an alternative when using full genomes is infeasible. We present an approach that offers relatively high performance on bacterial phenotyping tasks and demonstrates a path forward towards lightweight Genome Language Models that will enable analysis of entire genomes. Competing Interest Statement The authors have declared no competing interest.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00