{"paper_id":"1ef403b2-2b2a-4b67-940f-0f8c7278a002","body_text":"Abstract\nBacteria have evolved a vast diversity of functions and behaviours which are currently incompletely understood and poorly predicted from DNA sequence alone. To understand the syntax of bacterial evolution and discover genome-to-phenotype relationships, we curated over 1.3 million genomes spanning bacterial phylogenetic space, representing each as an ordered sequence of proteins which collectively were used to train a transformer-based, contextualised protein language model, Bacformer. By pretraining the model to learn genome-wide evolutionary patterns, Bacformer captures the compositional and positional relationships of proteins and can accurately: predict protein-protein interactions, operon structure (which we validated experimentally), and protein function; infer phenotypic traits and identify likely causal genes; and design template synthethic genomes with desired properties. Thus, Bacformer represents a new foundation model for bacterial genomics that provide biological insights and a framework for prediction, inference, and generative tasks.\nCompeting Interest Statement\nThe authors have declared no competing interest.\nFootnotes\nCorrected the model architecture visualization in Figure 1 and incorporated prior work citations.\nhttps://huggingface.co/collections/macwiatrak/bacformer-681a17d6a77a928a1531def2\n↵1 https://huggingface.co/collections/macwiatrak/bacformer-681a17d6a77a928a1531def2","source_license":"CC-BY-4.0","license_restricted":false}