Machine Learning Predicts Antimicrobial Resistance from Genomic Data across ESKAPEE Pathogens

doi:10.21203/rs.3.rs-7190203/v1

Machine Learning Predicts Antimicrobial Resistance from Genomic Data across ESKAPEE Pathogens

2025 · doi:10.21203/rs.3.rs-7190203/v1

preprint OA: closed

Full text JSON View at publisher

Full text 146,218 characters · extracted from preprint-html · click to expand

Machine Learning Predicts Antimicrobial Resistance from Genomic Data across ESKAPEE Pathogens | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Biological Sciences - Article Machine Learning Predicts Antimicrobial Resistance from Genomic Data across ESKAPEE Pathogens Anargyros Skoulakis, Konstantinos Daniilidis, Stefanos Digenis, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7190203/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Antimicrobial resistance (AMR) is a mounting global crisis, fueled by the rapid emergence of multidrug-resistant bacteria. Among the most concerning culprits are the ESKAPEE bacteria—Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter spp., and Escherichia coli—which are leading causes of hospital-acquired infections worldwide. In this study, we developed and validated machine learning models for predicting antimicrobial resistance phenotypes directly from genomic data. We assembled a robust dataset of 18,916 ESKAPEE genome assemblies, each paired with its corresponding antibiogram, covering susceptibility results for 40 different antibiotics. Using this data, we trained Random Forest and Extreme Gradient Boosting (XGBoost) models for each antibiotic separately, which consistently demonstrated excellent predictive performance, achieving over 90% recall and F1 score for almost all pathogen–antibiotic combinations. To maximize the utility and accessibility of our findings, we developed an interactive web platform ( https://dianalab.e-ce.uth.gr/amrpredictor/ ) that allows users to explore prediction outcomes and identify the most informative genomic features driving resistance using Shap values. Furthermore, we rigorously validated our approach in a clinical setting. We applied our prediction pipeline to metagenomic sequencing data obtained from 36 blood culture-positive ESKAPEE samples. This real-world evaluation revealed a strong concordance between our predicted resistance profiles and conventional phenotypic results. Importantly, this metagenomic dataset also serves as a valuable, independent benchmark for future research in developing and evaluating AMR prediction models across ESKAPEE pathogens. Our work underscores the transformative potential of integrating genomics and machine learning to provide accurate, interpretable, and clinically actionable predictions for combating antimicrobial resistance. Biological sciences/Microbiology/Clinical microbiology Biological sciences/Computational biology and bioinformatics/Machine learning Biological sciences/Microbiology/Antimicrobials/Antibiotics Health sciences/Diseases/Infectious diseases/Bacterial infection Machine Learning Genomic Data Antimicrobial Resistance Bacteria Antibiotics Clinical Diagnostics Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction The widespread and often indiscriminate use of antibiotics in human medicine, agriculture, and animal husbandry has accelerated the emergence and spread of AntiMicrobial Resistance (AMR). This resistance reduces the effectiveness of available treatments, resulting in prolonged illness, increased mortality, and greater healthcare costs. The World Health Organization (WHO) has identified AMR as one of the top ten global public health threats, warning that without urgent action, common infections could once again become deadly 1 . Particularly concerning are multidrug-resistant (MDR) pathogens that are resistant to multiple classes (>3) of antibiotics, thereby severely limiting treatment options. Among these, the ESKAPEE pathogens— Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter spp., and Escherichia coli —stand out due to their clinical relevance and high rates of resistance. These organisms are frequently implicated in healthcare-associated infections such as bloodstream infections, pneumonia, and urinary tract infections 2 . Their capacity to rapidly acquire and disseminate resistance genes makes them a priority for global surveillance and therapeutic innovation 3 . Traditional antimicrobial susceptibility testing (AST) methods remain fundamental in clinical microbiology laboratories, relying on culture-based techniques for implementation. In practice, this involves isolating bacteria from patient samples, growing them on selective media, and testing their response to antibiotics in vitro. However, these methods face inherent limitations, as they are applicable only to cultivable bacteria and require specialized microbiology facilities and trained personnel 4 . They are also time-consuming, often requiring 48–72 hours to produce results, which delays the initiation of targeted therapy and may lead to the empirical use of broad-spectrum antibiotics. Additionally, fastidious microorganisms, such as Mycobacterium tuberculosis or Clostridioides difficile, require even longer incubation times, leading to critical delays in establishing appropriate treatment. Faster and more informative alternatives are urgently needed to guide precision therapy and help curb the spread of antimicrobial resistance 5 . In recent years, computational methods have emerged as transformative tools in the field of antimicrobial resistance (AMR) prediction and strain characterization. Approaches that detect AMR genes or identify mutations in core genes associated with resistance have been widely used to predict AMR phenotypes. Building on these methods, machine learning (ML) algorithms have increasingly been applied to analyze high-dimensional, heterogeneous datasets, uncovering complex associations between genomic features and phenotypic resistance profiles—patterns that are often not discernible using conventional statistical techniques 5,6 . By integrating genomic, clinical, and environmental data, ML models can deliver accurate predictions of antimicrobial susceptibility, supporting personalized therapeutic decisions and enhancing antimicrobial stewardship efforts 7 . In this study, we address the challenge of antimicrobial resistance (AMR) by leveraging genomic information, given its central role in determining bacterial resistance phenotypes. We manually curated and analyzed a dataset of 18,916 genome assemblies from ESKAPEE pathogens, each linked to matched phenotypic antimicrobial susceptibility testing (AST) data. The machine learning (ML) models were trained using high-dimensional feature vectors derived from genomic inputs, specifically: k-mer frequency profiles (k = 3, 4, 5) generated from AMR gene protein sequences, upstream promoter DNA sequences, and ribosomal RNA (rRNA) gene sequences for each assembly. Notably, a separate model was developed for each antibiotic, allowing the algorithms to learn resistance patterns specific to each antimicrobial agent. Using these data, we trained Random Forest and XGBoost models to predict resistance to 40 antibiotics, achieving high predictive performance across most pathogen–antibiotic combinations. To evaluate clinical applicability and benchmark our method against standard susceptibility testing protocols (as performed at the University Hospital of Larissa, Greece), we developed and validated an integrated experimental and bioinformatic pipeline designed to process positive blood cultures from individual patients. Directly from clinical samples, this pipeline performs metagenomics sequencing, taxonomic identification, genome assembly, genomic annotation, and resistance prediction using the trained models within 24–48 hours. Results were then compared with conventional AST obtained from blood cultures. The core workflow of our approach is illustrated in Figure 1 . For the first time, our study demonstrates that combining machine learning with genomic data yields a robust, scalable, and accurate framework for AMR prediction that serves both research and clinical needs. As resistance continues to compromise the efficacy of existing antibiotics, such integrative strategies are crucial for enhancing diagnostic precision, applying targeted therapy, and guiding the development of novel therapeutics. Our findings contribute to ongoing efforts to modernize AMR surveillance and improve patient outcomes in the context of this global health crisis. Results Collection and Annotation of ESKAPEE assemblies Data acquisition for creating machine learning models to predict antimicrobial resistance (AMR) involved retrieving genome assemblies and corresponding phenotypic resistance data (based on MICs) from three major public repositories—NCBI NDARO, BV-BRC 8 , and CDC NARMS—focusing on the clinically significant ESKAPEE genera. Integration of these datasets yielded a comprehensive resistance table comprising 19,525 unique assemblies collected globally ( Figure 2a ). The distribution of resistant, susceptible, and intermediate phenotypes for each antibiotic is shown in Figure 2b . To ensure data quality, we applied stringent filtering criteria. Assemblies were evaluated based on N50 values, using thresholds of 50,000 for Klebsiella, Acinetobacter, Pseudomonas, Enterococcus, and Escherichia , and 20,000 for Enterobacter and Staphylococcus, resulting in a high-quality assembly set. MIC values from various testing methods (CLSI, EUCAST) were included; however, final interpretations into susceptible or resistant categories were standardized using CLSI breakpoints (34th edition, 2024). To ensure consistency across datasets, we retained only assemblies with clear MIC-supported phenotype calls under CLSI definitions. Assemblies with MICs falling into the intermediate category were excluded from analyses for the affected antibiotics, as they may introduce ambiguity or reflect historical testing differences. Only assemblies with MICs confidently matching susceptible or resistant phenotypes were kept for downstream analysis. Furthermore, genera–antibiotic pairs with limited representation (fewer than 10 assemblies per class (Resistant / Susceptible) or one class has less than 0.1% of the total assemblies for that genus) were excluded to maintain statistical robustness. Antibiotics with fewer than 50 strains with eligible antibiograms were also removed. This multi-step quality control pipeline resulted in a highly curated dataset of 18,916 assemblies with manually verified susceptibility profiles for 40 antibiotics. The distribution of assemblies across genera is shown in Figure 2c , Supplementary File 1 . The final quality-controlled assemblies with their respective antibiograms, for each antibiotic, are publicly available on Zenodo (https://zenodo.org/records/16213507 ), providing a valuable resource for the AMR research community. After quality control, for each genome, we identified known antimicrobial resistance (AMR) genes, core genes (genes with known mutations linked to antimicrobial phenotypes), their upstream promoter regions (300 nucleotides upstream of the gene start site), and ribosomal RNA (rRNA) genes. To compile a comprehensive set of AMR-related genes (AMR genes and core genes), we integrated data from three widely used databases: the Comprehensive Antibiotic Resistance Database (CARD) 9 , Reference Gene Catalog 10 and ResFinder DB 11 , initially obtaining 14,183 AMR protein sequences. Instead of using nucleotide sequences, we focused on protein sequences, as they better capture functionally relevant variation. To reduce redundancy and keep the feature space manageable, we clustered homologous proteins using CD-HIT 12 at stringent thresholds of 90% sequence identity and 90% alignment coverage, reducing the total to 2,580 representative AMR and core proteins ( Figure 2d ). For each genome, we then used BLASTx 13 to identify AMR gene and core gene hits, applying a 70% identity and 70% coverage cutoff to retain high-confidence matches. Additionally, we extracted the 300 nucleotides upstream of each AMR gene to capture potential promoter regions, accounting for strand orientation. rRNA genes (5S, 16S, and 23S), central to the mechanism of many antibiotics, were identified using BLASTn. All detected rRNA gene copies were retained, and for each genome, we constructed a pseudo full-length rRNA operon by concatenating the copies in the correct order. To ensure a consistent feature matrix across all genomes, any missing AMR or core genes were represented by a placeholder sequence (‘XXXX’), maintaining a fixed set of 2,580 AMR/core genes, 2,580 promoters, and 1 pseudo rRNA feature per genome. This comprehensive feature set, combining resistance genes, regulatory elements, and conserved structural components, served as the foundation for predictive modeling. Machine Learning Models predicting antimicrobial resistance using promoters, AMR genes, core genes and rRNA genes The development of machine learning models for predicting antimicrobial resistance involved a multi-stage pipeline encompassing data encoding, model training, performance evaluation, and interpretability analysis. Separate models were trained for each antibiotic, resulting in 40 distinct models. Stratified sampling ensured balanced representation across resistance classes and bacterial genera, enhancing generalizability to unseen data. To represent the genomic content of each bacterial assembly, we applied k-mer frequency encoding using k-mer sizes of 3, 4, and 5. Separate k-mer profiles were generated not only for each genomic feature type (promoters, rRNA genes, AMR and core genes) but also for each individual element within those types—meaning that, for example, gene 1, gene 2, promoter 1, and each rRNA gene each had its own dedicated k-mer profile. For AMR and core genes, k-mers were derived from the protein sequences, while for promoters and rRNA genes, they were based on the DNA sequences. These feature vectors were concatenated to create a comprehensive input for two supervised learning algorithms: Extreme Gradient Boosting (XGBoost) and Random Forests. The dataset of 18,916 assemblies was randomly split into training (80%) and testing (20%) sets using a fixed random seed to ensure reproducibility and avoid data leakage. Models were trained exclusively on the training set, with the test set reserved for unbiased evaluation. Hyperparameters were optimized using grid search and cross-validation. For XGBoost, we tuned the learning rate, maximum tree depth, and number of estimators to balance performance and generalization. For Random Forests, we adjusted the number of trees and their depth to maximize accuracy while minimizing overfitting. The trained models were evaluated using standard classification metrics—accuracy, precision, recall, and F1-score—calculated separately for each antibiotic and bacterial genus. Among these, recall is particularly critical in clinical settings, as misclassifying a resistant strain as susceptible can lead to inappropriate treatment and patient harm (very major error); in parallel, the ineffective antibiotic use enhances the development of resistant bacteria of normal flora. As shown in Figure 3a and Supplementary File 2 , the models achieved strong predictive performance across most antibiotics, with over 0.90 recall and F1-score in the majority of cases. Detailed performance metrics, both overall and genus-specific, are available through our interactive RShiny application (https://dianalab.e-ce.uth.gr/amrpredictor/). Notably, antibiotics such as moxifloxacin, cefpodoxime, chloramphenicol, ceftriaxone, and oxacillin reached near-perfect scores (recall and F1 ≥0.98), underscoring the robustness of the approach. Carbapenems like doripenem (recall and F1 ≥0.97), ertapenem (recall and F1 ≥0.96), and meropenem (recall and F1 ≥0.95) also demonstrated excellent predictability, reflecting their well-characterized resistance determinants. However, a few antibiotics showed reduced performance. Linezolid (recall 0.60, F1 0.66), tigecycline (recall 0.62, F1 0.69), colistin (recall 0.78, F1 0.79), cefuroxime (recall 0.77, F1 0.85), and minocycline (recall 0.77, F1 0.93) had notably lower scores, possibly due to underrepresented or poorly annotated resistance mechanisms in the genomic feature set. For instance, colistin resistance often involves complex regulatory pathways or plasmid-mediated mcr genes, which may not be fully captured in current databases. Similarly, tigecycline and linezolid resistance mechanisms are less common and more variable across species, reducing model generalizability. To maintain the reliability of genus-specific predictions, we applied a conservative threshold: if any key performance metric for a given antibiotic–genus pair or across all genera fell below 0.8, we considered that combination unreliable ( Supplementary File 3 ). We filtered predictions to include only antibiotics with consistently high performance for each species. As shown in Figure 3b , this selection highlights which antibiotic–species pairs can be robustly predicted by our models. 6 antibiotics, including nitrofurantoin (even the overall recall and F1 score is above 0.8, for each genera the performance is suboptimal <0.8 so it was removed), tigecycline, linezolid, cefuroxime, minocycline and colistin, were excluded entirely due to low predictive performance across genera, likely reflecting underlying resistance mechanisms not yet fully elucidated or integrated into AMR databases. Compared to prior AMR prediction studies, our machine learning models demonstrate superior or highly competitive predictive performance. For example, Moradigaravand et al. reported an average recall of ~0.83 and precision of ~0.92 for predicting resistance in E. coli using pan-genome data, focusing mainly on gene presence/absence and population structure across a limited antibiotic set 14 . In contrast, our models deliver recall and F1-scores exceeding 0.90 for the majority of 40 antibiotics across all major ESKAPEE pathogens, including critical agents like carbapenems (ertapenem, imipenem, meropenem with recall ~0.95–0.97), demonstrating broader taxonomic and antibiotic coverage with improved sensitivity. Similarly, Nguyen et al. and Ren et al reported strong performance but typically on narrower antibiotic panels (<20 drugs) or fewer species, with model recalls ranging from ~0.85 to 0.95 15, 16 . Our results match or surpass previous efforts while providing phenotypic predictions at a much larger scale. Overall, the combination of high recall and F1-scores across 40 antibiotics, extensive species representation, and a unified, interpretable feature space, positions our study as one of the most comprehensive and clinically relevant AMR prediction frameworks to date, advancing the field in both research and translational diagnostics—as emphasized by recent calls for innovation in AMR surveillance 17,18 . Rapid In Silico Antibiogram from Blood Metagenomes: A Clinical Proof of Concept To assess the clinical performance of our machine learning models and establish a practical pipeline for their application, we conducted a pilot study using blood cultures from patients with clinically confirmed bloodstream infections. Blood cultures were collected from 40 different patients and incubated for 6 hours to enhance microbial yield. Total DNA was then extracted, with host-derived DNA selectively depleted to enrich for microbial content. The resulting microbial DNA was prepared for sequencing according to standard metagenomic protocols and the manufacturer’s instructions for the MGI platform. Sequencing was performed on the MGI G99 instrument, generating ~4 million paired-end reads per sample. We used the MGI Pathogen Fast Identification (PFI) tool to identify the dominant bacterial species in each sample. In 4 out of the 40 cases, the microbial composition was mixed, with no single species exceeding 90% abundance. These were likely polymicrobial infections or samples with ambiguous taxonomic resolution and were excluded to avoid confounding the evaluation ( Figure 4a ). The remaining 36 samples, each dominated by one bacterial species, were analyzed further. We assembled the genomes using a hybrid approach with SPAdes 19 —combining de novo assembly and reference-guided scaffolding based on the closest matching reference genome from PFI. We then extracted k-mer profiles (sizes 3, 4, and 5), matching the same scheme used in model training, and applied our ML classifiers to predict antimicrobial susceptibility profiles (in silico antibiograms). The predicted resistance profiles were then systematically compared to the corresponding conventional susceptibility test results obtained in the clinical microbiology laboratory of University Hospital of Larissa, Greece. As shown in Figure 4b , our machine learning models demonstrated strong predictive performance across most antibiotics, underscoring the feasibility of this rapid, genome-based approach to antimicrobial susceptibility testing (AST). Notably, this workflow delivers results within 24–48 hours from blood draw, compared to 2–3 days typically required for culture-based methods, offering a clinically meaningful time advantage that can enable earlier, targeted treatment decisions. High-performing antibiotics included amikacin, cefepime, ceftriaxone, imipenem, meropenem, and levofloxacin, all showing recall and F1-scores ≥0.93, often with zero false positives or false negatives (e.g., amikacin: TP=27, TN=1, FP=0, FN=0). Ciprofloxacin and levofloxacin also performed well, with recall around 93–94% and F1-scores ~0.97. Some antibiotics showed moderate performance. Ampicillin had a recall of 0.67 and F1-score of 0.80 due to one false negative out of three positive cases. Cefotaxime reached perfect recall (1.00) but a lower F1-score (0.86) because of one false positive. Gentamicin had perfect recall but a lower F1-score (0.84), driven by seven false positives—likely reflecting its complex resistance mechanisms, such as aminoglycoside-modifying enzymes and membrane permeability changes. When compared to rule-based tools like ResFinder ( Figure 4c, Supplementary File 4 ), our ML models consistently showed superior performance, underscoring the added value of combining machine learning with curated genomic features beyond just known resistance genes. To explore whether underperforming antibiotics could be improved, we applied a distributionally robust optimization (DRO)-inspired training strategy (see Methods), focusing on ampicillin, aztreonam, cefotaxime, and gentamicin. This approach reweighted uncertain samples and used label smoothing to reduce overfitting. While DRO significantly improved ampicillin’s performance—suggesting that initial misclassifications were driven by label noise or sample imbalance—it did not substantially improve aztreonam, cefotaxime, or gentamicin. This suggests that for these antibiotics, the limitations are likely biological rather than computational, reflecting incomplete representation of resistance mechanisms in current genomic databases ( Figure 4d ). Interpreting Model Predictions Using SHAP for Feature-Level Insights To gain biological insights into model predictions and feature relevance, we applied SHAP (SHapley Additive exPlanations) values 20 to estimate the contribution of individual genomic features to resistance outcomes. SHAP analyses were performed both on the full training dataset and separately for each antibiotic–genus pair, enabling fine-grained interpretation of potential resistance determinants. These results, including SHAP-derived feature importance scores, are available through our interactive RShiny application (https://dianalab.e-ce.uth.gr/amrpredictor/). While SHAP values offer powerful interpretability and can highlight genomic regions potentially linked to AMR, their insights are inherently limited to the features present in the training data. The web application includes SHAP values for all antibiotic–genus combinations, even those below the performance threshold of 0.8, allowing users to inspect models that performed poorly and explore why. For each antibiotic, users can view the most important features identified either across all genera or within specific genera. Shap values need an indicative interpretation, as they show the proportion of significance of each feature. In our interface, the displayed importance refers to how much a given k-mer feature influenced the model's decision, regardless of direction. The SHAP analysis revealed several biologically meaningful patterns in the model predictions. For cefotaxime, the model consistently ranked k-mers from the bla CTX-M gene as highly influential, with positive contributions to resistance prediction—an expected and well-established marker for β-lactam resistance. Similarly, for amikacin, multiple k-mers derived from the aph(3')-VI gene were identified as top features, confirming its known role in aminoglycoside resistance. These examples highlight cases where the model captured relevant genomic signals, supporting its interpretability and potential clinical utility. For ciprofloxacin, genes such as gyrA and parC appeared among the most important features, aligning with known resistance mechanisms, though the directionality and distribution of SHAP values were more diffuse. In contrast, for ampicillin–sulbactam, top-ranked features included promoter-region k-mers not clearly linked to known resistance determinants, suggesting either indirect associations or gaps in the current feature set. Overall, while some antibiotic–genus combinations yielded interpretable biological insights, others highlighted the need for further refinement of features or inclusion of additional genomic signals. These results should be interpreted as hypothesis-generating: SHAP values indicate which features influenced model decisions, but do not prove causality. They must be complemented with domain knowledge, genomic context, and functional validation. Nevertheless, the ability to pinpoint consistently important regions makes this approach a valuable starting point for researchers investigating the molecular basis of resistance in specific antibiotic–species combinations. Discussion Antimicrobial resistance (AMR) is one of the most urgent global health threats of the 21st century. Without effective interventions, it is expected to escalate, undermining the efficacy of antimicrobial treatments and placing an increasing burden on healthcare systems worldwide. Our study shows that machine learning (ML) models trained on genomic features—including resistance genes, upstream promoter regions, mutations in core genes, and ribosomal RNA genes—can accurately predict resistance profiles for the majority of antibiotics tested. These results were validated both on held-out public datasets and in a real-world setting using metagenomic data from 36 ESKAPEE-positive blood cultures obtained from individual patients. Importantly, several antibiotics showed excellent predictive performance, including amikacin, cefepime, ceftriaxone, imipenem, meropenem, and levofloxacin, with recall and F1-scores exceeding 0.93 and near-perfect agreement with phenotypic susceptibility results. This underscores the robustness of the models for key frontline antibiotics and their potential utility in clinical decision-making. While overall performance was strong, accuracy varied across antibiotics. For example, gentamicin exhibited high recall but lower precision in clinical samples, largely due to an increased number of false positives—suggesting the model may overpredict resistance, potentially due to weaker discriminatory features or metagenomic noise. More broadly, antibiotics that consistently showed poor model performance (e.g., tigecycline, linezolid, cefuroxime) may point to the presence of resistance mechanisms not well captured by current genomic databases or feature sets. In this way, our framework not only offers a powerful predictive tool but also highlights antibiotic–genera combinations that merit further biological investigation. Our current approach, while effective across many antibiotics and pathogens, has several limitations. The feature set is focused on known resistance genes, upstream promoter regions, clinically significant mutations in core genes, and rRNA elements, which may overlook other genomic contributors to resistance—such as mobile genetic elements, non-coding RNAs, structural rearrangements, or epigenetic factors. Incorporating these additional layers could improve model interpretability and expand coverage. Furthermore, our models are trained to predict binary susceptibility outcomes based on CLSI breakpoints, without estimating MIC values or accounting for intermediate categories, which may reduce precision near clinical decision thresholds. In some genus–antibiotic combinations, model performance was impacted by imbalanced or sparse training data, leading to the exclusion of unreliable predictions. These observations underscore the importance of curating balanced, well-annotated datasets and expanding feature diversity to further enhance the accuracy and generalizability of genomic AMR prediction models. This study provides a compelling proof of concept that bacterial genomic content can be effectively leveraged to predict antimicrobial susceptibility profiles with high accuracy. Similar findings have been reported in recent studies applying ML to AMR prediction. For example, Nguyen et al. demonstrated that genomic features, including k-mers and known resistance genes, can be used to predict minimum inhibitory concentrations (MICs) for nontyphoidal Salmonella , often exceeding 90% accuracy. In another large-scale study, Moradigaravand et al. used pan-genome data and ML models to predict resistance phenotypes in Escherichia coli , achieving high performance and identifying genomic elements linked to resistance across a broad population set. Their work emphasized the utility of genome-wide variation in resistance prediction and biomarker discovery. Compared to these studies, our work expands the scope by systematically evaluating resistance prediction across all major ESKAPEE pathogens and 40 antibiotics using a unified feature set. Furthermore, our application of these models to clinical metagenomic samples represents a critical step toward diagnostic translation. To our knowledge, this is the most comprehensive and best-performing ML-based AMR prediction framework reported to date, both in terms of taxonomic breadth and validation in a clinical context. Importantly, the metagenomic dataset generated from 36 bloodstream infection samples provides an independent, real-world benchmark for the AMR research community. Derived from a tertiary hospital in Greece, this dataset includes a notably high proportion of resistant strains compared to publicly available databases, offering an invaluable resource for testing predictive models under clinically realistic conditions. As an unseen, clinically relevant dataset, it enables robust evaluation of future antimicrobial resistance (AMR) prediction models, moving beyond internal cross-validation or simulation-based benchmarks. We encourage its use as a standardized reference set to promote reproducibility, model comparison, and methodological advancement in the AMR prediction field. While previous studies have shown the feasibility of genomic AMR prediction, our framework is among the first to integrate it into a fully clinical, culture-independent pipeline, enabling resistance prediction directly from metagenomic sequencing data. This approach is especially valuable when conventional methods are infeasible — for example, in patients previously treated with antibiotics where cultures return falsely negative, or when working with fastidious or unculturable pathogens such as Mycobacterium tuberculosis, Legionella pneumophila, or Helicobacter pylori. Although further validation and workflow optimization are needed before routine adoption, this method offers a promising alternative in urgent or complex diagnostic settings. To ensure widespread adoption and long-term clinical utility, future machine learning models must be trained on even larger, geographically diverse genome–antibiogram datasets and be continuously updated to capture emerging resistance mechanisms. In this context, initiatives like COMPARE at ENA represent an important effort toward building such global resources 21 . Overall, our results demonstrate the potential of integrating metagenomic sequencing with ML-based prediction to accelerate antimicrobial susceptibility testing. Unlike traditional culture-based methods, which typically require 48–72 hours, this approach can deliver accurate resistance profiles within 24–48 hours — potentially enabling earlier, more targeted therapy in clinical practice. Beyond predictive accuracy, our study places equal emphasis on model interpretability. By applying SHAP (SHapley Additive exPlanations) values to our trained models, we provide feature-level insights into the genomic elements most influential in each prediction. These results are openly accessible through our interactive web application, which allows researchers to explore which k-mers and gene regions drive model decisions across antibiotics and bacterial genera. Notably, features that are consistently ranked as important may warrant further biological investigation, as they could highlight previously unexplored mechanisms underlying the resistance phenotype. In well-performing models, the top-ranked features often correspond to well-known resistance genes, providing an additional layer of biological validation. In contrast, lower-performing models—such as those for gentamicin or tigecycline—show less consistent or less interpretable feature patterns, possibly pointing to the presence of as-yet-undiscovered or under-characterized resistance mechanisms. As such, our platform not only delivers accurate predictions but also serves as a hypothesis-generating tool to guide future research into novel genomic determinants of AMR, ultimately advancing both basic science and clinical applications. In conclusion, our findings highlight the transformative potential of machine learning for antimicrobial resistance (AMR) diagnostics, enabling rapid, culture-independent prediction of resistance profiles directly from genomic data. The successful application of this framework to clinical bloodstream infection samples demonstrates its translational promise and paves the way for real-world implementation. As genomic resources expand and computational methods advance, genome-based AMR prediction may define the next generation of microbiological diagnostics—particularly in cases where conventional methods are slow, limited, or infeasible. Looking ahead, one can envision a clinical paradigm where blood is drawn from the patient and analyzed in near real-time using integrated next-generation sequencing and machine learning pipelines, delivering reliable resistance predictions within hours. Given that over 90% of microorganisms cannot be routinely isolated in cultures, such approaches could revolutionize infectious disease management by supporting more precise, timely, and effective treatments. Realizing this vision will require broader adoption of sequencing technologies, cost reductions in instrumentation and reagents, expansion of curated genome–antibiogram datasets, regular updates of interpretive breakpoints (e.g., CLSI, EUCAST), and deep integration of bioinformatics and machine learning expertise into clinical workflows. Methods Public Data Collection Data acquisition for antimicrobial resistance (AMR) prediction involved collecting bacterial genome assemblies and corresponding antibiogram data from three major public repositories: NCBI NDARO (data downloaded in December 2023 via https://www.ncbi.nlm.nih.gov/pathogens/isolates), BV-BRC (https://www.bv-brc.org) ( downloaded December 2023), and CDC NARMS (https://wwwn.cdc.gov/NARMSNow; downloaded January 2024). The study focused on the clinically significant ESKAPEE genera ( Enterococcus, Staphylococcus, Klebsiella, Acinetobacter, Pseudomonas, Enterobacter., and Escherichia ). From NDARO, 13,827 isolates were retrieved, yielding 13,159 unique assemblies after the removal of redundant entries. The NARMS dataset contributed 870 E. coli O157 assemblies with associated antibiograms. The BV-BRC database initially contained 234,123 assemblies from ESKAPEE genera, from which 6,249 assemblies with matched experimental antibiograms were retained after filtering. Across datasets, extensive preprocessing was conducted, including de-duplication, harmonization of antibiotic names, normalization of minimum inhibitory concentration (MIC) values, and integration of antibiogram records. The final dataset comprised resistance profiles for 19,525 unique assemblies, which were processed for quality control. Data Quality Control Quality control of both genome assemblies and antibiograms was essential to ensure the reliability of the data used for antimicrobial resistance prediction. For genome assemblies, the N50 statistic was used as a primary measure of contiguity and assembly quality. Assemblies with low N50 values were excluded to maintain high standards of completeness. While N50 values were directly available for NDARO entries, those from CDC NARMS and BV-BRC required manual computation. We calculated these values by extracting contig lengths from FASTA files and applying the Biostrings::N50() function in R. Assemblies were filtered using a minimum N50 threshold of 50,000 for most species ( Enterococcus, Klebsiella, Acinetobacter, Pseudomonas, and Escherichia ) and 20,000 for Enterobacter and Staphylococcus . After filtering, 18,916 high-quality assemblies were retained. Antibiogram quality control retained only entries with quantitative susceptibility data reported as Minimum Inhibitory Concentrations (MICs), obtained using standardized methods (Clinical Laboratory Standards Institute [CLSI] or European Committee on Antimicrobial Susceptibility Testing [EUCAST]). To ensure consistency across datasets, susceptibility and resistance classifications were determined according to MIC breakpoints defined in the CLSI 34th edition (https://clsi.org/resources/breakpoint-implementation-toolkit/). Strains with MICs falling into the intermediate category were excluded from analysis for the corresponding antibiotic, as such cases could reflect potential testing, reporting, or reagent discrepancies. Additionally, any record showing a mismatch between the reported phenotype and MIC-derived classification was discarded. The Zenodo repository folder (18916_assemblies_antibiograms) contains the fully quality-controlled assemblies and antibiograms used in this study. Annotation Pipeline for Analysis of WGS Data To enable predictive modeling of antimicrobial resistance (AMR), we developed a comprehensive annotation pipeline designed to extract relevant genomic features from bacterial assemblies. The annotation focused on three biologically informative categories: (i) known antimicrobial resistance genes, including both horizontally acquired resistance determinants (e.g., β-lactamases, efflux pumps) and core genes whose mutations are associated with resistance phenotypes; (ii) the upstream regulatory regions (300 bp) of these genes, which may influence expression; and (iii) ribosomal RNA (rRNA) genes (5S, 16S, 23S), which are functionally linked to the mechanism of action of several antibiotics. To construct a high-confidence database of AMR genes, we integrated protein-coding sequences from three major repositories: the Comprehensive Antibiotic Resistance Database (CARD, version 3.2.9), ReferenceGeneCatalog (version 3.12), and ResFinderDB (accessed June 2024). From CARD, we extracted 5,078 resistance genes by matching protein accession numbers in the Antibiotic Resistance Ontology (ARO). Reference Gene Catalog provided 8,157 gene entries retrieved from the NCBI FTP server. ResFinder contributed 3,150 resistance genes. All the genes were translated into protein sequences using a custom Python script. Combined, this resulted in a unified non-redundant dataset of 14,183 antimicrobial peptides. To reduce redundancy and facilitate homology-based searches, protein sequences were clustered using CD-HIT (version 4.8.1) with stringent parameters (-c 0.9, -aL 0.9) to ensure that highly similar but functionally distinct genes (e.g., tetM and tetO ) remained in separate clusters. This threshold was chosen based on the findings of Zilhao et al. 22 , which showed that co-occurring tetracycline resistance genes can be mistakenly merged under less stringent cutoffs. Clustering resulted in 2,580 non-redundant protein families. Representative sequences from each cluster were used to build a custom BLAST database. For each genome assembly, AMR gene detection was performed using blastx (NCBI BLAST+ version 2.15.0+) with an E-value threshold of 1e -50 and output formatted to include standard fields as well as aligned query/subject sequences and subject lengths. Hits were filtered based on ≥70% identity and global coverage using a Python script. For each valid alignment, the 300 nucleotides upstream of the alignment start site were extracted using the Biostrings R package (version 2.66.0) 23 , with reverse strand orientation handled appropriately. To complement AMR gene annotations, rRNA genes were identified via BLASTn by aligning 5S, 16S, and 23S reference sequences to each assembly. Separate BLAST databases were created for each rRNA gene type. Given the presence of multiple rRNA operons in bacterial genomes, multiple matches per genome were expected and retained. BLASTn results were parsed using a bash script and analyzed with Python to record the best-scoring non-overlapping hits. For downstream analysis, we constructed a pseudo-rRNA operon for each genome by concatenating all detected rRNA gene copies in the canonical order—5S, 16S, 23S—mimicking their natural genomic arrangement based on a reference genome. Unmatched regions or gaps were padded with the character ‘X’ to ensure sequence uniformity across samples. This pipeline generated a consistent, high-resolution annotation of each genome, capturing AMR-related protein features, upstream regulatory sequences, and rRNA elements for downstream machine learning applications. Training Machine Learning Models for Predicting Antimicrobial Resistance To build robust models for antimicrobial resistance (AMR) prediction, we developed a comprehensive pipeline covering data preprocessing, k-mer–based feature engineering, model training, and performance evaluation. The input included AMR genes (protein level), promoter regions, and rRNA genes (nucleotide level), each processed separately to retain biological relevance. Overlapping k-mers of sizes 3, 4, and 5 were extracted per gene or protein, and only k-mers present in the training set were kept to maintain consistency and reduce noise. We computed k-mer frequencies using Scikit-learn (v1.0.2) 24 CountVectorizer, producing high-dimensional, sparse feature vectors capturing local sequence patterns tied to resistance mechanisms. All analyses were conducted in Python, using pandas (v2.0.3), numpy (v1.21.5), matplotlib (v3.7.2), seaborn (v0.13.2), scipy (v1.7.3), argparse (v1.1). For supervised learning, we used Extreme Gradient Boosting (XGBoost, v2.0.3) 25 and Random Forest classifiers implemented via Scikit-learn. The dataset was split into training and test sets with a stratified 80:20 split to preserve class balance. Hyperparameters were tuned via grid search: for XGBoost, we explored max_depth [7, 9, 11, 13, 15, 17, 19, 21, 23], ran 300 boosting rounds, and applied early stopping (patience 10) using 5-fold cross-validation log loss; for Random Forests, we tuned n_estimators [100, 200, 400], max_depth [10, 20, 30, ..., 100], and both bootstrap settings. Evaluation metrics included accuracy, precision, recall, and F1-score. Performance was compared across k-mer sizes to determine the most predictive setup. Feature Importance and Model Interpretability To interpret the contributions of individual genomic features to antimicrobial resistance predictions, we employed both model-specific importance metrics and SHAP (SHapley Additive exPlanations) values. For Random Forest and XGBoost models, raw importance scores were extracted using feature_importances_ and get_score() (importance type: 'weight'), respectively. These values were normalized to reflect the relative contribution of each feature as a percentage. Features contributing less than 0.001% were filtered out to reduce noise and enhance interpretability. To provide a model-agnostic and more nuanced explanation of feature influence, SHAP analysis was conducted using the SHAP Python package (v0.46.0). SHAP values were calculated on the held-out test set using TreeExplainer, capturing both the magnitude and direction of each feature’s effect on the prediction. Specifically, we extracted SHAP values for the predicted class (class 1, resistant) and computed the mean absolute SHAP value per feature to assess importance, along with the average SHAP value to determine the direction of influence (positive or negative). To link model-derived features with biological context, we annotated each feature with metadata including the gene or region name, associated sequence, and sequence type (protein, promoter, or rRNA). Feature identifiers were constructed by concatenating these attributes, and SHAP-derived importance scores were merged with this metadata to create an integrated, interpretable view. This multi-level approach enabled biologically meaningful insights into which genomic regions drove resistance predictions and highlighted candidate markers for further experimental validation. Interactive R/Shiny Application To enhance accessibility and promote transparency, we developed an interactive R/Shiny web application that enables users to explore the dataset and machine learning results related to antimicrobial resistance prediction. The platform provides descriptive summaries of the data—such as the distribution of resistance phenotypes across genera and antibiotics—as well as detailed model performance metrics including accuracy, precision, recall, F1-score, and AUC. SHAP-based feature importance scores are also available, supporting interpretability of model predictions. The application is built with R (v4.4.2) and Shiny (v1.10.0), and is hosted on a Linux-based server with 2 CPU cores, 4 GB RAM, and 19 GB of SSD storage. The web app is publicly accessible at https://dianalab.e-ce.uth.gr/amrpredictor/ . DRO inspired Machine Learning Models To improve the prediction of antimicrobial resistance under uncertain or noisy conditions, we implemented a machine learning pipeline inspired by distributionally robust optimization (DRO). First, we trained an initial XGBoost classifier to estimate prediction uncertainty, calculated as the proximity of predicted probabilities to the decision boundary (|0.5 − probability|). A tunable α-quantile (set at 30%) was applied to identify the most uncertain samples, which were then upweighted (weight = 5) to emphasize difficult cases during subsequent training. Additionally, label smoothing was applied to reduce overfitting by slightly shifting the true labels toward 0.5 (using a smoothing factor of 0.05). Finally, we trained an XGBoost regressor with a logistic objective on the smoothed labels and adjusted sample weights. The hyperparameters were selected based on prior performance across the full training dataset, and the classification threshold was fixed at 0.5. Performance was evaluated using standard classification metrics, including precision, recall, F1-score, and confusion matrix. This strategy aimed to enhance model robustness by accounting for uncertainty and prioritizing high-risk misclassifications. Metagenomic Sequencing and AST Prediction : real time for blood stream infections A total of 40 blood cultures, obtained from individual patients, hospitalized in a 600-beds University Hospital of Larissa, Greece and clinically confirmed with bloodstream infections, were included to evaluate the performance of our model. 10 mL of whole blood from each patient was inoculated into aerobic blood culture bottles (BACTEC, BD) and incubated in the BD BACTEC™ FX automated blood culture system (Becton, DickinsonBecton, Dickinson). A positive growth signal was used as an indicator of microbial presence. DNA extraction was performed in three steps: red blood cell (RBC) lysis, host DNA depletion, and microbial DNA isolation. Briefly, 200 μL of the culture-positive blood bottle was mixed with 600 μL of RBC Lysis Buffer (Zymo) and incubated for 5 minutes at room temperature. Samples were centrifuged at 2000×g for 10 minutes, and the supernatant was carefully removed. The pellet was resuspended in 200 μL of PBS. Host DNA was depleted using the HostZERO Microbial DNA Kit (Zymo), following the manufacturer’s protocol. The sample was then placed in a 2 mL tube holder and subjected to mechanical lysis using a bead beater at maximum speed for two 10-minute cycles, with a 2-minute interval. DNA was extracted using the PSS magLEAD 12Gc automated system (BioServices, Stockholm). DNA concentration and purity were assessed using both Nanodrop One and Qubit fluorometers (Thermo Fisher Scientific). Metagenomic sequencing was performed on the DNBSEQ-G99 platform (MGI) using the MGIEasy Fast FS Library Prep Set, MGIEasy Dual Barcode Circulization Kit, and the High-throughput Sequencing Set (G99 SM FCL PE150), aiming for 4 million reads per sample. Before the assembly of the microbial genome of each sample, the MGI Pathogen Fast Identification (PFI) tool was used for the molecular identification of microorganism. Four samples were excluded for having mix of bacteria from PFI, since the most abundant bacteria was less than 90%. Then, raw reads were assembled using SPAdes (version 4.0.0), guided by the closest reference genome corresponding to the dominant species identified by the MGI Pathogen Fast Identification (PFI) tool. Assembled genomes were processed through our annotation pipeline to detect antimicrobial resistance (AMR) and core genes, and associated upstream promoter regions, and ribosomal RNA (rRNA) elements. K-mer encodings (k = 3, 4, 5) were generated for these features. For each assembly, resistance prediction was performed using our pre-trained machine learning models, restricted to the antibiotics with demonstrated high model performance for the corresponding genus. Predicted resistance profiles were compared with conventional antimicrobial susceptibility testing (AST) results. Briefly, 10 μl of each positive blood culture bottle was inoculated into blood agar and Mac Conkey agar and both plates were incubated at 37 o C for 24h or more. Then, colonies were tested for identification and susceptibility testing using the VITEK 2 automated system (bioMérieux), according to the guidelines of the manufacturer. The interpretation of the susceptibility results obtained from the VITEK 2 was based on CLSI breakpoints 34th edition. For comparison, ResFinder (v 4.7.2) 26,27 predictions were generated using the tool’s preset thresholds, and the predicted phenotypes were obtained from the output pheno_table. System configuration The analyses described in this paper were conducted on a computational platform equipped with an Intel® Xeon® Gold 6226R CPU @ 2.90GHz with 376 GB of RAM. The operating system used was Ubuntu Linux version 20.04.5. The primary programming languages and their versions used for this research were Python (v3.12.7) and R (v4.2.2). Declarations DATA AVAILABILITY The source code used for model training, prediction, and analysis is available on GitHub at https://github.com/dianalabgr/amrprediction under the MIT License. The curated genome assemblies and manually checked antibiograms used for model development are available on Zenodo at https://zenodo.org/uploads/16213507 . This Zenodo repository includes: the machine learning models with their performance metrics and SHAP values (models_shap_values.zip), the encoded k-mer datasets used for training (kmer3_data.zip, kmer4_data.zip, kmer5_data.zip, kmer_data_extra.zip), the assemblies and antibiograms of the 36 metagenomic blood culture samples, which can be used as a real clinical benchmark dataset (bloodcultures_assemblies_antibiograms.zip), the k-mer profiles used to predict their resistance phenotypes (kmers_blood_cultures.zip), and the filtered high-quality assemblies and antibiograms used for model training (filtered_assemblies.zip, qced_antibiograms.zip). The raw metagenomic sequencing data (FASTQ files) from blood cultures of 40 patients with bloodstream infections have been deposited in the NCBI Sequence Read Archive (SRA) under BioProject accession PRJNA1290626. This includes 52 Gbases of multispecies data across 40 SRA experiments and 40 BioSamples, collected at the University Hospital of Larissa (registration date: 12-Jul-2025). ACKNOWLEDGEMENTS The authors acknowledge the members of DIANA-lab for their very useful comments and ideas. FUNDING This research has been co‐financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, and Greece 2.0, under the call “RESEARCH – CREATE – INNOVATE” (ID 16971), with project id: TAEDK-06179. Additionally, this research has been supported by ELIXIR-GR: The Greek Research Infrastructure for Data Management and Analysis in Life Sciences” [MIS-5002780], implemented under the Action “Reinforcement of the Research and Innovation Infrastructure,” funded by the Operational Programme “Competitiveness, Entrepreneurship and Innovation” [NSRF 2014–2020] and co-financed by Greece and the European Union (European Regional Development Fund). Also, specifically for Anargyros Skoulakis, the research work was also supported by the Hellenic Foundation for Research and Innovation (HFRI) under the 5th Call for HFRI PhD Fellowships (Fellowship Number: 20480) CONFLICT OF INTEREST The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. References (WHO, World Health Organization (2022). Antimicrobial Resistance Surveillance in Europe 2022–2020 Data. World Health Organization, Regional Office for Europe. URL: https://iris.who.int/bitstream/handle/10665/351141/9789289056687-eng.pdf. ) Kritsotakis, E. I., Lagoutari, D., Michailellis, E., Georgakakis, I., & Gikas, A. (2022). Burden of multidrug and extensively drug-resistant ESKAPEE pathogens in a secondary hospital care setting in Greece. Epidemiology & Infection , 150 , e170. Ma, Y. X., Wang, C. Y., Li, Y. Y., Li, J., Wan, Q. Q., Chen, J. H., ... & Niu, L. N. (2020). Considerations and caveats in combating ESKAPE pathogens against nosocomial infections. Advanced Science , 7 (1), 1901872. Boolchandani, M., D’Souza, A. W., & Dantas, G. (2019). Sequencing-based methods and resources to study antimicrobial resistance. Nature Reviews Genetics , 20 (6), 356-370. Nguyen, M., Brettin, T., Long, S. W., Musser, J. M., Olsen, R. J., Olson, R., ... & Davis, J. J. (2018). Developing an in silico minimum inhibitory concentration panel test for Klebsiella pneumoniae. Scientific reports , 8 (1), 421. Kim, J. I., Maguire, F., Tsang, K. K., Gouliouris, T., Peacock, S. J., McAllister, T. A., ... & Beiko, R. G. (2022). Machine learning for antimicrobial resistance prediction: current practice, limitations, and clinical perspective. Clinical microbiology reviews , 35 (3), e00179-21. Macesic, N., Polubriaginof, F., & Tatonetti, N. P. (2017). Machine learning: novel bioinformatics approaches for combating antimicrobial resistance. Current opinion in infectious diseases , 30 (6), 511-517. Olson, R. D., Assaf, R., Brettin, T., Conrad, N., Cucinell, C., Davis, J. J., ... & Stevens, R. L. (2023). Introducing the bacterial and viral bioinformatics resource center (BV-BRC): a resource combining PATRIC, IRD and ViPR. Nucleic acids research , 51 (D1), D678-D689. Alcock, B. P., Huynh, W., Chalil, R., Smith, K. W., Raphenya, A. R., Wlodarski, M. A., ... & McArthur, A. G. (2023). CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic acids research , 51 (D1), D690-D699. Feldgarden, M., Brover, V., Gonzalez-Escalona, N., Frye, J. G., Haendiges, J., Haft, D. H., ... & Klimke, W. (2021). AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Scientific reports , 11 (1), 12728. Zankari, E., Hasman, H., Cosentino, S., Vestergaard, M., Rasmussen, S., Lund, O., ... & Larsen, M. V. (2012). Identification of acquired antimicrobial resistance genes. Journal of antimicrobial chemotherapy , 67 (11), 2640-2644. Li, W., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics , 22 (13), 1658-1659. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: architecture and applications. BMC bioinformatics , 10 (1), 421. Moradigaravand, D., Palm, M., Farewell, A., Mustonen, V., Warringer, J., & Parts, L. (2018). Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data. PLoS computational biology , 14 (12), e1006258. Nguyen, M., Long, S. W., McDermott, P. F., Olsen, R. J., Olson, R., Stevens, R. L., ... & Davis, J. J. (2019). Using machine learning to predict antimicrobial MICs and associated genomic features for nontyphoidal Salmonella. Journal of clinical microbiology , 57 (2), 10-1128. Ren, Y., Chakraborty, T., Doijad, S., Falgenhauer, L., Falgenhauer, J., Goesmann, A., ... & Heider, D. (2022). Prediction of antimicrobial resistance based on whole-genome sequencing and machine learning. Bioinformatics , 38 (2), 325-334. Ardila, C. M., González-Arroyave, D., & Tobón, S. (2025). Machine learning for predicting antimicrobial resistance in critical and high-priority pathogens: A systematic review considering antimicrobial susceptibility tests in real-world healthcare settings. Plos one , 20 (2), e0319460. Sakagianni, A., Koufopoulou, C., Feretzakis, G., Kalles, D., Verykios, V. S., & Myrianthefs, P. (2023). Using machine learning to predict antimicrobial resistance―a literature review. Antibiotics , 12 (3), 452. Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., & Korobeynikov, A. (2020). Using SPAdes de novo assembler. Current protocols in bioinformatics , 70 (1), e102. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems , 30 . Matamoros, S., Hendriksen, R. S., Pataki, B. Á., Pakseresht, N., Rossello, M., Silvester, N., ... & Schultsz, C. (2020). Accelerating surveillance and research of antimicrobial resistance–an online repository for sharing of antimicrobial susceptibility data associated with whole-genome sequences. Microbial genomics , 6 (5), e000342. Schwarz, S., & Noble, W. C. (1994). Tetracycline resistance genes in staphylococci from the skin of pigs. Journal of Applied Microbiology , 76 (4), 320-326. Pagès H, Aboyoun P, Gentleman R, DebRoy S (2025). Biostrings: Efficient manipulation of biological strings. doi:10.18129/B9.bioc.Biostrings, R package version 2.76.0, https://bioconductor.org/packages/Biostrings. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830. Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). Bortolaia, V., Kaas, R. S., Ruppe, E., Roberts, M. C., Schwarz, S., Cattoir, V., ... & Aarestrup, F. M. (2020). ResFinder 4.0 for predictions of phenotypes from genotypes. Journal of Antimicrobial Chemotherapy, 75(12), 3491-3500. Clausen, P. T., Aarestrup, F. M., & Lund, O. (2018). Rapid and precise alignment of raw reads against redundant databases with KMA. BMC bioinformatics, 19(1), 307. Additional Declarations There is NO Competing Interest. Supplementary Files SupplementaryFile1.xlsx Dataset 1 SupplementaryFile2.xlsx Dataset 2 SupplementaryFile3.xlsx Dataset 3 SupplementaryFile4.zip Dataset 4 Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7190203","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Biological Sciences - Article","associatedPublications":[],"authors":[{"id":493429369,"identity":"f9d44726-9ddb-4a19-92ad-7b82149b9fef","order_by":0,"name":"Anargyros Skoulakis","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA1UlEQVRIie2PMQrCQBBFx35lOxlR2CuMBMRCyFU2eAFLK/EAK7YRj2ATCaReEawSbFdstFdImcJC18LSrJ3gvu7Df8wfAI/nF2HQ0KARuA1y7KbAS2nPrELuCgBpm1wUoXLSVT4Ig8M6Ks8Egrf0Z4UKRVtlMMrMLcXnsN5yJWsUzkizEmXfFIlVJJ1qFLFgtL2XGAZxkVZOCjyH7ZjBRsLnmdsVyvfjXTfHKDbNbCAJ638RarS5XPfTkC+K9FhNhoJ36oa9wVcTXesWrr9pezwezz/xAGpASYN+hEIAAAAAAElFTkSuQmCC","orcid":"","institution":"University of Thessaly","correspondingAuthor":true,"prefix":"","firstName":"Anargyros","middleName":"","lastName":"Skoulakis","suffix":""},{"id":493429370,"identity":"857cbc11-966a-4f23-a209-7fdec2a440c0","order_by":1,"name":"Konstantinos Daniilidis","email":"","orcid":"https://orcid.org/0000-0002-6813-545X","institution":"University of Thessaly","correspondingAuthor":false,"prefix":"","firstName":"Konstantinos","middleName":"","lastName":"Daniilidis","suffix":""},{"id":493429371,"identity":"8841af4f-0226-4374-882f-3d741f64013e","order_by":2,"name":"Stefanos Digenis","email":"","orcid":"","institution":"University of Thessaly","correspondingAuthor":false,"prefix":"","firstName":"Stefanos","middleName":"","lastName":"Digenis","suffix":""},{"id":493429372,"identity":"670f10b3-2f0c-4257-8981-7ff4180842ed","order_by":3,"name":"Christos-Georgios Gkountinoudis","email":"","orcid":"","institution":"University of Thessaly","correspondingAuthor":false,"prefix":"","firstName":"Christos-Georgios","middleName":"","lastName":"Gkountinoudis","suffix":""},{"id":493429373,"identity":"1b5bd227-406b-4fde-bce9-872c3a4aa365","order_by":4,"name":"Efthimia Petinaki","email":"","orcid":"","institution":"Department of Microbiology, University Hospital of Larissa","correspondingAuthor":false,"prefix":"","firstName":"Efthimia","middleName":"","lastName":"Petinaki","suffix":""},{"id":493429374,"identity":"987ee34a-9176-4a25-8021-c9338ced0f67","order_by":5,"name":"Artemis G. Hatzigeorgiou","email":"","orcid":"","institution":"University of Thessaly","correspondingAuthor":false,"prefix":"","firstName":"Artemis","middleName":"G.","lastName":"Hatzigeorgiou","suffix":""}],"badges":[],"createdAt":"2025-07-22 20:15:14","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7190203/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7190203/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":88078569,"identity":"50f15840-d0da-43f1-95af-1ce561e34333","added_by":"auto","created_at":"2025-08-01 07:51:25","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":729997,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eOverview of the AMR prediction workflow.\u003c/strong\u003e The figure summarizes the three main stages of the study. (1) Dataset creation and feature annotation: Genomic assemblies from ESKAPEE pathogens and their matched antibiograms were collected from public databases (NDARO, BV-BRC, CDC NARMS), followed by quality control filtering. AMR genes, core genes, upstream promoter regions, and rRNA genes were annotated to build a comprehensive feature set. (2) K-mer encoding and machine learning: For each genomic feature, separate k-mer profiles (sizes 3, 4, 5) were generated and combined into feature matrices for model training. Random Forest and XGBoost classifiers were developed for 40 antibiotics, achieving high predictive performance across most genus–antibiotic pairs. (3) Metagenomic pipeline validation: The trained models were applied to metagenomic sequencing data from 36 culture-positive blood samples, demonstrating rapid and accurate in silico prediction of resistance profiles within 24–48 hours compared to conventional AST that need 48-72 hours. Figure was created with Canva Pro.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7190203/v1/a61e304d1c5212dd1acf85b5.png"},{"id":88078572,"identity":"280329b2-960d-41b7-92e9-dd63c6db9174","added_by":"auto","created_at":"2025-08-01 07:51:25","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":1020875,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCompilation and preprocessing of the training dataset for machine learning–based AMR prediction. (a)\u003c/strong\u003e Global distribution of 19,525 genome assemblies obtained from three major sources—NCBI NDARO, BV-BRC, and CDC NARMS—covering seven ESKAPEE genera. Circle size indicates the number of isolates per location; colors represent bacterial genus. \u003cstrong\u003e(b)\u003c/strong\u003e Distribution of susceptibility phenotypes (resistant, intermediate, susceptible, not stated) across the 40 antibiotics included in the final dataset. Phenotypes were determined based on MIC values and interpreted according to CLSI 34th edition. Only antibiotics with ≥50 eligible samples were retained. Definitions: Susceptible — the microorganism is inhibited by the antibiotic at standard (achievable) concentrations; Intermediate — the microorganism may be inhibited, but higher doses or favorable pharmacokinetics are needed for clinical effectiveness; Resistant — the microorganism is not inhibited, even at high concentrations; Not stated — no MIC value was provided. \u003cstrong\u003e(c)\u003c/strong\u003e Proportion of assemblies per bacterial genus after quality control, highlighting the predominance of Escherichia and Klebsiella. \u003cstrong\u003e(d)\u003c/strong\u003e Venn diagram showing overlap among AMR protein sequences from CARD, AMRFinderPlus, and ResFinder databases. After merging and redundancy filtering using CD-HIT (90% identity and coverage), 2,580 representative AMR protein sequences were retained for downstream analyses.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7190203/v1/7fd579206fab26f18f21b74f.png"},{"id":88078574,"identity":"16d47c93-b192-4a5a-a34e-bbb776773069","added_by":"auto","created_at":"2025-08-01 07:51:25","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1155196,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePerformance and interpretability of machine learning models for antimicrobial resistance prediction.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e(a) \u003c/strong\u003eAntibiotic-wise predictive performance of the best ML models across all genera in the test dataset, evaluated using recall and F1-score. Hyperparameters were optimized using 5-fold cross-validation on the training set, and final metrics were computed on a held-out 20% test set. Antibiotic–genus combinations were retained only if they exceeded performance thresholds (accuracy, precision, recall, F1-score \u0026gt; 0.8). \u003cstrong\u003e(b)\u003c/strong\u003e The dot-line plot visualizes which antibiotics can be reliably predicted by our machine learning models for each ESKAPEE genus. Antibiotics are shown on the x-axis and genera on the y-axis; dots represent genus–antibiotic combinations included in the final prediction set, with lines connecting antibiotics across genera for visual clarity. On the left, a horizontal bar chart summarizes the total number of antibiotics predicted per genus, with the exact count displayed at the end of each bar. Only antibiotics and genera passing performance thresholds (recall and F1-score \u0026gt;0.8) were retained, highlighting the robustness and scope of the models across clinically relevant species.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-7190203/v1/a8c52283cdd99b3f4e45a43e.png"},{"id":88078571,"identity":"8cf9ca1c-c4d6-466c-9bbf-8156250fb127","added_by":"auto","created_at":"2025-08-01 07:51:25","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":1697953,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eEvaluation of machine learning–based antimicrobial resistance prediction in clinical metagenomic samples. \u003c/strong\u003e(a) Taxonomic concordance between culture-based identification and metagenomic species classification (MGI Pathogen Fast Identification, PFI) for 40 blood cultures samples. Samples with ambiguous or mixed-species profiles (highlighted) were excluded from downstream analysis. (b) Performance metrics of ML-based resistance prediction across antibiotics using genome assemblies derived from 36 metagenomic samples. The models show high accuracy, precision, and recall across most antibiotics. (c) Comparative performance of ResFinder on the same metagenomic samples, using the same antibiotic set. ML models consistently outperform ResFinder, particularly in recall and F1-score. (d) Results of models retrained using distributionally robust optimization (DRO) for antibiotics with initially lower performance (ampicillin, aztreonam, cefotaxime, gentamicin). Improvements were observed for ampicillin, while other antibiotics showed limited gains, suggesting biological limitations in feature capture.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-7190203/v1/8fa72232ca427471b41c1a95.png"},{"id":89892126,"identity":"0008544c-1490-4755-a808-d12ff3a746dd","added_by":"auto","created_at":"2025-08-26 07:48:39","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":6782741,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7190203/v1/6557a8bd-0bf1-4e0b-a0ca-5b8f66eb3d91.pdf"},{"id":88078567,"identity":"a9217a52-968c-42b2-9735-94687b2092ce","added_by":"auto","created_at":"2025-08-01 07:51:25","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":53990,"visible":true,"origin":"","legend":"Dataset 1","description":"","filename":"SupplementaryFile1.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-7190203/v1/da1b390669760645e62b6009.xlsx"},{"id":88078570,"identity":"07201c30-db28-4419-bea8-bbc2f7c148a9","added_by":"auto","created_at":"2025-08-01 07:51:25","extension":"xlsx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":10225,"visible":true,"origin":"","legend":"Dataset 2","description":"","filename":"SupplementaryFile2.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-7190203/v1/49a06ad43c0740dec2fec387.xlsx"},{"id":88078568,"identity":"26480103-d369-4bee-9a11-b69610af0b6c","added_by":"auto","created_at":"2025-08-01 07:51:25","extension":"xlsx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":10204,"visible":true,"origin":"","legend":"Dataset 3","description":"","filename":"SupplementaryFile3.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-7190203/v1/efe3c7daa0f6fbb8f6da31de.xlsx"},{"id":88078573,"identity":"c6360a7c-6b41-4ec4-a6c5-3cf0e9bde549","added_by":"auto","created_at":"2025-08-01 07:51:25","extension":"zip","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":75106,"visible":true,"origin":"","legend":"Dataset 4","description":"","filename":"SupplementaryFile4.zip","url":"https://assets-eu.researchsquare.com/files/rs-7190203/v1/7af29a73e689b5c4e5d16d0e.zip"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Machine Learning Predicts Antimicrobial Resistance from Genomic Data across ESKAPEE Pathogens","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe widespread and often indiscriminate use of antibiotics in human medicine, agriculture, and animal husbandry has accelerated the emergence and spread of AntiMicrobial Resistance (AMR). This resistance reduces the effectiveness of available treatments, resulting in prolonged illness, increased mortality, and greater healthcare costs. The World Health Organization (WHO) has identified AMR as one of the top ten global public health threats, warning that without urgent action, common infections could once again become deadly\u003csup\u003e1\u003c/sup\u003e. Particularly concerning are multidrug-resistant (MDR) pathogens that are resistant to multiple classes (\u0026gt;3) of antibiotics, thereby severely limiting treatment options. Among these, the ESKAPEE pathogens\u0026mdash;\u003cem\u003eEnterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter spp.,\u0026nbsp;\u003c/em\u003eand\u003cem\u003e\u0026nbsp;Escherichia coli\u003c/em\u003e\u0026mdash;stand out due to their clinical relevance and high rates of resistance. These organisms are frequently implicated in healthcare-associated infections such as bloodstream infections, pneumonia, and urinary tract infections\u003csup\u003e2\u003c/sup\u003e. Their capacity to rapidly acquire and disseminate resistance genes makes them a priority for global surveillance and therapeutic innovation\u003csup\u003e3\u003c/sup\u003e.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTraditional antimicrobial susceptibility testing (AST) methods remain fundamental in clinical microbiology laboratories, relying on culture-based techniques for implementation. In practice, this involves isolating bacteria from patient samples, growing them on selective media, and testing their response to antibiotics in vitro. However, these methods face inherent limitations, as they are applicable only to cultivable bacteria and require specialized microbiology facilities and trained personnel\u003csup\u003e4\u003c/sup\u003e. They are also time-consuming, often requiring 48\u0026ndash;72 hours to produce results, which delays the initiation of targeted therapy and may lead to the empirical use of broad-spectrum antibiotics. Additionally, fastidious microorganisms, such as Mycobacterium tuberculosis or Clostridioides difficile, require even longer incubation times, leading to critical delays in establishing appropriate treatment. Faster and more informative alternatives are urgently needed to guide precision therapy and help curb the spread of antimicrobial resistance\u003csup\u003e5\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eIn recent years, computational methods have emerged as transformative tools in the field of antimicrobial resistance (AMR) prediction and strain characterization. Approaches that detect AMR genes or identify mutations in core genes associated with resistance have been widely used to predict AMR phenotypes. Building on these methods, machine learning (ML) algorithms have increasingly been applied to analyze high-dimensional, heterogeneous datasets, uncovering complex associations between genomic features and phenotypic resistance profiles\u0026mdash;patterns that are often not discernible using conventional statistical techniques\u003csup\u003e5,6\u003c/sup\u003e. By integrating genomic, clinical, and environmental data, ML models can deliver accurate predictions of antimicrobial susceptibility, supporting personalized therapeutic decisions and enhancing antimicrobial stewardship efforts\u003csup\u003e7\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eIn this study, we address the challenge of antimicrobial resistance (AMR) by leveraging genomic information, given its central role in determining bacterial resistance phenotypes. We manually curated and analyzed a dataset of 18,916 genome assemblies from ESKAPEE pathogens, each linked to matched phenotypic antimicrobial susceptibility testing (AST) data. The machine learning (ML) models were trained using high-dimensional feature vectors derived from genomic inputs, specifically: k-mer frequency profiles (k = 3, 4, 5) generated from AMR gene protein sequences, upstream promoter DNA sequences, and ribosomal RNA (rRNA) gene sequences for each assembly. Notably, a separate model was developed for each antibiotic, allowing the algorithms to learn resistance patterns specific to each antimicrobial agent. Using these data, we trained Random Forest and XGBoost models to predict resistance to 40 antibiotics, achieving high predictive performance across most pathogen\u0026ndash;antibiotic combinations. To evaluate clinical applicability and benchmark our method against standard susceptibility testing protocols (as performed at the University Hospital of Larissa, Greece), we developed and validated an integrated experimental and bioinformatic pipeline designed to process positive blood cultures from individual patients. Directly from clinical samples, this pipeline performs metagenomics sequencing, taxonomic identification, genome assembly, genomic annotation, and resistance prediction using the trained models within 24\u0026ndash;48 hours. Results were then compared with conventional AST obtained from blood cultures. The core workflow of our approach is illustrated in \u003cstrong\u003eFigure 1\u003c/strong\u003e. For the first time, our study demonstrates that combining machine learning with genomic data yields a robust, scalable, and accurate framework for AMR prediction that serves both research and clinical needs. As resistance continues to compromise the efficacy of existing antibiotics, such integrative strategies are crucial for enhancing diagnostic precision, applying targeted therapy, and guiding the development of novel therapeutics. Our findings contribute to ongoing efforts to modernize AMR surveillance and improve patient outcomes in the context of this global health crisis.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e\u003cstrong\u003eCollection and Annotation of ESKAPEE assemblies\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eData acquisition for creating machine learning models to predict antimicrobial resistance (AMR) involved retrieving genome assemblies and corresponding phenotypic resistance data (based on MICs) from three major public repositories\u0026mdash;NCBI NDARO, BV-BRC\u003csup\u003e8\u003c/sup\u003e, and CDC NARMS\u0026mdash;focusing on the clinically significant ESKAPEE genera. Integration of these datasets yielded a comprehensive resistance table comprising 19,525 unique assemblies collected globally (\u003cstrong\u003eFigure 2a\u003c/strong\u003e). The distribution of resistant, susceptible, and intermediate phenotypes for each antibiotic is shown in \u003cstrong\u003eFigure 2b\u003c/strong\u003e. To ensure data quality, we applied stringent filtering criteria. Assemblies were evaluated based on N50 values, using thresholds of 50,000 for \u003cem\u003eKlebsiella, Acinetobacter, Pseudomonas, Enterococcus,\u0026nbsp;\u003c/em\u003eand\u003cem\u003e\u0026nbsp;Escherichia\u003c/em\u003e, and 20,000 for \u003cem\u003eEnterobacter\u0026nbsp;\u003c/em\u003eand \u003cem\u003eStaphylococcus,\u003c/em\u003e resulting in a high-quality assembly set.\u003c/p\u003e\n\u003cp\u003eMIC values from various testing methods (CLSI, EUCAST) were included; however, final interpretations into susceptible or resistant categories were standardized using CLSI breakpoints (34th edition, 2024). To ensure consistency across datasets, we retained only assemblies with clear MIC-supported phenotype calls under CLSI definitions. Assemblies with MICs falling into the intermediate category were excluded from analyses for the affected antibiotics, as they may introduce ambiguity or reflect historical testing differences. Only assemblies with MICs confidently matching susceptible or resistant phenotypes were kept for downstream analysis.\u003c/p\u003e\n\u003cp\u003eFurthermore, genera\u0026ndash;antibiotic pairs with limited representation (fewer than 10 assemblies per class (Resistant / Susceptible) or one class has less than 0.1% of the total assemblies for that genus) were excluded to maintain statistical robustness. Antibiotics with fewer than 50 strains with eligible antibiograms were also removed. This multi-step quality control pipeline resulted in a highly curated dataset of 18,916 assemblies with manually verified susceptibility profiles for 40 antibiotics. The distribution of assemblies across genera is shown in \u003cstrong\u003eFigure 2c\u003c/strong\u003e,\u003cstrong\u003e\u0026nbsp;Supplementary File 1\u003c/strong\u003e. The final quality-controlled assemblies with their respective antibiograms, for each antibiotic, are publicly available on Zenodo (https://zenodo.org/records/16213507 ), providing a valuable resource for the AMR research community.\u003c/p\u003e\n\u003cp\u003eAfter quality control, for each genome, we identified known antimicrobial resistance (AMR) \u0026nbsp;genes, core genes (genes with known mutations linked to antimicrobial phenotypes), their upstream promoter regions (300 nucleotides upstream of the gene start site), and ribosomal RNA (rRNA) genes. To compile a comprehensive set of AMR-related genes (AMR genes and core genes), we integrated data from three widely used databases: the Comprehensive Antibiotic Resistance Database (CARD)\u003csup\u003e9\u003c/sup\u003e, Reference Gene Catalog\u003csup\u003e10\u0026nbsp;\u003c/sup\u003eand ResFinder DB\u003csup\u003e11\u003c/sup\u003e, initially obtaining 14,183 AMR protein sequences. Instead of using nucleotide sequences, we focused on protein sequences, as they better capture functionally relevant variation. To reduce redundancy and keep the feature space manageable, we clustered homologous proteins using CD-HIT\u003csup\u003e12\u003c/sup\u003e at stringent thresholds of 90% sequence identity and 90% alignment coverage, reducing the total to 2,580 representative AMR and core proteins (\u003cstrong\u003eFigure 2d\u003c/strong\u003e). For each genome, we then used BLASTx\u003csup\u003e13\u003c/sup\u003e to identify AMR gene and core gene hits, applying a 70% identity and 70% coverage cutoff to retain high-confidence matches. Additionally, we extracted the 300 nucleotides upstream of each AMR gene to capture potential promoter regions, accounting for strand orientation. rRNA genes (5S, 16S, and 23S), central to the mechanism of many antibiotics, were identified using BLASTn. All detected rRNA gene copies were retained, and for each genome, we constructed a pseudo full-length rRNA operon by concatenating the copies in the correct order. To ensure a consistent feature matrix across all genomes, any missing AMR or core genes were represented by a placeholder sequence (\u0026lsquo;XXXX\u0026rsquo;), maintaining a fixed set of 2,580 AMR/core genes, 2,580 promoters, and 1 pseudo rRNA feature per genome. This comprehensive feature set, combining resistance genes, regulatory elements, and conserved structural components, served as the foundation for predictive modeling.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMachine Learning Models predicting antimicrobial resistance using promoters, AMR genes, core genes and rRNA genes\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe development of machine learning models for predicting antimicrobial resistance involved a multi-stage pipeline encompassing data encoding, model training, performance evaluation, and interpretability analysis. Separate models were trained for each antibiotic, resulting in 40 distinct models. Stratified sampling ensured balanced representation across resistance classes and bacterial genera, enhancing generalizability to unseen data.\u003c/p\u003e\n\u003cp\u003eTo represent the genomic content of each bacterial assembly, we applied k-mer frequency encoding using k-mer sizes of 3, 4, and 5. Separate k-mer profiles were generated not only for each genomic feature type (promoters, rRNA genes, AMR and core genes) but also for each individual element within those types\u0026mdash;meaning that, for example, gene 1, gene 2, promoter 1, and each rRNA gene each had its own dedicated k-mer profile. For AMR and core genes, k-mers were derived from the protein sequences, while for promoters and rRNA genes, they were based on the DNA sequences. These feature vectors were concatenated to create a comprehensive input for two supervised learning algorithms: Extreme Gradient Boosting (XGBoost) and Random Forests. The dataset of 18,916 assemblies was randomly split into training (80%) and testing (20%) sets using a fixed random seed to ensure reproducibility and avoid data leakage. Models were trained exclusively on the training set, with the test set reserved for unbiased evaluation. Hyperparameters were optimized using grid search and cross-validation. For XGBoost, we tuned the learning rate, maximum tree depth, and number of estimators to balance performance and generalization. For Random Forests, we adjusted the number of trees and their depth to maximize accuracy while minimizing overfitting.\u003c/p\u003e\n\u003cp\u003eThe trained models were evaluated using standard classification metrics\u0026mdash;accuracy, precision, recall, and F1-score\u0026mdash;calculated separately for each antibiotic and bacterial genus. Among these, recall is particularly critical in clinical settings, as misclassifying a resistant strain as susceptible can lead to inappropriate treatment and patient harm (very major error); in parallel, the ineffective antibiotic use enhances the development of resistant bacteria of normal flora. As shown in \u003cstrong\u003eFigure 3a\u0026nbsp;\u003c/strong\u003eand \u003cstrong\u003eSupplementary File 2\u003c/strong\u003e, the models achieved strong predictive performance across most antibiotics, with over 0.90 recall and F1-score in the majority of cases. Detailed performance metrics, both overall and genus-specific, are available through our interactive RShiny application (https://dianalab.e-ce.uth.gr/amrpredictor/).\u003c/p\u003e\n\u003cp\u003eNotably, antibiotics such as moxifloxacin, cefpodoxime, chloramphenicol, ceftriaxone, and oxacillin reached near-perfect scores (recall and F1 \u0026ge;0.98), underscoring the robustness of the approach. Carbapenems like doripenem (recall and F1 \u0026ge;0.97), ertapenem (recall and F1 \u0026ge;0.96), and meropenem (recall and F1 \u0026ge;0.95) also demonstrated excellent predictability, reflecting their well-characterized resistance determinants. However, a few antibiotics showed reduced performance. Linezolid (recall 0.60, F1 0.66), tigecycline (recall 0.62, F1 0.69), colistin (recall 0.78, F1 0.79), cefuroxime (recall 0.77, F1 0.85), and minocycline (recall 0.77, F1 0.93) had notably lower scores, possibly due to underrepresented or poorly annotated resistance mechanisms in the genomic feature set. For instance, colistin resistance often involves complex regulatory pathways or plasmid-mediated\u0026nbsp;\u003cem\u003emcr\u003c/em\u003e genes, which may not be fully captured in current databases. Similarly, tigecycline and linezolid resistance mechanisms are less common and more variable across species, reducing model generalizability.\u003c/p\u003e\n\u003cp\u003eTo maintain the reliability of genus-specific predictions, we applied a conservative threshold: if any key performance metric for a given antibiotic\u0026ndash;genus pair or across all genera fell below 0.8, we considered that combination unreliable (\u003cstrong\u003eSupplementary File 3\u003c/strong\u003e). We filtered predictions to include only antibiotics with consistently high performance for each species. As shown in \u003cstrong\u003eFigure 3b\u003c/strong\u003e, this selection highlights which antibiotic\u0026ndash;species pairs can be robustly predicted by our models. 6 antibiotics, including nitrofurantoin (even the overall recall and F1 score is above 0.8, for each genera the performance is suboptimal \u0026lt;0.8 so it was removed), tigecycline, linezolid, cefuroxime, minocycline and colistin, were excluded entirely due to low predictive performance across genera, likely reflecting underlying resistance mechanisms not yet fully elucidated or integrated into AMR databases.\u003c/p\u003e\n\u003cp\u003eCompared to prior AMR prediction studies, our machine learning models demonstrate superior or highly competitive predictive performance. For example, Moradigaravand et al. reported an average recall of ~0.83 and precision of ~0.92 for predicting resistance in \u003cem\u003eE. coli\u003c/em\u003e using pan-genome data, focusing mainly on gene presence/absence and population structure across a limited antibiotic set\u003csup\u003e14\u003c/sup\u003e. In contrast, our models deliver recall and F1-scores exceeding 0.90 for the majority of 40 antibiotics across all major ESKAPEE pathogens, including critical agents like carbapenems (ertapenem, imipenem, meropenem with recall ~0.95\u0026ndash;0.97), demonstrating broader taxonomic and antibiotic coverage with improved sensitivity. Similarly, Nguyen et al. \u0026nbsp; and Ren et al reported strong performance but typically on narrower antibiotic panels (\u0026lt;20 drugs) or fewer species, with model recalls ranging from ~0.85 to 0.95\u003csup\u003e15, 16\u003c/sup\u003e. Our results match or surpass previous efforts while providing phenotypic predictions at a much larger scale.\u003c/p\u003e\n\u003cp\u003eOverall, the combination of high recall and F1-scores across 40 antibiotics, extensive species representation, and a unified, interpretable feature space, positions our study as one of the most comprehensive and clinically relevant AMR prediction frameworks to date, advancing the field in both research and translational diagnostics\u0026mdash;as emphasized by recent calls for innovation in AMR surveillance \u003csup\u003e17,18\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eRapid In Silico Antibiogram from Blood Metagenomes: A Clinical Proof of Concept\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo assess the clinical performance of our machine learning models and establish a practical pipeline for their application, we conducted a pilot study using blood cultures from patients with clinically confirmed bloodstream infections. Blood cultures were collected from 40 different patients and incubated for 6 hours to enhance microbial yield. Total DNA was then extracted, with host-derived DNA selectively depleted to enrich for microbial content. The resulting microbial DNA was prepared for sequencing according to standard metagenomic protocols and the manufacturer\u0026rsquo;s instructions for the MGI platform.\u003c/p\u003e\n\u003cp\u003eSequencing was performed on the MGI G99 instrument, generating ~4 million paired-end reads per sample. We used the MGI Pathogen Fast Identification (PFI) tool to identify the dominant bacterial species in each sample. In 4 out of the 40 cases, the microbial composition was mixed, with no single species exceeding 90% abundance. These were likely polymicrobial infections or samples with ambiguous taxonomic resolution and were excluded to avoid confounding the evaluation (\u003cstrong\u003eFigure 4a\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003eThe remaining 36 samples, each dominated by one bacterial species, were analyzed further. We assembled the genomes using a hybrid approach with SPAdes\u003csup\u003e19\u003c/sup\u003e \u0026mdash;combining de novo assembly and reference-guided scaffolding based on the closest matching reference genome from PFI. We then extracted k-mer profiles (sizes 3, 4, and 5), matching the same scheme used in model training, and applied our ML classifiers to predict antimicrobial susceptibility profiles (in silico antibiograms).\u003c/p\u003e\n\u003cp\u003eThe predicted resistance profiles were then systematically compared to the corresponding conventional susceptibility test results obtained in the clinical microbiology laboratory of University Hospital of Larissa, Greece. As shown in \u003cstrong\u003eFigure 4b\u003c/strong\u003e, our machine learning models demonstrated strong predictive performance across most antibiotics, underscoring the feasibility of this rapid, genome-based approach to antimicrobial susceptibility testing (AST). Notably, this workflow delivers results within 24\u0026ndash;48 hours from blood draw, compared to 2\u0026ndash;3 days typically required for culture-based methods, offering a clinically meaningful time advantage that can enable earlier, targeted treatment decisions.\u003c/p\u003e\n\u003cp\u003eHigh-performing antibiotics included amikacin, cefepime, ceftriaxone, imipenem, meropenem, and levofloxacin, all showing recall and F1-scores \u0026ge;0.93, often with zero false positives or false negatives (e.g., amikacin: TP=27, TN=1, FP=0, FN=0). Ciprofloxacin and levofloxacin also performed well, with recall around 93\u0026ndash;94% and F1-scores ~0.97.\u003c/p\u003e\n\u003cp\u003eSome antibiotics showed moderate performance. Ampicillin had a recall of 0.67 and F1-score of 0.80 due to one false negative out of three positive cases. Cefotaxime reached perfect recall (1.00) but a lower F1-score (0.86) because of one false positive. Gentamicin had perfect recall but a lower F1-score (0.84), driven by seven false positives\u0026mdash;likely reflecting its complex resistance mechanisms, such as aminoglycoside-modifying enzymes and membrane permeability changes.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWhen compared to rule-based tools like ResFinder (\u003cstrong\u003eFigure 4c, Supplementary File 4\u003c/strong\u003e), our ML models consistently showed superior performance, underscoring the added value of combining machine learning with curated genomic features beyond just known resistance genes.\u003c/p\u003e\n\u003cp\u003eTo explore whether underperforming antibiotics could be improved, we applied a distributionally robust optimization (DRO)-inspired training strategy (see Methods), focusing on ampicillin, aztreonam, cefotaxime, and gentamicin. This approach reweighted uncertain samples and used label smoothing to reduce overfitting. While DRO significantly improved ampicillin\u0026rsquo;s performance\u0026mdash;suggesting that initial misclassifications were driven by label noise or sample imbalance\u0026mdash;it did not substantially improve aztreonam, cefotaxime, or gentamicin. This suggests that for these antibiotics, the limitations are likely biological rather than computational, reflecting incomplete representation of resistance mechanisms in current genomic databases (\u003cstrong\u003eFigure 4d\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eInterpreting Model Predictions Using SHAP for Feature-Level Insights\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo gain biological insights into model predictions and feature relevance, we applied SHAP (SHapley Additive exPlanations) values\u003csup\u003e20\u003c/sup\u003e to estimate the contribution of individual genomic features to resistance outcomes. SHAP analyses were performed both on the full training dataset and separately for each antibiotic\u0026ndash;genus pair, enabling fine-grained interpretation of potential resistance determinants. These results, including SHAP-derived feature importance scores, are available through our interactive RShiny application (https://dianalab.e-ce.uth.gr/amrpredictor/). While SHAP values offer powerful interpretability and can highlight genomic regions potentially linked to AMR, their insights are inherently limited to the features present in the training data.\u003c/p\u003e\n\u003cp\u003eThe web application includes SHAP values for all antibiotic\u0026ndash;genus combinations, even those below the performance threshold of 0.8, allowing users to inspect models that performed poorly and explore why. For each antibiotic, users can view the most important features identified either across all genera or within specific genera. Shap values need an indicative interpretation, as they show the proportion of significance of each feature. \u0026nbsp;In our interface, the displayed importance refers to how much a given k-mer feature influenced the model\u0026apos;s decision, regardless of direction.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe SHAP analysis revealed several biologically meaningful patterns in the model predictions. For cefotaxime, the model consistently ranked k-mers from the \u003cem\u003ebla\u003c/em\u003eCTX-M\u0026nbsp;gene as highly influential, with positive contributions to resistance prediction\u0026mdash;an expected and well-established marker for \u0026beta;-lactam resistance. Similarly, for amikacin, multiple k-mers derived from the aph(3\u0026apos;)-VI gene were identified as top features, confirming its known role in aminoglycoside resistance. These examples highlight cases where the model captured relevant genomic signals, supporting its interpretability and potential clinical utility. For ciprofloxacin, genes such as \u003cem\u003egyrA\u0026nbsp;\u003c/em\u003eand\u003cem\u003e\u0026nbsp;parC\u003c/em\u003e appeared among the most important features, aligning with known resistance mechanisms, though the directionality and distribution of SHAP values were more diffuse. In contrast, for ampicillin\u0026ndash;sulbactam, top-ranked features included promoter-region k-mers not clearly linked to known resistance determinants, suggesting either indirect associations or gaps in the current feature set. Overall, while some antibiotic\u0026ndash;genus combinations yielded interpretable biological insights, others highlighted the need for further refinement of features or inclusion of additional genomic signals. These results should be interpreted as hypothesis-generating: SHAP values indicate which features influenced model decisions, but do not prove causality. They must be complemented with domain knowledge, genomic context, and functional validation. Nevertheless, the ability to pinpoint consistently important regions makes this approach a valuable starting point for researchers investigating the molecular basis of resistance in specific antibiotic\u0026ndash;species combinations.\u003c/p\u003e\n"},{"header":"Discussion","content":"\u003cp\u003eAntimicrobial resistance (AMR) is one of the most urgent global health threats of the 21st century. Without effective interventions, it is expected to escalate, undermining the efficacy of antimicrobial treatments and placing an increasing burden on healthcare systems worldwide. Our study shows that machine learning (ML) models trained on genomic features\u0026mdash;including resistance genes, upstream promoter regions, mutations in core genes, and ribosomal RNA genes\u0026mdash;can accurately predict resistance profiles for the majority of antibiotics tested. These results were validated both on held-out public datasets and in a real-world setting using metagenomic data from 36 ESKAPEE-positive blood cultures obtained from individual patients. Importantly, several antibiotics showed excellent predictive performance, including amikacin, cefepime, ceftriaxone, imipenem, meropenem, and levofloxacin, with recall and F1-scores exceeding 0.93 and near-perfect agreement with phenotypic susceptibility results. This underscores the robustness of the models for key frontline antibiotics and their potential utility in clinical decision-making. While overall performance was strong, accuracy varied across antibiotics. For example, gentamicin exhibited high recall but lower precision in clinical samples, largely due to an increased number of false positives\u0026mdash;suggesting the model may overpredict resistance, potentially due to weaker discriminatory features or metagenomic noise. More broadly, antibiotics that consistently showed poor model performance (e.g., tigecycline, linezolid, cefuroxime) may point to the presence of resistance mechanisms not well captured by current genomic databases or feature sets. In this way, our framework not only offers a powerful predictive tool but also highlights antibiotic\u0026ndash;genera combinations that merit further biological investigation.\u003c/p\u003e\n\u003cp\u003eOur current approach, while effective across many antibiotics and pathogens, has several limitations. The feature set is focused on known resistance genes, upstream promoter regions, clinically significant mutations in core genes, \u0026nbsp;and rRNA elements, which may overlook other genomic contributors to resistance\u0026mdash;such as mobile genetic elements, non-coding RNAs, structural rearrangements, or epigenetic factors. Incorporating these additional layers could improve model interpretability and expand coverage. Furthermore, our models are trained to predict binary susceptibility outcomes based on CLSI breakpoints, without estimating MIC values or accounting for intermediate categories, which may reduce precision near clinical decision thresholds. In some genus\u0026ndash;antibiotic combinations, model performance was impacted by imbalanced or sparse training data, leading to the exclusion of unreliable predictions. These observations underscore the importance of curating balanced, well-annotated datasets and expanding feature diversity to further enhance the accuracy and generalizability of genomic AMR prediction models.\u003c/p\u003e\n\u003cp\u003eThis study provides a compelling proof of concept that bacterial genomic content can be effectively leveraged to predict antimicrobial susceptibility profiles with high accuracy. Similar findings have been reported in recent studies applying ML to AMR prediction. For example, Nguyen et al. demonstrated that genomic features, including k-mers and known resistance genes, can be used to predict minimum inhibitory concentrations (MICs) for nontyphoidal \u003cem\u003eSalmonella\u003c/em\u003e, often exceeding 90% accuracy. In another large-scale study, Moradigaravand et al. used pan-genome data and ML models to predict resistance phenotypes in \u003cem\u003eEscherichia coli\u003c/em\u003e, achieving high performance and identifying genomic elements linked to resistance across a broad population set. Their work emphasized the utility of genome-wide variation in resistance prediction and biomarker discovery. Compared to these studies, our work expands the scope by systematically evaluating resistance prediction across all major ESKAPEE pathogens and 40 antibiotics using a unified feature set. Furthermore, our application of these models to clinical metagenomic samples represents a critical step toward diagnostic translation. To our knowledge, this is the most comprehensive and best-performing ML-based AMR prediction framework \u0026nbsp;reported to date, both in terms of taxonomic breadth and validation in a clinical context.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eImportantly, the metagenomic dataset generated from 36 bloodstream infection samples provides an independent, real-world benchmark for the AMR research community. Derived from a tertiary hospital in Greece, this dataset includes a notably high proportion of resistant strains compared to publicly available databases, offering an invaluable resource for testing predictive models under clinically realistic conditions. As an unseen, clinically relevant dataset, it enables robust evaluation of future antimicrobial resistance (AMR) prediction models, moving beyond internal cross-validation or simulation-based benchmarks. We encourage its use as a standardized reference set to promote reproducibility, model comparison, and methodological advancement in the AMR prediction field.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWhile previous studies have shown the feasibility of genomic AMR prediction, our framework is among the first to integrate it into a fully clinical, culture-independent pipeline, enabling resistance prediction directly from metagenomic sequencing data. This approach is especially valuable when conventional methods are infeasible \u0026mdash; for example, in patients previously treated with antibiotics where cultures return falsely negative, or when working with fastidious or unculturable pathogens such as Mycobacterium tuberculosis, Legionella pneumophila, or Helicobacter pylori. Although further validation and workflow optimization are needed before routine adoption, this method offers a promising alternative in urgent or complex diagnostic settings. To ensure widespread adoption and long-term clinical utility, future machine learning models must be trained on even larger, geographically diverse genome\u0026ndash;antibiogram datasets and be continuously updated to capture emerging resistance mechanisms. In this context, initiatives like COMPARE at ENA represent an important effort toward building such global resources\u003csup\u003e21\u003c/sup\u003e. Overall, our results demonstrate the potential of integrating metagenomic sequencing with ML-based prediction to accelerate antimicrobial susceptibility testing. Unlike traditional culture-based methods, which typically require 48\u0026ndash;72 hours, this approach can deliver accurate resistance profiles within 24\u0026ndash;48 hours \u0026mdash; potentially enabling earlier, more targeted therapy in clinical practice.\u003c/p\u003e\n\u003cp\u003eBeyond predictive accuracy, our study places equal emphasis on model interpretability. By applying SHAP (SHapley Additive exPlanations) values to our trained models, we provide feature-level insights into the genomic elements most influential in each prediction. These results are openly accessible through our interactive web application, which allows researchers to explore which k-mers and gene regions drive model decisions across antibiotics and bacterial genera. Notably, features that are consistently ranked as important may warrant further biological investigation, as they could highlight previously unexplored mechanisms underlying the resistance phenotype. In well-performing models, the top-ranked features often correspond to well-known resistance genes, providing an additional layer of biological validation. In contrast, lower-performing models\u0026mdash;such as those for gentamicin or tigecycline\u0026mdash;show less consistent or less interpretable feature patterns, possibly pointing to the presence of as-yet-undiscovered or under-characterized resistance mechanisms. As such, our platform not only delivers accurate predictions but also serves as a hypothesis-generating tool to guide future research into novel genomic determinants of AMR, ultimately advancing both basic science and clinical applications.\u003c/p\u003e\n\u003cp\u003eIn conclusion, our findings highlight the transformative potential of machine learning for antimicrobial resistance (AMR) diagnostics, enabling rapid, culture-independent prediction of resistance profiles directly from genomic data. The successful application of this framework to clinical bloodstream infection samples demonstrates its translational promise and paves the way for real-world implementation. As genomic resources expand and computational methods advance, genome-based AMR prediction may define the next generation of microbiological diagnostics\u0026mdash;particularly in cases where conventional methods are slow, limited, or infeasible. Looking ahead, one can envision a clinical paradigm where blood is drawn from the patient and analyzed in near real-time using integrated next-generation sequencing and machine learning pipelines, delivering reliable resistance predictions within hours. Given that over 90% of microorganisms cannot be routinely isolated in cultures, such approaches could revolutionize infectious disease management by supporting more precise, timely, and effective treatments. Realizing this vision will require broader adoption of sequencing technologies, cost reductions in instrumentation and reagents, expansion of curated genome\u0026ndash;antibiogram datasets, regular updates of interpretive breakpoints (e.g., CLSI, EUCAST), and deep integration of bioinformatics and machine learning expertise into clinical workflows.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003e\u003cstrong\u003ePublic Data Collection\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eData acquisition for antimicrobial resistance (AMR) prediction involved collecting bacterial genome assemblies and corresponding antibiogram data from three major public repositories: NCBI NDARO (data downloaded in December 2023 via https://www.ncbi.nlm.nih.gov/pathogens/isolates), BV-BRC (https://www.bv-brc.org) ( downloaded December 2023), and CDC NARMS (https://wwwn.cdc.gov/NARMSNow; downloaded January 2024). The study focused on the clinically significant ESKAPEE genera (\u003cem\u003eEnterococcus, Staphylococcus, Klebsiella, Acinetobacter, Pseudomonas, Enterobacter.,\u003c/em\u003e and \u003cem\u003eEscherichia\u003c/em\u003e). From NDARO, 13,827 isolates were retrieved, yielding 13,159 unique assemblies after the removal of redundant entries. The NARMS dataset contributed 870 \u003cem\u003eE. coli\u003c/em\u003e O157 assemblies with associated antibiograms. The BV-BRC database initially contained 234,123 assemblies from ESKAPEE genera, from which 6,249 assemblies with matched experimental antibiograms were retained after filtering. Across datasets, extensive preprocessing was conducted, including de-duplication, harmonization of antibiotic names, normalization of minimum inhibitory concentration (MIC) values, and integration of antibiogram records. The final dataset comprised resistance profiles for 19,525 unique assemblies, which were processed for quality control.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Quality Control\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eQuality control of both genome assemblies and antibiograms was essential to ensure the reliability of the data used for antimicrobial resistance prediction. For genome assemblies, the N50 statistic was used as a primary measure of contiguity and assembly quality. Assemblies with low N50 values were excluded to maintain high standards of completeness. While N50 values were directly available for NDARO entries, those from CDC NARMS and BV-BRC required manual computation. We calculated these values by extracting contig lengths from FASTA files and applying the Biostrings::N50() function in R. Assemblies were filtered using a minimum N50 threshold of 50,000 for most species (\u003cem\u003eEnterococcus, Klebsiella, Acinetobacter, Pseudomonas, and Escherichia\u003c/em\u003e) and 20,000 for \u003cem\u003eEnterobacter\u003c/em\u003e and \u003cem\u003eStaphylococcus\u003c/em\u003e. After filtering, 18,916 high-quality assemblies were retained.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAntibiogram quality control retained only entries with quantitative susceptibility data reported as Minimum Inhibitory Concentrations (MICs), obtained using standardized methods (Clinical Laboratory Standards Institute [CLSI] or European Committee on Antimicrobial Susceptibility Testing [EUCAST]). To ensure consistency across datasets, susceptibility and resistance classifications were determined according to MIC breakpoints defined in the CLSI 34th edition (https://clsi.org/resources/breakpoint-implementation-toolkit/). Strains with MICs falling into the intermediate category were excluded from analysis for the corresponding antibiotic, as such cases could reflect potential testing, reporting, or reagent discrepancies. Additionally, any record showing a mismatch between the reported phenotype and MIC-derived classification was discarded. The Zenodo repository folder (18916_assemblies_antibiograms) contains the fully quality-controlled assemblies and antibiograms used in this study.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAnnotation Pipeline for Analysis of WGS Data\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo enable predictive modeling of antimicrobial resistance (AMR), we developed a comprehensive annotation pipeline designed to extract relevant genomic features from bacterial assemblies. The annotation focused on three biologically informative categories: (i) known antimicrobial resistance genes, including both horizontally acquired resistance determinants (e.g., \u0026beta;-lactamases, efflux pumps) and core genes whose mutations are associated with resistance phenotypes; (ii) the upstream regulatory regions (300 bp) of these genes, which may influence expression; and (iii) ribosomal RNA (rRNA) genes (5S, 16S, 23S), which are functionally linked to the mechanism of action of several antibiotics.\u003c/p\u003e\n\u003cp\u003eTo construct a high-confidence database of AMR genes, we integrated protein-coding sequences from three major repositories: the Comprehensive Antibiotic Resistance Database (CARD, version 3.2.9), ReferenceGeneCatalog (version 3.12), and ResFinderDB (accessed June 2024). From CARD, we extracted 5,078 resistance genes by matching protein accession numbers in the Antibiotic Resistance Ontology (ARO). Reference Gene Catalog provided 8,157 gene entries retrieved from the NCBI FTP server. ResFinder contributed 3,150 resistance genes. All the genes were translated into protein sequences using a custom Python script. Combined, this resulted in a unified non-redundant dataset of 14,183 antimicrobial peptides.\u003c/p\u003e\n\u003cp\u003eTo reduce redundancy and facilitate homology-based searches, protein sequences were clustered using CD-HIT (version 4.8.1) with stringent parameters (-c 0.9, -aL 0.9) to ensure that highly similar but functionally distinct genes (e.g., \u003cem\u003etetM\u003c/em\u003e and \u003cem\u003etetO\u003c/em\u003e) remained in separate clusters. This threshold was chosen based on the findings of Zilhao et al. \u003csup\u003e22\u003c/sup\u003e, which showed that co-occurring tetracycline resistance genes can be mistakenly merged under less stringent cutoffs. Clustering resulted in 2,580 non-redundant protein families. Representative sequences from each cluster were used to build a custom BLAST database.\u003c/p\u003e\n\u003cp\u003eFor each genome assembly, AMR gene detection was performed using blastx (NCBI BLAST+ version 2.15.0+) with an E-value threshold of 1e\u003csup\u003e-50\u003c/sup\u003e and output formatted to include standard fields as well as aligned query/subject sequences and subject lengths. Hits were filtered based on \u0026ge;70% identity and global coverage using a Python script. For each valid alignment, the 300 nucleotides upstream of the alignment start site were extracted using the Biostrings R package (version 2.66.0)\u003csup\u003e23\u003c/sup\u003e, with reverse strand orientation handled appropriately.\u003c/p\u003e\n\u003cp\u003eTo complement AMR gene annotations, rRNA genes were identified via BLASTn by aligning 5S, 16S, and 23S reference sequences to each assembly. Separate BLAST databases were created for each rRNA gene type. Given the presence of multiple rRNA operons in bacterial genomes, multiple matches per genome were expected and retained. BLASTn results were parsed using a bash script and analyzed with Python to record the best-scoring non-overlapping hits. For downstream analysis, we constructed a pseudo-rRNA operon for each genome by concatenating all detected rRNA gene copies in the canonical order\u0026mdash;5S, 16S, 23S\u0026mdash;mimicking their natural genomic arrangement based on a reference genome. Unmatched regions or gaps were padded with the character \u0026lsquo;X\u0026rsquo; to ensure sequence uniformity across samples.\u003c/p\u003e\n\u003cp\u003eThis pipeline generated a consistent, high-resolution annotation of each genome, capturing AMR-related protein features, upstream regulatory sequences, and rRNA elements for downstream machine learning applications. \u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTraining Machine Learning Models for Predicting Antimicrobial Resistance\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo build robust models for antimicrobial resistance (AMR) prediction, we developed a comprehensive pipeline covering data preprocessing, k-mer\u0026ndash;based feature engineering, model training, and performance evaluation. The input included AMR genes (protein level), promoter regions, and rRNA genes (nucleotide level), each processed separately to retain biological relevance. Overlapping k-mers of sizes 3, 4, and 5 were extracted per gene or protein, and only k-mers present in the training set were kept to maintain consistency and reduce noise. We computed k-mer frequencies using Scikit-learn (v1.0.2)\u003csup\u003e24\u003c/sup\u003e CountVectorizer, producing high-dimensional, sparse feature vectors capturing local sequence patterns tied to resistance mechanisms.\u003c/p\u003e\n\u003cp\u003eAll analyses were conducted in Python, using pandas (v2.0.3), numpy (v1.21.5), matplotlib (v3.7.2), seaborn (v0.13.2), scipy (v1.7.3), argparse (v1.1). For supervised learning, we used Extreme Gradient Boosting (XGBoost, v2.0.3)\u003csup\u003e25\u003c/sup\u003e and Random Forest classifiers implemented via Scikit-learn.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe dataset was split into training and test sets with a stratified 80:20 split to preserve class balance. Hyperparameters were tuned via grid search: for XGBoost, we explored max_depth [7, 9, 11, 13, 15, 17, 19, 21, 23], ran 300 boosting rounds, and applied early stopping (patience 10) using 5-fold cross-validation log loss; for Random Forests, we tuned n_estimators [100, 200, 400], max_depth [10, 20, 30, ..., 100], and both bootstrap settings. Evaluation metrics included accuracy, precision, recall, and F1-score. Performance was compared across k-mer sizes to determine the most predictive setup.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFeature Importance and Model Interpretability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo interpret the contributions of individual genomic features to antimicrobial resistance predictions, we employed both model-specific importance metrics and SHAP (SHapley Additive exPlanations) values. For Random Forest and XGBoost models, raw importance scores were extracted using feature_importances_ and get_score() (importance type: \u0026apos;weight\u0026apos;), respectively. These values were normalized to reflect the relative contribution of each feature as a percentage. Features contributing less than 0.001% were filtered out to reduce noise and enhance interpretability.\u003c/p\u003e\n\u003cp\u003eTo provide a model-agnostic and more nuanced explanation of feature influence, SHAP analysis was conducted using the SHAP Python package (v0.46.0). SHAP values were calculated on the held-out test set using TreeExplainer, capturing both the magnitude and direction of each feature\u0026rsquo;s effect on the prediction. Specifically, we extracted SHAP values for the predicted class (class 1, resistant) and computed the mean absolute SHAP value per feature to assess importance, along with the average SHAP value to determine the direction of influence (positive or negative).\u003c/p\u003e\n\u003cp\u003eTo link model-derived features with biological context, we annotated each feature with metadata including the gene or region name, associated sequence, and sequence type (protein, promoter, or rRNA). Feature identifiers were constructed by concatenating these attributes, and SHAP-derived importance scores were merged with this metadata to create an integrated, interpretable view. This multi-level approach enabled biologically meaningful insights into which genomic regions drove resistance predictions and highlighted candidate markers for further experimental validation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eInteractive R/Shiny Application\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo enhance accessibility and promote transparency, we developed an interactive R/Shiny web application that enables users to explore the dataset and machine learning results related to antimicrobial resistance prediction. The platform provides descriptive summaries of the data\u0026mdash;such as the distribution of resistance phenotypes across genera and antibiotics\u0026mdash;as well as detailed model performance metrics including accuracy, precision, recall, F1-score, and AUC. SHAP-based feature importance scores are also available, supporting interpretability of model predictions. The application is built with R (v4.4.2) and Shiny (v1.10.0), and is hosted on a Linux-based server with 2 CPU cores, 4 GB RAM, and 19 GB of SSD storage. The web app is publicly accessible at https://dianalab.e-ce.uth.gr/amrpredictor/ .\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDRO inspired Machine Learning Models\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo improve the prediction of antimicrobial resistance under uncertain or noisy conditions, we implemented a machine learning pipeline inspired by distributionally robust optimization (DRO). First, we trained an initial XGBoost classifier to estimate prediction uncertainty, calculated as the proximity of predicted probabilities to the decision boundary (|0.5 \u0026minus; probability|). A tunable \u0026alpha;-quantile (set at 30%) was applied to identify the most uncertain samples, which were then upweighted (weight = 5) to emphasize difficult cases during subsequent training. Additionally, label smoothing was applied to reduce overfitting by slightly shifting the true labels toward 0.5 (using a smoothing factor of 0.05). Finally, we trained an XGBoost regressor with a logistic objective on the smoothed labels and adjusted sample weights. The hyperparameters were selected based on prior performance across the full training dataset, and the classification threshold was fixed at 0.5. Performance was evaluated using standard classification metrics, including precision, recall, F1-score, and confusion matrix. This strategy aimed to enhance model robustness by accounting for uncertainty and prioritizing high-risk misclassifications.\u003c/p\u003e\n\u003ch3\u003e\u003cstrong\u003eMetagenomic Sequencing and AST Prediction : real time for blood stream infections\u003c/strong\u003e\u003c/h3\u003e\n\u003cp\u003eA total of 40 blood cultures, \u0026nbsp;obtained from individual patients, hospitalized in a 600-beds University Hospital of Larissa, Greece and clinically confirmed with bloodstream infections, were included to evaluate the performance of our model. 10 mL of whole blood from each patient was inoculated into aerobic blood culture bottles (BACTEC, BD) and incubated in the BD BACTEC\u0026trade; FX automated blood culture system (Becton, DickinsonBecton, Dickinson). A positive growth signal was used as an indicator of microbial presence.\u003c/p\u003e\n\u003cp\u003eDNA extraction was performed in three steps: red blood cell (RBC) lysis, host DNA depletion, and microbial DNA isolation. Briefly, 200 \u0026mu;L of the culture-positive blood bottle was mixed with 600 \u0026mu;L of RBC Lysis Buffer (Zymo) and incubated for 5 minutes at room temperature. Samples were centrifuged at 2000\u0026times;g for 10 minutes, and the supernatant was carefully removed. The pellet was resuspended in 200 \u0026mu;L of PBS. Host DNA was depleted using the HostZERO Microbial DNA Kit (Zymo), following the manufacturer\u0026rsquo;s protocol. The sample was then placed in a 2 mL tube holder and subjected to mechanical lysis using a bead beater at maximum speed for two 10-minute cycles, with a 2-minute interval. DNA was extracted using the PSS magLEAD 12Gc automated system (BioServices, Stockholm). DNA concentration and purity were assessed using both Nanodrop One and Qubit fluorometers (Thermo Fisher Scientific).\u003c/p\u003e\n\u003cp\u003eMetagenomic sequencing was performed on the DNBSEQ-G99 platform (MGI) using the MGIEasy Fast FS Library Prep Set, MGIEasy Dual Barcode Circulization Kit, and the High-throughput Sequencing Set (G99 SM FCL PE150), aiming for 4 million reads per sample. \u0026nbsp; Before the assembly of the microbial genome of each sample, the MGI Pathogen Fast Identification (PFI) tool was used for the molecular identification of microorganism. Four samples were excluded for having mix of bacteria from PFI, since the most abundant bacteria was less than 90%. Then, raw reads were assembled using SPAdes (version 4.0.0), guided by the closest reference genome corresponding to the dominant species identified by the MGI Pathogen Fast Identification (PFI) tool.\u003c/p\u003e\n\u003cp\u003eAssembled genomes were processed through our annotation pipeline to detect antimicrobial resistance (AMR) and core genes, and associated upstream promoter regions, and ribosomal RNA (rRNA) elements. K-mer encodings (k = 3, 4, 5) were generated for these features. For each assembly, resistance prediction was performed using our pre-trained machine learning models, restricted to the antibiotics with demonstrated high model performance for the corresponding genus.\u003c/p\u003e\n\u003cp\u003ePredicted resistance profiles were compared with conventional antimicrobial susceptibility testing (AST) results. \u0026nbsp;Briefly, 10 \u0026mu;l of each positive blood culture bottle was inoculated into blood agar and Mac Conkey agar and both plates were incubated at 37\u003csup\u003eo\u003c/sup\u003eC for 24h or more. Then, colonies were tested for identification and susceptibility testing using the VITEK 2 automated system (bioM\u0026eacute;rieux), according to the guidelines of the manufacturer. The interpretation of the susceptibility results obtained from the VITEK 2 was based on CLSI breakpoints 34th edition. For comparison, ResFinder (v 4.7.2)\u003csup\u003e26,27\u003c/sup\u003e predictions were generated using the tool\u0026rsquo;s preset thresholds, and the predicted phenotypes were obtained from the output pheno_table.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSystem configuration\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe analyses described in this paper were conducted on a computational platform equipped with an Intel\u0026reg; Xeon\u0026reg; Gold 6226R CPU @ 2.90GHz with 376 GB of RAM. The operating system used was Ubuntu Linux version 20.04.5. The primary programming languages and their versions used for this research were Python (v3.12.7) and R (v4.2.2).\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eDATA AVAILABILITY\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe source code used for model training, prediction, and analysis is available on GitHub at https://github.com/dianalabgr/amrprediction under the MIT License. The curated genome assemblies and manually checked antibiograms used for model development are available on Zenodo at https://zenodo.org/uploads/16213507 . This Zenodo repository includes:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003ethe machine learning models with their performance metrics and SHAP values (models_shap_values.zip),\u003c/li\u003e\n \u003cli\u003ethe encoded k-mer datasets used for training (kmer3_data.zip, kmer4_data.zip, kmer5_data.zip, kmer_data_extra.zip),\u003c/li\u003e\n \u003cli\u003ethe assemblies and antibiograms of the 36 metagenomic blood culture samples, which can be used as a real clinical benchmark dataset (bloodcultures_assemblies_antibiograms.zip),\u003c/li\u003e\n \u003cli\u003ethe k-mer profiles used to predict their resistance phenotypes (kmers_blood_cultures.zip),\u003c/li\u003e\n \u003cli\u003eand the filtered high-quality assemblies and antibiograms used for model training (filtered_assemblies.zip, qced_antibiograms.zip).\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe raw metagenomic sequencing data (FASTQ files) from blood cultures of 40 patients with bloodstream infections have been deposited in the NCBI Sequence Read Archive (SRA) under BioProject accession PRJNA1290626. This includes 52 Gbases of multispecies data across 40 SRA experiments and 40 BioSamples, collected at the University Hospital of Larissa (registration date: 12-Jul-2025).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eACKNOWLEDGEMENTS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors acknowledge the members of DIANA-lab for their very useful comments and ideas.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFUNDING\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis research has been co‐financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, and Greece 2.0, under the call \u0026ldquo;RESEARCH \u0026ndash; CREATE \u0026ndash; INNOVATE\u0026rdquo; (ID 16971), with project id: TAEDK-06179. Additionally, this research has been supported by ELIXIR-GR: The Greek Research Infrastructure for Data Management and Analysis in Life Sciences\u0026rdquo; [MIS-5002780], implemented under the Action \u0026ldquo;Reinforcement of the Research and Innovation Infrastructure,\u0026rdquo; funded by the Operational Programme \u0026ldquo;Competitiveness, Entrepreneurship and Innovation\u0026rdquo; [NSRF 2014\u0026ndash;2020] and co-financed by Greece and the European Union (European Regional Development Fund). Also, specifically for Anargyros Skoulakis, the research work was also supported by the Hellenic Foundation for Research and Innovation (HFRI) under the 5th Call for HFRI PhD Fellowships (Fellowship Number: 20480)\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCONFLICT OF INTEREST\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003e(WHO, World Health Organization (2022). Antimicrobial Resistance Surveillance in Europe 2022\u0026ndash;2020 Data. World Health Organization, Regional Office for Europe. URL: https://iris.who.int/bitstream/handle/10665/351141/9789289056687-eng.pdf. )\u003c/li\u003e\n\u003cli\u003eKritsotakis, E. I., Lagoutari, D., Michailellis, E., Georgakakis, I., \u0026amp; Gikas, A. (2022). Burden of multidrug and extensively drug-resistant ESKAPEE pathogens in a secondary hospital care setting in Greece. \u003cem\u003eEpidemiology \u0026amp; Infection\u003c/em\u003e, \u003cem\u003e150\u003c/em\u003e, e170.\u003c/li\u003e\n\u003cli\u003eMa, Y. X., Wang, C. Y., Li, Y. Y., Li, J., Wan, Q. Q., Chen, J. H., ... \u0026amp; Niu, L. N. (2020). Considerations and caveats in combating ESKAPE pathogens against nosocomial infections. \u003cem\u003eAdvanced Science\u003c/em\u003e, \u003cem\u003e7\u003c/em\u003e(1), 1901872.\u003c/li\u003e\n\u003cli\u003eBoolchandani, M., D\u0026rsquo;Souza, A. W., \u0026amp; Dantas, G. (2019). Sequencing-based methods and resources to study antimicrobial resistance. \u003cem\u003eNature Reviews Genetics\u003c/em\u003e, \u003cem\u003e20\u003c/em\u003e(6), 356-370.\u003c/li\u003e\n\u003cli\u003eNguyen, M., Brettin, T., Long, S. W., Musser, J. M., Olsen, R. J., Olson, R., ... \u0026amp; Davis, J. J. (2018). Developing an in silico minimum inhibitory concentration panel test for Klebsiella pneumoniae. \u003cem\u003eScientific reports\u003c/em\u003e, \u003cem\u003e8\u003c/em\u003e(1), 421.\u003c/li\u003e\n\u003cli\u003eKim, J. I., Maguire, F., Tsang, K. K., Gouliouris, T., Peacock, S. J., McAllister, T. A., ... \u0026amp; Beiko, R. G. (2022). Machine learning for antimicrobial resistance prediction: current practice, limitations, and clinical perspective. \u003cem\u003eClinical microbiology reviews\u003c/em\u003e, \u003cem\u003e35\u003c/em\u003e(3), e00179-21.\u003c/li\u003e\n\u003cli\u003eMacesic, N., Polubriaginof, F., \u0026amp; Tatonetti, N. P. (2017). Machine learning: novel bioinformatics approaches for combating antimicrobial resistance. \u003cem\u003eCurrent opinion in infectious diseases\u003c/em\u003e, \u003cem\u003e30\u003c/em\u003e(6), 511-517.\u003c/li\u003e\n\u003cli\u003eOlson, R. D., Assaf, R., Brettin, T., Conrad, N., Cucinell, C., Davis, J. J., ... \u0026amp; Stevens, R. L. (2023). Introducing the bacterial and viral bioinformatics resource center (BV-BRC): a resource combining PATRIC, IRD and ViPR. \u003cem\u003eNucleic acids research\u003c/em\u003e, \u003cem\u003e51\u003c/em\u003e(D1), D678-D689.\u003c/li\u003e\n\u003cli\u003eAlcock, B. P., Huynh, W., Chalil, R., Smith, K. W., Raphenya, A. R., Wlodarski, M. A., ... \u0026amp; McArthur, A. G. (2023). CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. \u003cem\u003eNucleic acids research\u003c/em\u003e, \u003cem\u003e51\u003c/em\u003e(D1), D690-D699.\u003c/li\u003e\n\u003cli\u003eFeldgarden, M., Brover, V., Gonzalez-Escalona, N., Frye, J. G., Haendiges, J., Haft, D. H., ... \u0026amp; Klimke, W. (2021). AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. \u003cem\u003eScientific reports\u003c/em\u003e, \u003cem\u003e11\u003c/em\u003e(1), 12728.\u003c/li\u003e\n\u003cli\u003eZankari, E., Hasman, H., Cosentino, S., Vestergaard, M., Rasmussen, S., Lund, O., ... \u0026amp; Larsen, M. V. (2012). Identification of acquired antimicrobial resistance genes. \u003cem\u003eJournal of antimicrobial chemotherapy\u003c/em\u003e, \u003cem\u003e67\u003c/em\u003e(11), 2640-2644.\u003c/li\u003e\n\u003cli\u003eLi, W., \u0026amp; Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. \u003cem\u003eBioinformatics\u003c/em\u003e, \u003cem\u003e22\u003c/em\u003e(13), 1658-1659.\u003c/li\u003e\n\u003cli\u003eCamacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., \u0026amp; Madden, T. L. (2009). BLAST+: architecture and applications. \u003cem\u003eBMC bioinformatics\u003c/em\u003e, \u003cem\u003e10\u003c/em\u003e(1), 421.\u003c/li\u003e\n\u003cli\u003eMoradigaravand, D., Palm, M., Farewell, A., Mustonen, V., Warringer, J., \u0026amp; Parts, L. (2018). Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data. \u003cem\u003ePLoS computational biology\u003c/em\u003e, \u003cem\u003e14\u003c/em\u003e(12), e1006258.\u003c/li\u003e\n\u003cli\u003eNguyen, M., Long, S. W., McDermott, P. F., Olsen, R. J., Olson, R., Stevens, R. L., ... \u0026amp; Davis, J. J. (2019). Using machine learning to predict antimicrobial MICs and associated genomic features for nontyphoidal Salmonella. \u003cem\u003eJournal of clinical microbiology\u003c/em\u003e, \u003cem\u003e57\u003c/em\u003e(2), 10-1128.\u003c/li\u003e\n\u003cli\u003eRen, Y., Chakraborty, T., Doijad, S., Falgenhauer, L., Falgenhauer, J., Goesmann, A., ... \u0026amp; Heider, D. (2022). Prediction of antimicrobial resistance based on whole-genome sequencing and machine learning. \u003cem\u003eBioinformatics\u003c/em\u003e, \u003cem\u003e38\u003c/em\u003e(2), 325-334.\u003c/li\u003e\n\u003cli\u003eArdila, C. M., Gonz\u0026aacute;lez-Arroyave, D., \u0026amp; Tob\u0026oacute;n, S. (2025). Machine learning for predicting antimicrobial resistance in critical and high-priority pathogens: A systematic review considering antimicrobial susceptibility tests in real-world healthcare settings. \u003cem\u003ePlos one\u003c/em\u003e, \u003cem\u003e20\u003c/em\u003e(2), e0319460.\u003c/li\u003e\n\u003cli\u003eSakagianni, A., Koufopoulou, C., Feretzakis, G., Kalles, D., Verykios, V. S., \u0026amp; Myrianthefs, P. (2023). Using machine learning to predict antimicrobial resistance―a literature review. \u003cem\u003eAntibiotics\u003c/em\u003e, \u003cem\u003e12\u003c/em\u003e(3), 452.\u003c/li\u003e\n\u003cli\u003ePrjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., \u0026amp; Korobeynikov, A. (2020). Using SPAdes de novo assembler. \u003cem\u003eCurrent protocols in bioinformatics\u003c/em\u003e, \u003cem\u003e70\u003c/em\u003e(1), e102.\u003c/li\u003e\n\u003cli\u003eLundberg, S. M., \u0026amp; Lee, S. I. (2017). A unified approach to interpreting model predictions. \u003cem\u003eAdvances in neural information processing systems\u003c/em\u003e, \u003cem\u003e30\u003c/em\u003e.\u003c/li\u003e\n\u003cli\u003eMatamoros, S., Hendriksen, R. S., Pataki, B. \u0026Aacute;., Pakseresht, N., Rossello, M., Silvester, N., ... \u0026amp; Schultsz, C. (2020). Accelerating surveillance and research of antimicrobial resistance\u0026ndash;an online repository for sharing of antimicrobial susceptibility data associated with whole-genome sequences. \u003cem\u003eMicrobial genomics\u003c/em\u003e, \u003cem\u003e6\u003c/em\u003e(5), e000342.\u003c/li\u003e\n\u003cli\u003eSchwarz, S., \u0026amp; Noble, W. C. (1994). Tetracycline resistance genes in staphylococci from the skin of pigs. \u003cem\u003eJournal of Applied Microbiology\u003c/em\u003e, \u003cem\u003e76\u003c/em\u003e(4), 320-326.\u003c/li\u003e\n\u003cli\u003ePag\u0026egrave;s H, Aboyoun P, Gentleman R, DebRoy S (2025). Biostrings: Efficient manipulation of biological strings. doi:10.18129/B9.bioc.Biostrings, R package version 2.76.0, https://bioconductor.org/packages/Biostrings.\u003c/li\u003e\n\u003cli\u003ePedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... \u0026amp; Duchesnay, \u0026Eacute;. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.\u003c/li\u003e\n\u003cli\u003eChen, T., \u0026amp; Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).\u003c/li\u003e\n\u003cli\u003eBortolaia, V., Kaas, R. S., Ruppe, E., Roberts, M. C., Schwarz, S., Cattoir, V., ... \u0026amp; Aarestrup, F. M. (2020). ResFinder 4.0 for predictions of phenotypes from genotypes. Journal of Antimicrobial Chemotherapy, 75(12), 3491-3500.\u003c/li\u003e\n\u003cli\u003eClausen, P. T., Aarestrup, F. M., \u0026amp; Lund, O. (2018). Rapid and precise alignment of raw reads against redundant databases with KMA. BMC bioinformatics, 19(1), 307.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Machine Learning, Genomic Data, Antimicrobial Resistance, Bacteria, Antibiotics, Clinical Diagnostics","lastPublishedDoi":"10.21203/rs.3.rs-7190203/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7190203/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Antimicrobial resistance (AMR) is a mounting global crisis, fueled by the rapid emergence of multidrug-resistant bacteria. Among the most concerning culprits are the ESKAPEE bacteria—Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter spp., and Escherichia coli—which are leading causes of hospital-acquired infections worldwide. \r\nIn this study, we developed and validated machine learning models for predicting antimicrobial resistance phenotypes directly from genomic data. We assembled a robust dataset of 18,916 ESKAPEE genome assemblies, each paired with its corresponding antibiogram, covering susceptibility results for 40 different antibiotics. Using this data, we trained Random Forest and Extreme Gradient Boosting (XGBoost) models for each antibiotic separately, which consistently demonstrated excellent predictive performance, achieving over 90% recall and F1 score for almost all pathogen–antibiotic combinations. To maximize the utility and accessibility of our findings, we developed an interactive web platform (https://dianalab.e-ce.uth.gr/amrpredictor/) that allows users to explore prediction outcomes and identify the most informative genomic features driving resistance using Shap values. Furthermore, we rigorously validated our approach in a clinical setting. We applied our prediction pipeline to metagenomic sequencing data obtained from 36 blood culture-positive ESKAPEE samples. This real-world evaluation revealed a strong concordance between our predicted resistance profiles and conventional phenotypic results. Importantly, this metagenomic dataset also serves as a valuable, independent benchmark for future research in developing and evaluating AMR prediction models across ESKAPEE pathogens. Our work underscores the transformative potential of integrating genomics and machine learning to provide accurate, interpretable, and clinically actionable predictions for combating antimicrobial resistance.","manuscriptTitle":"Machine Learning Predicts Antimicrobial Resistance from Genomic Data across ESKAPEE Pathogens","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-08-01 07:51:21","doi":"10.21203/rs.3.rs-7190203/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"8ea6f185-b92a-4cb1-b83d-21ca7c3b4674","owner":[],"postedDate":"August 1st, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":52400121,"name":"Biological sciences/Microbiology/Clinical microbiology"},{"id":52400122,"name":"Biological sciences/Computational biology and bioinformatics/Machine learning"},{"id":52400123,"name":"Biological sciences/Microbiology/Antimicrobials/Antibiotics"},{"id":52400124,"name":"Health sciences/Diseases/Infectious diseases/Bacterial infection"}],"tags":[],"updatedAt":"2025-08-26T07:40:29+00:00","versionOfRecord":[],"versionCreatedAt":"2025-08-01 07:51:21","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7190203","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7190203","identity":"rs-7190203","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00