Abstract
Background Helicobacter pylori (H. pylori) infection is widespread globally and is linked to outcomes ranging from chronic gastritis to gastric cancer. However, only a minority of infected individuals progress to malignancy, influenced by a mix of bacterial, host, and environmental factors. Current predictive approaches are limited due to relying mainly on clinical and lifestyle data. Genomic approaches have been sparsely used, and thus their incorporation into machine learning models could ensure early and personalized detection. This study aimed to evaluate the impact of integrating host metadata with genomic features from H. pylori to predict gastric cancer outcomes and identify associated variables.
Methods
1,363 publicly available H. pylori genomes with associated host information between 1991 and 2024 were collected from NCBI and EnteroBase. Demographic features, virulence genes, sequence-derived and variant-based features were extracted. Machine learning models were then developed to classify infection outcomes into gastric cancer and non-gastric cancer. Logistic regression, an interpretable baseline model, was compared against higher-performance ensemble models (XGBoost, Random Forest). Model performance was assessed using recall, precision, AUROC, and AUPRC curves.
Results
The logistic regression model achieved a recall of 0.736 (95% CI: 0.644-0.831) for gastric cancer and an AUROC of 0.888 (95% CI: 0.843-0.929). Both XGBoost and Random Forest models outperformed the baseline model with AUROC values ranging from 0.950-0.954 (95% CI: 0.904-0.976). Black-box model recall for gastric cancer detection improved compared to the baseline by 8.3% for XGBoost (0.797, 95% CI: 0.711-0.877), and 11.4% for Random Forest (0.820, 95% CI: 0.734-0.896). Across models, patient age consistently emerged as the strongest predictor of gastric cancer, with several sequence-derived genomic features beyond pre-established virulence genes contributing to the infection outcome differences.
Conclusion
This study demonstrates that combining pathogen genomics with host demographics uncovers novel risk factors and ensures early detection with high predictive power. The use of explainability methods like SHAP allows for greater interpretability by clinical professionals and improves informed decision-making processes. Validation and translation into clinical practice can be carried out with broader, diverse datasets along with the inclusion of additional host and lifestyle variables.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Figures 2, 3 and 4 revised; New multi panel figures added for model performance and explanations for each model (new Figures 2-7); Supplemental files updated
6. List of abbreviations
- H. pylori
- Helicobacter pylori
- XGBoost
- eXtreme Gradient Boosting
- SMOTE-NC
- Synthetic Minority Over-sampling Technique for Nominal and Continuous
- SHAP
- SHapley Additive exPlanations
- AUROC
- Area Under Receiver Operating Characteristic curve
- MALT
- Mucosa-associated lymphoid tissue
- WHO
- World Health Organisation
- cagA
- cytotoxin-associated gene A
- cagPAI
- cag pathogenicity island
- vacA
- vacuolating cytotoxin A
- BabA
- blood group antigen binding adhesin
- oipA
- outer inflammatory protein A
- sLeX
- sialylated Lewis antigens
- SRA
- Sequence Read Archive
- NAC
- Nucleic Acid Composition
- MMI
- Multivariate Mutual Information
- ROC
- Receiver Operating Characteristic
- AUC
- Area Under the Curve
- AUPRC
- Area Under the Precision-Recall curve
- AP
- Average precision
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.