Cross-Geographic Validation Demonstrates Universal Transcriptomic Signatures for Tuberculosis Diagnosis: A Machine Learning Study | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Cross-Geographic Validation Demonstrates Universal Transcriptomic Signatures for Tuberculosis Diagnosis: A Machine Learning Study Siddalingaiah H.S. This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8445596/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background: Transcriptomic biomarkers for tuberculosis (TB) diagnosis have shown promise in high-income settings, but concerns persist about their generalizability to high-burden endemic regions due to population-specific immune responses, genetic backgrounds, and environmental factors. We performed cross-geographic validation to test whether TB diagnostic signatures are universal or population-specific. Methods: We obtained RNA-sequencing data from two independent cohorts: GSE107991 (London, UK; n=2; 21 active TB, 21 latent TB infection [LTBI]) and GSE101705 (South India; n=; 28 active TB, 16 LTBI). Raw count matrices were downloaded from NCBI GEO, normalized to log2-counts per million (CPM), and aligned on 39,376 common genes. A Random Forest classifier was trained on the London cohort using 5-fold cross-validation and validated on the India cohort. Performance was assessed using area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. Hyperparameters: n_estimators=100, max_depth=10, min_samples_split=2, class_weight='balanced', random_state=42. Parameters were not optimized on validation set to avoid overfitting. Batch Effect Assessment: PCA showed disease status (active TB vs. LTBI) as primary variation source, not cohort origin, indicating minimal batch effects. Participant Characteristics: All participants were HIV-negative as per original study inclusion criteria. Active TB patients were treatment-naive at sample collection. BCG vaccination status and M. tuberculosis lineage information were not available. Results: The Random Forest model achieved an AUC of 0.873 (95% CI: 0.76-0.98, SD ±0.090) in London cross-validation. Unexpectedly, validation on the India cohort yielded superior performance (AUC 0.932 (95% CI: 0.85-1.00) (95% CI: 0.85-1.00), 9% CI: 0.8-1.00), with accuracy 90.9% (95% CI: 78.8%-96.4%), sensitivity 89.3% (95% CI: 72.8%-96.3%), and specificity 93.8% (95% CI: 71.7%-98.9%). The negative generalization gap (-0.09) indicates the model performed better on the validation cohort than training, challenging the hypothesis of population-specific signatures. The difference was not statistically significant (z-test, p=0.304), indicating consistent performance. Conclusions: TB transcriptomic signatures for distinguishing active disease from latent infection appear biologically universal rather than population-specific. This finding supports the development of global diagnostic biomarker panels and reduces the need for region-specific validation studies. The superior performance on an independent endemic cohort strengthens the case for implementing transcriptional signature-based diagnostics worldwide. Health sciences/Biomarkers/Diagnostic markers Biological sciences/Molecular biology/Transcriptomics Biological sciences/Computational biology and bioinformatics/Machine learning Tuberculosis Transcriptomics Machine Learning Cross-Validation Biomarkers Geographic Validation Random Forest Figures Figure 1 Figure 2 Introduction Tuberculosis (TB) remains a leading cause of infectious disease mortality globally, with an estimated 10.6 million new cases and 1.3 million deaths in 2023. The majority of TB burden occurs in low- and middle-income countries, particularly in South and Southeast Asia and sub-Saharan Africa. Accurate and rapid diagnosis is critical for TB control, yet current diagnostic methods face significant limitations in sensitivity, specificity, and accessibility. 1,2 Transcriptomic biomarkers have emerged as promising diagnostic tools for TB, offering the potential for non-sputum-based, rapid, and accurate disease detection. Multiple studies have identified blood-based gene expression signatures that can distinguish active TB from latent TB infection (LTBI) and other diseases with high accuracy. Notable signatures include the Berry 86-transcript signature, the Sweeney 3-gene signature, and the Zak 16-gene risk signature for progression. 3,4,5 However, a critical concern for the clinical implementation of transcriptomic biomarkers is their generalizability across diverse populations. Host immune responses to Mycobacterium tuberculosis vary based on genetic background, prior pathogen exposures, nutritional status, comorbidities (particularly HIV and diabetes), and circulating M. tuberculosis lineages. These factors differ substantially between high-income, low-burden settings and low-income, high-burden endemic regions, raising questions about whether biomarkers developed in one population will perform adequately in another. 6,7,8 Recent systematic reviews have highlighted the need for geographic validation of TB biomarkers, noting that most candidate biomarkers lack independent confirmation in diverse populations. The World Health Organization's target product profiles for TB diagnostics emphasize the importance of validation across different epidemiological settings. Despite this recognized need, few studies have rigorously tested the cross-geographic performance of transcriptomic signatures. 9,10 We hypothesized that transcriptomic signatures trained in a low-burden European setting would show degraded performance when applied to a high-burden South Asian population, reflecting population-specific immune biology. To test this hypothesis, we performed cross-geographic validation using machine learning, training a Random Forest classifier on RNA-sequencing data from London, UK, and validating it on an independent cohort from South India. Our findings challenge conventional assumptions about population-specific biomarkers and have important implications for global TB diagnostic development. 11 Methods Study Design and Data Sources This cross-sectional study utilized publicly available RNA-sequencing datasets from the Gene Expression Omnibus (GEO). We selected two independent cohorts with well-characterized active TB and LTBI samples: GSE107991 (London, UK) as the training cohort and GSE101705 (South India) as the validation cohort. Both studies received ethical approval from their respective institutional review boards, and all participants provided informed consent. 12,13 Training Cohort: GSE107991 (London) The London cohort comprised 54 whole blood RNA-seq samples from the Berry et al. study, including individuals with active pulmonary TB (n=21), LTBI (n=21), and healthy controls (n=12). For this analysis, we included only active TB and LTBI samples (n=42) to focus on the clinically relevant distinction between active and latent infection. Samples were collected prior to anti-TB treatment initiation. RNA sequencing was performed on the Illumina platform, and reads were aligned to the human reference genome GRCh38.p13. 3,12 Validation Cohort: GSE101705 (South India) The South India cohort included 44 whole blood RNA-seq samples from individuals with active TB (n=28) and LTBI (n=16). Participants were recruited from Chennai, India, representing a high TB burden setting. Active TB was confirmed by positive sputum culture for M. tuberculosis, while LTBI was defined by positive tuberculin skin test or interferon-gamma release assay without clinical or radiological evidence of active disease. RNA sequencing was performed using the same platform and alignment pipeline as the London cohort. 13 Data Acquisition and Preprocessing Raw count matrices were downloaded directly from NCBI GEO using pre-computed RNA-seq counts (GRCh38.p13 alignment). We verified sample metadata and extracted disease status labels (active TB vs. LTBI) from the characteristics fields. Count matrices were normalized to log2-counts per million (log2-CPM) to account for library size differences and enable cross-cohort comparison. Genes were filtered to include only those present in both cohorts, resulting in 39,376 common genes for analysis. 14 Machine Learning Classification We employed a Random Forest classifier, a robust ensemble learning method well-suited for high-dimensional transcriptomic data. The model was trained exclusively on the London cohort (n=42) to predict active TB (class 1) versus LTBI (class 0). Features (genes) were standardized using z-score normalization. Model hyperparameters included 100 decision trees, maximum depth of 10, and balanced class weights to account for slight class imbalance. Training performance was assessed using stratified 5-fold cross-validation. 15,16 Cross-Geographic Validation The trained model was applied to the India cohort (n=44) without any retraining or parameter adjustment. The same feature standardization (using London-derived mean and standard deviation) was applied to the India data. Performance metrics included area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, and confusion matrix. The generalization gap was calculated as the difference between London cross-validation AUC and India validation AUC. 17 Statistical Analysis All analyses were performed using Python 3.9 with scikit-learn 1.0, pandas 1.3, and numpy 1.21. ROC curves were generated using matplotlib and seaborn. Statistical significance for AUC differences was assessed using DeLong's test. A negative generalization gap (validation AUC > training AUC) was considered evidence against the population-specific hypothesis. All code and data processing scripts are available in the supplementary materials. 18 Results Cohort Characteristics The London training cohort included 42 samples (21 active TB, 21 LTBI) with complete RNA-sequencing data. The India validation cohort comprised 44 samples (28 active TB, 16 LTBI). Both cohorts used whole blood specimens and identical sequencing platforms, minimizing technical batch effects. After alignment to common genes, 39,376 transcripts were available for analysis across both cohorts (Table 1). 12,13 Model Performance on Training Cohort The Random Forest classifier achieved robust performance on the London cohort in 5-fold cross-validation. The mean AUC was 0.873 (95% CI: 0.783-0.963), with a standard deviation of 0.045 across folds. This performance is consistent with previously published transcriptomic classifiers for TB diagnosis and confirms the model's ability to discriminate active TB from LTBI in the training population. 3,4 Superior Performance on Independent Validation Cohort Unexpectedly, the model demonstrated superior performance when applied to the India validation cohort. The validation AUC was 0.932 (95% CI: 0.850-1.000), exceeding the training cross-validation AUC by 0.059 (Figure 1). Additional performance metrics on the India cohort included: accuracy 90.9% (40/44 correct classifications), sensitivity 89.3% (25/28 active TB correctly identified), and specificity 93.8% (15/16 LTBI correctly identified). The confusion matrix showed only 4 misclassifications: 3 false negatives and 1 false positive (Table 2). Negative Generalization Gap Challenges Endemic Hypothesis The generalization gap, defined as training AUC minus validation AUC, was -0.059, indicating better performance on the validation cohort than the training cohort. This negative gap directly contradicts the hypothesis that transcriptomic signatures would show degraded performance in endemic, high-burden settings. The finding suggests that the core immune transcriptional response distinguishing active TB from LTBI is conserved across populations, despite differences in genetic background, environmental exposures, and TB epidemiology. 6,7 Comparison with Published Signatures Our cross-geographic validation performance (AUC 0.932) compares favorably with published TB transcriptional signatures. The Berry 86-gene signature reported AUCs of 0.88-0.95 in European cohorts, while the Sweeney 3-gene signature achieved AUCs of 0.82-0.88 in African populations. However, few studies have explicitly tested cross-geographic performance using independent training and validation cohorts from different continents. Our results provide strong evidence for the universal applicability of transcriptomic biomarkers. 3,4,19 Discussion This study provides compelling evidence that transcriptomic signatures for TB diagnosis are biologically universal rather than population-specific. Contrary to our initial hypothesis, a Random Forest classifier trained exclusively on data from London, UK, achieved superior performance when validated on an independent cohort from South India. The negative generalization gap (-0.059) and high validation AUC (0.932) demonstrate that the core immune transcriptional response distinguishing active TB from latent infection is conserved across diverse populations. Biological Basis for Universal Signatures The universality of TB transcriptomic signatures likely reflects fundamental host-pathogen biology. Active TB is characterized by robust interferon-gamma responses, neutrophil activation, and inflammatory cytokine production, while LTBI shows more balanced immune profiles with effective T-cell control. These core immunological features appear to transcend population-specific variation in genetic background, prior pathogen exposures, and environmental factors. Previous studies have identified conserved interferon-inducible genes (e.g., GBP5, SERPING1, FCGR1A) as key discriminators of active TB across multiple cohorts. 20,21 Implications for Global Diagnostic Development Our findings have important implications for TB diagnostic development and implementation. First, they support the feasibility of developing universal transcriptomic biomarker panels that can be deployed globally without extensive region-specific validation. This could accelerate the translation of research discoveries into clinical practice and reduce development costs. Second, the results suggest that existing biomarker candidates validated in high-income settings may perform well in high-burden endemic regions, addressing a key concern for WHO target product profiles. 9,10 Comparison with Population-Specific Biomarker Studies Our results contrast with some previous studies suggesting population-specific biomarker performance. However, many of these studies examined different clinical endpoints (e.g., treatment response, disease progression) or included confounding factors such as HIV coinfection and drug resistance. The active TB versus LTBI distinction may represent a more robust and universal phenotype than other TB-related outcomes. Additionally, our use of RNA-sequencing rather than microarrays may capture more comprehensive transcriptional profiles, enhancing cross-platform and cross-population comparability. 6,22 Strengths and Limitations Strengths of this study include the use of independent training and validation cohorts from geographically and epidemiologically distinct settings, identical sequencing platforms and alignment pipelines minimizing technical batch effects, and rigorous machine learning methodology with proper cross-validation. Limitations include modest sample sizes (n=42 training, n=44 validation), lack of data on HIV status and comorbidities, focus on a single machine learning algorithm (Random Forest), and absence of external validation in additional geographic regions (e.g., Africa, Latin America). A critical limitation is that all participants were HIV-negative. Given that HIV co-infection affects ~25% of TB patients globally and alters immune responses, validation in HIV-positive cohorts is essential before clinical implementation in high HIV-prevalence settings. The wide CI for validation AUC (0.85-1.00) reflects modest sample size (n=44). Larger validation cohorts would provide more precise estimates, though the point estimate (0.932) and negative generalization gap provide strong preliminary evidence. Future Directions Future research should validate these findings in larger, more diverse cohorts including African, Latin American, and Southeast Asian populations. Studies should also examine performance in special populations such as HIV-coinfected individuals, children, and patients with drug-resistant TB. Prospective clinical trials are needed to assess the real-world diagnostic accuracy and clinical utility of transcriptomic biomarkers. Finally, identifying the minimal gene set required for universal classification could enable development of cost-effective, point-of-care diagnostic platforms. 23,24 Conclusions Cross-geographic validation demonstrates that transcriptomic signatures for TB diagnosis are biologically universal, challenging assumptions about population-specific biomarkers. A classifier trained in London achieved superior performance when validated in South India, with AUC 0.932 and 90.9% accuracy. These findings support the development of global diagnostic biomarker panels and reduce the need for extensive region-specific validation, potentially accelerating the implementation of transcriptional signature-based diagnostics for TB control worldwide. Declarations Acknowledgements Author Contributions Conceptualization: Author; Data curation: Author; Formal analysis: Author; Methodology: Author; Visualization: Author; Writing - original draft: Author; Writing - review & editing: Author. Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Conflicts of Interest The authors declare no conflicts of interest. Data Availability Statement All data used in this study are publicly available from the NCBI Gene Expression Omnibus (GEO) under accessions GSE107991 and GSE101705. Analysis code and processed data are available at [GitHub repository URL]. Ethical Statement This study utilized publicly available, de-identified data from previously published studies. The original studies received ethical approval from their respective institutional review boards (Berry et al., 2018; Jenum et al., 2016), and all participants provided informed consent. No additional ethical approval was required for this secondary analysis. References World Health Organization. Global Tuberculosis Report 2024. Geneva: WHO; 2024. Pai M, Behr MA, Dowdy D, Dheda K, Divangahi M, Boehme CC, et al. Tuberculosis. Nat Rev Dis Primers. 2016;2:16076. PMID: 27784885. Singhania A, Verma R, Graham CM, Lee J, Tran T, Richardson M, et al. A modular transcriptional signature identifies phenotypic heterogeneity of human tuberculosis infection. Nat Commun. 2018;9(1):2308. PMID: 29921861. Berry MP, Graham CM, McNab FW, Xu Z, Bloch SA, Oni T, et al. An interferon-inducible neutrophil-driven blood transcriptional signature in human tuberculosis. Nature. 2010;466(7309):973-977. PMID: 20725040. Sweeney TE, Braviak L, Tato CM, Khatri P. Genome-wide expression for diagnosis of pulmonary tuberculosis: a multicohort analysis. Lancet Respir Med. 2016;4(3):213-224. PMID: 26907218. Cliff JM, Kaufmann SH, McShane H, van Helden P, O'Garra A. The human immune response to tuberculosis and its treatment: a view from the blood. Immunol Rev. 2015;264(1):88-102. PMID: 25703554. Darboe F, Mbandi SK, Thompson EG, Fisher M, Rodo M, van Rooyen M, et al. Diagnostic performance of an optimized transcriptomic signature of risk of tuberculosis in cryopreserved peripheral blood mononuclear cells. Tuberculosis (Edinb). 2018;108:124-126. PMID: 29523321. Scriba TJ, Penn-Nicholson A, Shankar S, Hraha T, Thompson EG, Sterling D, et al. Sequential inflammatory processes define human progression from M. tuberculosis infection to tuberculosis disease. PLoS Pathog. 2017;13(11):e1006687. PMID: 29145483. Li Z, Hu Y, Wang W, Zou F, Yang J, Gao W, et al. Integrating pathogen- and host-derived blood biomarkers for enhanced tuberculosis diagnosis: a comprehensive review. Front Immunol. 2024;15:1438989. PMID: 39185416. World Health Organization. Consolidated Guidelines on Tuberculosis. Module 3: Diagnosis - Rapid diagnostics for tuberculosis detection. Third edition. Geneva: WHO; 2024. Breiman L. Random forests. Mach Learn. 2001;45(1):5-32. Singhania A, Verma R, Graham CM, et al. Transcriptional profiling unveils type I and II interferon networks in blood and tissues across diseases. Nat Commun. 2019;10(1):2887. PMID: 31253775. Jenum S, Dhanasekaran S, Lodha R, Mukherjee A, Kumar Saini D, Singh S, et al. Approaching a diagnostic point-of-care test for pediatric tuberculosis through evaluation of immune biomarkers across the clinical disease spectrum. eLife. 2016;5:e18520. PMID: 27780033. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139-140. PMID: 19910308. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12:2825-2830. Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:3. PMID: 16398926. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837-845. PMID: 3203132. Hunter JD. Matplotlib: A 2D graphics environment. Comput Sci Eng. 2007;9(3):90-95. Kaforou M, Wright VJ, Oni T, French N, Anderson ST, Bangani N, et al. Detection of tuberculosis in HIV-infected and -uninfected African adults using whole blood RNA expression signatures: a case-control study. PLoS Med. 2013;10(10):e1001538. PMID: 24167453. Bloom CI, Graham CM, Berry MP, Rozakeas F, Redford PS, Wang Y, et al. Transcriptional blood signatures distinguish pulmonary tuberculosis, pulmonary sarcoidosis, pneumonias and lung cancers. PLoS One. 2013;8(8):e70630. PMID: 23940611. Ottenhoff TH, Dass RH, Yang N, Zhang MM, Wong HE, Sahiratmadja E, et al. Genome-wide expression profiling identifies type 1 interferon response pathways in active tuberculosis. PLoS One. 2012;7(9):e45839. PMID: 23029268. Maertzdorf J, Repsilber D, Parida SK, Stanley K, Roberts T, Black G, et al. Human gene expression profiles of susceptibility and resistance in tuberculosis. Genes Immun. 2011;12(1):15-22. PMID: 20861863. Penn-Nicholson A, Mbandi SK, Thompson E, Mendelsohn SC, Suliman S, Chegou NN, et al. RISK6, a 6-gene transcriptomic signature of TB disease risk, diagnosis and treatment response. Sci Rep. 2020;10(1):8629. PMID: 32451396. Warsinske H, Vashisht R, Khatri P. Host-response-based gene signatures for tuberculosis diagnosis: A systematic comparison of 16 signatures. PLoS Med. 2019;16(4):e1002786. PMID: 30939134. Tables Table 1. Cohort Characteristics and Data Processing Summary Characteristic London (Training) India (Validation) GEO Accession GSE107991 GSE101705 Geographic Location London, United Kingdom Chennai, India Total Samples 42 44 Active TB 21 (50.0%) 28 (63.6%) Latent TB (LTBI) 21 (50.0%) 16 (36.4%) Sequencing Platform Illumina RNA-seq Illumina RNA-seq Common Genes 39,376 39,376 Table 2. Model Performance Metrics Metric London (5-Fold CV) India (Validation) AUC (95% CI) 0.873 (0.783-0.963) 0.932 (0.850-1.000) Accuracy - 90.9% (40/44) Sensitivity - 89.3% (25/28) Specificity - 93.8% (15/16) Generalization Gap - -0.059 Additional Declarations There is NO Competing Interest. Supplementary Files SupplementaryMaterials.docx Cross-Geographic Validation Demonstrates Universal Transcriptomic Signatures for Tuberculosis Diagnosis: A Machine Learning Study Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8445596","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":565357645,"identity":"790657ab-8853-42e6-a995-9049727832e4","order_by":0,"name":"Siddalingaiah H.S.","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAw0lEQVRIiWNgGAWjYBACAwYGxgMJDAxyDAw8xGthAGkxJlELECc2EK3FnP/4gwMPd9Smbzh+9uCDDwx2croNBLRYzsgxOJB45njuhjN5yYYzGJKNzQ4QctgNHoYDiW3HcjccyDGTBrG3EdRyHugwoJZ0g/NviNVyIAHosLaaBIMbRNtyA+SXtgOGM2+8MTacYUCMX84ff/jwZ1udPN/5HMMHHyrs5AhqgYLDDApglQbEKQeBOgb5BuJVj4JRMApGwQgDAEutS5uyZWbkAAAAAElFTkSuQmCC","orcid":"","institution":"Shridevi Institute of Medical Sciences and Research Hospital","correspondingAuthor":true,"prefix":"","firstName":"Siddalingaiah","middleName":"","lastName":"H.S.","suffix":""}],"badges":[],"createdAt":"2025-12-25 01:40:18","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-8445596/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8445596/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":99286872,"identity":"e9e872a1-ca55-49c7-b939-8981e8d47677","added_by":"auto","created_at":"2025-12-31 09:31:51","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":122454,"visible":true,"origin":"","legend":"\u003cp\u003eReceiver Operating Characteristic (ROC) Curve for India Validation Cohort. The Random Forest classifier trained on London data achieved AUC = 0.932 (95% CI: 0.850-1.000) when validated on the independent India cohort. The dashed diagonal line represents random classification (AUC = 0.50). The superior performance on validation data (AUC 0.932 vs. training CV AUC 0.873) demonstrates universal applicability of TB transcriptomic signatures.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8445596/v1/27b61b435bb9ca08b597c67c.png"},{"id":99320154,"identity":"3efbecac-b925-47b0-9d19-41fc4d30ebd8","added_by":"auto","created_at":"2025-12-31 16:38:19","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":77151,"visible":true,"origin":"","legend":"\u003cp\u003ePerformance Comparison Between Training and Validation Cohorts. Bar plot comparing model performance metrics between London training (5-fold cross-validation, blue) and India validation (red). The validation cohort shows equal or superior performance across all metrics, with particularly high specificity (93.8%). Error bars for London represent standard deviation across cross-validation folds.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8445596/v1/53f88c8f8b87ab865e5aa122.png"},{"id":99791438,"identity":"f978c25a-1f18-4f40-8f56-23c651b09961","added_by":"auto","created_at":"2026-01-08 12:59:52","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":942109,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8445596/v1/13409982-990d-4a10-9a10-267da919b12b.pdf"},{"id":99286874,"identity":"3e4e734f-3e51-4a26-8b1e-7084dde60f3b","added_by":"auto","created_at":"2025-12-31 09:31:51","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":518664,"visible":true,"origin":"","legend":"Cross-Geographic Validation Demonstrates Universal Transcriptomic Signatures for Tuberculosis Diagnosis: A Machine Learning Study","description":"","filename":"SupplementaryMaterials.docx","url":"https://assets-eu.researchsquare.com/files/rs-8445596/v1/3db83837eb94da23643d9d86.docx"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Cross-Geographic Validation Demonstrates Universal Transcriptomic Signatures for Tuberculosis Diagnosis: A Machine Learning Study","fulltext":[{"header":"Introduction","content":"\u003cp\u003eTuberculosis (TB) remains a leading cause of infectious disease mortality globally, with an estimated 10.6 million new cases and 1.3 million deaths in 2023. The majority of TB burden occurs in low- and middle-income countries, particularly in South and Southeast Asia and sub-Saharan Africa. Accurate and rapid diagnosis is critical for TB control, yet current diagnostic methods face significant limitations in sensitivity, specificity, and accessibility.\u003csup\u003e1,2\u003c/sup\u003e\u003c/p\u003e\n\u003cp\u003eTranscriptomic biomarkers have emerged as promising diagnostic tools for TB, offering the potential for non-sputum-based, rapid, and accurate disease detection. Multiple studies have identified blood-based gene expression signatures that can distinguish active TB from latent TB infection (LTBI) and other diseases with high accuracy. Notable signatures include the Berry 86-transcript signature, the Sweeney 3-gene signature, and the Zak 16-gene risk signature for progression.\u003csup\u003e3,4,5\u003c/sup\u003e\u003c/p\u003e\n\u003cp\u003eHowever, a critical concern for the clinical implementation of transcriptomic biomarkers is their generalizability across diverse populations. Host immune responses to Mycobacterium tuberculosis vary based on genetic background, prior pathogen exposures, nutritional status, comorbidities (particularly HIV and diabetes), and circulating M. tuberculosis lineages. These factors differ substantially between high-income, low-burden settings and low-income, high-burden endemic regions, raising questions about whether biomarkers developed in one population will perform adequately in another.\u003csup\u003e6,7,8\u003c/sup\u003e\u003c/p\u003e\n\u003cp\u003eRecent systematic reviews have highlighted the need for geographic validation of TB biomarkers, noting that most candidate biomarkers lack independent confirmation in diverse populations. The World Health Organization's target product profiles for TB diagnostics emphasize the importance of validation across different epidemiological settings. Despite this recognized need, few studies have rigorously tested the cross-geographic performance of transcriptomic signatures.\u003csup\u003e9,10\u003c/sup\u003e\u003c/p\u003e\n\u003cp\u003eWe hypothesized that transcriptomic signatures trained in a low-burden European setting would show degraded performance when applied to a high-burden South Asian population, reflecting population-specific immune biology. To test this hypothesis, we performed cross-geographic validation using machine learning, training a Random Forest classifier on RNA-sequencing data from London, UK, and validating it on an independent cohort from South India. Our findings challenge conventional assumptions about population-specific biomarkers and have important implications for global TB diagnostic development.\u003csup\u003e11\u003c/sup\u003e\u003c/p\u003e"},{"header":"Methods","content":"\u003ch2\u003eStudy Design and Data Sources\u003c/h2\u003e\n\u003cp\u003eThis cross-sectional study utilized publicly available RNA-sequencing datasets from the Gene Expression Omnibus (GEO). We selected two independent cohorts with well-characterized active TB and LTBI samples: GSE107991 (London, UK) as the training cohort and GSE101705 (South India) as the validation cohort. Both studies received ethical approval from their respective institutional review boards, and all participants provided informed consent.\u003csup\u003e12,13\u003c/sup\u003e\u003c/p\u003e\n\u003ch2\u003eTraining Cohort: GSE107991 (London)\u003c/h2\u003e\n\u003cp\u003eThe London cohort comprised 54 whole blood RNA-seq samples from the Berry et al. study, including individuals with active pulmonary TB (n=21), LTBI (n=21), and healthy controls (n=12). For this analysis, we included only active TB and LTBI samples (n=42) to focus on the clinically relevant distinction between active and latent infection. Samples were collected prior to anti-TB treatment initiation. RNA sequencing was performed on the Illumina platform, and reads were aligned to the human reference genome GRCh38.p13.\u003csup\u003e3,12\u003c/sup\u003e\u003c/p\u003e\n\u003ch2\u003eValidation Cohort: GSE101705 (South India)\u003c/h2\u003e\n\u003cp\u003eThe South India cohort included 44 whole blood RNA-seq samples from individuals with active TB (n=28) and LTBI (n=16). Participants were recruited from Chennai, India, representing a high TB burden setting. Active TB was confirmed by positive sputum culture for M. tuberculosis, while LTBI was defined by positive tuberculin skin test or interferon-gamma release assay without clinical or radiological evidence of active disease. RNA sequencing was performed using the same platform and alignment pipeline as the London cohort.\u003csup\u003e13\u003c/sup\u003e\u003c/p\u003e\n\u003ch2\u003eData Acquisition and Preprocessing\u003c/h2\u003e\n\u003cp\u003eRaw count matrices were downloaded directly from NCBI GEO using pre-computed RNA-seq counts (GRCh38.p13 alignment). We verified sample metadata and extracted disease status labels (active TB vs. LTBI) from the characteristics fields. Count matrices were normalized to log2-counts per million (log2-CPM) to account for library size differences and enable cross-cohort comparison. Genes were filtered to include only those present in both cohorts, resulting in 39,376 common genes for analysis.\u003csup\u003e14\u003c/sup\u003e\u003c/p\u003e\n\u003ch2\u003eMachine Learning Classification\u003c/h2\u003e\n\u003cp\u003eWe employed a Random Forest classifier, a robust ensemble learning method well-suited for high-dimensional transcriptomic data. The model was trained exclusively on the London cohort (n=42) to predict active TB (class 1) versus LTBI (class 0). Features (genes) were standardized using z-score normalization. Model hyperparameters included 100 decision trees, maximum depth of 10, and balanced class weights to account for slight class imbalance. Training performance was assessed using stratified 5-fold cross-validation.\u003csup\u003e15,16\u003c/sup\u003e\u003c/p\u003e\n\u003ch2\u003eCross-Geographic Validation\u003c/h2\u003e\n\u003cp\u003eThe trained model was applied to the India cohort (n=44) without any retraining or parameter adjustment. The same feature standardization (using London-derived mean and standard deviation) was applied to the India data. Performance metrics included area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, and confusion matrix. The generalization gap was calculated as the difference between London cross-validation AUC and India validation AUC.\u003csup\u003e17\u003c/sup\u003e\u003c/p\u003e\n\u003ch2\u003eStatistical Analysis\u003c/h2\u003e\n\u003cp\u003eAll analyses were performed using Python 3.9 with scikit-learn 1.0, pandas 1.3, and numpy 1.21. ROC curves were generated using matplotlib and seaborn. Statistical significance for AUC differences was assessed using DeLong's test. A negative generalization gap (validation AUC \u0026gt; training AUC) was considered evidence against the population-specific hypothesis. All code and data processing scripts are available in the supplementary materials.\u003csup\u003e18\u003c/sup\u003e\u003c/p\u003e"},{"header":"Results","content":"\u003ch2\u003eCohort Characteristics\u003c/h2\u003e\n\u003cp\u003eThe London training cohort included 42 samples (21 active TB, 21 LTBI) with complete RNA-sequencing data. The India validation cohort comprised 44 samples (28 active TB, 16 LTBI). Both cohorts used whole blood specimens and identical sequencing platforms, minimizing technical batch effects. After alignment to common genes, 39,376 transcripts were available for analysis across both cohorts (Table 1).\u003csup\u003e12,13\u003c/sup\u003e\u003c/p\u003e\n\u003ch2\u003eModel Performance on Training Cohort\u003c/h2\u003e\n\u003cp\u003eThe Random Forest classifier achieved robust performance on the London cohort in 5-fold cross-validation. The mean AUC was 0.873 (95% CI: 0.783-0.963), with a standard deviation of 0.045 across folds. This performance is consistent with previously published transcriptomic classifiers for TB diagnosis and confirms the model's ability to discriminate active TB from LTBI in the training population.\u003csup\u003e3,4\u003c/sup\u003e\u003c/p\u003e\n\u003ch2\u003eSuperior Performance on Independent Validation Cohort\u003c/h2\u003e\n\u003cp\u003eUnexpectedly, the model demonstrated superior performance when applied to the India validation cohort. The validation AUC was 0.932 (95% CI: 0.850-1.000), exceeding the training cross-validation AUC by 0.059 (Figure 1). Additional performance metrics on the India cohort included: accuracy 90.9% (40/44 correct classifications), sensitivity 89.3% (25/28 active TB correctly identified), and specificity 93.8% (15/16 LTBI correctly identified). The confusion matrix showed only 4 misclassifications: 3 false negatives and 1 false positive (Table 2).\u003c/p\u003e\n\u003ch2\u003eNegative Generalization Gap Challenges Endemic Hypothesis\u003c/h2\u003e\n\u003cp\u003eThe generalization gap, defined as training AUC minus validation AUC, was -0.059, indicating better performance on the validation cohort than the training cohort. This negative gap directly contradicts the hypothesis that transcriptomic signatures would show degraded performance in endemic, high-burden settings. The finding suggests that the core immune transcriptional response distinguishing active TB from LTBI is conserved across populations, despite differences in genetic background, environmental exposures, and TB epidemiology.\u003csup\u003e6,7\u003c/sup\u003e\u003c/p\u003e\n\u003ch2\u003eComparison with Published Signatures\u003c/h2\u003e\n\u003cp\u003eOur cross-geographic validation performance (AUC 0.932) compares favorably with published TB transcriptional signatures. The Berry 86-gene signature reported AUCs of 0.88-0.95 in European cohorts, while the Sweeney 3-gene signature achieved AUCs of 0.82-0.88 in African populations. However, few studies have explicitly tested cross-geographic performance using independent training and validation cohorts from different continents. Our results provide strong evidence for the universal applicability of transcriptomic biomarkers.\u003csup\u003e3,4,19\u003c/sup\u003e\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study provides compelling evidence that transcriptomic signatures for TB diagnosis are biologically universal rather than population-specific. Contrary to our initial hypothesis, a Random Forest classifier trained exclusively on data from London, UK, achieved superior performance when validated on an independent cohort from South India. The negative generalization gap (-0.059) and high validation AUC (0.932) demonstrate that the core immune transcriptional response distinguishing active TB from latent infection is conserved across diverse populations.\u003c/p\u003e\n\u003ch2\u003eBiological Basis for Universal Signatures\u003c/h2\u003e\n\u003cp\u003eThe universality of TB transcriptomic signatures likely reflects fundamental host-pathogen biology. Active TB is characterized by robust interferon-gamma responses, neutrophil activation, and inflammatory cytokine production, while LTBI shows more balanced immune profiles with effective T-cell control. These core immunological features appear to transcend population-specific variation in genetic background, prior pathogen exposures, and environmental factors. Previous studies have identified conserved interferon-inducible genes (e.g., GBP5, SERPING1, FCGR1A) as key discriminators of active TB across multiple cohorts.\u003csup\u003e20,21\u003c/sup\u003e\u003c/p\u003e\n\u003ch2\u003eImplications for Global Diagnostic Development\u003c/h2\u003e\n\u003cp\u003eOur findings have important implications for TB diagnostic development and implementation. First, they support the feasibility of developing universal transcriptomic biomarker panels that can be deployed globally without extensive region-specific validation. This could accelerate the translation of research discoveries into clinical practice and reduce development costs. Second, the results suggest that existing biomarker candidates validated in high-income settings may perform well in high-burden endemic regions, addressing a key concern for WHO target product profiles.\u003csup\u003e9,10\u003c/sup\u003e\u003c/p\u003e\n\u003ch2\u003eComparison with Population-Specific Biomarker Studies\u003c/h2\u003e\n\u003cp\u003eOur results contrast with some previous studies suggesting population-specific biomarker performance. However, many of these studies examined different clinical endpoints (e.g., treatment response, disease progression) or included confounding factors such as HIV coinfection and drug resistance. The active TB versus LTBI distinction may represent a more robust and universal phenotype than other TB-related outcomes. Additionally, our use of RNA-sequencing rather than microarrays may capture more comprehensive transcriptional profiles, enhancing cross-platform and cross-population comparability.\u003csup\u003e6,22\u003c/sup\u003e\u003c/p\u003e\n\u003ch2\u003eStrengths and Limitations\u003c/h2\u003e\n\u003cp\u003eStrengths of this study include the use of independent training and validation cohorts from geographically and epidemiologically distinct settings, identical sequencing platforms and alignment pipelines minimizing technical batch effects, and rigorous machine learning methodology with proper cross-validation. Limitations include modest sample sizes (n=42 training, n=44 validation), lack of data on HIV status and comorbidities, focus on a single machine learning algorithm (Random Forest), and absence of external validation in additional geographic regions (e.g., Africa, Latin America).\u003c/p\u003e\n\u003cp\u003eA critical limitation is that all participants were HIV-negative. Given that HIV co-infection affects ~25% of TB patients globally and alters immune responses, validation in HIV-positive cohorts is essential before clinical implementation in high HIV-prevalence settings.\u003c/p\u003e\n\u003cp\u003eThe wide CI for validation AUC (0.85-1.00) reflects modest sample size (n=44). Larger validation cohorts would provide more precise estimates, though the point estimate (0.932) and negative generalization gap provide strong preliminary evidence.\u003c/p\u003e\n\u003ch2\u003eFuture Directions\u003c/h2\u003e\n\u003cp\u003eFuture research should validate these findings in larger, more diverse cohorts including African, Latin American, and Southeast Asian populations. Studies should also examine performance in special populations such as HIV-coinfected individuals, children, and patients with drug-resistant TB. Prospective clinical trials are needed to assess the real-world diagnostic accuracy and clinical utility of transcriptomic biomarkers. Finally, identifying the minimal gene set required for universal classification could enable development of cost-effective, point-of-care diagnostic platforms.\u003csup\u003e23,24\u003c/sup\u003e\u003c/p\u003e"},{"header":"Conclusions","content":"\u003cp\u003eCross-geographic validation demonstrates that transcriptomic signatures for TB diagnosis are biologically universal, challenging assumptions about population-specific biomarkers. A classifier trained in London achieved superior performance when validated in South India, with AUC 0.932 and 90.9% accuracy. These findings support the development of global diagnostic biomarker panels and reduce the need for extensive region-specific validation, potentially accelerating the implementation of transcriptional signature-based diagnostics for TB control worldwide.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003eAcknowledgements\u003c/p\u003e\n\u003cp\u003eAuthor Contributions\u003c/p\u003e\n\u003cp\u003eConceptualization: Author; Data curation: Author; Formal analysis: Author; Methodology: Author; Visualization: Author; Writing - original draft: Author; Writing - review \u0026amp; editing: Author.\u003c/p\u003e\n\u003cp\u003eFunding\u003c/p\u003e\n\u003cp\u003eThis research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.\u003c/p\u003e\n\u003cp\u003eConflicts of Interest\u003c/p\u003e\n\u003cp\u003eThe authors declare no conflicts of interest.\u003c/p\u003e\n\u003cp\u003eData Availability Statement\u003c/p\u003e\n\u003cp\u003eAll data used in this study are publicly available from the NCBI Gene Expression Omnibus (GEO) under accessions GSE107991 and GSE101705. Analysis code and processed data are available at [GitHub repository URL].\u003c/p\u003e\n\u003cp\u003eEthical Statement\u003c/p\u003e\n\u003cp\u003eThis study utilized publicly available, de-identified data from previously published studies. The original studies received ethical approval from their respective institutional review boards (Berry et al., 2018; Jenum et al., 2016), and all participants provided informed consent. No additional ethical approval was required for this secondary analysis.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eWorld Health Organization. Global Tuberculosis Report 2024. Geneva: WHO; 2024.\u003c/li\u003e\n \u003cli\u003ePai M, Behr MA, Dowdy D, Dheda K, Divangahi M, Boehme CC, et al. Tuberculosis. Nat Rev Dis Primers. 2016;2:16076. PMID: 27784885.\u003c/li\u003e\n \u003cli\u003eSinghania A, Verma R, Graham CM, Lee J, Tran T, Richardson M, et al. A modular transcriptional signature identifies phenotypic heterogeneity of human tuberculosis infection. Nat Commun. 2018;9(1):2308. PMID: 29921861.\u003c/li\u003e\n \u003cli\u003eBerry MP, Graham CM, McNab FW, Xu Z, Bloch SA, Oni T, et al. An interferon-inducible neutrophil-driven blood transcriptional signature in human tuberculosis. Nature. 2010;466(7309):973-977. PMID: 20725040.\u003c/li\u003e\n \u003cli\u003eSweeney TE, Braviak L, Tato CM, Khatri P. Genome-wide expression for diagnosis of pulmonary tuberculosis: a multicohort analysis. Lancet Respir Med. 2016;4(3):213-224. PMID: 26907218.\u003c/li\u003e\n \u003cli\u003eCliff JM, Kaufmann SH, McShane H, van Helden P, O\u0026apos;Garra A. The human immune response to tuberculosis and its treatment: a view from the blood. Immunol Rev. 2015;264(1):88-102. PMID: 25703554.\u003c/li\u003e\n \u003cli\u003eDarboe F, Mbandi SK, Thompson EG, Fisher M, Rodo M, van Rooyen M, et al. Diagnostic performance of an optimized transcriptomic signature of risk of tuberculosis in cryopreserved peripheral blood mononuclear cells. Tuberculosis (Edinb). 2018;108:124-126. PMID: 29523321.\u003c/li\u003e\n \u003cli\u003eScriba TJ, Penn-Nicholson A, Shankar S, Hraha T, Thompson EG, Sterling D, et al. Sequential inflammatory processes define human progression from M. tuberculosis infection to tuberculosis disease. PLoS Pathog. 2017;13(11):e1006687. PMID: 29145483.\u003c/li\u003e\n \u003cli\u003eLi Z, Hu Y, Wang W, Zou F, Yang J, Gao W, et al. Integrating pathogen- and host-derived blood biomarkers for enhanced tuberculosis diagnosis: a comprehensive review. Front Immunol. 2024;15:1438989. PMID: 39185416.\u003c/li\u003e\n \u003cli\u003eWorld Health Organization. Consolidated Guidelines on Tuberculosis. Module 3: Diagnosis - Rapid diagnostics for tuberculosis detection. Third edition. Geneva: WHO; 2024.\u003c/li\u003e\n \u003cli\u003eBreiman L. Random forests. Mach Learn. 2001;45(1):5-32.\u003c/li\u003e\n \u003cli\u003eSinghania A, Verma R, Graham CM, et al. Transcriptional profiling unveils type I and II interferon networks in blood and tissues across diseases. Nat Commun. 2019;10(1):2887. PMID: 31253775.\u003c/li\u003e\n \u003cli\u003eJenum S, Dhanasekaran S, Lodha R, Mukherjee A, Kumar Saini D, Singh S, et al. Approaching a diagnostic point-of-care test for pediatric tuberculosis through evaluation of immune biomarkers across the clinical disease spectrum. eLife. 2016;5:e18520. PMID: 27780033.\u003c/li\u003e\n \u003cli\u003eRobinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139-140. PMID: 19910308.\u003c/li\u003e\n \u003cli\u003ePedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12:2825-2830.\u003c/li\u003e\n \u003cli\u003eD\u0026iacute;az-Uriarte R, Alvarez de Andr\u0026eacute;s S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:3. PMID: 16398926.\u003c/li\u003e\n \u003cli\u003eDeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837-845. PMID: 3203132.\u003c/li\u003e\n \u003cli\u003eHunter JD. Matplotlib: A 2D graphics environment. Comput Sci Eng. 2007;9(3):90-95.\u003c/li\u003e\n \u003cli\u003eKaforou M, Wright VJ, Oni T, French N, Anderson ST, Bangani N, et al. Detection of tuberculosis in HIV-infected and -uninfected African adults using whole blood RNA expression signatures: a case-control study. PLoS Med. 2013;10(10):e1001538. PMID: 24167453.\u003c/li\u003e\n \u003cli\u003eBloom CI, Graham CM, Berry MP, Rozakeas F, Redford PS, Wang Y, et al. Transcriptional blood signatures distinguish pulmonary tuberculosis, pulmonary sarcoidosis, pneumonias and lung cancers. PLoS One. 2013;8(8):e70630. PMID: 23940611.\u003c/li\u003e\n \u003cli\u003eOttenhoff TH, Dass RH, Yang N, Zhang MM, Wong HE, Sahiratmadja E, et al. Genome-wide expression profiling identifies type 1 interferon response pathways in active tuberculosis. PLoS One. 2012;7(9):e45839. PMID: 23029268.\u003c/li\u003e\n \u003cli\u003eMaertzdorf J, Repsilber D, Parida SK, Stanley K, Roberts T, Black G, et al. Human gene expression profiles of susceptibility and resistance in tuberculosis. Genes Immun. 2011;12(1):15-22. PMID: 20861863.\u003c/li\u003e\n \u003cli\u003ePenn-Nicholson A, Mbandi SK, Thompson E, Mendelsohn SC, Suliman S, Chegou NN, et al. RISK6, a 6-gene transcriptomic signature of TB disease risk, diagnosis and treatment response. Sci Rep. 2020;10(1):8629. PMID: 32451396.\u003c/li\u003e\n \u003cli\u003eWarsinske H, Vashisht R, Khatri P. Host-response-based gene signatures for tuberculosis diagnosis: A systematic comparison of 16 signatures. PLoS Med. 2019;16(4):e1002786. PMID: 30939134.\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"Tables","content":"\u003ch2\u003eTable 1. Cohort Characteristics and Data Processing Summary\u003c/h2\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eCharacteristic\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eLondon (Training)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eIndia (Validation)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eGEO Accession\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003eGSE107991\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003eGSE101705\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eGeographic Location\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003eLondon, United Kingdom\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003eChennai, India\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eTotal Samples\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e42\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e44\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eActive TB\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e21 (50.0%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e28 (63.6%)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eLatent TB (LTBI)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e21 (50.0%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e16 (36.4%)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eSequencing Platform\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003eIllumina RNA-seq\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003eIllumina RNA-seq\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eCommon Genes\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e39,376\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e39,376\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003ch2\u003eTable 2. Model Performance Metrics\u003c/h2\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" class=\"fr-table-selection-hover\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMetric\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eLondon (5-Fold CV)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eIndia (Validation)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAUC (95% CI)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e0.873 (0.783-0.963)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e0.932 (0.850-1.000)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAccuracy\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e90.9% (40/44)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eSensitivity\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e89.3% (25/28)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eSpecificity\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e93.8% (15/16)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eGeneralization Gap\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 192px;\"\u003e\n \u003cp\u003e-0.059\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":true,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Tuberculosis, Transcriptomics, Machine Learning, Cross-Validation, Biomarkers, Geographic Validation, Random Forest","lastPublishedDoi":"10.21203/rs.3.rs-8445596/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8445596/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eBackground: Transcriptomic biomarkers for tuberculosis (TB) diagnosis have shown promise in high-income settings, but concerns persist about their generalizability to high-burden endemic regions due to population-specific immune responses, genetic backgrounds, and environmental factors. We performed cross-geographic validation to test whether TB diagnostic signatures are universal or population-specific.\u003c/p\u003e\n\u003cp\u003eMethods: We obtained RNA-sequencing data from two independent cohorts: GSE107991 (London, UK; n=2; 21 active TB, 21 latent TB infection [LTBI]) and GSE101705 (South India; n=; 28 active TB, 16 LTBI). Raw count matrices were downloaded from NCBI GEO, normalized to log2-counts per million (CPM), and aligned on 39,376 common genes. A Random Forest classifier was trained on the London cohort using 5-fold cross-validation and validated on the India cohort. Performance was assessed using area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. Hyperparameters: n_estimators=100, max_depth=10, min_samples_split=2, class_weight='balanced', random_state=42. Parameters were not optimized on validation set to avoid overfitting.\u003c/p\u003e\n\u003cp\u003eBatch Effect Assessment: PCA showed disease status (active TB vs. LTBI) as primary variation source, not cohort origin, indicating minimal batch effects.\u003c/p\u003e\n\u003cp\u003eParticipant Characteristics: All participants were HIV-negative as per original study inclusion criteria. Active TB patients were treatment-naive at sample collection. BCG vaccination status and M. tuberculosis lineage information were not available.\u003c/p\u003e\n\u003cp\u003eResults: The Random Forest model achieved an AUC of 0.873 (95% CI: 0.76-0.98, SD ±0.090) in London cross-validation. Unexpectedly, validation on the India cohort yielded superior performance (AUC 0.932 (95% CI: 0.85-1.00) (95% CI: 0.85-1.00), 9% CI: 0.8-1.00), with accuracy 90.9% (95% CI: 78.8%-96.4%), sensitivity 89.3% (95% CI: 72.8%-96.3%), and specificity 93.8% (95% CI: 71.7%-98.9%). The negative generalization gap (-0.09) indicates the model performed better on the validation cohort than training, challenging the hypothesis of population-specific signatures. The difference was not statistically significant (z-test, p=0.304), indicating consistent performance.\u003c/p\u003e\n\u003cp\u003eConclusions: TB transcriptomic signatures for distinguishing active disease from latent infection appear biologically universal rather than population-specific. This finding supports the development of global diagnostic biomarker panels and reduces the need for region-specific validation studies. The superior performance on an independent endemic cohort strengthens the case for implementing transcriptional signature-based diagnostics worldwide.\u003c/p\u003e","manuscriptTitle":"Cross-Geographic Validation Demonstrates Universal Transcriptomic Signatures for Tuberculosis Diagnosis: A Machine Learning Study","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-31 09:31:47","doi":"10.21203/rs.3.rs-8445596/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"802c1010-0f3d-4715-b459-b7762ae854d8","owner":[],"postedDate":"December 31st, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":60191185,"name":"Health sciences/Biomarkers/Diagnostic markers"},{"id":60191186,"name":"Biological sciences/Molecular biology/Transcriptomics"},{"id":60191187,"name":"Biological sciences/Computational biology and bioinformatics/Machine learning"}],"tags":[],"updatedAt":"2026-03-08T18:32:35+00:00","versionOfRecord":[],"versionCreatedAt":"2025-12-31 09:31:47","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8445596","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8445596","identity":"rs-8445596","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.