Multimodal Metagenomic Profiling of Bronchoalveolar Lavage Fluid for Diagnostic Classification of Pulmonary Diseases | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Multimodal Metagenomic Profiling of Bronchoalveolar Lavage Fluid for Diagnostic Classification of Pulmonary Diseases Dongsheng Han, Fei Yu, Bin Lou, Bin Yang, Yifei Shen, Huifang Liu, and 4 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6108429/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 07 Oct, 2025 Read the published version in npj Digital Medicine → Version 1 posted 9 You are reading this latest preprint version Abstract Recent advances in unbiased metagenomic next-generation sequencing (mNGS) enable simultaneous examination of microbial and host genetic material. In this study, we developed a multimodal machine learning-based diagnostic approach to differentiate lung cancer and pulmonary infections using 402 bronchoalveolar lavage fluid (BALF) mNGS datasets. The training cohort revealed differences in DNA/RNA microbial composition, bacteriophage abundances, and host responses, including gene expression, transposable element levels, immune cell composition, and tumor fraction derived from copy number variation (CNV). The diagnostic model (Model VI) that integrated these differential features demonstrated an AUC of 0.937 (95% CI = 0.91–0.964) in the training cohort and 0.847 (95% CI = 0.776–0.918) in the validation cohort for distinguishing lung cancer from pulmonary infections. The application of a rule-in and rule-out strategy-based composite predictive model significantly enhanced accuracy (ACC) in distinguishing between lung cancer and tuberculosis (ACC = 0.896), fungal infection (ACC = 0.915), and bacterial infection (ACC = 0.907). These findings underscore the potential of cost-effective mNGS-based analysis for early differentiation between lung cancer and pulmonary infections. Biological sciences/Biological techniques/Sequencing/Next generation sequencing Biological sciences/Biological techniques/Sequencing/Rna sequencing metagenomic next-generation sequencing mNGS lung cancer pulmonary infections Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Introduction Lung cancer and pulmonary infections pose significant global health challenges with high incidence, mortality rates, and substantial socioeconomic burdens 1 , 2 . Clinicians often struggle to differentiate them based solely on clinical and radiological features, lacking rapid and accurate histopathological or microbiological test results. This leads to misdiagnoses and delays or incorrect treatments 3 , 4 . Various pathogens causing pulmonary infections, such as bacteria ( Pseudomonas , Streptococcus ), mycobacteria ( Mycobacterium tuberculosis , Non-tuberculous mycobacteria ), aerobic actinomycetes ( Nocardia ), fungi ( Aspergillus , Mucor , cryptococcus ), and others, can mimic lung cancer, sharing indistinguishable clinical symptoms (e.g., dyspnea, fatigue, cough, and hemoptysis) and radiographic features (e.g., spiculated solid nodules or masses, cavities with nodular margins, and chest wall and mediastinal invasion) 3 , 5 . Consequently, clinicians often employ multiple testing methods to detect lung infections and cancer 6 . An affordable diagnostic method requiring fewer samples, aiding clinicians in quicker and accurate decisions, would greatly benefit patient treatment and management. Metagenomic Next-generation Sequencing (mNGS) is a sequencing technology capable of identifying pathogens in specimens with microbial nucleic acid concentrations beyond detection limits within 24 hours or even less 7 – 9 . In recent years, it has been widely employed in the diagnosis of various complex infectious diseases and has been confirmed a powerful tool with an excellent diagnostic accuracy in detecting pneumonia-related pathogens 10 – 12 . Excitingly, recent studies have confirmed that analyzing transcriptomic data derived from human sequences of mNGS testing can aid in distinguishing infectious diseases such as sepsis, acute respiratory infections, tuberculous meningitis, and non-infectious diseases 13 – 15 . Developing intelligent algorithms based on chromosomal instability and tumor-related copy number variations generated by mNGS data is useful to diagnose malignant tumors 16 – 18 . These studies prompt us to further contemplate whether it is possible to utilize mNGS data from respiratory tract samples to establish an integrative genomic diagnostic method that combines microbial and host response characteristics of the patients. This method is anticipated to identify pulmonary infectious diseases that can be mistaken for lung cancer without escalating patient testing expenses, utilizing minimal tests and samples, and within a relatively short timeframe. Here, we conducted mNGS testing on bronchoalveolar lavage fluid samples (BALF-mNGS) from 402 clinical patients with lung cancer or pulmonary infections. Subsequently, we analyzed the microbial information and host response information derived from metagenomic sequencing data, and based on this, we established and validated an integrated host/microbe metagenomics-driven machine learning approach for the differential diagnosis of lung cancer and pulmonary infections. Materials and Methods 1. Study design, patient collection and ethics statement This observational study assessed adults admitted to the First Affiliated Hospital, Zhejiang University School of Medicine (FAHZU), suspected of lung cancer or pulmonary infections. Enrollment occurred between March 8, 2020, and May 27, 2023, for patients aged ≥ 18, requiring BALF samples within 72 hours of intubation to establish the etiology. Exclusions involved cases with underlying leukemia, no definitive diagnosis post-extensive follow-up, or lacking matching DNA and RNA mNGS data from BALF samples (Fig. 1 A). A total of 123 lung cancer, 279 pulmonary infections including tuberculosis, fungal, and bacterial infections, and 32 negative control cases (e.g., immune pneumonitis, organizing pneumonia and drug-related pneumonia) were included. The diagnosis of lung cancer relies on clinical suspicion and positive laboratory results from tests cytology, flow cytometry and/or tissue biopsy. Pathological information of all samples was determined based on surgically resected tissue sections according to 2015 WHO Histological Classification of Lung Cancer 19 . The diagnosis of pulmonary infections is based on clinical suspicion and determination of the causative pathogen through standard microbiological diagnostics (cultures, antigen/antibody tests, PCR, sequencing, see in Supplementary Data S1). Archival material at FAHZU was retrospectively analyzed under no-patient contact protocols approved by the FAHZU Institutional Review Board (IIT20220714A). A written consent given prior to the procedure used to obtain the sample covered the use of residual samples for research. Then, we constructed training set and validation set by time order of collecting date. We ranked all lung cancer samples by collection time and separated them into first 70% and last 30% (Fig. 1 B, Supplementary Data S1). Training set was used for differential analysis, feature selection and ensemble model training. Validation set was used for performance validation and rule-in/rule-out combining predictions. 2. DNA/RNA extraction, library construction and sequencing For metagenomic sequencing (DNA sequencing), 1 mL of BALF sample was subjected to depletion of host nucleic acid using 1 U benzonase (Sigma) and 0.5% Tween 20 (Sigma) and incubation at 37°C for 5 min. A total of 600 µL of the mixture was transferred to new tubes containing 500 µL of ceramic beads for bead beating using a Minilys Personal TGrinder H24 Homogenizer (catalogue number: OSE-TH-01, Tiangen, China). Then, the nucleic acid from 400 µL of the pretreated sample was extracted and eluted in 60 µL elution buffer using a QIAamp UCP Pathogen Mini Kit (catalogue number: 50214, Qiagen, Germany). The extracted DNA was quantified using a Qubit dsDNA HS Assay Kit (catalogue number: Q32854, Invitrogen, USA) 9 . For metatranscriptome sequencing (RNA sequencing), 1 mL BALF sample was centrifuged at 12,000 rpm for 10 min. Then, 200 µL of the precipitate was lysed in TRIzol LS (Thermo Fisher Scientific, Carlsbad, CA, USA), followed by RNA extraction using a Direct-zol RNA Miniprep kit (Zymo Research, Irvine, CA, USA) according to the manufacturer's instructions 20 . According to the manufacturer's instructions, 30 µL DNA was used to generate libraries with the Nextera DNA Flex kit (Illumina, San Diego, CA, USA), and 10 µL of purified RNA was used for cDNA generation and library preparation with an Ovation Trio RNA-Seq Library Preparation Kit (NuGEN, CA, USA). A Qubit dsDNA HS Assay Kit was used to measure the library concentration. The library quality was assessed with an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) and a High Sensitivity DNA kit. The library was sequenced using an Illumina NextSeq 550 sequencer with a 75-cycle single-end sequencing strategy 20 , 21 . 3. Microbial annotation, community structure comparison and differential taxon analysis As previous study described, we used a validated mNGS sequencing pipeline for microbial composition analysis 9 , 20 . In brief, Trimmomatic was used to remove low-quality, duplicate, and < 50 bp reads, as well as adapter contamination 22 . Human sequences were excluded by mapping to human reference genome(hg38) using SNAP v1.0beta 8 . SortMERNA v4.3.7 was used for ribosomal RNA removement 21 . Kraken2 v.2.0.7 and Bracken v.2.5 created taxonomic profiles using default settings and the default database ( https://benlangmead.github.io/aws-indexes/k2 ) 9,20 . Sequencing reads for detected microbes were normalized as RPM (reads per million) to correct for various sequencing depths. The BALF mNGS data from 32 non-infection and non-cancer cases were used as negative controls (NC, Supplementary Data S3). Further analysis was also done to identify possible contaminants in the DNA/RNA mNGS datasets. To this end, we compared the relative abundance of taxa between background bronchoscope control and BAL samples. Taxa with median relative abundance greater in background than in BAL/frequencies in NCs were higher than 50% and average relative abundances in NCs were higher than 0.1% were identified as probable contaminants and removed 20 , 23 . For bacteriophage annotation, the cleaned reads were aligned against a curated phage database (CPD) containing 26,159 phage representative genomes using BLAST (word size: 18, e-value: 0.0005, culling limit: 1) 24 . Microbial and Phage counting in DNA (DNA microbial abundances, DMA)/RNA (RNA microbial abundances, RMA) mNGS data relied on relative abundances 9 , 20 , 24 . The α-diversity of the microbial composition in DNA/RNA mNGS data, including the Shannon index, Simpson index, Chao1 index, and ACE index, were computed using the “vegan” package in R software after sequence processing. Permutational multivariate ANOVA (PERMANOVA) was conducted using the "vegan" package to determine the difference in sample β-diversity (measured by Bray‒Curtis distance). Principal coordinates analysis (PCoA) was used to identify differences of microbial community structure. LefSE assessed the difference between each group's microbial taxon or bacteriophage 25 . 4. Gene expression (GE), transposable elements expression (TEE), cell-type composition analysis (CC) For the analysis of host gene expression, high-quality data were aligned to the human genome hg38 using HISAT2 with default parameters. Gene-level quantification was performed using FeatureCounts 26 , 27 . The gene counts were aggregated using the featureCounts program from the Subread package release 2.0.0 ( http://subread.sourceforge.net/ ) 20,23 . Additionally, trimmed clean reads were mapped using STAR with previously defined parameters 28 . TEtranscripts software was utilized to estimate the abundances of Transposable Elements (TE) and to conduct differential expression analysis. The GTF file containing transposable element annotations was obtained from https://hammelllab.labsites.cshl.edu/software/#TEtranscripts . All Genes and TE were normalized, corrected batch effect and calculated differential expression in each group using the DESeq2 package, applying criteria of FDR ≤ 0.05 and Fold-change ≥ 1.5 29 . Gene set enrichment analysis (GSEA) for DEGs was carried out using the REACTOME, KEGG, and GO databases 30 – 32 . Significantly enriched pathways or biological processes were determined based on Fisher's exact test (p-value < 0.05), following Benjamin and Hochberg's adjustment 20 , 23 . To estimate the relative proportions of invasive immune cell types and infer the proportions of immune cells, the CIBERSORT algorithm was applied with the original gene signature file LM22 and 1000 permutations 34 . Latent variables were calculated by PLIER R package 35 . Continuous nonparametric data of latent variables and cell proportions were compared using the Mann‒Whitney-U test. P values from multiple testing of latent variables were adjusted using the Benjamini-Hochberg adjustment with a significance level of 0.05. 5. Copy number variants-derived (CNV) tumor fractions calling The DNA metagenomic sequencing data were used in downstream analyses to identify CNVs through the ichorCNA 36 . CNVkit and estimate software package to generate ctDNA tumor fractions as previously described and validated in tumor tissue and body fluid 37 , 38 . The ichorCNA ploidy parameter restart value was set to 2 and the maximum copy number to use was lowered to 3. The tumor fraction with the highest loglikelihood was retrieved and reported. Continuous nonparametric data were compared using the Mann‒Whitney-U test. P values less than 0.05 were considered statistically significant. 6. Ensemble machine learning models for DMA, RMA, GE, TEE, CC and CNV We first performed differential analysis to identify significant features associated between disease types and microbial DNA/RNA relative abundances (DMA/RMA), Transcripts per Millions (TPM) of host gene expression (GE), TPM of transposable elements (TEE), relative abundance of host cell (CC) and score value of tumor fraction/CNV (CNV) respectively. Within each type of data, given the adjusted p value cut-off was set to 0.05, the features with an adjusted p value less than the cut-off were selected. After obtaining all candidate features, we calculated the frequency of features in the training set for each classifier capable of performing feature selection across different data types, conducting 1,000 iterations to determine the occurrence frequency of each feature. We then selected the optimal combination of features through sequential forward selection. Using four different classifiers—LASSO, SVM, XGBoost, and Random Forest—we constructed models in the training set for six different models (Model I-VI). The Lasso were implemented via the glmnet package. The regularization parameter, λ, was determined by 10-fold, whereas the L1-L2 trade-off parameter, α, was set to 0–1 (interval = 0.1). For the CoxBoost model, we used 10-fold routine optimBoostPenalty function to first determine the optimal penalty (amount of shrinkage). The SVM model was implemented via svm package. The regression approach takes censoring into account when formulating the inequality constraints of the support vector problem. Random Forrest had two parameters ntree and mtry, where ntree represented the number of trees in the forest and mtry was the number of randomly selected variables for splitting at each node. To integrate the predicted cancer or infection probability scores by LASSO, SVM, XGBoost, and Random Forest, we calculated the probability of cancer in a formula as: Pr(Cancer) = α*Pr(Classifier A) + β* Pr(Classifier A); α + β = 1; Each bootstrap value was 0.1. We picked up best mean AUC of all comparison in validation dataset for each Model (I-VI). Finally, we chose the best classifier along with Model VI to conduct combined rule-in and rule-out predictions (Fig. 1 C, Supplementary Figure S1 ). The R package "mlr3" was used to perform machine learning models 39 . The prediction model accuracy, sensitivity and specificity were assessed using the AUC. DeLong test was used for calculating significance of p-value between two ROCs 40 All data analyses were performed with the R studio built under R version 4.1.0. Results 1. Clinical features of study cohort Based on the established criteria (Methods, Fig. 1 A), we enrolled a total of 402 patients, consisting of 123 lung cancer patients and 279 patients with pulmonary infections. According to etiological findings, the infection group was further subdivided into three subgroups: pulmonary tuberculous (n = 86), fungal infection (n = 79), and bacterial infection (n = 114). Most patients, regardless of their subgroup, exhibited similar clinical and imaging characteristics, such as race (all were Chinese), underlying medical conditions, white blood cells (WBC) count and inflammatory indicators such as Procalcitonin (PCT) and C-reactive protein (CRP), and results of chest computed tomography (CT) scan (e.g., patchy shadows and nodules, cavities, mediastinal lymphadenopathy) (Table 1 ). The median mNGS DNA data per patient was 21.9 million reads (IQR 18.0-27.6 M), with the vast majority of reads (> 95%) being human. The median mNGS RNA data per patient was 19.1 million reads (IQR 13.8–26.2 M). We compared and screened the differential features within the mNGS data of the lung cancer and pulmonary infection groups in training cohort to establish a differential diagnosis approach for lung cancer and pulmonary infections. Subsequently, the lung cancer group was compared separately to the tuberculosis, fungal, and bacterial infection groups to develop a diagnostic method capable of rapidly distinguishing lung cancer from infections caused by different pathogens (Fig. 1 C, Supplementary Figure S1 ). Table 1 Demographic and clinical characteristics of the enrolled patients Characteristics Overall Lung Cancer Bacterial Infection Fungal Infection Tuberculosis p-value Patient demographics Total number, n 402 123 114 79 86 Age (median [IQR]) 59.50 [50.00, 67.50] 58.00 [51.00, 69.50] 57.00 [50.00, 66.00] 57.00 [46.00, 69.00] 57.50 [35.00, 67.75] 0.114 Sex = Male, n(%) 255(63.4) 86(69.9) 60(52.6) 50(63.3) 59(68.6) 0.086 Underlying conditions , n(%) Cardiovascular disease 69 (17.2) 24 (19.5) 18 (15.8) 17 (21.5) 10 (11.6) 0.309 Immunological disease 22 (5.5) 4 (3.3) 7 (6.1) 9 (11.4) 2 (2.3) 0.057 Liver insufficiency 34 (8.5) 9 (7.3) 9 (7.9) 11 (13.9) 5 (5.8) 0.293 Renal insufficiency 63 (15.7) 17 (13.8) 18 (15.8) 15 (19.0) 13 (15.1) 0.507 COPD 126 (31.3) 33 (26.8) 39 (34.2) 27 (34.2) 27 (31.4) 0.941 Center nervous system disorder 21 (5.2) 7 (5.7) 8 (7.0) 5 (6.3) 1 (1.2) 0.22 HIV 3 (0.7) 0 (0.0) 0 (0.0) 2 (2.5) 1 (1.2) 0.064 Hypertension 99 (24.6) 40 (32.5) 27 (23.7) 19 (24.1) 13 (15.1) 0.038 Diabetes 55 (13.7) 17 (13.8) 19 (16.7) 8 (10.1) 11 (12.8) 0.643 Laboratory testing , median [IQR] WBC (10×10 9 /L) 6.50 [5.05, 9.33] 6.42 [4.95, 9.23] 7.35 [5.32, 10.31] 6.58 [4.65, 9.20] 6.08 [5.07, 7.86] 0.102 NEUT(%) 70.30 [61.82, 80.25] 70.80 [63.35, 81.40] 71.30 [61.95, 80.67] 74.50 [60.50, 86.65] 67.50 [60.20, 73.05] 0.013 CRP (mg/L) 17.03 [3.30, 56.53] 22.27 [4.73, 71.02] 16.44 [3.30, 59.38] 10.30 [3.21, 52.62] 11.20 [3.55, 41.72] 0.136 PCT (ng/mL) 0.09 [0.04, 0.36] 0.10 [0.04, 0.37] 0.11 [0.05, 0.48] 0.19 [0.04, 0.55] 0.05 [0.05, 0.12] 0.042 Chest CT imaging features , n(%) Pulmonary emphysema 90 (22.4) 39 (31.7) 22 (19.3) 17 (21.5) 12 (14.0) 0.018 Pulmonary nodule 136 (33.8) 52 (42.3) 22 (19.3) 32 (40.5) 30 (34.9) 0.001 Pulmonary cavity 50 (12.4) 10 (8.1) 14 (12.3) 8 (10.1) 18 (20.9) 0.055 Ground-glass shadow 58 (14.4) 23 (18.7) 11 (9.6) 15 (19.0) 9 (10.5) 0.095 Multiple patchy solid shadows 284 (70.6) 82 (66.7) 80 (70.2) 53 (67.1) 69 (80.2) 0.078 Malignant pleural effusion 144 (35.8) 50 (40.7) 37 (32.5) 29 (36.7) 28 (32.6) 0.533 Pleural thickening 62 (15.4) 28 (22.8) 13 (11.4) 11 (13.9) 10 (11.6) 0.069 Mediastinal lymphadenopathy 146 (36.3) 58 (47.2) 35 (30.7) 21 (26.6) 32 (37.2) 0.012 # Categorical data were compared using the chi-square test or Fisher's exact test. 2. Microbial community structure and specific taxon of different pulmonary diseases Microbial communities were assessed in a total of 284 samples within the training cohort, comprising 87 cases of Lung Cancer and 197 cases of Pulmonary Infection. Given the low biomass of BAL samples in the DNA/RNA mNGS data, we first identified taxa as probable contaminants by calculating frequencies and average relative abundances in Negative controls (NCs) and comparing the relative abundance between BAL samples and NCs (Supplementary Figure S2 and S3). In general comparison, we didn’t find significant differences of DNA microbial α-diversity between lung cancer and pulmonary infections among all indices (Supplementary Figure S4A, Mann Whitney U test, p-value > 0.05). But in RNA data, it showed that Richness and Chao1 were higher and Evenness index was lower in lung cancer group (Supplementary Figure S4B). In subgroups comparison, we found both DNA and RNA showed that Richness and Chao1 were higher and Evenness index was lower in lung cancer group and bacterial infection group (Supplementary Figure S4C, D). On the other hands, β-diversity analysis based on the Bray-Curtis distance indicated that the microbial composition of BALF samples of cancer group was distinct from either infection groups or infection subgroups (Fig. 2 A, B, PERMANOVA, P < 0.01). For the RNA data, both α-diversity (Supplementary Figure S4B, D, Mann-Whitney-U test, p-value < 0.05) and β-diversity (Supplementary Figure S5A, B, PERMANOVA, p-value < 0.01) analyses of microbiome supported distinct microbial community features in the lower airways among pulmonary diseases. To find out specific microorganisms of different pulmonary diseases, we did Lefse analysis. The findings revealed a higher prevalence of S. oralis , P. micra , and P. gingivalis , which are often regarded as oral or airway commensals, in lung cancer compared to pulmonary infection (Fig. 2 C, LDA score > 2, adjusted p-value < 0.05). Conversely, pathogenic microorganisms commonly linked with infections, such as M. tuberculosis , P. aeruginosa A. fumigatus , and C. neoformans , were more frequently detected in the pulmonary infection. Notably, the anaerobic bacterium F. nucleatum appears as a specific microbe in bacterial infections (Fig. 2 C). We suggested that this is because the bacterial infections in our study included some patients with lung abscesses. Additionally, we observed that P. aeruginosa and P. gingivalis serve as specific microbes in different comparison groups, indicating varying disease-specific microbes (Fig. 2 C). This suggests that despite certain pathogens (e.g., M. tuberculosis , A. fumigatus , and C. neoformans ) having distinct microbial profiles that could potentially serve as valuable indicators for diagnosing pulmonary diseases, the differentiation among various pulmonary diseases based on lung microbiota is limited due to the complexity of the lung microbiome. 3. Difference in host immune response, transposable elements expression, and immune cell abundance of different pulmonary diseases First, to reduce the impact of ribosomal RNA on the effective data, we performed ribosomal RNA removal in the experimental steps. We observed that the percentages of eukaryotic rRNA (1.66%, IQR 1.01%-2.62%) and total rRNA (2.3%, IQR 1.38%-3.98%) were relatively low (Supplementary Table S1 , Figure S6). Additionally, we calculated the number of genes detected in each sample, finding a median of 17,827 (IQR 16,832.5–18,738.5). Finally, to discern host immune responses between lung cancer and infection, we conducted BALF host gene expression analyses, revealing substantial variations among various groups as depicted in volcano graph analysis (Extend Data Fig. 7A-D). GSEA enrichment analysis highlighted significant enrichment of differential expression genes (DEGs) in innate immune pathways like T-cell receptor signaling and cytokine-cytokine receptor signaling (Fig. 3 A). Employing PLIER on training datasets, we delineated host transcriptomic profiles across 545 canonical Pathways, identifying multiple differentially expressed latent variables (LVs) with distinct biological functions across different groups (Fig. 3 B, Mann-Whitney-U Test, adjusted p-value < 0.05). Specifically, in the cancer group, lower airway transcriptomes exhibited upregulation of the cell cycle (LV102 and LV107), while LV165, annotated as cytokine-cytokine receptor interaction pathways, displayed upregulation, contrary to LV86 in the same pathways (Fig. 3 B). Furthermore, we observed upregulation of interferon signaling and the innate immune system in infection groups (Fig. 3 D, E), notably driven by Pulmonary Tuberculosis, which exhibited the well-established upregulation of interferon signaling. For further exploration, we selected differentially expressed immune genes (IMG) from the ImmPort database and interferon-stimulated genes (ISG) from prior research 33 , 42 . Notably, TB-associated markers GBP1 and GBP5 were elevated in the TB group (Supplementary Figure S8A, adjusted p-value < 0.01). Four genes emerged as notably upregulated in the cancer group, and intriguingly, these genes were chemokines: C-C motif chemokine ligand 7 (CCL7), C-C motif chemokine ligand 8 (CCL8), C-C motif chemokine ligand 13 (CCL13) and pro-platelet basic protein (PPBP) also known as CXCL7 (Supplementary Figure S8A, indicated by red triangle, adjusted p-value < 0.01). Studies suggest that CCL7, highly expressed in tumor tissues, recruits cDC1 cells, aiding antitumor immunity and checkpoint immunotherapy. Additionally, CCL7, CCL8, and CCL13 are linked to tumor-associated macrophages (M2) 43 – 45 . We identified 27 transposable elements among lung cancer and three infection groups (Supplementary Figure S7E-H, Figure S8B), notably finding significantly higher LTR-ERV (LTR6A and HUERS-P3-int) levels in lung cancer (Supplementary Figure S8B, adjusted p-value = 0.019). To investigate variations in immune cell abundance across different groups, we estimated cell-type levels in host transcriptomes using computational quantification methods, including a deconvolution approach implemented in CIBERSORTx. Macrophage M1 were significantly elevated in pulmonary tuberculosis (Fig. 4 B, Mann-Whitney-U Test, p-value < 0.05), whereas Macrophage M2 levels were higher in fungal infection, pulmonary tuberculosis and lung cancer Macrophage M2 levels were higher in fungal infection, pulmonary tuberculosis and lung cancer (Fig. 4 C, Mann-Whitney-U Test, p-value < 0.01). Neutrophils were enriched in bacterial infection comparing with lung cancer (Fig. 4 D, Mann-Whitney-U test, p-value < 0.01). Furthermore, we observed notably higher monocytes in fungal infection (Fig. 4 E, Mann-Whitney-U Test, p-value < 0.01). 4. Copy number variants and CNV-derived tumor fraction of different pulmonary diseases Due to the host DNA removal step during sample preprocessing, we evaluated the sufficiency of host data. We analyzed 402 samples, finding a host rate of 98.22% (IQR 97.46, 98.66) and mapping reads of 19.23 million reads (IQR 15.95, 23.66) (Supplementary Table S2 ). The data volume in this study is higher than in previous studies17, thereby ensuring the reliability of subsequent analyses. To enhance CNV and tumor fraction estimations in BALF mNGS data, we used three distinct software tools. CNVkit revealed slight increases in CNV counts on chromosomes 11 (lung cancer group) and 3 (pulmonary infection group) (Supplementary Figure S9, p-value < 0.05). Higher CNV percentages on chromosome 3 were noted in the infection group (Supplementary Figure S10, p-value 0.05). Subsequently, ichorCNA estimated tumor fractions at 5.96% (lung cancer, 95% CI 4.15%-7.77%) and 6.29% (pulmonary infection, 95% CI 0.54%-12.04%) (Supplementary Figure S13A). Notably, no significant differences in tumor fractions were observed between the lung cancer and the three infection subgroups (Supplementary Figure S13B). Calculated scores (Stromal, Immune, ESTIMATE, Tumor Purity) using 'estimate' software showed no differences between lung cancer and all the pulmonary infection groups (Supplementary Figure S13C, D). This suggests that, unlike Cancer-Negative (Benign) samples 16 , BALF samples from infection patients display comparable levels of copy number variations seen in cancer patients. 5. Host/microbe metagenomics-based modelling for lung cancer and pulmonary infection diagnosis We first conducted individual machine learning modeling and dual-model ensemble modeling for Models I-VI. We evaluated their performance based on the mean AUC on the validation dataset. Among them, the optimal combination for Model I was found to be 0.1LASSO + 0.9RF, with a mean AUC of 0.778 (Supplementary Figure S14). For Model II, the best combination was 0.3RF + 0.7XGBoost, with a mean AUC of 0.691 (Supplementary Figure S15). Model III's optimal combination was 0.3LASSO + 0.7RF, achieving a mean AUC of 0.867 (Supplementary Figure S16). For Model IV, the best combination was 0.2LASSO + 0.8SVM, with a mean AUC of 0.584 (Supplementary Figure S17). The optimal combination for Model V was 0.9LASSO + 0.1XGBoost, with a mean AUC of 0.56 (Supplementary Figure S18). For Model VI, the optimal combination was 0.3LASSO + 0.7RF, with a mean AUC of 0.869 (Fig. 5 A). We observed that among the various comparison groups, Model VI consistently exhibited the highest AUC (Fig. 5 B). Specifically, Model VI significantly outperformed Models I, II, IV, and V in each comparison group (Figs. 5 C-F, DeLong Test p-value < 0.05). The results unveiled that Model VI, incorporating differential features from microbial and bacteriophage DNA/RNA abundances, host gene expression, immune cell composition, transposable elements, and CNV-derived tumor fraction, exhibited the highest discriminatory capability in both general and subgroup comparisons. Specifically, for the general comparison, Model VI demonstrated an AUC of 0.937 (95% CI = 0.91–0.964) with 92.0% sensitivity and 81.2% specificity in the training cohort. In the validation cohort, it achieved an AUC of 0.847 (95% CI = 0.776–0.918) with 94.4% sensitivity and 61.0% specificity, effectively distinguishing lung cancer from pulmonary infections (Fig. 5 C, Supplementary Table S3). The highlighted host transcriptome features in Model VI included genes involved in the cell cycle and cytokine-cytokine receptor pathways, such as ULBP1, BG3GAT1, and CCL13 (Supplementary Data S2; Fig. 5 G, H, and I, Mann-Whitney-U test, p-value < 0.05). Notably, CCL13, a downstream gene of EGFR, serves as a typical LUAD biomarker, while ULBP1 and BG3GAT1 are genes regulated by CCL13 for cDC modulation. In the subgroup comparisons, Model VI showcased notable performance. For instance, in distinguishing lung cancer from bacterial infection, it attained an AUC of 0.847, with 80.6% sensitivity and 82.4% specificity in the validation cohort. Similarly, in discerning lung cancer from fungal infection, Model VI displayed an AUC of 0.872, sensitivity of 94.4%, and specificity of 69.6%. Furthermore, when differentiating lung cancer from pulmonary tuberculosis, Model VI achieved an AUC of 0.909, sensitivity of 91.7%, and specificity of 76.0% (Fig. 5 B, D, E, and F, Supplementary Table S3). Noteworthy observations included higher levels of MAS1 associated with apoptosis and tissue injuries in Bacterial Infection, increased IL23Rlevels correlated with TLR4 in Pulmonary Tuberculosis, and elevated C1QL3 levels in Fungal Infection compared to lung cancer (Fig. 5 K, J, and L, Mann-Whitney-U test, p-value < 0.01). 6. A composite predictive model for Lung cancer and infection diagnosis With a rule-in and rule-out strategy, we developed a composite predictive model that combines the Model-VI used for general comparison with either Model VI used for subgroup comparison, aiming to enhance the diagnostic accuracy for lung cancer and infections. In this rule-in and rule-out strategy, if both Model-VI of general comparison and either Model VI used for a subgroup comparison classified a patient as lung cancer, we defined it in rule-in-band (i.e., positive of lung cancer diagnosis). While if both models classified a patient as infection, we defined it in rule-out-band (i.e., positive of pulmonary infection diagnosis) (Fig. 1 C, Supplementary Fig. 1). The validation cohort from each subgroup comparison was utilized to evaluate the performance of the composite predictive model. Within the lung cancer versus bacterial infection group, a total of 54 patients were categorized, with 27 identified as rule-in and 22 as rule-out (Fig. 6 A, Table 2 ). Similarly, within the comparison of lung cancer versus fungal infection, 47 patients were classified, comprising 32 rule-in and 11 rule-out cases (Fig. 6 B, Table 2 ). Moreover, in the evaluation of lung cancer versus fungal infection, 48 patients were allocated, consisting of 31 rule-in and 12 rule-out instances (Fig. 6 C, Table 2 ). Table 2 Test statistics for combination strategy. Treated Cancer Infection LR* Specificity# Sensitivity+ Lung Cancer vs. Bacterial Infection Rule-Out 0 22 0 1 - Rule-In 27 5 5.4 - 0.844 Lung Cancer vs. Fungal Infection Rule-Out 0 11 0 1 - Rule-In 32 4 8 - 0.889 Lung Cancer vs. Pulmonary Tuberculosis Rule-Out 0 12 0 1 - Rule-In 31 5 6.2 - 0.861 LR*: Likelihood Ratio, serves as an indicator of cancer risks. A higher LR signifies a stronger correlation with lung cancer. For example, within the Rule-out band for lung cancer versus bacterial infection, there were 22 patients classified as infection and 0 patients classified as having cancer. The LR calculation resulted in 0/22 = 0. Specificity#: refers to the accuracy of the rule-out band in correctly identifying infected patients. It is calculated as the number of infected patients correctly identified by the rule-out band (true positives) divided by the sum of true positives and the number of infected patients mistakenly identified as having cancer by the rule-out band. Sensitivity+: refers to the accuracy of the rule-in band in correctly identifying cancer patients. This is calculated by dividing the number of cancer patients correctly identified by the rule-in band (true positives) by the sum of true positives and the number of cancer patients mistakenly identified as having an infection by the rule-in band. From the results, it is evident that employing this strategy significantly enhanced the diagnostic accuracy (ACC) in distinguishing between lung cancer and bacterial infection, elevating it from 0.800 (56/70) to 0.907 (49/54) (Fig. 6 A). This enhancement was accompanied by a sensitivity of 100%, reflecting the rule-in band's accuracy in correctly identifying individuals with cancer, and a specificity of 84.4%, demonstrating the rule-out band's accuracy in correctly identifying patients with an infection (Table 2 ). Similarly, there was a significant enhancement in ACC, rising from 0.797 (47/59) to 0.915 (43/47) alongside a specificity of 88.9% and sensitivity of 100% for diagnosing Lung cancer and Fungal Infection (Fig. 6 B, Table 2 ). Of note, this method yielded 86.1% specificity and 100% sensitivity in distinguishing Lung cancer and Pulmonary Tuberculosis (ACC = 0.896, 43/48) (Fig. 6 C, Table 2 ). Accordingly, this integrated predictive approach indeed provides a highly accurate strategy to better utilize complex data generated by mNGS for distinguishing various pulmonary diseases in a clinically viable manner. Discussion In the realm of diagnostics, BALF-based mNGS testing has emerged as a rapid assay to pinpoint pulmonary infection pathogens 12 , 46 – 48 . Despite over 90% of mNGS results being human-origin reads, often disregarded as "noise", recent research posits that these sequences may harbor valuable biomarkers linked to the host's disease state 15 , 49 . Our study pioneers a comprehensive host/microbe metagenomics approach, utilizing BALF mNGS data for diagnosing lung cancer and pulmonary infections. This innovative methodology exhibits exceptional accuracy in distinguishing between lung cancer and diverse pulmonary infections (including pulmonary tuberculosis, fungal infection, and bacterial infection), amplifying the clinical applicability of BALF mNGS testing. While BALF samples exhibit inherent heterogeneity compared to whole blood or tissue specimens 18 , 50 , 51 , our analytical model demonstrates significant robustness. Specifically tailored for distinguishing lung cancer from pulmonary infection, our Model VI achieved a notable AUC of 0.847 (95% CI = 0.776–0.918) within the validation cohort. This cohort encompassed a spectrum of complex pulmonary infections, including bacterial, fungal, and tuberculosis infections, each characterized by substantial variations in host immune responses, pathogen profiles, and microbiota compositions. Impressively, our model's performance is comparable to the uniform multi-omics models utilized in other studies for different sample types. For instance, the IMX-BVN model used to differentiate acute bacterial infections from others achieved an AUC of 0.86 (95% CI 0.77–0.93), while distinguishing acute viral infections scored an AUC of 0.85 (95% CI 0.76–0.93) 52 . The diagnostic capacity of whole blood transcriptomics in discerning sepsis from non-sepsis states showed an AUC of 0.82, while plasma cell-free RNA transcriptomics reached an AUC of 0.77 14 . These studies indirectly demonstrate similar remarkable efficacy of our model in managing complex pulmonary conditions. Moreover, in the validation cohort differentiating lung cancer from pulmonary tuberculosis, the AUC escalated to 0.909 (95% CI = 0.831–0.979), showcasing the advantage of integrating multi-omics into the Model VI. Identifying patients with lung cancer or pulmonary infections remains a crucial clinical challenge in many medical settings. The decision to administer empirical antibiotics often relies on an educated guess. If we could further refine our diagnosis of specific infection subgroups (such as bacteria, fungi, or tuberculosis) after confirming an infection firstly using our developed Model VI, it could assist clinicians in more accurately employing antibiotic therapies. It can be seen that the sensitivity of our model is exceptionally high: in the validation cohort, the sensitivity reached 94.4% for pulmonary infection, 80.6% for bacterial infection, 94.4% for fungal infection, and 91.7% for tuberculosis. This implies that in addition to detecting pathogens, our model can also identify almost all lung cancer patients. By combining these results with pathological findings, we can ensure diagnostic accuracy and mitigate the limitation of our model's lower specificity. Furthermore, for patients with low tumor risk or those diagnosed with infections, invasive biopsy procedures can be avoided, thus reducing the potential harm caused by such procedures. To further classifier patients precisely, we have further developed a more rigorous integrated predictive model based on predefined rule-in and rule-out strategies, enhancing the differentiation accuracy between lung cancer and infection subgroups. The result showed improved accuracy in distinguishing lung cancer from pulmonary tuberculosis (ACC = 0.896), fungal infection (ACC = 0.915), and bacterial infection (AUC = 0.907). Such diagnostic approaches promise more precise clinical diagnoses, thereby yielding greater benefits for patients. In clinical practice, patients with suspected pulmonary infections or other diseases undergo pathogen detection and confirmation using DNA/RNA mNGS. Simultaneously, by employing our diagnostic model and rule-in/rule-out strategy, we can accurately identify patients with confirmed pulmonary infections, lung tumors, and those who are suspected cases. Further classification is achieved through additional methodologies, distinguishing lung tumor patients and suspected cases into confirmed lung tumor patients, non-tumor non-infection patients, and patients with concurrent lung tumors and pulmonary infections. This approach allows for more precise subsequent treatment (Fig. 6 D). Our study tested an integrated host-microbe mNGS diagnostic approach, examining microbial (including bacteriophage) DNA/RNA abundance, host gene expression, transposable elements, immune cell composition, and copy-number variants (CNV) derived tumor fraction. Prior research only only one or a few features independently to help diagnosis, like lung cancer microbiomes 53 . Previous 16s rRNA sequencing revealed higher Firmicutes and TM7 presence in lung cancer versus healthy controls 54 . Veillonella and Megasphaera showed promise as lung cancer biomarkers (AUC: 0.888), indicating distinctive bacterial profiles in lung cancer versus benign conditions 54 . Our data detected subtle microbial differences between lung cancer and pulmonary infections and infectious subgroups. Veillonella parvula notably increased in lung cancer compared to bacterial/fungal infection (Fig. 1 C, LDA score > 2, adjusted p-value < 0.05). Yet, the microbiome had limited diagnostic predictive power for diagnosis of lung cancer and pulmonary infection (AUC = 0.645 in validation cohort). Extracting more distinctive biological information from sequencing data is crucial for differentiating lung cancer from pulmonary infections. We believe that host immune dysregulation disrupts the composition of respiratory microbiota. Previous literature has underscored significant changes in the dynamic equilibrium between host and microbiome in conditions such as lung cancer and infections 55 , 56 . In this study, we independently compared the contributions of Microbial/Bacteriophage relative abundances (Model I and Model II), Host gene expression and composition of immune cell (Model III), TE expression levels (Model IV), and CNV-derived tumor fraction (Model V) for diagnosing lung cancer from infections. The results indicate that host immune response (Model III) reflects the most prominent differences in pulmonary disease status compared to other categories (Figs. 5 D-F, DeLong's ROC test, p-value < 0.05). In spite of the limited cellular content in certain BALF samples from patients, we successfully retrieved a robust human gene expression dataset. These data unveiled distinct immune responses across various pulmonary diseases. Analysis using PLIER revealed significant differences in latent variables associated with cell cycle, interferon, and cytokine pathways among these diseases. Notably, our findings highlighted genes involved in cell cycle regulation concurrently influencing PI3K-Akt signaling, p53 signaling, and lung cancer pathways, under the regulation of EGFR 57 . Additionally, we identified the GPB5 gene, known for its high diagnostic relevance in active tuberculosis 58 , and observed elevated expression levels of interferon signaling pathways in the pulmonary tuberculosis group compared to the other groups (Supplementary Figure S4A). This further underscores the reliability of our findings regarding the host immune response. Our top three classifier genes for lung cancer and pulmonary infection were identified as B3GAT1, ULBP1, and CCL13. Interestingly, these genes have not been previously linked in host gene expression signatures in bodily fluids related to lung cancer. Specifically, ULBP1's role as a ligand for the NKG2D receptor activates NK cells in lung cancer, fostering NK cell-mediated tumor surveillance and cytotoxicity 59 . Expression of ULBP1-6, particularly in squamous-cell carcinoma, correlates with clinical outcomes in NSCLC patients, suggesting a predictive value for clinical prognosis 60 . Conversely, CCL13, a ligand for CCR2, contributes to cancer-related processes such as metastasis and immunosuppression. CCR2 expression in M2 macrophages is integral in the bidirectional communication between these macrophages and cancer cells, driving lung cancer progression 61 . Additionally, CCL13, derived from M2 tumor-associated macrophages, promotes oral cancer metastasis by inducing inflammatory cytokines 45 . Finally, B3GAT1, or beta-1,3-glucuronyltransferase 1, holds significance in cancer, particularly concerning tumor cell motility and specific carbohydrate epitope biosynthesis. Its role in canonical integrin signaling pathways influences tumor cell motility, while its involvement in HNK-1 carbohydrate epitope biosynthesis bears relevance to neurodevelopment and cancer-related processes 62 . We investigated for the first time the expression levels of transposable elements in BALF samples from pulmonary diseases in this study. HERVK11D showed higher expression in lung cancer compared to fungal infection and tuberculosis. Similarly, ERVK-MER11B was more expressed in bacterial infection than tuberculosis (Supplementary Figure S4, adjusted p-value < 0.05). The heightened expression of HERV-K, linked to basal-like and triple-negative breast cancer progression, illustrates altered gene expression driving cancer advancement. HERV-derived long non-coding RNAs also promote cancer progression, signaling significant gene profile shifts in these cancers 63 , 64 . Additionally, two ERV1 were notably higher in lung cancer compared to all pulmonary infections (Supplementary Figure S4, adjusted p-value < 0.05). These findings underscore the importance of Repetitive Sequences in human health, exemplified by severe COVID-19 pneumonia triggering intense inflammatory responses and HERVs dysregulation in BALF samples. For example, HERV-FRD, notably upregulated in COVID-19 BALF, suggests HERVs as potential disease progression biomarkers linked to increased severity in aging 65 . Surprisingly, we first found that certain transposable elements were more expressed in BALF during pulmonary infection than in lung cancer. GSAT satellite, notably higher in bacterial and fungal infections (Supplementary Figure S4, adjusted p-value < 0.05), regulated by AP-1, holds significance in various pulmonary diseases by impacting gene expression and inflammatory cell activation crucial in pulmonary infections 66 , 67 . Several studies have employed copy number variants (CNV) from bodily fluids to diagnose pulmonary malignancies 16 , 18 , 68 . In these investigations, whole genome testing of metagenomic data demonstrated a heightened diagnostic accuracy for pulmonary malignancies in samples initially identified as negative via conventional testing. Intriguingly, our findings indicated no significant distinction in the CNV- derived tumor fraction of BALF samples between the lung cancer and pulmonary infections. This reiterates that relying solely on one-dimensional information acquired from conventional BALF mNGS, characterized by low-depth sequencing, is insufficient for diagnosing intricate or multifaceted diseases. Despite insights gained, our study has limitations. Firstly, our cohort lacked viral pneumonia cases due to reduced incidence during China's COVID-19 control measures. in fact, most of viral pneumonia often exhibits clinical and radiological differences from lung cancer, lessening the need for complex differential diagnostics than other infections associated with bacteria, fungi and mycobacteria. Secondly, our study focused on distinguishing infection from cancer, and therefore, the model established cannot address the differentiation effectiveness among various infection subgroups. We are conducting another study developing diagnostic models for distinguishing between infection subgroups, and some progress has been made thus far. In conclusion, we report that integrated host and microbe information from BAL nucleic acid enables accurate diagnosis of lung cancer and pulmonary infections. Future studies are needed to validate and test the clinical impact of this culture-independent diagnostic approach. Declarations Data and materials availability: (1) Code Availability. Essential scripts for implementing machine learning-based integrative procedure in multiple independent datasets are available on the Github website (https://github.com/whybeeVM/Metagenomic-Analysis-of-Lung-Cancer-and-Pulmonary-Infections). (2) Data Availability. Microbial reads from DNA and RNA mNGS data were deposited in NCBI's Sequence Read Archive (SRA) database under project number PRJNA1056765. Host gene expression profile derived from RNA sequencing data were deposited in GSE252118. Conflict of Interest Statement: The authors declare no competing interests. Funding: This study was supported by the National Key R&D Program of China (2023YFC2308300),“Leading Geese” Research and Development Plan of Zhejiang Province (No. 2024C03218), National Natural Science Foundation of China (No. 82472371) Author Contribution Study design, D.H., H.Z., S.Z. and Y.C.; Data collection, F.Y., B.L.and H.T.; Data analysis, D.H., Y.S. and Y.C.; Wrote the paper: D.H. and B.Y. All authors have read and approved the final version of the manuscript. Acknowledgement We thank all clinicians who provided detailed diagnostic and treatment data of patients for our study, as well as all infectious disease (ID) physicians, clinical microbiologists and oncologists who received our clinical consultations References Kreier F. Cancer will cost the world $25 trillion over next 30 years. Nature. 2023. Agusti A, Vogelmeier CF, Halpin DMG. Tackling the global burden of lung disease through prevention and early diagnosis. The Lancet Respiratory Medicine. 2022;10(11):1013-1015. Mckelvy BJ, Araujo-Filho JAB, Godoy MCB, et al. Infectious Diseases That May Mimic Lung Cancer. In: Moran CA, Truong MT, de Groot PM, editors. The Thorax: Medical, Radiological, and Pathological Assessment. Cham: Springer International Publishing; 2023. p. 827-851. Newman-Toker DE, Schaffer AC, Yu-Moe CW, et al. Serious misdiagnosis-related harms in malpractice claims: The "Big Three" - vascular events, infections, and cancers. Diagnosis (Berlin, Germany). 2019;6(3):227. Guimarães MD, Marchiori E, de Souza Portes Meirelles G, et al. Fungal Infection Mimicking Pulmonary Malignancy: Clinical and Radiological Characteristics. Lung. 2013;191(6):655-662. Fabre V, Davis A, Diekema DJ, et al. Principles of diagnostic stewardship: A practical guide from the Society for Healthcare Epidemiology of America Diagnostic Stewardship Task Force. Infection Control & Hospital Epidemiology. 2023;44(2):178-185. Blauwkamp TA, Thair S, Rosen MJ, et al. Analytical and clinical validation of a microbial cell-free DNA sequencing test for infectious disease. Nat Microbiol. 2019;4(4):663-674. Miller S, Naccache SN, Samayoa E, et al. Laboratory validation of a clinical metagenomic sequencing assay for pathogen detection in cerebrospinal fluid. Genome Res. 2019;29(5):831-842. Diao Z, Lai H, Han D, et al. Validation of a Metagenomic Next-Generation Sequencing Assay for Lower Respiratory Pathogen Detection. Microbiology Spectrum. 2023;11(1). Chiu CY, Miller SA. Clinical metagenomics. Nat Rev Genet. 2019;20(6):341-355. Diao Z, Han D, Zhang R, et al. Metagenomics next-generation sequencing tests take the stage in the diagnosis of lower respiratory tract infections. J Adv Res. 2022;38:201-212. Edgeworth JD. Respiratory metagenomics: route to routine service. Curr Opin Infect Dis. 2023;36(2):115-123. Ramachandran PS, Ramesh A, Creswell FV, et al. Integrating central nervous system metagenomics and host response for diagnosis of tuberculosis meningitis and its mimics. Nat Commun. 2022;13(1). Kalantar KL, Neyton L, Abdelghany M, et al. Integrated host-microbe plasma metagenomics for sepsis diagnosis in a prospective cohort of critically ill adults. Nat Microbiol. 2022;7(11):1805-1816. Langelier C, Kalantar KL, Moazed F, et al. Integrating host response and unbiased microbe detection for lower respiratory tract infection diagnosis in critically ill adults. Proceedings of the National Academy of Sciences. 2018;115(52):E12353-E12362. Gu W, Talevich E, Hsu E, et al. Detection of cryptogenic malignancies from metagenomic whole genome sequencing of body fluids. Genome Med. 2021;13(1). Gu W, Rauschecker AM, Hsu E, et al. Detection of Neoplasms by Metagenomic Next-Generation Sequencing of Cerebrospinal Fluid. Jama Neurol. 2021;78(11):1355-1366. Guo Y, Li H, Chen H, et al. Metagenomic next-generation sequencing to identify pathogens and cancer in lung biopsy tissue. Ebiomedicine. 2021;73:103639. Travis, W. D. et al. The 2015 World Health Organization Classification of Lung Tumors: Impact of Genetic, Clinical and Radiologic Advances Since the 2004 Classification. J. Thorac. Oncol. 10, 1243-1260 (2015). Sulaiman I, Chung M, Angel L, et al. Microbial signatures in the lower airways of mechanically ventilated COVID-19 patients associated with poor clinical outcome. Nat Microbiol. 2021;6(10):1245-1258. Kopylova E, Noé L, Touzet H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 2012 Dec 15;28(24):3211-7. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114-2120. Yan Z, Chen B, Yang Y,et.al. Multi-omics analyses of airway host-microbe interactions in chronic obstructive pulmonary disease identify potential therapeutic interventions. Nat Microbiol. 2022 Sep;7(9):1361-1375. Haddock NL, Barkal LJ, Ram-Mohan N, et al. Phage diversity in cell-free DNA identifies bacterial pathogens in human sepsis cases. Nat Microbiol. 2023;8(8):1495-1507. Segata N, Izard J, Waldron L, et al. Metagenomic biomarker discovery and explanation. Genome Biol. 2011;12(6):R60. Kim D, Paggi JM, Park C, et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37(8):907-915. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923-930. Jin Y, Tam OH, Paniagua E, et al. TEtranscripts: a package for including transposable elements in differential expression analysis of RNA-seq datasets. Bioinformatics. 2015;31(22):3593-3599. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. Subramanian A, Tamayo P, Mootha VK, et al. Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles. Proceedings of the National Academy of Sciences - Pnas. 2005;102(43):15545-15550. Kanehisa M, Furumichi M, Tanabe M, et al. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017;45(D1):D353-D361. Gillespie M, Jassal B, Stephan R, et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50(D1):D687-D692. Schoggins JW, Wilson SJ, Panis M, et al. A diverse range of gene products are effectors of the type I interferon antiviral response. Nature. 2011;472(7344):481-485. Steen CB, Liu CL, Alizadeh AA, et al. Profiling Cell Type Abundance and Expression in Bulk Tissues with CIBERSORTx. Methods Mol Biol. 2020;2117:135-157. Mao W, Zaslavsky E, Hartmann BM, et al. Pathway-level information extractor (PLIER) for gene expression data. Nat Methods. 2019;16(7):607-610. Adalsteinsson VA, Ha G, Freeman SS, et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nat Commun. 2017;8(1):1313-1324. Talevich E, Shain AH, Botton T, et al. CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing. Plos Comput Biol. 2016;12(4):e1004873. Yoshihara K, Shahmoradgoli M, Martinez E, et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat Commun. 2013;4:2612. Lang M, Binder M, Richter J, et.al. mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software (2019). Mayhew MB, Buturovic L, Luethy R, et al. A generalizable 29-mRNA neural-network classifier for acute bacterial and viral infections. Nat Commun. 2020;11(1):1177. Ren L, Wang Y, Zhong J, et al. Dynamics of the Upper Respiratory Tract Microbiota and Its Association with Mortality in COVID-19. Am J Respir Crit Care Med. 2021;204(12):1379-1390. Bhattacharya S, Dunn P, Thomas CG, et al. ImmPort, toward repurposing of open access immunological assay data for translational and clinical research. Sci Data. 2018;5:180015. Nakayama T, Lee IT, Le W, et al. Inflammatory molecular endotypes of nasal polyps derived from White and Japanese populations. J Allergy Clin Immun. 2022;149(4):1296-1308. Korbecki J, Kojder K, Simińska D, et al. CC Chemokines in a Tumor: A Review of Pro-Cancer and Anti-Cancer Properties of the Ligands of Receptors CCR1, CCR2, CCR3, and CCR4. Int J Mol Sci. 2020;21(21):8412. Liu Z, Rui T, Lin Z, et al. Tumor-Associated Macrophages Promote Metastasis of Oral Squamous Cell Carcinoma via CCL13 Regulated by Stress Granule. Cancers (Basel). 2022;14(20). Diao Z, Han D, Zhang R, et al. Metagenomics next-generation sequencing tests take the stage in the diagnosis of lower respiratory tract infections. J Adv Res. 2021. Charalampous T, Alcolea-Medina A, Snell LB, et al. Evaluating the potential for respiratory metagenomics to improve treatment of secondary infection and detection of nosocomial transmission on expanded COVID-19 intensive care units. Genome Med. 2021;13(1):182. Charalampous T, Kay GL, Richardson H, et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nat Biotechnol. 2019;37(7):783-792. Mick E, Tsitsiklis A, Kamm J, et al. Integrated host/microbe metagenomics enables accurate lower respiratory tract infection diagnosis in critically ill children. J Clin Invest. 2023;133(7). Davidson KR, Ha DM, Schwarz MI, et al. Bronchoalveolar lavage as a diagnostic procedure: a review of known cellular and molecular findings in various lung diseases. J Thorac Dis. 2020;12(9):4991-5019. Chellapandian D, Lehrnbecher T, Phillips B, et al. Bronchoalveolar lavage and lung biopsy in patients with cancer and hematopoietic stem-cell transplantation recipients: a systematic review and meta-analysis. J Clin Oncol. 2015;33(5):501-509. Mayhew MB, Buturovic L, Luethy R, et al. A generalizable 29-mRNA neural-network classifier for acute bacterial and viral infections. Nat Commun. 2020;11(1). Ran Z, Liu J, Wang F, et al. Pulmonary Micro-Ecological Changes and Potential Microbial Markers in Lung Cancer Patients. Front Oncol. 2020;10:576855. Lee SH, Sung JY, Yong D, et al. Characterization of microbiome in bronchoalveolar lavage fluid of patients with lung cancer comparing with benign mass like lesions. Lung Cancer. 2016;102:89-95. Dickson RP, Huffnagle GB. The Lung Microbiome: New Principles for Respiratory Bacteriology in Health and Disease. Plos Pathog. 2015;11(7):e1004923. Man WH, de Steenhuijsen Piters WAA, Bogaert D. The microbiota of the respiratory tract: gatekeeper to respiratory health. Nature Reviews. Microbiology. 2017;15(5):259-270. Da CSG, Shepherd FA, Tsao MS. EGFR mutations and lung cancer. Annu Rev Pathol. 2011;6:49-69. Sweeney TE, Braviak L, Tato CM, et al. Genome-wide expression for diagnosis of pulmonary tuberculosis: a multicohort analysis. The Lancet Respiratory Medicine. 2016;4(3):213-224. Schmiedel D, Mandelboim O. NKG2D Ligands-Critical Targets for Cancer Immune Escape and Therapy. Front Immunol. 2018;9:2040. Gowen BG, Chim B, Marceau CD, et al. A forward genetic screen reveals novel independent regulators of ULBP1, an activating ligand for natural killer cells. Elife. 2015;4. Schmall A, Al-Tamari HM, Herold S, et al. Macrophage and cancer cell cross-talk via CCR2 and CX3CR1 is a fundamental mechanism driving lung cancer. Am J Respir Crit Care Med. 2015;191(4):437-447. Jeffries AR, Mungall AJ, Dawson E, et al. beta-1,3-Glucuronyltransferase-1 gene implicated as a candidate for a schizophrenia-like psychosis through molecular analysis of a balanced translocation. Mol Psychiatry. 2003;8(7):654-663. Lemaitre C, Tsang J, Bireau C, et al. A human endogenous retrovirus-derived gene that can contribute to oncogenesis by activating the ERK pathway and inducing migration and invasion. Plos Pathog. 2017;13(6):e1006451. Jin X, Xu XE, Jiang YZ, et al. The endogenous retrovirus-derived long noncoding RNA TROJAN promotes triple-negative breast cancer progression via ZMYND8 degradation. Sci Adv. 2019;5(3):eaat9820. Kitsou K, Kotanidou A, Paraskevis D, et al. Upregulation of Human Endogenous Retroviruses in Bronchoalveolar Lavage Fluid of COVID-19 Patients. Microbiol Spectr. 2021;9(2):e126021. Wang A, Al-Kuhlani M, Johnston SC, et al. Transcription factor complex AP-1 mediates inflammation initiated by Chlamydia pneumoniae infection. Cell Microbiol. 2013;15(5):779-794. Arancio W, Coronnello C. Repetitive Sequence Transcription in Breast Cancer. Cells (Basel, Switzerland). 2022;11(16):2522. Lin P, Chen Y, Xu J, et al. A multicenter-retrospective cohort study of chromosome instability in lung cancer: clinical characteristics and prognosis of patients harboring chromosomal instability detected by metagenomic next-generation sequencing. J Thorac Dis. 2023;15(1):112-122. Additional Declarations No competing interests reported. Supplementary Files 2.supplementarymaterials.docx 3.SupplementaryDataS1S3.xlsx Cite Share Download PDF Status: Published Journal Publication published 07 Oct, 2025 Read the published version in npj Digital Medicine → Version 1 posted Editorial decision: Revision requested 27 May, 2025 Reviews received at journal 25 May, 2025 Reviewers agreed at journal 16 May, 2025 Reviews received at journal 24 Apr, 2025 Reviewers agreed at journal 18 Apr, 2025 Reviewers invited by journal 28 Feb, 2025 Editor assigned by journal 27 Feb, 2025 Submission checks completed at journal 27 Feb, 2025 First submitted to journal 25 Feb, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6108429","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":422863927,"identity":"2065ed50-155d-41ae-a6ab-5dfe4e4784f2","order_by":0,"name":"Dongsheng Han","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAuElEQVRIiWNgGAWjYBACAyCWYKhgYGYAM4jXcoZkLYxtEA5xWszZzx68+XNeHbvBAeaDt3kY7PIIarHsyUu25t3GxmxwgC3ZmochuZiwww7kmEkzbuMBauExk+ZhOJDYQFDL+Tdmkj/nSAC18H8jUsuNHDMJ3gYDkC1sxGp5Y2zNcyyBWfIwm7HlHINkYhyWY3jzR01dMt/x5oc33lTYEdYCA8mQyDQgVj0Q2JGgdhSMglEwCkYaAADFqDSU6a/+QQAAAABJRU5ErkJggg==","orcid":"","institution":"Zhejiang University School of Medicine","correspondingAuthor":true,"prefix":"","firstName":"Dongsheng","middleName":"","lastName":"Han","suffix":""},{"id":422863929,"identity":"b48e626a-57ff-4e3a-9b94-802eff18408f","order_by":1,"name":"Fei Yu","email":"","orcid":"","institution":"Zhejiang University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Fei","middleName":"","lastName":"Yu","suffix":""},{"id":422863931,"identity":"3aad1cb0-44fc-4304-9239-013f32f1d892","order_by":2,"name":"Bin Lou","email":"","orcid":"","institution":"Zhejiang University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Bin","middleName":"","lastName":"Lou","suffix":""},{"id":422863933,"identity":"330e682d-816a-4756-a2ab-72f60a12aa44","order_by":3,"name":"Bin Yang","email":"","orcid":"","institution":"Vision Medicals Co., Ltd","correspondingAuthor":false,"prefix":"","firstName":"Bin","middleName":"","lastName":"Yang","suffix":""},{"id":422863935,"identity":"5507cebc-f22a-4ab9-9df1-314729d4c8bd","order_by":4,"name":"Yifei Shen","email":"","orcid":"","institution":"Zhejiang University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Yifei","middleName":"","lastName":"Shen","suffix":""},{"id":422863938,"identity":"ccbfdc22-47ff-4d0a-84de-bd11e96d0686","order_by":5,"name":"Huifang Liu","email":"","orcid":"","institution":"Vision Medicals Co., Ltd","correspondingAuthor":false,"prefix":"","firstName":"Huifang","middleName":"","lastName":"Liu","suffix":""},{"id":422863941,"identity":"cf1c2523-3a3a-4d3f-ae0b-67f91ea74280","order_by":6,"name":"Hui Tang","email":"","orcid":"","institution":"Zhejiang University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Hui","middleName":"","lastName":"Tang","suffix":""},{"id":422863943,"identity":"a996497c-18d4-4bb0-81a9-4decbc00f5c3","order_by":7,"name":"Hua Zhou","email":"","orcid":"","institution":"Zhejiang University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Hua","middleName":"","lastName":"Zhou","suffix":""},{"id":422863945,"identity":"98bec717-8d4e-4fe2-87ef-edce0d524da4","order_by":8,"name":"Shufa Zheng","email":"","orcid":"","institution":"Zhejiang University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Shufa","middleName":"","lastName":"Zheng","suffix":""},{"id":422863946,"identity":"1d10dcb4-5600-47f6-a013-c063b6d99806","order_by":9,"name":"Yu Chen","email":"","orcid":"","institution":"Zhejiang University School of Medicine","correspondingAuthor":false,"prefix":"","firstName":"Yu","middleName":"","lastName":"Chen","suffix":""}],"badges":[],"createdAt":"2025-02-25 22:53:06","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6108429/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6108429/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41746-025-01977-5","type":"published","date":"2025-10-07T15:57:25+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":77574281,"identity":"f94fdbf1-f0b5-47da-b491-3de8a7daa11c","added_by":"auto","created_at":"2025-03-03 08:44:47","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":324840,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eStudy overview and analysis workflow.\u003c/strong\u003e \u003cstrong\u003e(A), \u003c/strong\u003eEnrollment flow diagram for the patients with suspected lung cancer or pneumonia that was studied. The patients were divided into pulmonary infection group (red) and lung cancer group (orange). The pulmonary infection group was further divided into bacterial infection group (sky blue), fungal infection group (malachite green) and tuberculosis group (dark blue). \u003cstrong\u003e(B), \u003c/strong\u003eStrategies for establishing the training and validation sets and the number of patients in different comparison groups. \u003cstrong\u003e(C),\u003c/strong\u003eGraphical scheme of development and validation a microbe/host mNGS diagnostic approach for the differential diagnosis of lung cancer and pulmonary infections. These comparisons encompassed six models: Model I: Microbial and Bacteriophage DNA relative abundances; Model II: Microbial and Bacteriophage RNA relative abundances; Model III: Host gene expression and composition of immune cell; Model IV: Transposable elements expressed levels; Model V: CNV-derived tumor fraction and Model VI: integrating all these features.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-6108429/v1/1be32a4dd67a3981ec24b70f.png"},{"id":77572547,"identity":"4f780075-bdcb-4c14-b573-b904f4cc7c30","added_by":"auto","created_at":"2025-03-03 08:36:47","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":633180,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eMicrobial and bacteriophage composition analyses in BALF DNA mNGS data.\u003c/strong\u003e \u003cstrong\u003e(A),\u003c/strong\u003e PCoA based on Bray–Curtis dissimilarity index of microbial and bacteriophage composition in comparing lung cancer and pulmonary infection (single-variable PERMANOVA, P value). PCoA1, principal component 1; PCoA2, principal component 2. \u003cstrong\u003e(B), \u003c/strong\u003ePCoA based on Bray–Curtis dissimilarity index of BALF mNGS data in comparing lung cancer and infection subgroups (single-variable PERMANOVA, P value). \u003cstrong\u003e(C),\u003c/strong\u003eBubble plot displaying Lefse analysis results and the relative abundance of microorganisms consistently differentially enriched across various pulmonary disease groups. The size of each bubble corresponds to the median relative abundance of statistically significant findings. A red dashed line delineates positive (on the right) and negative (on the left) fold changes. Bubbles indicate statistical significance (adjusted p-value \u0026lt; 0.05). Distinct bubble colors denote different pulmonary diseases, matching the labels atop the bubble graph, illustrating the enrichment of specific species within each pulmonary disease group.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-6108429/v1/640dec690c59169614262621.png"},{"id":77572554,"identity":"53cecfe7-6823-4db7-9dc3-9fb0671e2987","added_by":"auto","created_at":"2025-03-03 08:36:47","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":540992,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eHost immune profiling in different pulmonary diseases. (A),\u003c/strong\u003e Normalized enrichment scores of selected KEGG terms that reached statistical significance (adjusted p-value \u0026lt; 0.05) in the gene set enrichment analysis (GSEA) using differentially expressed (DE) genes between Lung Cancer and pulmonary infectious groups. (\u003cstrong\u003eB), \u003c/strong\u003eHeatmap of 39 significantly differential expressed latent variables (LVs) which had biological function. (Left) Heatmap of differential LVs on average of each group. C-F. GSEA of Cell Cycle, Cytokine-cytokine receptor interaction, Interferon Signaling and Innate Immune System. Each line representing one particular gene set with unique color, and up-regulated genes located in the left approaching the origin of the coordinates, by contrast the down-regulated lay on the right of x-axis. Only gene sets with NOM p-value \u0026lt; 0.05 and FDR q-value \u0026lt; 0.05 were considered significant. And only several leading gene sets were displayed in the plot.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-6108429/v1/1ecc689c968fed9e79a737e1.png"},{"id":77572586,"identity":"ab9ed193-00b7-4ef5-9989-f83428b3acbc","added_by":"auto","created_at":"2025-03-03 08:36:50","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":252133,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eHost immune cell profiling in different pulmonary diseases. \u003c/strong\u003e(\u003cstrong\u003eA), \u003c/strong\u003eIn silico estimation of cell-type proportions in the bulk RNA-sequencing using single-cell signatures. Cell-type abundance quantification plots. Comparison of abundance of Marcrophage M1 (\u003cstrong\u003eB\u003c/strong\u003e), Monocytes (\u003cstrong\u003eC\u003c/strong\u003e), Neutrophils (\u003cstrong\u003eD\u003c/strong\u003e) and Mast cells activated (\u003cstrong\u003eE\u003c/strong\u003e) among pulmonary disease groups in the BAL fluids. P-values were obtained using Wilcoxon rank-sum test (two-sided), *,**,*** represent significance between two groups and N.S. represents no significance between two groups.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-6108429/v1/e756806ef0cc40ea3236df2e.png"},{"id":77572555,"identity":"b4b7043b-0200-4ccf-8c14-b3449c1bc1ae","added_by":"auto","created_at":"2025-03-03 08:36:47","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":463149,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eHost/microbe metagenomics-based modelling for lung cancer and pulmonary infection diagnosis. (A), \u003c/strong\u003eA total of 58 prediction models (4 individual machine learning models: LASSO, Random Forest, XGboost and Support Vector Machine and 54 ensemble models) via Model VI and further calculated AUC of each model across all comparisons in validation datasets. Red label indicated the model with highest average AUC of all comparison.\u003cstrong\u003e (B), \u003c/strong\u003eThe area under the curve (AUC), along with median values and 95% confidence intervals, was calculated for Receiver Operating Characteristic (ROC) curve analyses using various datasets. The training datasets were represented by black squares with error bars, while the validation datasets were denoted by red triangles. \u003cstrong\u003e(C-F), \u003c/strong\u003eROC of validation datasets for classifying lung cancer versus pulmonary infection. Delong test was used for Comparing Two ROC Curves-Paired Design. *,**,*** represent significance between two groups and N.S. represents no significance between two groups. \u003cstrong\u003e(G-L),\u003c/strong\u003e Color distinctions represent various groups associated with pulmonary diseases. The median is visually depicted by black lines. The y-axis in each panel was trimmed at the maximum value among all groups of 1.5*IQR above the third quartile, where IQR is the interquartile range. For each host gene/transposable element (TE), we conducted formal comparisons among groups within the training cohort. Pairwise comparisons were performed with a Mann-Whitney test followed by Holm’s correction for multiple testing.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-6108429/v1/1395cd5bbf282fd26802c9bf.png"},{"id":77572551,"identity":"c467aa1f-2789-4c9f-b6eb-8a653b10b046","added_by":"auto","created_at":"2025-03-03 08:36:47","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":308104,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThe performance of composite predictive model using a rule-in and rule-out strategy. (A), \u003c/strong\u003eIn the validation cohort comparing lung cancer and bacterial infections, only 2 cancer patients were incorrectly classified as bacterial infections (rule-out-band), while 8 bacterial infections were wrongly classified as cancer (rule-in-band). The overall identification accuracy was 90.7% (49/54). \u003cstrong\u003e(B),\u003c/strong\u003e In the validation cohort comparing lung cancer and fungal infections, only 1 cancer patient was incorrectly classified as fungal infection (rule-out-band), and 1 fungal infection was wrongly classified as cancer (rule-in-band). The overall identification accuracy was 91.5% (43/47). \u003cstrong\u003e(C), \u003c/strong\u003eIn the validation cohort comparing lung cancer and pulmonary tuberculosis, only 4 tuberculosis patients were incorrectly classified as cancer (rule-in-band), with an overall identification accuracy of 89.6% (43/48). \u003cstrong\u003e(D), \u003c/strong\u003eThe flowchart of our ensemble model and rule-in/rule-out strategy for pulmonary diseases diagnosis in clinical settings.\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-6108429/v1/b13f94ecb4d585ef15437f68.png"},{"id":93420503,"identity":"2ebb124f-7ccf-428c-8ed1-b41d64dff1e6","added_by":"auto","created_at":"2025-10-13 16:10:07","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":4013759,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6108429/v1/1ae22490-0113-4b1e-a939-5ac7ab416d9a.pdf"},{"id":77572571,"identity":"d612e6e4-3275-4141-b630-2ab8b042efe4","added_by":"auto","created_at":"2025-03-03 08:36:48","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":9808801,"visible":true,"origin":"","legend":"","description":"","filename":"2.supplementarymaterials.docx","url":"https://assets-eu.researchsquare.com/files/rs-6108429/v1/69e1dd278fe888c42781d96d.docx"},{"id":77572548,"identity":"ff3e5e2f-32d8-4b68-95b4-74c4edbe8c91","added_by":"auto","created_at":"2025-03-03 08:36:47","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":55493,"visible":true,"origin":"","legend":"","description":"","filename":"3.SupplementaryDataS1S3.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6108429/v1/7aeb175b4c1f594860e9f869.xlsx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Multimodal Metagenomic Profiling of Bronchoalveolar Lavage Fluid for Diagnostic Classification of Pulmonary Diseases","fulltext":[{"header":"Introduction","content":"\u003cp\u003eLung cancer and pulmonary infections pose significant global health challenges with high incidence, mortality rates, and substantial socioeconomic burdens\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. Clinicians often struggle to differentiate them based solely on clinical and radiological features, lacking rapid and accurate histopathological or microbiological test results. This leads to misdiagnoses and delays or incorrect treatments\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eVarious pathogens causing pulmonary infections, such as bacteria (\u003cem\u003ePseudomonas\u003c/em\u003e, \u003cem\u003eStreptococcus\u003c/em\u003e), mycobacteria (\u003cem\u003eMycobacterium tuberculosis\u003c/em\u003e, \u003cem\u003eNon-tuberculous mycobacteria\u003c/em\u003e), aerobic actinomycetes (\u003cem\u003eNocardia\u003c/em\u003e), fungi (\u003cem\u003eAspergillus\u003c/em\u003e, \u003cem\u003eMucor\u003c/em\u003e, \u003cem\u003ecryptococcus\u003c/em\u003e), and others, can mimic lung cancer, sharing indistinguishable clinical symptoms (e.g., dyspnea, fatigue, cough, and hemoptysis) and radiographic features (e.g., spiculated solid nodules or masses, cavities with nodular margins, and chest wall and mediastinal invasion)\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e. Consequently, clinicians often employ multiple testing methods to detect lung infections and cancer\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e. An affordable diagnostic method requiring fewer samples, aiding clinicians in quicker and accurate decisions, would greatly benefit patient treatment and management.\u003c/p\u003e \u003cp\u003eMetagenomic Next-generation Sequencing (mNGS) is a sequencing technology capable of identifying pathogens in specimens with microbial nucleic acid concentrations beyond detection limits within 24 hours or even less\u003csup\u003e\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e. In recent years, it has been widely employed in the diagnosis of various complex infectious diseases and has been confirmed a powerful tool with an excellent diagnostic accuracy in detecting pneumonia-related pathogens\u003csup\u003e\u003cspan additionalcitationids=\"CR11\" citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e. Excitingly, recent studies have confirmed that analyzing transcriptomic data derived from human sequences of mNGS testing can aid in distinguishing infectious diseases such as sepsis, acute respiratory infections, tuberculous meningitis, and non-infectious diseases\u003csup\u003e\u003cspan additionalcitationids=\"CR14\" citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. Developing intelligent algorithms based on chromosomal instability and tumor-related copy number variations generated by mNGS data is useful to diagnose malignant tumors\u003csup\u003e\u003cspan additionalcitationids=\"CR17\" citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. These studies prompt us to further contemplate whether it is possible to utilize mNGS data from respiratory tract samples to establish an integrative genomic diagnostic method that combines microbial and host response characteristics of the patients. This method is anticipated to identify pulmonary infectious diseases that can be mistaken for lung cancer without escalating patient testing expenses, utilizing minimal tests and samples, and within a relatively short timeframe.\u003c/p\u003e \u003cp\u003eHere, we conducted mNGS testing on bronchoalveolar lavage fluid samples (BALF-mNGS) from 402 clinical patients with lung cancer or pulmonary infections. Subsequently, we analyzed the microbial information and host response information derived from metagenomic sequencing data, and based on this, we established and validated an integrated host/microbe metagenomics-driven machine learning approach for the differential diagnosis of lung cancer and pulmonary infections.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e1. Study design, patient collection and ethics statement\u003c/h2\u003e \u003cp\u003eThis observational study assessed adults admitted to the First Affiliated Hospital, Zhejiang University School of Medicine (FAHZU), suspected of lung cancer or pulmonary infections. Enrollment occurred between March 8, 2020, and May 27, 2023, for patients aged\u0026thinsp;\u0026ge;\u0026thinsp;18, requiring BALF samples within 72 hours of intubation to establish the etiology. Exclusions involved cases with underlying leukemia, no definitive diagnosis post-extensive follow-up, or lacking matching DNA and RNA mNGS data from BALF samples (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eA). A total of 123 lung cancer, 279 pulmonary infections including tuberculosis, fungal, and bacterial infections, and 32 negative control cases (e.g., immune pneumonitis, organizing pneumonia and drug-related pneumonia) were included. The diagnosis of lung cancer relies on clinical suspicion and positive laboratory results from tests cytology, flow cytometry and/or tissue biopsy. Pathological information of all samples was determined based on surgically resected tissue sections according to 2015 WHO Histological Classification of Lung Cancer\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. The diagnosis of pulmonary infections is based on clinical suspicion and determination of the causative pathogen through standard microbiological diagnostics (cultures, antigen/antibody tests, PCR, sequencing, see in Supplementary Data S1). Archival material at FAHZU was retrospectively analyzed under no-patient contact protocols approved by the FAHZU Institutional Review Board (IIT20220714A). A written consent given prior to the procedure used to obtain the sample covered the use of residual samples for research. Then, we constructed training set and validation set by time order of collecting date. We ranked all lung cancer samples by collection time and separated them into first 70% and last 30% (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eB, Supplementary Data S1). Training set was used for differential analysis, feature selection and ensemble model training. Validation set was used for performance validation and rule-in/rule-out combining predictions.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003e2. DNA/RNA extraction, library construction and sequencing\u003c/h3\u003e\n\u003cp\u003eFor metagenomic sequencing (DNA sequencing), 1 mL of BALF sample was subjected to depletion of host nucleic acid using 1 U benzonase (Sigma) and 0.5% Tween 20 (Sigma) and incubation at 37\u0026deg;C for 5 min. A total of 600 \u0026micro;L of the mixture was transferred to new tubes containing 500 \u0026micro;L of ceramic beads for bead beating using a Minilys Personal TGrinder H24 Homogenizer (catalogue number: OSE-TH-01, Tiangen, China). Then, the nucleic acid from 400 \u0026micro;L of the pretreated sample was extracted and eluted in 60 \u0026micro;L elution buffer using a QIAamp UCP Pathogen Mini Kit (catalogue number: 50214, Qiagen, Germany). The extracted DNA was quantified using a Qubit dsDNA HS Assay Kit (catalogue number: Q32854, Invitrogen, USA)\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e. For metatranscriptome sequencing (RNA sequencing), 1 mL BALF sample was centrifuged at 12,000 rpm for 10 min. Then, 200 \u0026micro;L of the precipitate was lysed in TRIzol LS (Thermo Fisher Scientific, Carlsbad, CA, USA), followed by RNA extraction using a Direct-zol RNA Miniprep kit (Zymo Research, Irvine, CA, USA) according to the manufacturer's instructions\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eAccording to the manufacturer's instructions, 30 \u0026micro;L DNA was used to generate libraries with the Nextera DNA Flex kit (Illumina, San Diego, CA, USA), and 10 \u0026micro;L of purified RNA was used for cDNA generation and library preparation with an Ovation Trio RNA-Seq Library Preparation Kit (NuGEN, CA, USA). A Qubit dsDNA HS Assay Kit was used to measure the library concentration. The library quality was assessed with an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) and a High Sensitivity DNA kit. The library was sequenced using an Illumina NextSeq 550 sequencer with a 75-cycle single-end sequencing strategy\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e,\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\n\u003ch3\u003e3. Microbial annotation, community structure comparison and differential taxon analysis\u003c/h3\u003e\n\u003cp\u003eAs previous study described, we used a validated mNGS sequencing pipeline for microbial composition analysis\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e. In brief, Trimmomatic was used to remove low-quality, duplicate, and \u0026lt;\u0026thinsp;50 bp reads, as well as adapter contamination\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. Human sequences were excluded by mapping to human reference genome(hg38) using SNAP v1.0beta\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e. SortMERNA v4.3.7 was used for ribosomal RNA removement\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e. Kraken2 v.2.0.7 and Bracken v.2.5 created taxonomic profiles using default settings and the default database (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://benlangmead.github.io/aws-indexes/k2\u003c/span\u003e\u003cspan address=\"https://benlangmead.github.io/aws-indexes/k2\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e)\u003csup\u003e9,20\u003c/sup\u003e. Sequencing reads for detected microbes were normalized as RPM (reads per million) to correct for various sequencing depths. The BALF mNGS data from 32 non-infection and non-cancer cases were used as negative controls (NC, Supplementary Data S3). Further analysis was also done to identify possible contaminants in the DNA/RNA mNGS datasets. To this end, we compared the relative abundance of taxa between background bronchoscope control and BAL samples. Taxa with median relative abundance greater in background than in BAL/frequencies in NCs were higher than 50% and average relative abundances in NCs were higher than 0.1% were identified as probable contaminants and removed\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e,\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eFor bacteriophage annotation, the cleaned reads were aligned against a curated phage database (CPD) containing 26,159 phage representative genomes using BLAST (word size: 18, e-value: 0.0005, culling limit: 1)\u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e. Microbial and Phage counting in DNA (DNA microbial abundances, DMA)/RNA (RNA microbial abundances, RMA) mNGS data relied on relative abundances\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e,\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThe α-diversity of the microbial composition in DNA/RNA mNGS data, including the Shannon index, Simpson index, Chao1 index, and ACE index, were computed using the \u0026ldquo;vegan\u0026rdquo; package in R software after sequence processing. Permutational multivariate ANOVA (PERMANOVA) was conducted using the \"vegan\" package to determine the difference in sample β-diversity (measured by Bray‒Curtis distance). Principal coordinates analysis (PCoA) was used to identify differences of microbial community structure. LefSE assessed the difference between each group's microbial taxon or bacteriophage\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\n\u003ch3\u003e4. Gene expression (GE), transposable elements expression (TEE), cell-type composition analysis (CC)\u003c/h3\u003e\n\u003cp\u003eFor the analysis of host gene expression, high-quality data were aligned to the human genome hg38 using HISAT2 with default parameters. Gene-level quantification was performed using FeatureCounts\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e,\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. The gene counts were aggregated using the featureCounts program from the Subread package release 2.0.0 (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://subread.sourceforge.net/\u003c/span\u003e\u003cspan address=\"http://subread.sourceforge.net/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e)\u003csup\u003e20,23\u003c/sup\u003e. Additionally, trimmed clean reads were mapped using STAR with previously defined parameters\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e. TEtranscripts software was utilized to estimate the abundances of Transposable Elements (TE) and to conduct differential expression analysis. The GTF file containing transposable element annotations was obtained from \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://hammelllab.labsites.cshl.edu/software/#TEtranscripts\u003c/span\u003e\u003cspan address=\"https://hammelllab.labsites.cshl.edu/software/#TEtranscripts\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. All Genes and TE were normalized, corrected batch effect and calculated differential expression in each group using the DESeq2 package, applying criteria of FDR\u0026thinsp;\u0026le;\u0026thinsp;0.05 and Fold-change\u0026thinsp;\u0026ge;\u0026thinsp;1.5\u003csup\u003e29\u003c/sup\u003e. Gene set enrichment analysis (GSEA) for DEGs was carried out using the REACTOME, KEGG, and GO databases\u003csup\u003e\u003cspan additionalcitationids=\"CR31\" citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e. Significantly enriched pathways or biological processes were determined based on Fisher's exact test (p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05), following Benjamin and Hochberg's adjustment\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e,\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e. To estimate the relative proportions of invasive immune cell types and infer the proportions of immune cells, the CIBERSORT algorithm was applied with the original gene signature file LM22 and 1000 permutations\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e. Latent variables were calculated by PLIER R package\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e. Continuous nonparametric data of latent variables and cell proportions were compared using the Mann‒Whitney-U test. P values from multiple testing of latent variables were adjusted using the Benjamini-Hochberg adjustment with a significance level of 0.05.\u003c/p\u003e\n\u003ch3\u003e5. Copy number variants-derived (CNV) tumor fractions calling\u003c/h3\u003e\n\u003cp\u003eThe DNA metagenomic sequencing data were used in downstream analyses to identify CNVs through the ichorCNA\u003csup\u003e\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e. CNVkit and estimate software package to generate ctDNA tumor fractions as previously described and validated in tumor tissue and body fluid\u003csup\u003e\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e,\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e. The ichorCNA ploidy parameter restart value was set to 2 and the maximum copy number to use was lowered to 3. The tumor fraction with the highest loglikelihood was retrieved and reported. Continuous nonparametric data were compared using the Mann‒Whitney-U test. P values less than 0.05 were considered statistically significant.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e6. Ensemble machine learning models for DMA, RMA, GE, TEE, CC and CNV\u003c/h2\u003e \u003cp\u003eWe first performed differential analysis to identify significant features associated between disease types and microbial DNA/RNA relative abundances (DMA/RMA), Transcripts per Millions (TPM) of host gene expression (GE), TPM of transposable elements (TEE), relative abundance of host cell (CC) and score value of tumor fraction/CNV (CNV) respectively. Within each type of data, given the adjusted p value cut-off was set to 0.05, the features with an adjusted p value less than the cut-off were selected. After obtaining all candidate features, we calculated the frequency of features in the training set for each classifier capable of performing feature selection across different data types, conducting 1,000 iterations to determine the occurrence frequency of each feature. We then selected the optimal combination of features through sequential forward selection. Using four different classifiers\u0026mdash;LASSO, SVM, XGBoost, and Random Forest\u0026mdash;we constructed models in the training set for six different models (Model I-VI). The Lasso were implemented via the glmnet package. The regularization parameter, λ, was determined by 10-fold, whereas the L1-L2 trade-off parameter, α, was set to 0\u0026ndash;1 (interval\u0026thinsp;=\u0026thinsp;0.1). For the CoxBoost model, we used 10-fold routine optimBoostPenalty function to first determine the optimal penalty (amount of shrinkage). The SVM model was implemented via svm package. The regression approach takes censoring into account when formulating the inequality constraints of the support vector problem. Random Forrest had two parameters ntree and mtry, where ntree represented the number of trees in the forest and mtry was the number of randomly selected variables for splitting at each node. To integrate the predicted cancer or infection probability scores by LASSO, SVM, XGBoost, and Random Forest, we calculated the probability of cancer in a formula as:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003ePr(Cancer) = α*Pr(Classifier A) + β* Pr(Classifier A);\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eα\u0026thinsp;+\u0026thinsp;β\u0026thinsp;=\u0026thinsp;1; Each bootstrap value was 0.1.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003eWe picked up best mean AUC of all comparison in validation dataset for each Model (I-VI). Finally, we chose the best classifier along with Model VI to conduct combined rule-in and rule-out predictions (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eC, Supplementary Figure \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). The R package \"mlr3\" was used to perform machine learning models\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e\u003c/sup\u003e. The prediction model accuracy, sensitivity and specificity were assessed using the AUC. DeLong test was used for calculating significance of p-value between two ROCs\u003csup\u003e\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e All data analyses were performed with the R studio built under R version 4.1.0.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e1. Clinical features of study cohort\u003c/h2\u003e \u003cp\u003eBased on the established criteria (Methods, Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eA), we enrolled a total of 402 patients, consisting of 123 lung cancer patients and 279 patients with pulmonary infections. According to etiological findings, the infection group was further subdivided into three subgroups: pulmonary tuberculous (n\u0026thinsp;=\u0026thinsp;86), fungal infection (n\u0026thinsp;=\u0026thinsp;79), and bacterial infection (n\u0026thinsp;=\u0026thinsp;114). Most patients, regardless of their subgroup, exhibited similar clinical and imaging characteristics, such as race (all were Chinese), underlying medical conditions, white blood cells (WBC) count and inflammatory indicators such as Procalcitonin (PCT) and C-reactive protein (CRP), and results of chest computed tomography (CT) scan (e.g., patchy shadows and nodules, cavities, mediastinal lymphadenopathy) (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The median mNGS DNA data per patient was 21.9\u0026nbsp;million reads (IQR 18.0-27.6 M), with the vast majority of reads (\u0026gt;\u0026thinsp;95%) being human. The median mNGS RNA data per patient was 19.1\u0026nbsp;million reads (IQR 13.8\u0026ndash;26.2 M).\u003c/p\u003e \u003cp\u003eWe compared and screened the differential features within the mNGS data of the lung cancer and pulmonary infection groups in training cohort to establish a differential diagnosis approach for lung cancer and pulmonary infections. Subsequently, the lung cancer group was compared separately to the tuberculosis, fungal, and bacterial infection groups to develop a diagnostic method capable of rapidly distinguishing lung cancer from infections caused by different pathogens (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eC, Supplementary Figure \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eDemographic and clinical characteristics of the enrolled patients\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCharacteristics\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eOverall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eLung Cancer\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eBacterial Infection\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eFungal Infection\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eTuberculosis\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003ep-value\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ePatient demographics\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTotal number, n\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e402\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e123\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e114\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e79\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e86\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAge (median [IQR])\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e59.50 [50.00, 67.50]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e58.00 [51.00, 69.50]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e57.00 [50.00, 66.00]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e57.00 [46.00, 69.00]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e57.50 [35.00, 67.75]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.114\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSex\u0026thinsp;=\u0026thinsp;Male, n(%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e255(63.4)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e86(69.9)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e60(52.6)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e50(63.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e59(68.6)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.086\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eUnderlying conditions\u003c/b\u003e, \u003cb\u003en(%)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCardiovascular disease\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e69 (17.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e24 (19.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e18 (15.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e17 (21.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e10 (11.6)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.309\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eImmunological disease\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e22 (5.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e4 (3.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e7 (6.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e9 (11.4)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e2 (2.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.057\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLiver insufficiency\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e34 (8.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e9 (7.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e9 (7.9)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e11 (13.9)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e5 (5.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.293\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRenal insufficiency\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e63 (15.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e17 (13.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e18 (15.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e15 (19.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e13 (15.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.507\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCOPD\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e126 (31.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e33 (26.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e39 (34.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e27 (34.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e27 (31.4)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.941\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCenter nervous system disorder\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e21 (5.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e7 (5.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e8 (7.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e5 (6.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1 (1.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.22\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHIV\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3 (0.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0 (0.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0 (0.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e2 (2.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1 (1.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.064\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHypertension\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e99 (24.6)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e40 (32.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e27 (23.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e19 (24.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e13 (15.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.038\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDiabetes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e55 (13.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e17 (13.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e19 (16.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e8 (10.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e11 (12.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.643\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLaboratory testing\u003c/b\u003e, \u003cb\u003emedian [IQR]\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWBC (10\u0026times;10\u003csup\u003e9\u003c/sup\u003e/L)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e6.50 [5.05, 9.33]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e6.42 [4.95, 9.23]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e7.35 [5.32, 10.31]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e6.58 [4.65, 9.20]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e6.08 [5.07, 7.86]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.102\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNEUT(%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e70.30 [61.82, 80.25]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e70.80 [63.35, 81.40]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e71.30 [61.95, 80.67]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e74.50 [60.50, 86.65]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e67.50 [60.20, 73.05]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.013\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCRP (mg/L)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e17.03 [3.30, 56.53]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e22.27 [4.73, 71.02]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e16.44 [3.30, 59.38]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e10.30 [3.21, 52.62]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e11.20 [3.55, 41.72]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.136\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePCT (ng/mL)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.09 [0.04, 0.36]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.10 [0.04, 0.37]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.11 [0.05, 0.48]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.19 [0.04, 0.55]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.05 [0.05, 0.12]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.042\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eChest CT imaging features\u003c/b\u003e, \u003cb\u003en(%)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePulmonary emphysema\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e90 (22.4)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e39 (31.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e22 (19.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e17 (21.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e12 (14.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.018\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePulmonary nodule\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e136 (33.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e52 (42.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e22 (19.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e32 (40.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e30 (34.9)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePulmonary cavity\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e50 (12.4)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e10 (8.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e14 (12.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e8 (10.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e18 (20.9)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.055\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGround-glass shadow\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e58 (14.4)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e23 (18.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e11 (9.6)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e15 (19.0)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e9 (10.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.095\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMultiple patchy solid shadows\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e284 (70.6)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e82 (66.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e80 (70.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e53 (67.1)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e69 (80.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.078\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMalignant pleural effusion\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e144 (35.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e50 (40.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e37 (32.5)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e29 (36.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e28 (32.6)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.533\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePleural thickening\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e62 (15.4)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e28 (22.8)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e13 (11.4)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e11 (13.9)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e10 (11.6)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.069\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMediastinal lymphadenopathy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e146 (36.3)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e58 (47.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e35 (30.7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e21 (26.6)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e32 (37.2)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.012\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e# Categorical data were compared using the chi-square test or Fisher's exact test.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e2. Microbial community structure and specific taxon of different pulmonary diseases\u003c/h2\u003e \u003cp\u003eMicrobial communities were assessed in a total of 284 samples within the training cohort, comprising 87 cases of Lung Cancer and 197 cases of Pulmonary Infection. Given the low biomass of BAL samples in the DNA/RNA mNGS data, we first identified taxa as probable contaminants by calculating frequencies and average relative abundances in Negative controls (NCs) and comparing the relative abundance between BAL samples and NCs (Supplementary Figure \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e and S3). In general comparison, we didn\u0026rsquo;t find significant differences of DNA microbial α-diversity between lung cancer and pulmonary infections among all indices (Supplementary Figure S4A, Mann Whitney U test, p-value\u0026thinsp;\u0026gt;\u0026thinsp;0.05). But in RNA data, it showed that Richness and Chao1 were higher and Evenness index was lower in lung cancer group (Supplementary Figure S4B). In subgroups comparison, we found both DNA and RNA showed that Richness and Chao1 were higher and Evenness index was lower in lung cancer group and bacterial infection group (Supplementary Figure S4C, D). On the other hands, β-diversity analysis based on the Bray-Curtis distance indicated that the microbial composition of BALF samples of cancer group was distinct from either infection groups or infection subgroups (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA, B, PERMANOVA, P\u0026thinsp;\u0026lt;\u0026thinsp;0.01). For the RNA data, both α-diversity (Supplementary Figure S4B, D, Mann-Whitney-U test, p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05) and β-diversity (Supplementary Figure S5A, B, PERMANOVA, p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.01) analyses of microbiome supported distinct microbial community features in the lower airways among pulmonary diseases. To find out specific microorganisms of different pulmonary diseases, we did Lefse analysis. The findings revealed a higher prevalence of \u003cem\u003eS. oralis\u003c/em\u003e, \u003cem\u003eP. micra\u003c/em\u003e, and \u003cem\u003eP. gingivalis\u003c/em\u003e, which are often regarded as oral or airway commensals, in lung cancer compared to pulmonary infection (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eC, LDA score\u0026thinsp;\u0026gt;\u0026thinsp;2, adjusted p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05). Conversely, pathogenic microorganisms commonly linked with infections, such as \u003cem\u003eM. tuberculosis\u003c/em\u003e, \u003cem\u003eP. aeruginosa A. fumigatus\u003c/em\u003e, and \u003cem\u003eC. neoformans\u003c/em\u003e, were more frequently detected in the pulmonary infection. Notably, the anaerobic bacterium \u003cem\u003eF. nucleatum\u003c/em\u003e appears as a specific microbe in bacterial infections (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eC). We suggested that this is because the bacterial infections in our study included some patients with lung abscesses. Additionally, we observed that \u003cem\u003eP. aeruginosa\u003c/em\u003e and \u003cem\u003eP. gingivalis\u003c/em\u003e serve as specific microbes in different comparison groups, indicating varying disease-specific microbes (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eC). This suggests that despite certain pathogens (e.g., \u003cem\u003eM. tuberculosis\u003c/em\u003e, \u003cem\u003eA. fumigatus\u003c/em\u003e, and \u003cem\u003eC. neoformans\u003c/em\u003e) having distinct microbial profiles that could potentially serve as valuable indicators for diagnosing pulmonary diseases, the differentiation among various pulmonary diseases based on lung microbiota is limited due to the complexity of the lung microbiome.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003e3. Difference in host immune response, transposable elements expression, and immune cell abundance of different pulmonary diseases\u003c/b\u003e \u003c/p\u003e \u003cp\u003eFirst, to reduce the impact of ribosomal RNA on the effective data, we performed ribosomal RNA removal in the experimental steps. We observed that the percentages of eukaryotic rRNA (1.66%, IQR 1.01%-2.62%) and total rRNA (2.3%, IQR 1.38%-3.98%) were relatively low (Supplementary Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e, Figure S6). Additionally, we calculated the number of genes detected in each sample, finding a median of 17,827 (IQR 16,832.5\u0026ndash;18,738.5). Finally, to discern host immune responses between lung cancer and infection, we conducted BALF host gene expression analyses, revealing substantial variations among various groups as depicted in volcano graph analysis (Extend Data Fig.\u0026nbsp;7A-D). GSEA enrichment analysis highlighted significant enrichment of differential expression genes (DEGs) in innate immune pathways like T-cell receptor signaling and cytokine-cytokine receptor signaling (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). Employing PLIER on training datasets, we delineated host transcriptomic profiles across 545 canonical Pathways, identifying multiple differentially expressed latent variables (LVs) with distinct biological functions across different groups (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB, Mann-Whitney-U Test, adjusted p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05). Specifically, in the cancer group, lower airway transcriptomes exhibited upregulation of the cell cycle (LV102 and LV107), while LV165, annotated as cytokine-cytokine receptor interaction pathways, displayed upregulation, contrary to LV86 in the same pathways (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB). Furthermore, we observed upregulation of interferon signaling and the innate immune system in infection groups (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eD, E), notably driven by Pulmonary Tuberculosis, which exhibited the well-established upregulation of interferon signaling.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFor further exploration, we selected differentially expressed immune genes (IMG) from the ImmPort database and interferon-stimulated genes (ISG) from prior research\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e,\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e. Notably, TB-associated markers GBP1 and GBP5 were elevated in the TB group (Supplementary Figure S8A, adjusted p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.01). Four genes emerged as notably upregulated in the cancer group, and intriguingly, these genes were chemokines: C-C motif chemokine ligand 7 (CCL7), C-C motif chemokine ligand 8 (CCL8), C-C motif chemokine ligand 13 (CCL13) and pro-platelet basic protein (PPBP) also known as CXCL7 (Supplementary Figure S8A, indicated by red triangle, adjusted p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.01). Studies suggest that CCL7, highly expressed in tumor tissues, recruits cDC1 cells, aiding antitumor immunity and checkpoint immunotherapy. Additionally, CCL7, CCL8, and CCL13 are linked to tumor-associated macrophages (M2)\u003csup\u003e\u003cspan additionalcitationids=\"CR44\" citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eWe identified 27 transposable elements among lung cancer and three infection groups (Supplementary Figure S7E-H, Figure S8B), notably finding significantly higher LTR-ERV (LTR6A and HUERS-P3-int) levels in lung cancer (Supplementary Figure S8B, adjusted p-value\u0026thinsp;=\u0026thinsp;0.019).\u003c/p\u003e \u003cp\u003eTo investigate variations in immune cell abundance across different groups, we estimated cell-type levels in host transcriptomes using computational quantification methods, including a deconvolution approach implemented in CIBERSORTx. Macrophage M1 were significantly elevated in pulmonary tuberculosis (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eB, Mann-Whitney-U Test, p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05), whereas Macrophage M2 levels were higher in fungal infection, pulmonary tuberculosis and lung cancer Macrophage M2 levels were higher in fungal infection, pulmonary tuberculosis and lung cancer (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eC, Mann-Whitney-U Test, p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.01). Neutrophils were enriched in bacterial infection comparing with lung cancer (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eD, Mann-Whitney-U test, p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.01). Furthermore, we observed notably higher monocytes in fungal infection (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eE, Mann-Whitney-U Test, p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.01).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e4. Copy number variants and CNV-derived tumor fraction of different pulmonary diseases\u003c/h2\u003e \u003cp\u003eDue to the host DNA removal step during sample preprocessing, we evaluated the sufficiency of host data. We analyzed 402 samples, finding a host rate of 98.22% (IQR 97.46, 98.66) and mapping reads of 19.23\u0026nbsp;million reads (IQR 15.95, 23.66) (Supplementary Table \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e). The data volume in this study is higher than in previous studies17, thereby ensuring the reliability of subsequent analyses. To enhance CNV and tumor fraction estimations in BALF mNGS data, we used three distinct software tools. CNVkit revealed slight increases in CNV counts on chromosomes 11 (lung cancer group) and 3 (pulmonary infection group) (Supplementary Figure S9, p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05). Higher CNV percentages on chromosome 3 were noted in the infection group (Supplementary Figure S10, p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05). However, no significant CNV count or percentage differences emerged when comparing the cancer group with the three infection subgroups (Supplementary Figure S11 and S12, p-value\u0026thinsp;\u0026gt;\u0026thinsp;0.05).\u003c/p\u003e \u003cp\u003eSubsequently, ichorCNA estimated tumor fractions at 5.96% (lung cancer, 95% CI 4.15%-7.77%) and 6.29% (pulmonary infection, 95% CI 0.54%-12.04%) (Supplementary Figure S13A). Notably, no significant differences in tumor fractions were observed between the lung cancer and the three infection subgroups (Supplementary Figure S13B). Calculated scores (Stromal, Immune, ESTIMATE, Tumor Purity) using 'estimate' software showed no differences between lung cancer and all the pulmonary infection groups (Supplementary Figure S13C, D). This suggests that, unlike Cancer-Negative (Benign) samples\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e, BALF samples from infection patients display comparable levels of copy number variations seen in cancer patients.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e5. Host/microbe metagenomics-based modelling for lung cancer and pulmonary infection diagnosis\u003c/h2\u003e \u003cp\u003eWe first conducted individual machine learning modeling and dual-model ensemble modeling for Models I-VI. We evaluated their performance based on the mean AUC on the validation dataset. Among them, the optimal combination for Model I was found to be 0.1LASSO\u0026thinsp;+\u0026thinsp;0.9RF, with a mean AUC of 0.778 (Supplementary Figure S14). For Model II, the best combination was 0.3RF\u0026thinsp;+\u0026thinsp;0.7XGBoost, with a mean AUC of 0.691 (Supplementary Figure S15). Model III's optimal combination was 0.3LASSO\u0026thinsp;+\u0026thinsp;0.7RF, achieving a mean AUC of 0.867 (Supplementary Figure S16). For Model IV, the best combination was 0.2LASSO\u0026thinsp;+\u0026thinsp;0.8SVM, with a mean AUC of 0.584 (Supplementary Figure S17). The optimal combination for Model V was 0.9LASSO\u0026thinsp;+\u0026thinsp;0.1XGBoost, with a mean AUC of 0.56 (Supplementary Figure S18). For Model VI, the optimal combination was 0.3LASSO\u0026thinsp;+\u0026thinsp;0.7RF, with a mean AUC of 0.869 (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eA). We observed that among the various comparison groups, Model VI consistently exhibited the highest AUC (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eB). Specifically, Model VI significantly outperformed Models I, II, IV, and V in each comparison group (Figs.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eC-F, DeLong Test p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05).\u003c/p\u003e \u003cp\u003eThe results unveiled that Model VI, incorporating differential features from microbial and bacteriophage DNA/RNA abundances, host gene expression, immune cell composition, transposable elements, and CNV-derived tumor fraction, exhibited the highest discriminatory capability in both general and subgroup comparisons. Specifically, for the general comparison, Model VI demonstrated an AUC of 0.937 (95% CI\u0026thinsp;=\u0026thinsp;0.91\u0026ndash;0.964) with 92.0% sensitivity and 81.2% specificity in the training cohort. In the validation cohort, it achieved an AUC of 0.847 (95% CI\u0026thinsp;=\u0026thinsp;0.776\u0026ndash;0.918) with 94.4% sensitivity and 61.0% specificity, effectively distinguishing lung cancer from pulmonary infections (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eC, Supplementary Table S3). The highlighted host transcriptome features in Model VI included genes involved in the cell cycle and cytokine-cytokine receptor pathways, such as ULBP1, BG3GAT1, and CCL13 (Supplementary Data S2; Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eG, H, and I, Mann-Whitney-U test, p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05). Notably, CCL13, a downstream gene of EGFR, serves as a typical LUAD biomarker, while ULBP1 and BG3GAT1 are genes regulated by CCL13 for cDC modulation.\u003c/p\u003e \u003cp\u003eIn the subgroup comparisons, Model VI showcased notable performance. For instance, in distinguishing lung cancer from bacterial infection, it attained an AUC of 0.847, with 80.6% sensitivity and 82.4% specificity in the validation cohort. Similarly, in discerning lung cancer from fungal infection, Model VI displayed an AUC of 0.872, sensitivity of 94.4%, and specificity of 69.6%. Furthermore, when differentiating lung cancer from pulmonary tuberculosis, Model VI achieved an AUC of 0.909, sensitivity of 91.7%, and specificity of 76.0% (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eB, D, E, and F, Supplementary Table S3). Noteworthy observations included higher levels of MAS1 associated with apoptosis and tissue injuries in Bacterial Infection, increased IL23Rlevels correlated with TLR4 in Pulmonary Tuberculosis, and elevated C1QL3 levels in Fungal Infection compared to lung cancer (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eK, J, and L, Mann-Whitney-U test, p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.01).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e6. A composite predictive model for Lung cancer and infection diagnosis\u003c/h2\u003e \u003cp\u003eWith a rule-in and rule-out strategy, we developed a composite predictive model that combines the Model-VI used for general comparison with either Model VI used for subgroup comparison, aiming to enhance the diagnostic accuracy for lung cancer and infections. In this rule-in and rule-out strategy, if both Model-VI of general comparison and either Model VI used for a subgroup comparison classified a patient as lung cancer, we defined it in rule-in-band (i.e., positive of lung cancer diagnosis). While if both models classified a patient as infection, we defined it in rule-out-band (i.e., positive of pulmonary infection diagnosis) (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eC, Supplementary Fig.\u0026nbsp;1).\u003c/p\u003e \u003cp\u003eThe validation cohort from each subgroup comparison was utilized to evaluate the performance of the composite predictive model. Within the lung cancer versus bacterial infection group, a total of 54 patients were categorized, with 27 identified as rule-in and 22 as rule-out (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eA, Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Similarly, within the comparison of lung cancer versus fungal infection, 47 patients were classified, comprising 32 rule-in and 11 rule-out cases (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eB, Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Moreover, in the evaluation of lung cancer versus fungal infection, 48 patients were allocated, consisting of 31 rule-in and 12 rule-out instances (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eC, Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eTest statistics for combination strategy.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTreated\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCancer\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eInfection\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eLR*\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eSpecificity#\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eSensitivity+\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eLung Cancer vs.\u003c/p\u003e \u003cp\u003eBacterial Infection\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRule-Out\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e22\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRule-In\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e27\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e5.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.844\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eLung Cancer vs.\u003c/p\u003e \u003cp\u003eFungal Infection\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRule-Out\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRule-In\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e32\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.889\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eLung Cancer vs.\u003c/p\u003e \u003cp\u003ePulmonary Tuberculosis\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRule-Out\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRule-In\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e31\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e6.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.861\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eLR*: Likelihood Ratio, serves as an indicator of cancer risks. A higher LR signifies a stronger correlation with lung cancer. For example, within the Rule-out band for lung cancer versus bacterial infection, there were 22 patients classified as infection and 0 patients classified as having cancer. The LR calculation resulted in 0/22\u0026thinsp;=\u0026thinsp;0.\u003c/p\u003e \u003cp\u003eSpecificity#: refers to the accuracy of the rule-out band in correctly identifying infected patients. It is calculated as the number of infected patients correctly identified by the rule-out band (true positives) divided by the sum of true positives and the number of infected patients mistakenly identified as having cancer by the rule-out band.\u003c/p\u003e \u003cp\u003eSensitivity+: refers to the accuracy of the rule-in band in correctly identifying cancer patients. This is calculated by dividing the number of cancer patients correctly identified by the rule-in band (true positives) by the sum of true positives and the number of cancer patients mistakenly identified as having an infection by the rule-in band.\u003c/p\u003e \u003cp\u003eFrom the results, it is evident that employing this strategy significantly enhanced the diagnostic accuracy (ACC) in distinguishing between lung cancer and bacterial infection, elevating it from 0.800 (56/70) to 0.907 (49/54) (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eA). This enhancement was accompanied by a sensitivity of 100%, reflecting the rule-in band's accuracy in correctly identifying individuals with cancer, and a specificity of 84.4%, demonstrating the rule-out band's accuracy in correctly identifying patients with an infection (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Similarly, there was a significant enhancement in ACC, rising from 0.797 (47/59) to 0.915 (43/47) alongside a specificity of 88.9% and sensitivity of 100% for diagnosing Lung cancer and Fungal Infection (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eB, Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Of note, this method yielded 86.1% specificity and 100% sensitivity in distinguishing Lung cancer and Pulmonary Tuberculosis (ACC\u0026thinsp;=\u0026thinsp;0.896, 43/48) (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eC, Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Accordingly, this integrated predictive approach indeed provides a highly accurate strategy to better utilize complex data generated by mNGS for distinguishing various pulmonary diseases in a clinically viable manner.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eIn the realm of diagnostics, BALF-based mNGS testing has emerged as a rapid assay to pinpoint pulmonary infection pathogens\u003csup\u003e\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e,\u003cspan additionalcitationids=\"CR47\" citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u003c/sup\u003e. Despite over 90% of mNGS results being human-origin reads, often disregarded as \"noise\", recent research posits that these sequences may harbor valuable biomarkers linked to the host's disease state\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e,\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e\u003c/sup\u003e. Our study pioneers a comprehensive host/microbe metagenomics approach, utilizing BALF mNGS data for diagnosing lung cancer and pulmonary infections. This innovative methodology exhibits exceptional accuracy in distinguishing between lung cancer and diverse pulmonary infections (including pulmonary tuberculosis, fungal infection, and bacterial infection), amplifying the clinical applicability of BALF mNGS testing.\u003c/p\u003e \u003cp\u003eWhile BALF samples exhibit inherent heterogeneity compared to whole blood or tissue specimens\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e,\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e\u003c/sup\u003e, our analytical model demonstrates significant robustness. Specifically tailored for distinguishing lung cancer from pulmonary infection, our Model VI achieved a notable AUC of 0.847 (95% CI\u0026thinsp;=\u0026thinsp;0.776\u0026ndash;0.918) within the validation cohort. This cohort encompassed a spectrum of complex pulmonary infections, including bacterial, fungal, and tuberculosis infections, each characterized by substantial variations in host immune responses, pathogen profiles, and microbiota compositions. Impressively, our model's performance is comparable to the uniform multi-omics models utilized in other studies for different sample types. For instance, the IMX-BVN model used to differentiate acute bacterial infections from others achieved an AUC of 0.86 (95% CI 0.77\u0026ndash;0.93), while distinguishing acute viral infections scored an AUC of 0.85 (95% CI 0.76\u0026ndash;0.93)\u003csup\u003e\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e\u003c/sup\u003e. The diagnostic capacity of whole blood transcriptomics in discerning sepsis from non-sepsis states showed an AUC of 0.82, while plasma cell-free RNA transcriptomics reached an AUC of 0.77\u003csup\u003e14\u003c/sup\u003e. These studies indirectly demonstrate similar remarkable efficacy of our model in managing complex pulmonary conditions. Moreover, in the validation cohort differentiating lung cancer from pulmonary tuberculosis, the AUC escalated to 0.909 (95% CI\u0026thinsp;=\u0026thinsp;0.831\u0026ndash;0.979), showcasing the advantage of integrating multi-omics into the Model VI. Identifying patients with lung cancer or pulmonary infections remains a crucial clinical challenge in many medical settings. The decision to administer empirical antibiotics often relies on an educated guess. If we could further refine our diagnosis of specific infection subgroups (such as bacteria, fungi, or tuberculosis) after confirming an infection firstly using our developed Model VI, it could assist clinicians in more accurately employing antibiotic therapies. It can be seen that the sensitivity of our model is exceptionally high: in the validation cohort, the sensitivity reached 94.4% for pulmonary infection, 80.6% for bacterial infection, 94.4% for fungal infection, and 91.7% for tuberculosis. This implies that in addition to detecting pathogens, our model can also identify almost all lung cancer patients. By combining these results with pathological findings, we can ensure diagnostic accuracy and mitigate the limitation of our model's lower specificity. Furthermore, for patients with low tumor risk or those diagnosed with infections, invasive biopsy procedures can be avoided, thus reducing the potential harm caused by such procedures. To further classifier patients precisely, we have further developed a more rigorous integrated predictive model based on predefined rule-in and rule-out strategies, enhancing the differentiation accuracy between lung cancer and infection subgroups. The result showed improved accuracy in distinguishing lung cancer from pulmonary tuberculosis (ACC\u0026thinsp;=\u0026thinsp;0.896), fungal infection (ACC\u0026thinsp;=\u0026thinsp;0.915), and bacterial infection (AUC\u0026thinsp;=\u0026thinsp;0.907). Such diagnostic approaches promise more precise clinical diagnoses, thereby yielding greater benefits for patients. In clinical practice, patients with suspected pulmonary infections or other diseases undergo pathogen detection and confirmation using DNA/RNA mNGS. Simultaneously, by employing our diagnostic model and rule-in/rule-out strategy, we can accurately identify patients with confirmed pulmonary infections, lung tumors, and those who are suspected cases. Further classification is achieved through additional methodologies, distinguishing lung tumor patients and suspected cases into confirmed lung tumor patients, non-tumor non-infection patients, and patients with concurrent lung tumors and pulmonary infections. This approach allows for more precise subsequent treatment (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eD).\u003c/p\u003e \u003cp\u003eOur study tested an integrated host-microbe mNGS diagnostic approach, examining microbial (including bacteriophage) DNA/RNA abundance, host gene expression, transposable elements, immune cell composition, and copy-number variants (CNV) derived tumor fraction. Prior research only only one or a few features independently to help diagnosis, like lung cancer microbiomes\u003csup\u003e\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e\u003c/sup\u003e. Previous 16s rRNA sequencing revealed higher \u003cem\u003eFirmicutes\u003c/em\u003e and \u003cem\u003eTM7\u003c/em\u003e presence in lung cancer versus healthy controls\u003csup\u003e\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e\u003c/sup\u003e. \u003cem\u003eVeillonella\u003c/em\u003e and \u003cem\u003eMegasphaera\u003c/em\u003e showed promise as lung cancer biomarkers (AUC: 0.888), indicating distinctive bacterial profiles in lung cancer versus benign conditions\u003csup\u003e\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e\u003c/sup\u003e. Our data detected subtle microbial differences between lung cancer and pulmonary infections and infectious subgroups. \u003cem\u003eVeillonella parvula\u003c/em\u003e notably increased in lung cancer compared to bacterial/fungal infection (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eC, LDA score\u0026thinsp;\u0026gt;\u0026thinsp;2, adjusted p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05). Yet, the microbiome had limited diagnostic predictive power for diagnosis of lung cancer and pulmonary infection (AUC\u0026thinsp;=\u0026thinsp;0.645 in validation cohort). Extracting more distinctive biological information from sequencing data is crucial for differentiating lung cancer from pulmonary infections.\u003c/p\u003e \u003cp\u003eWe believe that host immune dysregulation disrupts the composition of respiratory microbiota. Previous literature has underscored significant changes in the dynamic equilibrium between host and microbiome in conditions such as lung cancer and infections\u003csup\u003e\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e,\u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e56\u003c/span\u003e\u003c/sup\u003e. In this study, we independently compared the contributions of Microbial/Bacteriophage relative abundances (Model I and Model II), Host gene expression and composition of immune cell (Model III), TE expression levels (Model IV), and CNV-derived tumor fraction (Model V) for diagnosing lung cancer from infections. The results indicate that host immune response (Model III) reflects the most prominent differences in pulmonary disease status compared to other categories (Figs.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eD-F, DeLong's ROC test, p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05).\u003c/p\u003e \u003cp\u003eIn spite of the limited cellular content in certain BALF samples from patients, we successfully retrieved a robust human gene expression dataset. These data unveiled distinct immune responses across various pulmonary diseases. Analysis using PLIER revealed significant differences in latent variables associated with cell cycle, interferon, and cytokine pathways among these diseases. Notably, our findings highlighted genes involved in cell cycle regulation concurrently influencing PI3K-Akt signaling, p53 signaling, and lung cancer pathways, under the regulation of EGFR\u003csup\u003e\u003cspan citationid=\"CR57\" class=\"CitationRef\"\u003e57\u003c/span\u003e\u003c/sup\u003e. Additionally, we identified the GPB5 gene, known for its high diagnostic relevance in active tuberculosis \u003csup\u003e\u003cspan citationid=\"CR58\" class=\"CitationRef\"\u003e58\u003c/span\u003e\u003c/sup\u003e, and observed elevated expression levels of interferon signaling pathways in the pulmonary tuberculosis group compared to the other groups (Supplementary Figure S4A). This further underscores the reliability of our findings regarding the host immune response.\u003c/p\u003e \u003cp\u003eOur top three classifier genes for lung cancer and pulmonary infection were identified as B3GAT1, ULBP1, and CCL13. Interestingly, these genes have not been previously linked in host gene expression signatures in bodily fluids related to lung cancer. Specifically, ULBP1's role as a ligand for the NKG2D receptor activates NK cells in lung cancer, fostering NK cell-mediated tumor surveillance and cytotoxicity \u003csup\u003e\u003cspan citationid=\"CR59\" class=\"CitationRef\"\u003e59\u003c/span\u003e\u003c/sup\u003e. Expression of ULBP1-6, particularly in squamous-cell carcinoma, correlates with clinical outcomes in NSCLC patients, suggesting a predictive value for clinical prognosis\u003csup\u003e\u003cspan citationid=\"CR60\" class=\"CitationRef\"\u003e60\u003c/span\u003e\u003c/sup\u003e. Conversely, CCL13, a ligand for CCR2, contributes to cancer-related processes such as metastasis and immunosuppression. CCR2 expression in M2 macrophages is integral in the bidirectional communication between these macrophages and cancer cells, driving lung cancer progression\u003csup\u003e\u003cspan citationid=\"CR61\" class=\"CitationRef\"\u003e61\u003c/span\u003e\u003c/sup\u003e. Additionally, CCL13, derived from M2 tumor-associated macrophages, promotes oral cancer metastasis by inducing inflammatory cytokines\u003csup\u003e\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u003c/sup\u003e. Finally, B3GAT1, or beta-1,3-glucuronyltransferase 1, holds significance in cancer, particularly concerning tumor cell motility and specific carbohydrate epitope biosynthesis. Its role in canonical integrin signaling pathways influences tumor cell motility, while its involvement in HNK-1 carbohydrate epitope biosynthesis bears relevance to neurodevelopment and cancer-related processes\u003csup\u003e\u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e62\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eWe investigated for the first time the expression levels of transposable elements in BALF samples from pulmonary diseases in this study. HERVK11D showed higher expression in lung cancer compared to fungal infection and tuberculosis. Similarly, ERVK-MER11B was more expressed in bacterial infection than tuberculosis (Supplementary Figure S4, adjusted p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05). The heightened expression of HERV-K, linked to basal-like and triple-negative breast cancer progression, illustrates altered gene expression driving cancer advancement. HERV-derived long non-coding RNAs also promote cancer progression, signaling significant gene profile shifts in these cancers\u003csup\u003e\u003cspan citationid=\"CR63\" class=\"CitationRef\"\u003e63\u003c/span\u003e,\u003cspan citationid=\"CR64\" class=\"CitationRef\"\u003e64\u003c/span\u003e\u003c/sup\u003e. Additionally, two ERV1 were notably higher in lung cancer compared to all pulmonary infections (Supplementary Figure S4, adjusted p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05). These findings underscore the importance of Repetitive Sequences in human health, exemplified by severe COVID-19 pneumonia triggering intense inflammatory responses and HERVs dysregulation in BALF samples. For example, HERV-FRD, notably upregulated in COVID-19 BALF, suggests HERVs as potential disease progression biomarkers linked to increased severity in aging\u003csup\u003e\u003cspan citationid=\"CR65\" class=\"CitationRef\"\u003e65\u003c/span\u003e\u003c/sup\u003e. Surprisingly, we first found that certain transposable elements were more expressed in BALF during pulmonary infection than in lung cancer. GSAT satellite, notably higher in bacterial and fungal infections (Supplementary Figure S4, adjusted p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05), regulated by AP-1, holds significance in various pulmonary diseases by impacting gene expression and inflammatory cell activation crucial in pulmonary infections\u003csup\u003e\u003cspan citationid=\"CR66\" class=\"CitationRef\"\u003e66\u003c/span\u003e,\u003cspan citationid=\"CR67\" class=\"CitationRef\"\u003e67\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eSeveral studies have employed copy number variants (CNV) from bodily fluids to diagnose pulmonary malignancies\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e,\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR68\" class=\"CitationRef\"\u003e68\u003c/span\u003e\u003c/sup\u003e. In these investigations, whole genome testing of metagenomic data demonstrated a heightened diagnostic accuracy for pulmonary malignancies in samples initially identified as negative via conventional testing. Intriguingly, our findings indicated no significant distinction in the CNV- derived tumor fraction of BALF samples between the lung cancer and pulmonary infections. This reiterates that relying solely on one-dimensional information acquired from conventional BALF mNGS, characterized by low-depth sequencing, is insufficient for diagnosing intricate or multifaceted diseases.\u003c/p\u003e \u003cp\u003eDespite insights gained, our study has limitations. Firstly, our cohort lacked viral pneumonia cases due to reduced incidence during China's COVID-19 control measures. in fact, most of viral pneumonia often exhibits clinical and radiological differences from lung cancer, lessening the need for complex differential diagnostics than other infections associated with bacteria, fungi and mycobacteria. Secondly, our study focused on distinguishing infection from cancer, and therefore, the model established cannot address the differentiation effectiveness among various infection subgroups. We are conducting another study developing diagnostic models for distinguishing between infection subgroups, and some progress has been made thus far.\u003c/p\u003e \u003cp\u003eIn conclusion, we report that integrated host and microbe information from BAL nucleic acid enables accurate diagnosis of lung cancer and pulmonary infections. Future studies are needed to validate and test the clinical impact of this culture-independent diagnostic approach.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData and materials availability:\u0026nbsp;\u003c/strong\u003e(1)\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003eCode Availability. Essential scripts for implementing machine learning-based integrative procedure in multiple independent datasets are available on the Github website \u0026nbsp;(https://github.com/whybeeVM/Metagenomic-Analysis-of-Lung-Cancer-and-Pulmonary-Infections). (2) Data Availability. Microbial reads from DNA and RNA mNGS data were deposited in NCBI\u0026apos;s Sequence Read Archive (SRA) database under project number PRJNA1056765. Host gene expression profile derived from RNA sequencing data were deposited in GSE252118.\u003c/p\u003e\u003cp\u003e \u003ch2\u003eConflict of Interest Statement:\u003c/h2\u003e \u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eFunding:\u003c/h2\u003e \u003cp\u003eThis study was supported by the National Key R\u0026amp;D Program of China (2023YFC2308300),\u0026ldquo;Leading Geese\u0026rdquo; Research and Development Plan of Zhejiang Province (No. 2024C03218), National Natural Science Foundation of China (No. 82472371)\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eStudy design, D.H., H.Z., S.Z. and Y.C.; Data collection, F.Y., B.L.and H.T.; Data analysis, D.H., Y.S. and Y.C.; Wrote the paper: D.H. and B.Y. All authors have read and approved the final version of the manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgement\u003c/h2\u003e\u003cp\u003eWe thank all clinicians who provided detailed diagnostic and treatment data of patients for our study, as well as all infectious disease (ID) physicians, clinical microbiologists and oncologists who received our clinical consultations\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eKreier F. Cancer will cost the world $25 trillion over next 30 years. Nature. 2023.\u003c/li\u003e\n\u003cli\u003eAgusti A, Vogelmeier CF, Halpin DMG. Tackling the global burden of lung disease through prevention and early diagnosis. The Lancet Respiratory Medicine. 2022;10(11):1013-1015.\u003c/li\u003e\n\u003cli\u003eMckelvy BJ, Araujo-Filho JAB, Godoy MCB, et al. Infectious Diseases That May Mimic Lung Cancer. In: Moran CA, Truong MT, de Groot PM, editors. The Thorax: Medical, Radiological, and Pathological Assessment. Cham: Springer International Publishing; 2023. p. 827-851.\u003c/li\u003e\n\u003cli\u003eNewman-Toker DE, Schaffer AC, Yu-Moe CW, et al. Serious misdiagnosis-related harms in malpractice claims: The \u0026quot;Big Three\u0026quot; - vascular events, infections, and cancers. Diagnosis (Berlin, Germany). 2019;6(3):227.\u003c/li\u003e\n\u003cli\u003eGuimar\u0026atilde;es MD, Marchiori E, de Souza Portes Meirelles G, et al. Fungal Infection Mimicking Pulmonary Malignancy: Clinical and Radiological Characteristics. Lung. 2013;191(6):655-662.\u003c/li\u003e\n\u003cli\u003eFabre V, Davis A, Diekema DJ, et al. Principles of diagnostic stewardship: A practical guide from the Society for Healthcare Epidemiology of America Diagnostic Stewardship Task Force. Infection Control \u0026amp; Hospital Epidemiology. 2023;44(2):178-185.\u003c/li\u003e\n\u003cli\u003eBlauwkamp TA, Thair S, Rosen MJ, et al. Analytical and clinical validation of a microbial cell-free DNA sequencing test for infectious disease. Nat Microbiol. 2019;4(4):663-674.\u003c/li\u003e\n\u003cli\u003eMiller S, Naccache SN, Samayoa E, et al. Laboratory validation of a clinical metagenomic sequencing assay for pathogen detection in cerebrospinal fluid. Genome Res. 2019;29(5):831-842.\u003c/li\u003e\n\u003cli\u003eDiao Z, Lai H, Han D, et al. Validation of a Metagenomic Next-Generation Sequencing Assay for Lower Respiratory Pathogen Detection. Microbiology Spectrum. 2023;11(1).\u003c/li\u003e\n\u003cli\u003eChiu CY, Miller SA. Clinical metagenomics. Nat Rev Genet. 2019;20(6):341-355.\u003c/li\u003e\n\u003cli\u003eDiao Z, Han D, Zhang R, et al. Metagenomics next-generation sequencing tests take the stage in the diagnosis of lower respiratory tract infections. J Adv Res. 2022;38:201-212.\u003c/li\u003e\n\u003cli\u003eEdgeworth JD. Respiratory metagenomics: route to routine service. Curr Opin Infect Dis. 2023;36(2):115-123.\u003c/li\u003e\n\u003cli\u003eRamachandran PS, Ramesh A, Creswell FV, et al. Integrating central nervous system metagenomics and host response for diagnosis of tuberculosis meningitis and its mimics. Nat Commun. 2022;13(1).\u003c/li\u003e\n\u003cli\u003eKalantar KL, Neyton L, Abdelghany M, et al. Integrated host-microbe plasma metagenomics for sepsis diagnosis in a prospective cohort of critically ill adults. Nat Microbiol. 2022;7(11):1805-1816.\u003c/li\u003e\n\u003cli\u003eLangelier C, Kalantar KL, Moazed F, et al. Integrating host response and unbiased microbe detection for lower respiratory tract infection diagnosis in critically ill adults. Proceedings of the National Academy of Sciences. 2018;115(52):E12353-E12362.\u003c/li\u003e\n\u003cli\u003eGu W, Talevich E, Hsu E, et al. Detection of cryptogenic malignancies from metagenomic whole genome sequencing of body fluids. Genome Med. 2021;13(1).\u003c/li\u003e\n\u003cli\u003eGu W, Rauschecker AM, Hsu E, et al. Detection of Neoplasms by Metagenomic Next-Generation Sequencing of Cerebrospinal Fluid. Jama Neurol. 2021;78(11):1355-1366.\u003c/li\u003e\n\u003cli\u003eGuo Y, Li H, Chen H, et al. Metagenomic next-generation sequencing to identify pathogens and cancer in lung biopsy tissue. Ebiomedicine. 2021;73:103639.\u003c/li\u003e\n\u003cli\u003eTravis, W. D. et al. The 2015 World Health Organization Classification of Lung Tumors: Impact of Genetic, Clinical and Radiologic Advances Since the 2004 Classification. J. Thorac. Oncol. 10, 1243-1260 (2015).\u003c/li\u003e\n\u003cli\u003eSulaiman I, Chung M, Angel L, et al. Microbial signatures in the lower airways of mechanically ventilated COVID-19 patients associated with poor clinical outcome. Nat Microbiol. 2021;6(10):1245-1258.\u003c/li\u003e\n\u003cli\u003eKopylova E, No\u0026eacute; L, Touzet H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 2012 Dec 15;28(24):3211-7.\u003c/li\u003e\n\u003cli\u003eBolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114-2120.\u003c/li\u003e\n\u003cli\u003eYan Z, Chen B, Yang Y,et.al. Multi-omics analyses of airway host-microbe interactions in chronic obstructive pulmonary disease identify potential therapeutic interventions. Nat Microbiol. 2022 Sep;7(9):1361-1375. \u003c/li\u003e\n\u003cli\u003eHaddock NL, Barkal LJ, Ram-Mohan N, et al. Phage diversity in cell-free DNA identifies bacterial pathogens in human sepsis cases. Nat Microbiol. 2023;8(8):1495-1507.\u003c/li\u003e\n\u003cli\u003eSegata N, Izard J, Waldron L, et al. Metagenomic biomarker discovery and explanation. Genome Biol. 2011;12(6):R60.\u003c/li\u003e\n\u003cli\u003eKim D, Paggi JM, Park C, et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37(8):907-915.\u003c/li\u003e\n\u003cli\u003eLiao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923-930.\u003c/li\u003e\n\u003cli\u003eJin Y, Tam OH, Paniagua E, et al. TEtranscripts: a package for including transposable elements in differential expression analysis of RNA-seq datasets. Bioinformatics. 2015;31(22):3593-3599.\u003c/li\u003e\n\u003cli\u003eLove MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550.\u003c/li\u003e\n\u003cli\u003eSubramanian A, Tamayo P, Mootha VK, et al. Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles. Proceedings of the National Academy of Sciences - Pnas. 2005;102(43):15545-15550.\u003c/li\u003e\n\u003cli\u003eKanehisa M, Furumichi M, Tanabe M, et al. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017;45(D1):D353-D361.\u003c/li\u003e\n\u003cli\u003eGillespie M, Jassal B, Stephan R, et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50(D1):D687-D692.\u003c/li\u003e\n\u003cli\u003eSchoggins JW, Wilson SJ, Panis M, et al. A diverse range of gene products are effectors of the type I interferon antiviral response. Nature. 2011;472(7344):481-485.\u003c/li\u003e\n\u003cli\u003eSteen CB, Liu CL, Alizadeh AA, et al. Profiling Cell Type Abundance and Expression in Bulk Tissues with CIBERSORTx. Methods Mol Biol. 2020;2117:135-157.\u003c/li\u003e\n\u003cli\u003eMao W, Zaslavsky E, Hartmann BM, et al. Pathway-level information extractor (PLIER) for gene expression data. Nat Methods. 2019;16(7):607-610.\u003c/li\u003e\n\u003cli\u003eAdalsteinsson VA, Ha G, Freeman SS, et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nat Commun. 2017;8(1):1313-1324.\u003c/li\u003e\n\u003cli\u003eTalevich E, Shain AH, Botton T, et al. CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing. Plos Comput Biol. 2016;12(4):e1004873.\u003c/li\u003e\n\u003cli\u003eYoshihara K, Shahmoradgoli M, Martinez E, et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat Commun. 2013;4:2612.\u003c/li\u003e\n\u003cli\u003eLang M, Binder M, Richter J, et.al. mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software (2019). \u003c/li\u003e\n\u003cli\u003eMayhew MB, Buturovic L, Luethy R, et al. A generalizable 29-mRNA neural-network classifier for acute bacterial and viral infections. Nat Commun. 2020;11(1):1177.\u003c/li\u003e\n\u003cli\u003eRen L, Wang Y, Zhong J, et al. Dynamics of the Upper Respiratory Tract Microbiota and Its Association with Mortality in COVID-19. Am J Respir Crit Care Med. 2021;204(12):1379-1390.\u003c/li\u003e\n\u003cli\u003eBhattacharya S, Dunn P, Thomas CG, et al. ImmPort, toward repurposing of open access immunological assay data for translational and clinical research. Sci Data. 2018;5:180015.\u003c/li\u003e\n\u003cli\u003eNakayama T, Lee IT, Le W, et al. Inflammatory molecular endotypes of nasal polyps derived from White and Japanese populations. J Allergy Clin Immun. 2022;149(4):1296-1308.\u003c/li\u003e\n\u003cli\u003eKorbecki J, Kojder K, Simińska D, et al. CC Chemokines in a Tumor: A Review of Pro-Cancer and Anti-Cancer Properties of the Ligands of Receptors CCR1, CCR2, CCR3, and CCR4. Int J Mol Sci. 2020;21(21):8412.\u003c/li\u003e\n\u003cli\u003eLiu Z, Rui T, Lin Z, et al. Tumor-Associated Macrophages Promote Metastasis of Oral Squamous Cell Carcinoma via CCL13 Regulated by Stress Granule. Cancers (Basel). 2022;14(20).\u003c/li\u003e\n\u003cli\u003eDiao Z, Han D, Zhang R, et al. Metagenomics next-generation sequencing tests take the stage in the diagnosis of lower respiratory tract infections. J Adv Res. 2021.\u003c/li\u003e\n\u003cli\u003eCharalampous T, Alcolea-Medina A, Snell LB, et al. Evaluating the potential for respiratory metagenomics to improve treatment of secondary infection and detection of nosocomial transmission on expanded COVID-19 intensive care units. Genome Med. 2021;13(1):182.\u003c/li\u003e\n\u003cli\u003eCharalampous T, Kay GL, Richardson H, et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nat Biotechnol. 2019;37(7):783-792.\u003c/li\u003e\n\u003cli\u003eMick E, Tsitsiklis A, Kamm J, et al. Integrated host/microbe metagenomics enables accurate lower respiratory tract infection diagnosis in critically ill children. J Clin Invest. 2023;133(7).\u003c/li\u003e\n\u003cli\u003eDavidson KR, Ha DM, Schwarz MI, et al. Bronchoalveolar lavage as a diagnostic procedure: a review of known cellular and molecular findings in various lung diseases. J Thorac Dis. 2020;12(9):4991-5019.\u003c/li\u003e\n\u003cli\u003eChellapandian D, Lehrnbecher T, Phillips B, et al. Bronchoalveolar lavage and lung biopsy in patients with cancer and hematopoietic stem-cell transplantation recipients: a systematic review and meta-analysis. J Clin Oncol. 2015;33(5):501-509.\u003c/li\u003e\n\u003cli\u003eMayhew MB, Buturovic L, Luethy R, et al. A generalizable 29-mRNA neural-network classifier for acute bacterial and viral infections. Nat Commun. 2020;11(1).\u003c/li\u003e\n\u003cli\u003eRan Z, Liu J, Wang F, et al. Pulmonary Micro-Ecological Changes and Potential Microbial Markers in Lung Cancer Patients. Front Oncol. 2020;10:576855.\u003c/li\u003e\n\u003cli\u003eLee SH, Sung JY, Yong D, et al. Characterization of microbiome in bronchoalveolar lavage fluid of patients with lung cancer comparing with benign mass like lesions. Lung Cancer. 2016;102:89-95.\u003c/li\u003e\n\u003cli\u003eDickson RP, Huffnagle GB. The Lung Microbiome: New Principles for Respiratory Bacteriology in Health and Disease. Plos Pathog. 2015;11(7):e1004923.\u003c/li\u003e\n\u003cli\u003eMan WH, de Steenhuijsen Piters WAA, Bogaert D. The microbiota of the respiratory tract: gatekeeper to respiratory health. Nature Reviews. Microbiology. 2017;15(5):259-270.\u003c/li\u003e\n\u003cli\u003eDa CSG, Shepherd FA, Tsao MS. EGFR mutations and lung cancer. Annu Rev Pathol. 2011;6:49-69.\u003c/li\u003e\n\u003cli\u003eSweeney TE, Braviak L, Tato CM, et al. Genome-wide expression for diagnosis of pulmonary tuberculosis: a multicohort analysis. The Lancet Respiratory Medicine. 2016;4(3):213-224.\u003c/li\u003e\n\u003cli\u003eSchmiedel D, Mandelboim O. NKG2D Ligands-Critical Targets for Cancer Immune Escape and Therapy. Front Immunol. 2018;9:2040.\u003c/li\u003e\n\u003cli\u003eGowen BG, Chim B, Marceau CD, et al. A forward genetic screen reveals novel independent regulators of ULBP1, an activating ligand for natural killer cells. Elife. 2015;4.\u003c/li\u003e\n\u003cli\u003eSchmall A, Al-Tamari HM, Herold S, et al. Macrophage and cancer cell cross-talk via CCR2 and CX3CR1 is a fundamental mechanism driving lung cancer. Am J Respir Crit Care Med. 2015;191(4):437-447.\u003c/li\u003e\n\u003cli\u003eJeffries AR, Mungall AJ, Dawson E, et al. beta-1,3-Glucuronyltransferase-1 gene implicated as a candidate for a schizophrenia-like psychosis through molecular analysis of a balanced translocation. Mol Psychiatry. 2003;8(7):654-663.\u003c/li\u003e\n\u003cli\u003eLemaitre C, Tsang J, Bireau C, et al. A human endogenous retrovirus-derived gene that can contribute to oncogenesis by activating the ERK pathway and inducing migration and invasion. Plos Pathog. 2017;13(6):e1006451.\u003c/li\u003e\n\u003cli\u003eJin X, Xu XE, Jiang YZ, et al. The endogenous retrovirus-derived long noncoding RNA TROJAN promotes triple-negative breast cancer progression via ZMYND8 degradation. Sci Adv. 2019;5(3):eaat9820.\u003c/li\u003e\n\u003cli\u003eKitsou K, Kotanidou A, Paraskevis D, et al. Upregulation of Human Endogenous Retroviruses in Bronchoalveolar Lavage Fluid of COVID-19 Patients. Microbiol Spectr. 2021;9(2):e126021.\u003c/li\u003e\n\u003cli\u003eWang A, Al-Kuhlani M, Johnston SC, et al. Transcription factor complex AP-1 mediates inflammation initiated by Chlamydia pneumoniae infection. Cell Microbiol. 2013;15(5):779-794.\u003c/li\u003e\n\u003cli\u003eArancio W, Coronnello C. Repetitive Sequence Transcription in Breast Cancer. Cells (Basel, Switzerland). 2022;11(16):2522.\u003c/li\u003e\n\u003cli\u003eLin P, Chen Y, Xu J, et al. A multicenter-retrospective cohort study of chromosome instability in lung cancer: clinical characteristics and prognosis of patients harboring chromosomal instability detected by metagenomic next-generation sequencing. J Thorac Dis. 2023;15(1):112-122.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"metagenomic next-generation sequencing, mNGS, lung cancer, pulmonary infections","lastPublishedDoi":"10.21203/rs.3.rs-6108429/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6108429/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eRecent advances in unbiased metagenomic next-generation sequencing (mNGS) enable simultaneous examination of microbial and host genetic material. In this study, we developed a multimodal machine learning-based diagnostic approach to differentiate lung cancer and pulmonary infections using 402 bronchoalveolar lavage fluid (BALF) mNGS datasets. The training cohort revealed differences in DNA/RNA microbial composition, bacteriophage abundances, and host responses, including gene expression, transposable element levels, immune cell composition, and tumor fraction derived from copy number variation (CNV). The diagnostic model (Model VI) that integrated these differential features demonstrated an AUC of 0.937 (95% CI\u0026thinsp;=\u0026thinsp;0.91\u0026ndash;0.964) in the training cohort and 0.847 (95% CI\u0026thinsp;=\u0026thinsp;0.776\u0026ndash;0.918) in the validation cohort for distinguishing lung cancer from pulmonary infections. The application of a rule-in and rule-out strategy-based composite predictive model significantly enhanced accuracy (ACC) in distinguishing between lung cancer and tuberculosis (ACC\u0026thinsp;=\u0026thinsp;0.896), fungal infection (ACC\u0026thinsp;=\u0026thinsp;0.915), and bacterial infection (ACC\u0026thinsp;=\u0026thinsp;0.907). These findings underscore the potential of cost-effective mNGS-based analysis for early differentiation between lung cancer and pulmonary infections.\u003c/p\u003e","manuscriptTitle":"Multimodal Metagenomic Profiling of Bronchoalveolar Lavage Fluid for Diagnostic Classification of Pulmonary Diseases","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-03-03 08:36:41","doi":"10.21203/rs.3.rs-6108429/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-05-27T13:53:44+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-05-25T19:49:45+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"182785732154560850968881233343594495199","date":"2025-05-16T11:10:59+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-04-24T14:33:17+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"250054591509395262824506032040114894091","date":"2025-04-18T12:18:10+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-02-28T08:19:51+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-02-28T00:56:45+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-02-27T11:11:28+00:00","index":"","fulltext":""},{"type":"submitted","content":"npj Digital Medicine","date":"2025-02-25T22:37:03+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"a04ac24f-8632-426d-9435-403313b3ae2a","owner":[],"postedDate":"March 3rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":45069934,"name":"Biological sciences/Biological techniques/Sequencing/Next generation sequencing"},{"id":45069935,"name":"Biological sciences/Biological techniques/Sequencing/Rna sequencing"}],"tags":[],"updatedAt":"2025-10-13T16:08:49+00:00","versionOfRecord":{"articleIdentity":"rs-6108429","link":"https://doi.org/10.1038/s41746-025-01977-5","journal":{"identity":"npj-digital-medicine","isVorOnly":false,"title":"npj Digital Medicine"},"publishedOn":"2025-10-07 15:57:25","publishedOnDateReadable":"October 7th, 2025"},"versionCreatedAt":"2025-03-03 08:36:41","video":"","vorDoi":"10.1038/s41746-025-01977-5","vorDoiUrl":"https://doi.org/10.1038/s41746-025-01977-5","workflowStages":[]},"version":"v1","identity":"rs-6108429","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6108429","identity":"rs-6108429","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.