Improving Early Prostate Cancer Detection Using Fragmentomics and Ensemble Machine Learning Models

doi:10.21203/rs.3.rs-6166592/v1

Improving Early Prostate Cancer Detection Using Fragmentomics and Ensemble Machine Learning Models

2025 · doi:10.21203/rs.3.rs-6166592/v1

preprint OA: closed

Full text JSON View at publisher

Full text 59,203 characters · extracted from preprint-html · click to expand

Improving Early Prostate Cancer Detection Using Fragmentomics and Ensemble Machine Learning Models | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Improving Early Prostate Cancer Detection Using Fragmentomics and Ensemble Machine Learning Models Zhixuan Fu, Qiaozhen Hong, Yipeng Xu, Jingyu Wang, Shaliu Fu, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6166592/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Prostate cancer (PCa) detection remains challenging when prostate-specific antigen (PSA) levels fall within the ambiguous 4–20 ng/mL range. This study aimed to develop a non-invasive, cost-effective method for early PCa detection using circulating tumor DNA (ctDNA) fragmentomic features. Whole-genome sequencing of cell-free DNA (cfDNA) from plasma samples was performed in a training cohort (78 PCa patients, 99 non-cancer controls) and an independent test cohort (57 non-metastatic PCa, 20 metastatic cases, 55 controls). Three fragmentomic features—fragment size distribution (FSD), fragment size coverage (FSC), and copy number variance (CNV)—were analyzed using ensemble machine learning models (GLM, GBM, XGBoost). The FSD model demonstrated superior performance (AUROC: 0.898 in training, 0.95 in testing), particularly in high-risk patients (PSA 4–20 ng/mL; AUROC: 0.907). A composite PCa score integrating FSD, FSC, and CNV achieved AUROCs of 0.901 (training) and 0.89 (testing), outperforming PSA (AUROCs: 0.875 and 0.855, respectively). In high-risk subgroups (PSA 4–20 ng/mL), the PCa score maintained high sensitivity (79.31%) and specificity (84%). Despite limitations in cohort size and external validation, this study highlights the clinical potential of cfDNA fragmentomics for early PCa detection, especially in diagnostically ambiguous PSA ranges. Biological sciences/Cancer/Cancer screening Biological sciences/Cancer/Tumour biomarkers Health sciences/Molecular medicine Figures Figure 1 Introduction Prostate cancer (PCa) ranks as the second most prevalent cancer in men globally and the third leading cause of cancer-related deaths [1]. Prostate-specific antigen (PSA) serves as a pivotal indicator of PCa, with elevated levels proving effective in identifying patients [2]. However, the diagnostic efficacy of PSA diminishes when levels fall between 4 ng/mL and 20 ng/mL, exhibiting a limited ability to differentiate between cancer and non-cancer cohorts. This underscores the pressing necessity for the development of precise and cost-effective methodologies for early-stage PCa detection, particularly in high-risk groups. Liquid biopsy, which features circulating tumor DNA (ctDNA), has emerged as a minimally invasive and accurate modality for cancer detection [3]. ctDNA refers to the fraction of cell-free DNA (cfDNA) originating from tumor cells circulating within the bloodstream of a patient. Recent studies have highlighted substantial dissimilarities in the cfDNA fragment size patterns between individuals with cancer and those without cancer [4]. Tumor-specific fragmentomic features have the potential to revolutionize early-stage cancer detection and warrant comprehensive validation in PCa. Results Superior Performance of FSD in PCa Discrimination In this study, we have developed a workflow for PCa detection utilizing cfDNA whole-genome sequencing (WGS). The workflow, which includes cfDNA sequencing and model construction, is illustrated in Supplementary Fig. 1. Specifically, we extracted cfDNA from plasma samples of both individuals with PCa and those without cancer, forming the basis of our approach (Supplementary Fig. 1a). We carefully assembled a cohort consisting of 78 previously untreated individuals with non-metastatic PCa (comprising 33 localized and 45 locally advanced cases) and 99 individuals without cancer (49 BiopsyNegative and 50 healthy), all matched for age. This cohort served as our training group (Supplementary Fig. 1b). Additionally, we established an independent test cohort, which included 57 individuals with non-metastatic PCa (31 localized and 27 locally advanced), 20 individuals with metastatic hormone-sensitive PCa (mHSPC), and 55 individuals without cancer, again matched for age (Supplementary Fig. 1c). Our PCa detection model incorporated three types of fragmentomic features: fragment size distribution (FSD), fragment size coverage (FSC), and copy number variance (CNV). Each feature was utilized independently to train the models using an ensemble approach with a Generalized Linear Model (GLM), Gradient Boosting Model (GBM), and XGBoost (XGB) (Supplementary Fig. 1c). Subsequently, these individual models were integrated using a GLM, resulting in our final PCa score (Supplementary Methods). To evaluate the efficacy of fragmentomic features (FSC, FSD, and CNV) in distinguishing patients with PCa from those without, we utilized models incorporating 177 training samples, employing GLM, GBM, and XGB algorithms. These models underwent assessment through predictions obtained via a five-fold cross-validation of the test sets. Among the three fragmentomic features, the FSD model emerged as the most robust performer, achieving an impressive AUROC of 0.898 (95% CI, 0.854–0.942) across all patients in the training cohort (Supplementary Fig. 2a). Particularly noteworthy is its performance with high-risk patients exhibiting PSA levels ranging from > 4 to < 20 ng/mL, where the FSD model demonstrated an AUROC of 0.907 (95% CI, 0.841–0.971), alongside 89.6% sensitivity and 84.1% specificity (Supplementary Fig. 2b). Conversely, the FSC and CNV models exhibited moderate performance, with AUROC values of 0.694 (95% CI, 0.616–0.772) and 0.766 (95% CI, 0.697–0.836), respectively (Supplementary Fig. 2a). This trend persisted among patients within the PSA range of 4 to 20, with the FSC and CNV models yielding AUROC values of 0.706 (95% CI, 0.596–0.816) and 0.714 (95% CI, 0.609–0.819), respectively (Supplementary Fig. 2b). To further assess the robustness of the fragmentomic ensemble models, we applied them to a separate test cohort comprising 112 individuals who had not undergone prior intervention (Supplementary Fig. 3). Impressively, the FSD model exhibited robust and outstanding performance across all tested individuals and within the high-risk population with PSA levels ranging between 4 and 20, demonstrating AUROC values of 0.95 (95% CI, 0.852 − 1) and 0.923 (95% CI, 0.772 − 1), respectively (Supplementary Fig. 3a-b). Conversely, the FSC and CNV models showed relatively diminished performance across the entire test cohort, with AUROC values of 0.6 (95% CI, 0.417 − 0.783) and 0.615 (95% CI, 0.425 − 0.805), respectively (Supplementary Fig. 3a). Additionally, among individuals with PSA levels between 4 and 20, the FSC and CNV models exhibited limited discriminatory power (with AUROC values of 0.6 and 0.523, respectively) (Supplementary Fig. 3b). We further compared the sensitivity and specificity of the three feature models and found that the FSD model demonstrated high sensitivity and specificity across all samples and within the high-risk populations, consistent with its performance in the training cohort (Supplementary Fig. 3c). Integrated PCa Score Outperforms PSA in Diagnostic Accuracy To enhance the accuracy and resilience of fragmentomic models in detecting PCa, we developed a GLM model that integrates the FSC, FSD, and CNV models, generating a PCa score for each patient. Subsequently, we evaluated these scores. Impressively, across all patients in the training set, the PCa score achieved an AUROC of 0.901 (95% CI, 0.858–0.944), surpassing the diagnostic accuracy of PSA (AUROC: 0.875, 95% CI, 0.826–0.925) (Supplementary Fig. 4a). The corresponding sensitivity and specificity of the PCa score were 78.2% and 83.83%, respectively, with a positive predictive value (PPV) of 79.22% and a negative predictive value (NPV) of 83%, using a cutoff score of 0.5 (Supplementary Fig. 4c, d). Specifically, among patients with PSA levels between 4 and 20 ng/mL, the PCa score exhibited a significantly higher AUROC (0.909, 95% CI, 0.845–0.972) compared to PSA (0.676, 95% CI, 0.566–0.786), accompanied by 85.42% sensitivity and 84.1% specificity, as well as a PPV of 85.42% and NPV of 84.1% (Supplementary Fig. 4b, d). Furthermore, compared with the FSD model, the secondary GLM modestly improved performance in the training cohort, suggesting potential contributions of the FSC and CNV models to the PCa score. The PCa score was finally evaluated using an independent test dataset comprising 112 samples. Consistent with its high performance in the training cohort, the PCa score exhibited an AUROC of 0.89 (95% CI, 0.826–0.954), surpassing the AUROC of PSA, which stood at 0.855 (95% CI, 0.74–0.97) (Fig. 1 a). Using a cutoff score of 0.5, the PCa score demonstrated a sensitivity of 77.19%, specificity of 83.64%, PPV of 83.02%, and NPV of 77.97%, mirroring the results observed in the training cohort (Fig. 1 c, d). Among patients with PSA levels between 4 and 20, the PCa score maintained high accuracy, with an AUROC of 0.886 (95% CI, 0.787–0.985), sensitivity of 79.31%, specificity of 84%, PPV of 88.46%, and NPV of 73.91% (Fig. 1 b, d), thus highlighting the effectiveness of the PCa score for detecting PCa in high-risk populations. Furthermore, when the PCa detection model was applied to 20 patients, consistent score ranges were observed with the non-metastatic group (Fig. 1 c), indicating the detection of PCa signals in both metastatic and non-metastatic samples. Discussion The study presents compelling evidence supporting the utility of cfDNA fragmentomic features, particularly fragment size distribution (FSD), in enhancing the detection of early-stage prostate cancer (PCa), especially within the diagnostically challenging PSA "gray zone" (4–20 ng/mL). The results highlight several critical advancements and underscore the potential of liquid biopsy as a non-invasive tool for PCa diagnosis. The FSD model emerged as the most robust single-feature classifier, achieving an AUROC of 0.898 in the training cohort and 0.95 in the independent test cohort. This exceptional performance, surpassing both FSC (AUROC: 0.694–0.6) and CNV (AUROC: 0.766–0.615), aligns with prior observations that tumor-derived cfDNA exhibits distinct fragmentation patterns compared to non-cancerous cfDNA. The shorter fragment sizes characteristic of tumor DNA likely contribute to the discriminative power of FSD, as cancer cells release DNA through mechanisms such as apoptosis and necrosis, which differ from the physiological processes governing cfDNA release in healthy individuals. Notably, the FSD model maintained high sensitivity (89.6%) and specificity (84.1%) in the PSA gray zone, addressing a critical limitation of conventional PSA testing. The integration of FSD, FSC, and CNV into a composite PCa score via a secondary GLM yielded significant improvements over PSA alone. In the training cohort, the PCa score achieved an AUROC of 0.901 (vs. PSA: 0.875), with enhanced performance in the PSA gray zone (AUROC: 0.909 vs. PSA: 0.676). This improvement was replicated in the independent test cohort (AUROC: 0.89 vs. PSA: 0.855), demonstrating robust generalizability. The PCa score’s sensitivity (77.19–85.42%) and specificity (83.64–84.1%) further highlight its clinical utility, particularly given the moderate PPV (79.22–88.46%) and NPV (77.97–84%), which are comparable to or better than existing biomarkers. Importantly, the PCa score’s ability to detect both non-metastatic and metastatic PCa signals (Fig. 1 c) suggests broad applicability across disease stages, though further validation in metastatic cohorts is warranted. The study’s focus on patients with PSA levels of 4–20 ng/mL addresses a critical unmet need. Traditional PSA testing in this range suffers from low specificity, leading to unnecessary biopsies and overdiagnosis. The PCa score’s AUROC of 0.909 in this subgroup, coupled with 85.42% sensitivity and 84.1% specificity, represents a paradigm shift. By reducing diagnostic ambiguity, this approach could minimize invasive procedures for benign cases while ensuring timely detection of malignancies. However, the relatively small sample size in this subgroup (training: 33 localized + 45 locally advanced PCa; test: 31 localized + 27 locally advanced) necessitates caution in extrapolating these results to larger populations. Despite the significant advantages offered by the PCa score and its non-invasive fragmentomic profile analyses for detecting early-stage PCa, our study possesses certain limitations. Initially, while our analyses produced robust results within the early-stage validation cohort and among high-risk individuals, it is imperative to conduct a more extensive prospective validation study within a broader screening population before clinical implementation. Additionally, conducting multicenter training and validation studies is necessary to ensure the external validity and generalizability of PCa detection models, a factor limited in the current study. Contemporary research has unveiled the potential of multimodal approaches in early cancer detection [5], which integrate additional fragmentation features such as breakpoint motifs and end motifs of cfDNA sequences. Notably, our study did not integrate these features into the early PCa detection model. Furthermore, although the current PCa detection model appeared to outperform the single FSD model in the training cohort, it did not demonstrate significant accuracy improvement in the FSD model in the test cohort, possibly due to limited training cohort sizes. the PCa score’s incremental improvement over the standalone FSD model in the test cohort (AUROC: 0.89 vs. FSD: 0.95) raises questions about the added value of FSC and CNV. This may reflect overfitting in the training phase or insufficient contribution from weaker features, underscoring the need for feature optimization. Despite these limitations, the crucial observation that noninvasive and cost-effective cfDNA fragmentation analyses can differentiate early-stage PCa patients from noncancerous individuals underscores the potential of identifying PCa not only within high-risk patient groups but also within the broader general population. Methods Patient Plasma Samples and study design This study primarily enrolled 135 non-metastatic PCa patients, 20 metastatic and 154 non-cancer patients including 76 Biopsy Negative patients and 78 healthy volunteers from Zhejiang Cancer Hospital in China (Supplementary Tables). The 135 non-metastatic PCa patients included 64 localized patients and 71 locally advanced patients. Plasma of all enrolled PCa patients have been extracted before cancer drug treatment and surgical treatment, which excluding the influence of clinical therapy to tumor ctDNA abundance. Healthy samples were collected from routine health check-ups, and limited PSA levels below 4 to reduce the potential of prostate cancer. Biopsy Negative samples were obtained from prostate cancer clinic where patients exhibited symptoms of prostate abnormalities, yet their biopsy results were negative. This study was approved by the Ethics Committees of Zhejiang Cancer Hospital (Approval No. IRB-2021-247). All methods were performed in accordance with the relevant guidelines and regulations. cfDNA extraction, library preparation and whole genome sequencing For each sample, 5mL peripheral blood was collected in EDTA tubes and processed within 2h. Plasma was isolated by centrifugation at 1600g for 20 min at room temperature, and additional centrifugation at 15000g for 10min under 4 ℃ in microcentrifuge tubes. cfDNA was extracted from 2-3mL plasma using Qiagen Circulating Nucleic Acids Kit (Qiagen), according to manufacturer’s instructions, measured with Qubit fluorometer (Life Technologies) and stored at -80 ℃. 10ng cfDNA was used for Library preparation with Kapa HyperPrep Kit (Kapa Biosystems). The library was amplificated with KAPA HiFi Hotstart ReadyMix (KAPA Biosystems) and NEBNext Multiplex Oligos for Illumina (New England BioLabs)as follows: initial denaturation at 95 ℃ for 3 min, followed by 4 cycles of 98 ℃ for 20 s, 65 ℃ for 15 s, 72 ℃ for 30 s, and the final extension at 72 ℃ for 1 min. After purification with Beckman Agencourt AMPure XP beads, amplified libraries were measured with Qubit fluorometer (Life Technologies) and Bioanalyzer 2100 (Agilent), and then pooled and sequenced (Wuxi NextCode) on an Illumina NovaSeq 6000 system to generate 150 bp paired-end reads. Bioinformatics analysis and machine-learning modeling The sequenced reads were aligned to hg19 genome using Bowtie2 with the default settings. The generated SAM files from hg19 alignment were converted to BAM format, ensuring the removal of duplicate reads, and the reads were then sorted and indexed using SAMtools for subsequent analysis. The mapping quality of raw sequencing files were evaluated using SAMtools flagstat function. With reference to previous studies, we calculated fragment-related features of fragment size distribution (FSD), fragment size coverage (FSC) and copy number variance (CNV) for each patient. For the CNV feature, we divided the whole human for each 5 million base bins hg19 genome except for bins overlapped known blacklisted regions, and calculated CNV count in each bin by read Counter function in HMM copy utils ( https://github.com/shahcompbio/hmmcopy_utils ), and corrected with GC content by ichorCNA. The fragmentation size coverage (FSC) and fragmentation size distribution (FSD) features were calculated using custom scripts from GitHub repository https://github.com/adamtongji/PCa_frag_manuscript . Detailly, the FSC feature was calculated as the coverage of short (100–150 bp) and long (221–300 bp) cfDNA fragments divided by the coverage of intermediate (151–220 bp) fragments, considering both longer or shorter tumor-derived cfDNA. The FSD feature was calculated as the fraction of cfDNA fragments ranging from 100 bp to 300 bp in 5 bp stepwise in all cfDNA fragments at every chromosome arm. The FSD, FSC and CNV were subsequently used for model training and testing steps for all machine learning algorithms. We applied three machine learning algorithms, including Generalized Linear Model (GLM), Gradient Boosting Model (GBM) and XGBoost (XGB), to construct ensemble models for all features. GLM and GBM model were implemented with R caret package, and XGB model was implemented with R xgboost package. We further constructed a GLM model of elastic net (alpha = 0.5 in caret function) trained with all output predictions of CNV, FSD and FSC models by 5-fold cross-validation (CV), and took predictions of the GLM model as PCa cancer score. Model performance was evaluated for each sample by computing the area under the receiver operating characteristic curve (AUROC). For the training set, we evaluated each model using the held-out datasets in the test data of 5-fold CV as internal test datasets. For the independent valid datasets, we applied final trained models from all training samples to these independent test samples. Statistical analysis For statistical analysis, the receiver operating characteristic (ROC) curves were generated using the pROC package. Based on true positive (TP), true negative (TN), false positive (FP), and false negative (FN) of cancer prediction at threshold of 0.5, we calculated the sensitivity [TP/(TP + FN)], specificity [TN/(TN + FP)], positive (PPV) [TP/(TP + FP)] and negative predictive values (NPV) [TN/(TN + FN)]. The corresponding 95% confidence intervals were calculated using the ROCR package in R. Declarations Conflicts of Interest J. Wang, S. Fu, and L. Dan work at Suzhou Danbei Medical Technology Co., Ltd, a company that focuses on cancer early detection. No potential conflicts of interest were disclosed by the other authors. Availability of data and materials The data supporting the conclusions of this study is included in the article and its Supplementary files. All R packages used are available online as described in the method section. Customized code for data processing and visualization can be accessed on the GitHub repository (https://github.com/adamtongji/PCa_frag_manuscript). Ethics approval and consent to participate This study was approved by the Ethics Committees of Zhejiang Cancer Hospital (Approval No. IRB-2021-247). All participants provided written informed consent. All methods were performed in accordance with the relevant guidelines and regulations, including the Declaration of Helsinki and institutional ethical review protocols. Author Contribution Zhixuan Fu, Qiaozhen Hong, Yipeng Xu, and Jingyu Wang wrote the main manuscript text and contributed equally to the article.Fu and Dan Li prepared figures 1-4. All authors reviewed the manuscirpt. Funding This work was supported by grants from the China Postdoctoral Science Foundation (2023M742651, GZC20231946) and National Natural Science Foundation of China (No.83172210). References Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F: Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 71, 209-249 (2021). Ang M, Rajcic B, Foreman D, Moretti K, O'Callaghan ME: Men presenting with prostate-specific antigen (PSA) values of over 100 ng/mL. BJU Int 117, 68-75 (2016). Tivey A, Church M, Rothwell D, Dive C, Cook N: Circulating tumour DNA - looking beyond the blood. Nat Rev Clin Oncol 19, 600-612 (2022). Cristiano S, Leal A, Phallen J, et al: Genome-wide cell-free DNA fragmentation in patients with cancer. Nature 570, 385-389 (2019). Ma X, Chen Y, Tang W, et al: Multi-dimensional fragmentomic assay for ultrasensitive early detection of colorectal advanced adenoma and adenocarcinoma. J Hematol Oncol 14, 175 (2021). Additional Declarations No competing interests reported. Supplementary Files Supplementarypaperfig1.pdf Supplementarypaperfig2.pdf Supplementarypaperfig3.pdf Supplementarypaperfig4.pdf SupplementaryFigures.pdf Supplementarytables.xlsx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6166592","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":436473295,"identity":"3f189985-1ae7-4ff2-9310-10de4126313e","order_by":0,"name":"Zhixuan Fu","email":"","orcid":"","institution":"Zhejiang Cancer Hospital","correspondingAuthor":false,"prefix":"","firstName":"Zhixuan","middleName":"","lastName":"Fu","suffix":""},{"id":436473296,"identity":"cdcf5f7a-e3a9-4d4b-940b-5d0538abd14f","order_by":1,"name":"Qiaozhen Hong","email":"","orcid":"","institution":"Quzhou Kecheng People's Hospital","correspondingAuthor":false,"prefix":"","firstName":"Qiaozhen","middleName":"","lastName":"Hong","suffix":""},{"id":436473297,"identity":"43ef4c8c-2057-4eb7-82a4-4c0cb8a5eae8","order_by":2,"name":"Yipeng Xu","email":"","orcid":"","institution":"Zhejiang Cancer Hospital","correspondingAuthor":false,"prefix":"","firstName":"Yipeng","middleName":"","lastName":"Xu","suffix":""},{"id":436473298,"identity":"bdba3631-7fd9-45dc-8884-60cb20d8a22d","order_by":3,"name":"Jingyu Wang","email":"","orcid":"","institution":"Suzhou Danbei Medical Technology Co., Ltd","correspondingAuthor":false,"prefix":"","firstName":"Jingyu","middleName":"","lastName":"Wang","suffix":""},{"id":436473299,"identity":"54feebac-152e-430b-967b-a9761200022e","order_by":4,"name":"Shaliu Fu","email":"","orcid":"","institution":"Suzhou Danbei Medical Technology Co., Ltd","correspondingAuthor":false,"prefix":"","firstName":"Shaliu","middleName":"","lastName":"Fu","suffix":""},{"id":436473300,"identity":"7e5a4e56-ac61-476e-963d-86b1daad4f44","order_by":5,"name":"Dan Li","email":"","orcid":"","institution":"Suzhou Danbei Medical Technology Co., Ltd","correspondingAuthor":false,"prefix":"","firstName":"Dan","middleName":"","lastName":"Li","suffix":""},{"id":436473301,"identity":"83e43d7b-78fa-422a-85ca-9b414fcd5fbf","order_by":6,"name":"Jifei Zhang","email":"","orcid":"","institution":"Quzhou Kecheng People's Hospital","correspondingAuthor":false,"prefix":"","firstName":"Jifei","middleName":"","lastName":"Zhang","suffix":""},{"id":436473302,"identity":"e5c62b12-ef0a-4105-b0f1-3cda682ba064","order_by":7,"name":"Zhiwen Pan","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA0ElEQVRIiWNgGAWjYLACxgYJIMl84MCHH6RpYUs8OLOHeC0gksf4MAcbEarN2c8eYPy5wyKxX7rnw2EGHgZ5frED+LVY9uQlMEiekUicOefshsMFFgyGM2cn4NdicCDHgMGwTSJ3w43cDYdn8DAkGNwmpOX8GwOGRKCW/TdyHhzmYSNGyw2gLQdBtkjkMBCnxXLGGwPGxjaJ+hk30gyAgSxB2C/m/DkGjD/b6oz5ZyQ//vDhh408vzQhhzEwsCNHuQR+5VAto2AUjIJRMAoIAAB0C0XUnaRC/gAAAABJRU5ErkJggg==","orcid":"","institution":"Zhejiang Cancer Hospital","correspondingAuthor":true,"prefix":"","firstName":"Zhiwen","middleName":"","lastName":"Pan","suffix":""}],"badges":[],"createdAt":"2025-03-06 03:23:19","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6166592/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6166592/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":79834452,"identity":"7a6d5edd-1fa4-4953-886e-64c32b6b85b3","added_by":"auto","created_at":"2025-04-03 11:09:26","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":79989,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePerformance of the PCa detection model in the test cohort. \u003c/strong\u003e(a) ROC curve illustrating the performance of the PCa detection model utilizing the cancer score feature and the PSA feature model across the entire test cohort. (b) ROC curves for subsets of the test cohort with PSA levels above 4 and below 20. (c) Violin plot displaying PCa scores in the entire test cohort categorized by non-cancer, non-metastatic, and metastatic participants. The threshold for PCa classification is set at 0.5. (d) Table showing the performance of the PCa detection model in the entire test cohort and its subsets with PSA \u0026gt; 4 and PSA \u0026lt; 20.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-6166592/v1/68e5df756aa130779e7f5b29.png"},{"id":108181043,"identity":"5107fdd9-b463-4839-bd86-cadeab424e95","added_by":"auto","created_at":"2026-04-30 08:56:34","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":222330,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6166592/v1/b588e0fc-b144-4759-8aeb-ffee1aaeac98.pdf"},{"id":79834458,"identity":"45a0acae-8d60-4027-b5e6-ed97d790b0d4","added_by":"auto","created_at":"2025-04-03 11:09:27","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":351253,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarypaperfig1.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6166592/v1/cbbd064034b82343539d702a.pdf"},{"id":79834454,"identity":"02bb484a-c4e3-446e-afa4-eb1a312885cf","added_by":"auto","created_at":"2025-04-03 11:09:26","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":137216,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarypaperfig2.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6166592/v1/824137a697e5bab38a1114fe.pdf"},{"id":79834459,"identity":"0560d526-b575-4c74-890e-312ff69053a9","added_by":"auto","created_at":"2025-04-03 11:09:27","extension":"pdf","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":151524,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarypaperfig3.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6166592/v1/0e2d9cb6ec38bbc481c69fb2.pdf"},{"id":79835568,"identity":"ae584c71-8ea8-4cf8-82b4-ba324c14fe52","added_by":"auto","created_at":"2025-04-03 11:17:27","extension":"pdf","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":610704,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarypaperfig4.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6166592/v1/6ecdf51b9017bb7d714a7779.pdf"},{"id":79834460,"identity":"5fb9ee6b-868a-4e39-ba84-d62f84800b92","added_by":"auto","created_at":"2025-04-03 11:09:27","extension":"pdf","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":783714,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFigures.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6166592/v1/23671fb1aa6bfea43bd86f44.pdf"},{"id":79834464,"identity":"f70355b3-69af-4572-8c65-ed0b491569f4","added_by":"auto","created_at":"2025-04-03 11:09:27","extension":"xlsx","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":65684,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarytables.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6166592/v1/99f26ee5d52fa52b10596693.xlsx"}],"financialInterests":"No competing interests reported.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eImproving Early Prostate Cancer Detection Using Fragmentomics and Ensemble Machine Learning Models\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"Introduction","content":"\u003cp\u003eProstate cancer (PCa) ranks as the second most prevalent cancer in men globally and the third leading cause of cancer-related deaths [1]. Prostate-specific antigen (PSA) serves as a pivotal indicator of PCa, with elevated levels proving effective in identifying patients [2]. However, the diagnostic efficacy of PSA diminishes when levels fall between 4 ng/mL and 20 ng/mL, exhibiting a limited ability to differentiate between cancer and non-cancer cohorts. This underscores the pressing necessity for the development of precise and cost-effective methodologies for early-stage PCa detection, particularly in high-risk groups.\u003c/p\u003e \u003cp\u003eLiquid biopsy, which features circulating tumor DNA (ctDNA), has emerged as a minimally invasive and accurate modality for cancer detection [3]. ctDNA refers to the fraction of cell-free DNA (cfDNA) originating from tumor cells circulating within the bloodstream of a patient. Recent studies have highlighted substantial dissimilarities in the cfDNA fragment size patterns between individuals with cancer and those without cancer [4]. Tumor-specific fragmentomic features have the potential to revolutionize early-stage cancer detection and warrant comprehensive validation in PCa.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eSuperior Performance of FSD in PCa Discrimination\u003c/h2\u003e \u003cp\u003eIn this study, we have developed a workflow for PCa detection utilizing cfDNA whole-genome sequencing (WGS). The workflow, which includes cfDNA sequencing and model construction, is illustrated in Supplementary Fig.\u0026nbsp;1. Specifically, we extracted cfDNA from plasma samples of both individuals with PCa and those without cancer, forming the basis of our approach (Supplementary Fig.\u0026nbsp;1a). We carefully assembled a cohort consisting of 78 previously untreated individuals with non-metastatic PCa (comprising 33 localized and 45 locally advanced cases) and 99 individuals without cancer (49 BiopsyNegative and 50 healthy), all matched for age. This cohort served as our training group (Supplementary Fig.\u0026nbsp;1b). Additionally, we established an independent test cohort, which included 57 individuals with non-metastatic PCa (31 localized and 27 locally advanced), 20 individuals with metastatic hormone-sensitive PCa (mHSPC), and 55 individuals without cancer, again matched for age (Supplementary Fig.\u0026nbsp;1c). Our PCa detection model incorporated three types of fragmentomic features: fragment size distribution (FSD), fragment size coverage (FSC), and copy number variance (CNV). Each feature was utilized independently to train the models using an ensemble approach with a Generalized Linear Model (GLM), Gradient Boosting Model (GBM), and XGBoost (XGB) (Supplementary Fig.\u0026nbsp;1c). Subsequently, these individual models were integrated using a GLM, resulting in our final PCa score (Supplementary Methods).\u003c/p\u003e \u003cp\u003eTo evaluate the efficacy of fragmentomic features (FSC, FSD, and CNV) in distinguishing patients with PCa from those without, we utilized models incorporating 177 training samples, employing GLM, GBM, and XGB algorithms. These models underwent assessment through predictions obtained via a five-fold cross-validation of the test sets. Among the three fragmentomic features, the FSD model emerged as the most robust performer, achieving an impressive AUROC of 0.898 (95% CI, 0.854\u0026ndash;0.942) across all patients in the training cohort (Supplementary Fig.\u0026nbsp;2a). Particularly noteworthy is its performance with high-risk patients exhibiting PSA levels ranging from \u0026gt;\u0026thinsp;4 to \u0026lt;\u0026thinsp;20 ng/mL, where the FSD model demonstrated an AUROC of 0.907 (95% CI, 0.841\u0026ndash;0.971), alongside 89.6% sensitivity and 84.1% specificity (Supplementary Fig.\u0026nbsp;2b). Conversely, the FSC and CNV models exhibited moderate performance, with AUROC values of 0.694 (95% CI, 0.616\u0026ndash;0.772) and 0.766 (95% CI, 0.697\u0026ndash;0.836), respectively (Supplementary Fig.\u0026nbsp;2a). This trend persisted among patients within the PSA range of 4 to 20, with the FSC and CNV models yielding AUROC values of 0.706 (95% CI, 0.596\u0026ndash;0.816) and 0.714 (95% CI, 0.609\u0026ndash;0.819), respectively (Supplementary Fig.\u0026nbsp;2b).\u003c/p\u003e \u003cp\u003eTo further assess the robustness of the fragmentomic ensemble models, we applied them to a separate test cohort comprising 112 individuals who had not undergone prior intervention (Supplementary Fig.\u0026nbsp;3). Impressively, the FSD model exhibited robust and outstanding performance across all tested individuals and within the high-risk population with PSA levels ranging between 4 and 20, demonstrating AUROC values of 0.95 (95% CI, 0.852\u0026thinsp;\u0026minus;\u0026thinsp;1) and 0.923 (95% CI, 0.772\u0026thinsp;\u0026minus;\u0026thinsp;1), respectively (Supplementary Fig.\u0026nbsp;3a-b). Conversely, the FSC and CNV models showed relatively diminished performance across the entire test cohort, with AUROC values of 0.6 (95% CI, 0.417\u0026thinsp;\u0026minus;\u0026thinsp;0.783) and 0.615 (95% CI, 0.425\u0026thinsp;\u0026minus;\u0026thinsp;0.805), respectively (Supplementary Fig.\u0026nbsp;3a). Additionally, among individuals with PSA levels between 4 and 20, the FSC and CNV models exhibited limited discriminatory power (with AUROC values of 0.6 and 0.523, respectively) (Supplementary Fig.\u0026nbsp;3b). We further compared the sensitivity and specificity of the three feature models and found that the FSD model demonstrated high sensitivity and specificity across all samples and within the high-risk populations, consistent with its performance in the training cohort (Supplementary Fig.\u0026nbsp;3c).\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eIntegrated PCa Score Outperforms PSA in Diagnostic Accuracy\u003c/h3\u003e\n\u003cp\u003eTo enhance the accuracy and resilience of fragmentomic models in detecting PCa, we developed a GLM model that integrates the FSC, FSD, and CNV models, generating a PCa score for each patient. Subsequently, we evaluated these scores. Impressively, across all patients in the training set, the PCa score achieved an AUROC of 0.901 (95% CI, 0.858\u0026ndash;0.944), surpassing the diagnostic accuracy of PSA (AUROC: 0.875, 95% CI, 0.826\u0026ndash;0.925) (Supplementary Fig.\u0026nbsp;4a). The corresponding sensitivity and specificity of the PCa score were 78.2% and 83.83%, respectively, with a positive predictive value (PPV) of 79.22% and a negative predictive value (NPV) of 83%, using a cutoff score of 0.5 (Supplementary Fig.\u0026nbsp;4c, d). Specifically, among patients with PSA levels between 4 and 20 ng/mL, the PCa score exhibited a significantly higher AUROC (0.909, 95% CI, 0.845\u0026ndash;0.972) compared to PSA (0.676, 95% CI, 0.566\u0026ndash;0.786), accompanied by 85.42% sensitivity and 84.1% specificity, as well as a PPV of 85.42% and NPV of 84.1% (Supplementary Fig.\u0026nbsp;4b, d). Furthermore, compared with the FSD model, the secondary GLM modestly improved performance in the training cohort, suggesting potential contributions of the FSC and CNV models to the PCa score.\u003c/p\u003e \u003cp\u003eThe PCa score was finally evaluated using an independent test dataset comprising 112 samples. Consistent with its high performance in the training cohort, the PCa score exhibited an AUROC of 0.89 (95% CI, 0.826\u0026ndash;0.954), surpassing the AUROC of PSA, which stood at 0.855 (95% CI, 0.74\u0026ndash;0.97) (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea). Using a cutoff score of 0.5, the PCa score demonstrated a sensitivity of 77.19%, specificity of 83.64%, PPV of 83.02%, and NPV of 77.97%, mirroring the results observed in the training cohort (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec, d). Among patients with PSA levels between 4 and 20, the PCa score maintained high accuracy, with an AUROC of 0.886 (95% CI, 0.787\u0026ndash;0.985), sensitivity of 79.31%, specificity of 84%, PPV of 88.46%, and NPV of 73.91% (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb, d), thus highlighting the effectiveness of the PCa score for detecting PCa in high-risk populations. Furthermore, when the PCa detection model was applied to 20 patients, consistent score ranges were observed with the non-metastatic group (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec), indicating the detection of PCa signals in both metastatic and non-metastatic samples.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe study presents compelling evidence supporting the utility of cfDNA fragmentomic features, particularly fragment size distribution (FSD), in enhancing the detection of early-stage prostate cancer (PCa), especially within the diagnostically challenging PSA \"gray zone\" (4\u0026ndash;20 ng/mL). The results highlight several critical advancements and underscore the potential of liquid biopsy as a non-invasive tool for PCa diagnosis.\u003c/p\u003e \u003cp\u003eThe FSD model emerged as the most robust single-feature classifier, achieving an AUROC of 0.898 in the training cohort and 0.95 in the independent test cohort. This exceptional performance, surpassing both FSC (AUROC: 0.694\u0026ndash;0.6) and CNV (AUROC: 0.766\u0026ndash;0.615), aligns with prior observations that tumor-derived cfDNA exhibits distinct fragmentation patterns compared to non-cancerous cfDNA. The shorter fragment sizes characteristic of tumor DNA likely contribute to the discriminative power of FSD, as cancer cells release DNA through mechanisms such as apoptosis and necrosis, which differ from the physiological processes governing cfDNA release in healthy individuals. Notably, the FSD model maintained high sensitivity (89.6%) and specificity (84.1%) in the PSA gray zone, addressing a critical limitation of conventional PSA testing.\u003c/p\u003e \u003cp\u003eThe integration of FSD, FSC, and CNV into a composite PCa score via a secondary GLM yielded significant improvements over PSA alone. In the training cohort, the PCa score achieved an AUROC of 0.901 (vs. PSA: 0.875), with enhanced performance in the PSA gray zone (AUROC: 0.909 vs. PSA: 0.676). This improvement was replicated in the independent test cohort (AUROC: 0.89 vs. PSA: 0.855), demonstrating robust generalizability. The PCa score\u0026rsquo;s sensitivity (77.19\u0026ndash;85.42%) and specificity (83.64\u0026ndash;84.1%) further highlight its clinical utility, particularly given the moderate PPV (79.22\u0026ndash;88.46%) and NPV (77.97\u0026ndash;84%), which are comparable to or better than existing biomarkers. Importantly, the PCa score\u0026rsquo;s ability to detect both non-metastatic and metastatic PCa signals (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec) suggests broad applicability across disease stages, though further validation in metastatic cohorts is warranted.\u003c/p\u003e \u003cp\u003eThe study\u0026rsquo;s focus on patients with PSA levels of 4\u0026ndash;20 ng/mL addresses a critical unmet need. Traditional PSA testing in this range suffers from low specificity, leading to unnecessary biopsies and overdiagnosis. The PCa score\u0026rsquo;s AUROC of 0.909 in this subgroup, coupled with 85.42% sensitivity and 84.1% specificity, represents a paradigm shift. By reducing diagnostic ambiguity, this approach could minimize invasive procedures for benign cases while ensuring timely detection of malignancies. However, the relatively small sample size in this subgroup (training: 33 localized\u0026thinsp;+\u0026thinsp;45 locally advanced PCa; test: 31 localized\u0026thinsp;+\u0026thinsp;27 locally advanced) necessitates caution in extrapolating these results to larger populations.\u003c/p\u003e \u003cp\u003eDespite the significant advantages offered by the PCa score and its non-invasive fragmentomic profile analyses for detecting early-stage PCa, our study possesses certain limitations. Initially, while our analyses produced robust results within the early-stage validation cohort and among high-risk individuals, it is imperative to conduct a more extensive prospective validation study within a broader screening population before clinical implementation. Additionally, conducting multicenter training and validation studies is necessary to ensure the external validity and generalizability of PCa detection models, a factor limited in the current study. Contemporary research has unveiled the potential of multimodal approaches in early cancer detection [5], which integrate additional fragmentation features such as breakpoint motifs and end motifs of cfDNA sequences. Notably, our study did not integrate these features into the early PCa detection model. Furthermore, although the current PCa detection model appeared to outperform the single FSD model in the training cohort, it did not demonstrate significant accuracy improvement in the FSD model in the test cohort, possibly due to limited training cohort sizes. the PCa score\u0026rsquo;s incremental improvement over the standalone FSD model in the test cohort (AUROC: 0.89 vs. FSD: 0.95) raises questions about the added value of FSC and CNV. This may reflect overfitting in the training phase or insufficient contribution from weaker features, underscoring the need for feature optimization. Despite these limitations, the crucial observation that noninvasive and cost-effective cfDNA fragmentation analyses can differentiate early-stage PCa patients from noncancerous individuals underscores the potential of identifying PCa not only within high-risk patient groups but also within the broader general population.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003ePatient Plasma Samples and study design\u003c/h2\u003e \u003cp\u003eThis study primarily enrolled 135 non-metastatic PCa patients, 20 metastatic and 154 non-cancer patients including 76 Biopsy Negative patients and 78 healthy volunteers from Zhejiang Cancer Hospital in China (Supplementary Tables). The 135 non-metastatic PCa patients included 64 localized patients and 71 locally advanced patients. Plasma of all enrolled PCa patients have been extracted before cancer drug treatment and surgical treatment, which excluding the influence of clinical therapy to tumor ctDNA abundance. Healthy samples were collected from routine health check-ups, and limited PSA levels below 4 to reduce the potential of prostate cancer. Biopsy Negative samples were obtained from prostate cancer clinic where patients exhibited symptoms of prostate abnormalities, yet their biopsy results were negative. This study was approved by the Ethics Committees of Zhejiang Cancer Hospital (Approval No. IRB-2021-247). All methods were performed in accordance with the relevant guidelines and regulations.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003ecfDNA extraction, library preparation and whole genome sequencing\u003c/h2\u003e \u003cp\u003eFor each sample, 5mL peripheral blood was collected in EDTA tubes and processed within 2h. Plasma was isolated by centrifugation at 1600g for 20 min at room temperature, and additional centrifugation at 15000g for 10min under 4 ℃ in microcentrifuge tubes. cfDNA was extracted from 2-3mL plasma using Qiagen Circulating Nucleic Acids Kit (Qiagen), according to manufacturer\u0026rsquo;s instructions, measured with Qubit fluorometer (Life Technologies) and stored at -80 ℃. 10ng cfDNA was used for Library preparation with Kapa HyperPrep Kit (Kapa Biosystems). The library was amplificated with KAPA HiFi Hotstart ReadyMix (KAPA Biosystems) and NEBNext Multiplex Oligos for Illumina (New England BioLabs)as follows: initial denaturation at 95 ℃ for 3 min, followed by 4 cycles of 98 ℃ for 20 s, 65 ℃ for 15 s, 72 ℃ for 30 s, and the final extension at 72 ℃ for 1 min. After purification with Beckman Agencourt AMPure XP beads, amplified libraries were measured with Qubit fluorometer (Life Technologies) and Bioanalyzer 2100 (Agilent), and then pooled and sequenced (Wuxi NextCode) on an Illumina NovaSeq 6000 system to generate 150 bp paired-end reads.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eBioinformatics analysis and machine-learning modeling\u003c/h3\u003e\n\u003cp\u003eThe sequenced reads were aligned to hg19 genome using Bowtie2 with the default settings. The generated SAM files from hg19 alignment were converted to BAM format, ensuring the removal of duplicate reads, and the reads were then sorted and indexed using SAMtools for subsequent analysis. The mapping quality of raw sequencing files were evaluated using SAMtools flagstat function.\u003c/p\u003e \u003cp\u003eWith reference to previous studies, we calculated fragment-related features of fragment size distribution (FSD), fragment size coverage (FSC) and copy number variance (CNV) for each patient. For the CNV feature, we divided the whole human for each 5\u0026nbsp;million base bins hg19 genome except for bins overlapped known blacklisted regions, and calculated CNV count in each bin by read Counter function in HMM copy utils (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/shahcompbio/hmmcopy_utils\u003c/span\u003e\u003cspan address=\"https://github.com/shahcompbio/hmmcopy_utils\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), and corrected with GC content by ichorCNA. The fragmentation size coverage (FSC) and fragmentation size distribution (FSD) features were calculated using custom scripts from GitHub repository \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/adamtongji/PCa_frag_manuscript\u003c/span\u003e\u003cspan address=\"https://github.com/adamtongji/PCa_frag_manuscript\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Detailly, the FSC feature was calculated as the coverage of short (100\u0026ndash;150 bp) and long (221\u0026ndash;300 bp) cfDNA fragments divided by the coverage of intermediate (151\u0026ndash;220 bp) fragments, considering both longer or shorter tumor-derived cfDNA. The FSD feature was calculated as the fraction of cfDNA fragments ranging from 100 bp to 300 bp in 5 bp stepwise in all cfDNA fragments at every chromosome arm. The FSD, FSC and CNV were subsequently used for model training and testing steps for all machine learning algorithms.\u003c/p\u003e \u003cp\u003eWe applied three machine learning algorithms, including Generalized Linear Model (GLM), Gradient Boosting Model (GBM) and XGBoost (XGB), to construct ensemble models for all features. GLM and GBM model were implemented with R caret package, and XGB model was implemented with R xgboost package. We further constructed a GLM model of elastic net (alpha\u0026thinsp;=\u0026thinsp;0.5 in caret function) trained with all output predictions of CNV, FSD and FSC models by 5-fold cross-validation (CV), and took predictions of the GLM model as PCa cancer score. Model performance was evaluated for each sample by computing the area under the receiver operating characteristic curve (AUROC). For the training set, we evaluated each model using the held-out datasets in the test data of 5-fold CV as internal test datasets. For the independent valid datasets, we applied final trained models from all training samples to these independent test samples.\u003c/p\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003eStatistical analysis\u003c/h2\u003e \u003cp\u003eFor statistical analysis, the receiver operating characteristic (ROC) curves were generated using the pROC package. Based on true positive (TP), true negative (TN), false positive (FP), and false negative (FN) of cancer prediction at threshold of 0.5, we calculated the sensitivity [TP/(TP\u0026thinsp;+\u0026thinsp;FN)], specificity [TN/(TN\u0026thinsp;+\u0026thinsp;FP)], positive (PPV) [TP/(TP\u0026thinsp;+\u0026thinsp;FP)] and negative predictive values (NPV) [TN/(TN\u0026thinsp;+\u0026thinsp;FN)]. The corresponding 95% confidence intervals were calculated using the ROCR package in R.\u003c/p\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eConflicts of Interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eJ. Wang, S. Fu, and L. Dan work at Suzhou Danbei Medical Technology Co., Ltd, a company that focuses on cancer early detection. No potential conflicts of interest were disclosed by the other authors.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe data supporting the conclusions of this study is included in the article and its Supplementary files. All R packages used are available online as described in the method section. Customized code for data processing and visualization can be accessed on the GitHub repository (https://github.com/adamtongji/PCa_frag_manuscript).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was approved by the Ethics Committees of Zhejiang Cancer Hospital (Approval No. IRB-2021-247). All participants provided written informed consent.\u003c/p\u003e\n\u003cp\u003eAll methods were performed in accordance with the relevant guidelines and regulations, including the Declaration of Helsinki and institutional ethical review protocols.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contribution\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eZhixuan Fu, Qiaozhen Hong, Yipeng Xu, and Jingyu Wang wrote the main manuscript text and contributed equally to the article.Fu and Dan Li prepared figures 1-4. All authors reviewed the manuscirpt.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was supported by grants from the China Postdoctoral Science Foundation (2023M742651, GZC20231946) and National Natural Science Foundation of China (No.83172210).\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eSung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F: Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 71, 209-249 (2021). \u003c/li\u003e\n\u003cli\u003eAng M, Rajcic B, Foreman D, Moretti K, O\u0026apos;Callaghan ME: Men presenting with prostate-specific antigen (PSA) values of over 100 ng/mL. BJU Int 117, 68-75 (2016). \u003c/li\u003e\n\u003cli\u003eTivey A, Church M, Rothwell D, Dive C, Cook N: Circulating tumour DNA - looking beyond the blood. Nat Rev Clin Oncol 19, 600-612 (2022). \u003c/li\u003e\n\u003cli\u003eCristiano S, Leal A, Phallen J, et al: Genome-wide cell-free DNA fragmentation in patients with cancer. Nature 570, 385-389 (2019). \u003c/li\u003e\n\u003cli\u003eMa X, Chen Y, Tang W, et al: Multi-dimensional fragmentomic assay for ultrasensitive early detection of colorectal advanced adenoma and adenocarcinoma. J Hematol Oncol 14, 175 (2021). \u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-6166592/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6166592/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eProstate cancer (PCa) detection remains challenging when prostate-specific antigen (PSA) levels fall within the ambiguous 4\u0026ndash;20 ng/mL range. This study aimed to develop a non-invasive, cost-effective method for early PCa detection using circulating tumor DNA (ctDNA) fragmentomic features. Whole-genome sequencing of cell-free DNA (cfDNA) from plasma samples was performed in a training cohort (78 PCa patients, 99 non-cancer controls) and an independent test cohort (57 non-metastatic PCa, 20 metastatic cases, 55 controls). Three fragmentomic features\u0026mdash;fragment size distribution (FSD), fragment size coverage (FSC), and copy number variance (CNV)\u0026mdash;were analyzed using ensemble machine learning models (GLM, GBM, XGBoost). The FSD model demonstrated superior performance (AUROC: 0.898 in training, 0.95 in testing), particularly in high-risk patients (PSA 4\u0026ndash;20 ng/mL; AUROC: 0.907). A composite PCa score integrating FSD, FSC, and CNV achieved AUROCs of 0.901 (training) and 0.89 (testing), outperforming PSA (AUROCs: 0.875 and 0.855, respectively). In high-risk subgroups (PSA 4\u0026ndash;20 ng/mL), the PCa score maintained high sensitivity (79.31%) and specificity (84%). Despite limitations in cohort size and external validation, this study highlights the clinical potential of cfDNA fragmentomics for early PCa detection, especially in diagnostically ambiguous PSA ranges.\u003c/p\u003e","manuscriptTitle":"Improving Early Prostate Cancer Detection Using Fragmentomics and Ensemble Machine Learning Models","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-04-03 11:09:22","doi":"10.21203/rs.3.rs-6166592/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"c9d7dc8c-5748-48aa-aa63-02682f16da63","owner":[],"postedDate":"April 3rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":46533014,"name":"Biological sciences/Cancer/Cancer screening"},{"id":46533015,"name":"Biological sciences/Cancer/Tumour biomarkers"},{"id":46533016,"name":"Health sciences/Molecular medicine"}],"tags":[],"updatedAt":"2026-04-27T02:39:52+00:00","versionOfRecord":[],"versionCreatedAt":"2025-04-03 11:09:22","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6166592","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6166592","identity":"rs-6166592","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00