A novel prognostic model for colorectal cancer based on epithelial cell marker genes identified and validated by combining Single-Cell and Bulk RNA- Sequencing

preprint OA: closed
Full text JSON View at publisher
Full text 97,339 characters · extracted from preprint-html · click to expand
A novel prognostic model for colorectal cancer based on epithelial cell marker genes identified and validated by combining Single-Cell and Bulk RNA- Sequencing | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article A novel prognostic model for colorectal cancer based on epithelial cell marker genes identified and validated by combining Single-Cell and Bulk RNA- Sequencing Liyang Cai, Xin Guo, Yucheng Zhang, Huajie Xie, Yongfeng Liu, and 4 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4780290/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 07 Mar, 2025 Read the published version in Scientific Reports → Version 1 posted 11 You are reading this latest preprint version Abstract Background Colorectal cancer (CRC) is a prevalent malignant tumor characterized by high global incidence and mortality rates. Furthermore, it is imperative to comprehend the molecular mechanisms underlying its development and to identify effective prognostic markers. These efforts are crucial for pinpointing potential therapeutic targets and enhancing patient survival rates. Therefore, We develop a novel prognostic model aimed at providing new theoretical support for clinical prognosis evaluation and treatment. Methods We downloaded data from the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) databases. Subsequently, we performed single-cell analysis and developed a prognostic model associated with colorectal cancer. Results We divided the scRNA-seq dataset (GSE221575) into 19 cell clusters and classified these clusters into 11 distinct cell types using marker genes. Using univariate Cox regression and LASSO (Least Absolute Shrinkage and Selection Operator) analyses, we developed a prognostic model consisting of 9 genes. Based on our 9-gene model, we divided patients into high-risk and low-risk groups using the median risk score. The high-risk group demonstrated significant positive correlations with M0 macrophages, CD8 + T cells, and M2 macrophages. The enrichment analyses indicate significant enrichment of immune-related pathways in the high-risk group, including HEDGEHOG_SIGNALING, Wnt signaling pathway, and cell adhesion molecules. Drug sensitivity analysis revealed that the low-risk group was sensitive to 5 chemotherapeutic drugs, while the high-risk group was sensitive to only 1. Additionally, we developed a highly reliable nomogram for clinical application. This suggests that the risk score derived from our modeling analysis is highly effective for stratifying colorectal cancer samples. Conclusions This study comprehensively applied bioinformatics methods to construct a risk score model. The model showed good predictive performance, offering potential guidance for individualized treatment of colorectal cancer patients. Furthermore, it may provide valuable insights into the disease's pathogenesis and identify potential therapeutic targets for further research. Biological sciences/Cancer/Cancer screening Biological sciences/Cancer/Gastrointestinal cancer Biological sciences/Cancer/Tumour biomarkers colorectal cancer prognostic model scRNA-seq epithelial cell marker genes Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Background Colorectal cancer is a prevalent malignant tumor, comprising approximately 10% of global cancer diagnoses and cancer-related deaths each year [ 1 ] , with nearly 9 million deaths annually. Over the past few decades, the incidence of colorectal cancer in high-income countries has stabilized or declined [ 2 ] , largely due to increased acceptance of colorectal cancer screening and colonoscopic polypectomy among the elderly [ 3 ] . In contrast, there has been a global rise in the incidence of colorectal cancer diagnosed in young people, known as early-onset colorectal cancer [ 4 ] . Additionally, factors such as family history, obesity, poor diet (high-fat, low-fiber diets), and long-term inflammatory bowel disease are also considered related to the occurrence of colorectal cancer [ 5 ] . Most colorectal cancers originate from polyps, a process that starts with abnormal crypts, evolves into pre-tumor lesions (polyps), and ultimately develops into colorectal cancer within an estimated period of 10–15 years [ 5 ] . It is currently believed that the cells of origin for most colorectal cancers are stem cells or stem cell-like cells. [ 6 , 7 ] . The emergence of these cancer stem cells stems from the progressive accumulation of genetic and epigenetic alterations that deactivate tumor suppressor genes and activate oncogenes. Presently, the primary treatment modalities for colorectal cancer encompass surgical resection, chemotherapy, radiotherapy, and targeted therapy. Surgical resection represents the cornerstone of curative treatment for colorectal cancer, with the quality of resection playing a pivotal role in prognosis. Assessment of resection quality can be achieved through objective parameters [ 8 ] . As adjuvant therapy, fluoropyrimidine-based chemotherapy can improve the survival rates of resected stage III and high-risk stage II colon cancer (such as high-risk T4, poorly differentiated) [ 9 , 10 ] . Preoperative radiotherapy is beneficial in reducing the risk of local recurrence, with the absolute risk reduction depending on clinical staging and surgical quality [ 11 ] . Currently available biomarkers for predicting prognosis and treatment response in CRC patients, such as carcinoembryonic antigen (CEA) and carbohydrate antigen 19 − 9 (CA19-9), have suboptimal sensitivity and specificity [ 12 , 13 ] . Therefore, there is a need for more precise biomarkers. With the advancement of high-throughput genomic screening technologies, such as next-generation sequencing and microarray analysis, a multitude of molecular biomarkers and features have been identified. These have potential clinical prognostic and predictive value, identified through comprehensive association and bioinformatics analyses. Notably, several genomics-based biomarkers, such as mismatch repair (MMR) or microsatellite instability (MSI) status, have entered clinical practice and have been validated as predictive markers for adjuvant chemotherapy in stage II CRC patients [ 14 – 16 ] . With the rapid advancements in next-generation sequencing technologies, an increasing number of studies are employing RNA sequencing (RNA-seq) to analyze gene expression patterns in colorectal cancer. However, RNA sequencing (RNA-seq) is typically conducted in bulk, where the data reflect the average gene expression patterns across a large population of cells [ 17 ] . Notably, single-cell RNA sequencing (ScRNA-seq) is a cutting-edge sequencing technology that offers detailed insights into the characteristics of individual immune cells or tumor cells [ 18 ] . ScRNA-seq indeed highlights intratumoral heterogeneity by revealing different subpopulations of cells within tumors. It also has the capability to quantify and analyze immune cell infiltration patterns within tumor tissues. In this study, we aimed to explore the distribution of different cell types, gene expression characteristics, and their correlations with clinical prognosis in colorectal cancer tissues using both single-cell RNA sequencing and gene chip analysis. Additionally, we developed a novel prognostic model leveraging feature genes identified through single-cell analysis. Methods Data source and preprocessing The GEO database, managed by NCBI, stores gene expression data from various studies, aiding researchers in accessing and analyzing biological and biomedical research datasets. We retrieved the single-cell data files for GSE221575 from the GEO database, specifically selecting datasets from 4 samples that feature comprehensive single-cell expression profiles, essential for our detailed single-cell analysis. We additionally obtained the Series Matrix File data for GSE17536 from the NCBI GEO public database, annotated with GPL570, encompassing expression profiles from 177 patients. Furthermore, we acquired the Series Matrix File data for GSE38832 from the NCBI GEO public database, annotated with GPL570, featuring expression profile data from 122 patients. The TCGA database ( https://portal.gdc.cancer.gov/ ), recognized as the largest repository of cancer-related genetic information, comprehensively archives diverse data types such as gene expression data, copy number variations, SNPs, and beyond. In our study, we accessed raw mRNA expression data specifically for colorectal cancer, encompassing a total of 701 samples, comprising 51 normal samples and 650 tumor samples. Single-cell analysis Initially, the expression profiles were imported using the Seurat package. Subsequently, aberrant samples were excluded by evaluating UMI counts, the quantity of genes detected per cell, and the mitochondrial gene fraction. The data were subsequently standardized, normalized, and subjected to PCA (Principal Component Analysis) to achieve linear dimensionality reduction. The optimal number of principal components was identified using an elbow plot. Subsequently, UMAP (Uniform Manifold Approximation and Projection) was employed for nonlinear dimensionality reduction to elucidate the spatial relationships between clusters. Cell types and their associated marker genes within the tissue were identified and annotated through utilization of the CellMarker and PanglaoDB databases, supplemented by consulting pertinent literature sources. Contribution of different cell subpopulations to colorectal cancer We characterized the contribution of various cell subpopulations to the disease by evaluating both the cell counts and alterations in gene expression patterns. In summary, our approach involved conducting differential gene expression analysis to pinpoint the top 100 highly expressed genes in control versus tumor samples, treating these genes as feature markers for each group. Subsequently, we computed the differential expression levels and expression proportions of these genes within each cell subtype. We utilized the square root of the product of fold change (FC) and percentage proportion (PctProp) to gauge the contribution of these genes to the disease. Model construction and prognosis A candidate gene set was identified, and lasso regression was employed to develop a prognostic model. This involved incorporating the expression values of each selected gene to formulate a risk score formula for every patient. The coefficients derived from the lasso regression analysis were utilized to weight the contribution of each gene in the risk score calculation. This approach helps in predicting patient outcomes based on the expression levels of the selected genes. Based on the risk score formula derived from the lasso regression, patients were categorized into low-risk and high-risk groups using the median risk score as the threshold. To assess survival disparities between these groups, Kaplan-Meier analysis was conducted, and the results were compared using the log-rank test. Lasso regression analysis and stratified analysis were employed to rigorously examine the impact of the risk score on predicting patient prognosis. The precision of the model's predictions was meticulously evaluated by analyzing Receiver Operating Characteristic (ROC) curves, ensuring a thorough assessment of its predictive power. Immune cell infiltration analysis The CIBERSORT method is indeed widely recognized for its application in evaluating immune cell types within biological microenvironments. This approach utilizes support vector regression to perform deconvolution analysis on the expression matrix of immune cell subtypes. With a robust framework built on 547 biomarkers, CIBERSORT effectively discriminates among 22 distinct human immune cell phenotypes. These phenotypes encompass a broad spectrum, encompassing T cells, B cells, plasma cells, and various myeloid subgroups. In this study, we used the CIBERSORT algorithm to analyze patient data, estimating the proportions of 22 immune infiltrating cell types. We then performed correlation analysis between gene expression and immune cell content. Drug sensitivity analysis Based on data from the Genomics of Drug Sensitivity in Cancer (GDSC) database, we utilized the R package "pRRophetic" to forecast the sensitivity of each tumor sample to chemotherapy. We obtained IC50 estimates for each specific chemotherapy drug using regression methods and validated the accuracy of these predictions through 10-fold cross-validation on the GDSC training set. In our analysis, default parameters were used throughout, including "combat" for batch effect removal and averaging for handling duplicate gene expression data. GSVA analysis (gene set difference analysis) Gene Set Variation Analysis (GSVA) is a non-parametric, unsupervised method for evaluating transcriptome gene set enrichment. GSVA converts gene-level changes to pathway-level changes by comprehensively scoring gene sets of interest, thereby assessing the biological functions of samples. In this study, gene sets were downloaded from the Molecular Signatures Database (v7.0) and scored using the GSVA algorithm to assess potential biological function changes in different samples. GSEA analysis Patients were meticulously stratified into high-risk and low-risk cohorts according to the nuanced risk scores generated by the model, followed by an intricate exploration of signal pathway variances between these delineated groups using the powerful analytical tool known as Gene Set Enrichment Analysis (GSEA). The background gene sets were meticulously curated from the comprehensive MsigDB database, specifically tailored for subtype pathway annotation and rigorous differential expression analysis across subtypes. Enriched gene sets achieving statistical significance (adjusted p-value < 0.05) were meticulously prioritized based on their consistency scores. Utilizing GSEA analysis, a widely adopted approach, allowed for an in-depth exploration into the intricate associations linking tumor subtypes with their profound biological implications. Nomogram model construction A sophisticated nomogram, meticulously crafted through rigorous regression analysis, elegantly integrates both the nuanced risk scores and intricate clinical symptoms. Scaled line segments were meticulously plotted on a unified plane, each segment proportionally representing the interdependencies among variables within the predictive model. Through the construction of a multivariate regression model, distinct scores were assigned to each tier of influential factors, delineated by their respective contributions (regression coefficients) to the outcome variable. The cumulative total score was then computed by aggregating these individual scores, thereby yielding the predictive value. Statistical analysis Survival curves were generated using the Kaplan-Meier method and were compared using the log-rank test. Multivariate analysis was conducted using the Cox proportional hazards model. All statistical analyses were conducted using R (version 4.3.0), with p < 0.05 considered statistically significant. Results Definition of clusters and dimensionality reduction for visual representation of the cells First, we read the expression profiles using the Seurat package and filtered out low-expression genes (nFeature_RNA > 200 & nFeature_RNA < 6000 & percent.mt < 25 & nCount_RNA < 40000), resulting in 5,241 cells (Fig. 1 a-b). We displayed the top 5 genes with the highest standard deviation among these cells (Fig. 1 c). The data were then sequentially processed for standardization, normalization, PCA, and Harmony analysis (Fig. 1 d-f). Using UMAP analysis, we determined the positional relationships between each cluster, identifying 19 cell clusters (Fig. 2 a). Further annotation of each subtype in this study revealed that all cell clusters were annotated into the following cell categories: Plasma cell, T cell, Profiling NKT cell, Macrophage cell, B cell, Epithelial cell, Fibroblasts, Goblet cell, Mast cell, EC cell, and Cholangiocytes (Fig. 2 b). We also presented a bubble plot of the classic markers for these 11 cell types (Fig. 2 c) and a bar plot showing the proportion of cells corresponding to each group (Fig. 2 d). In comparing control versus tumor samples, we conducted screening to identify highly expressed genes. Subsequently, we quantified the differential expression levels and determined the expression proportions of these genes within each cell subtype. The disease contribution was determined by the square root of the FC * PctProp value, with Epithelial cells showing the highest contribution (Fig. 2 e). Therefore, we selected highly expressed genes in control vs. tumor samples with avg_log2FC > 1 and p_val_adj < 0.05 as the candidate gene set for subsequent analysis. Construction and validation of the predictive model based on epithelial cell marker genes Using the candidate genes obtained in the previous step, we applied the lasso regression feature selection algorithm to identify characteristic genes in colorectal cancer. The processed colorectal cancer dataset from the TCGA database, containing patient survival information, was randomly divided into a training set and a test set at a 4:1 ratio. After lasso regression analysis (Fig. 3 a-Fig. 3c), we obtained the optimal risk score value for each sample for subsequent analysis.RiskScore = S100P * (-0.127895799000468) + PIGR * (-0.110505479982379) + RAB11FIP1 * (-0.110168920940582) + USP53 * (-0.0657253153577585) + CDH1 * (-0.0447173205642763) + LGALS4 * (-0.026899354451419) + ATP10B * (-0.0253973538959315) + SLC12A2 * (-0.0148820970575857) + LAMB3 * (0.0964297473594213). Patients were stratified into high-risk and low-risk groups according to their calculated risk scores, and subsequent survival analysis was performed utilizing Kaplan-Meier curves. In both the training set and the test set, the survival rate was markedly lower in the high-risk group compared to the low-risk group (Fig. 3 d-Fig. 3e). Furthermore, ROC curve analysis from both the training and test sets indicated strong validation performance of the model (Fig. 3 f-Fig. 3g). We downloaded processed colorectal cancer patient data with survival information from the GEO database (GSE17536 and GSE38832). Using our model, we predicted the clinical classification of colorectal cancer patients sourced from the GEO database. Subsequently, we assessed survival differences between the predicted groups using Kaplan-Meier analysis to evaluate the stability and predictive accuracy of the model. The findings revealed a significant disparity in survival rates between the high-risk and low-risk groups within the external validation set obtained from the GEO database (Fig. 3 h). To validate the accuracy of our model, we conducted ROC curve analysis using the external dataset. The results illustrated robust predictive performance of the model in assessing patient prognosis. (Fig. 3 i). Analysis of immune cell infiltration to explore the impact of risk scores on the immune microenvironment in colorectal cancer The tumor microenvironment (TME) primarily consists of tumor-associated fibroblasts, immune cells, extracellular matrix, various growth factors, inflammatory factors, specific physicochemical characteristics, and the cancer cells themselves. The TME significantly influences tumor diagnosis, survival outcomes, and clinical treatment sensitivity. By analyzing the relationship between risk scores and tumor immune infiltration, we further investigated the potential molecular mechanisms through which risk scores impact colorectal cancer progression. The proportions of immune cells in the high-risk and low-risk groups are illustrated in Fig. 4 a. Additionally, we compared the differences in immune cell content between these two groups. The results showed that the high-risk group had significantly lower levels of activated dendritic cells, plasma cells, and resting CD4 memory T cells, while the levels of M0 macrophages and CD8 T cells were significantly higher (Fig. 4 b). Subsequently, we investigated the relationship between the risk score and immune cells. The study results indicated that the risk score was significantly positively correlated with M0 macrophages, CD8 T cells, and M2 macrophages, and significantly negatively correlated with resting CD4 memory T cells, activated dendritic cells, eosinophils, and plasma cells (Fig. 4 c). Further analysis to explore the potential molecular mechanisms of risk scores impacting tumor progression The treatment of early-stage colorectal cancer with surgery combined with chemotherapy has demonstrated clear efficacy. Our study utilized drug sensitivity data from the GDSC database and employed the R package "pRRophetic" to predict the chemotherapy sensitivity of each tumor sample. This approach allowed us to further explore the relationship between risk scores and sensitivity to common chemotherapy drugs. The study results indicated that the risk score was significantly associated with sensitivity to drugs such as AKT inhibitor VIII, Axitinib, BAY 61-3606, BIBW2992, BMS 708163, and Bicalutamide (Fig. 5 a). Next, we examined the specific signaling pathways involved in the high-risk and low-risk models to investigate the potential molecular mechanisms by which risk scores influence tumor progression. GSVA results showed that the differential pathways between the two groups were primarily enriched in the HEDGEHOG_SIGNALING, WNT_BETA_CATENIN_SIGNALING, and IL6_JAK_STAT3_SIGNALING pathways(Fig. 5 b). GSEA results indicated that the pathways involved included the Wnt signaling pathway, cell adhesion molecules, and the MAPK signaling pathway (Fig. 5 c). The molecular interaction network among these pathways is illustrated in Fig. 5 d. Construction of nomogram model In this study, both univariate and multivariate analyses demonstrated that the risk score is an independent prognostic factor for colorectal cancer patients (Fig. 6 a-Fig. 6b). Subsequently, samples were stratified into high-risk and low-risk groups based on the median value of the risk score. The results of the regression analysis were visualized using column plots, which demonstrated that the risk score significantly contributes to the scoring process of the nomogram prediction model across all samples (Fig. 6 c). Additionally, predictions were made for both the three-year and five-year survival periods in colorectal cancer (Fig. 6 d). Clinical indicator analysis further demonstrates the applicability of risk score to colorectal cancer samples Next, we stratified samples based on the values of clinical indicators and displayed the corresponding risk score values using box plots(Fig. 7 a-g). Through rank-sum tests, we identified significant differences in risk score distributions among groups defined by clinical indicators such as Fustat, M, N, T, and Stage (p-value < 0.05). These findings suggest that the risk score derived from the modeling analysis is well-suited for subtyping colorectal cancer samples. Disscussion Colorectal cancer (CRC) is one of the most common cancers worldwide, with high incidence and mortality rates. It is reported that nearly 1.4 million new cases of CRC and 700,000 CRC-related deaths occur globally each year [ 19 ] . Brenner et al. found that patients with early-diagnosed colorectal cancer have a 5-year survival rate exceeding 90% [ 20 ] . However, due to inadequate diagnostic methods, colorectal cancer is often diagnosed at advanced stages [ 21 ] . Despite significant improvements in diagnosis and treatment, the 5-year survival rate for patients diagnosed with metastatic colorectal cancer remains low, at approximately 12% [ 22 ] . Therefore, there is an urgent need to elucidate the molecular mechanisms of colorectal cancer development and to identify novel biomarkers for early detection and prognosis assessment to improve survival outcomes. Single-cell RNA sequencing (scRNA-seq) has emerged as a valuable tool for transcriptomic profiling of various cancer cell types, crucial for identifying potential therapeutic targets. In this study, we utilized colorectal cancer scRNA-seq data from the GEO database to define cellular subpopulations within tumors and characterize their contributions to the disease based on cell numbers and gene expression changes. We then selected marker genes with the highest disease relevance from these subpopulations as a candidate gene set for further analysis. This led to the construction of a prognostic risk model with favorable prognostic efficiency, which serves as a biomarker for predicting immunotherapy response. Similar viewpoints were also proposed by Juan et al. [ 23 ] . They applied scRNA-seq to analyze the heterogeneity of tumor immune cells, developing a 3-gene biomarker (including CLTA, TALDO1, and CSTB) based on tumor immune microenvironment (TIME) heterogeneity to predict survival outcomes and immunotherapy responses. Zheng et al. [ 24 ] selected 6 prognosis-related HUB genes from GEO esophageal squamous cell carcinoma (ESCC) and TCGA esophageal cancer datasets, showing significantly increased expression of HUB genes in normal tissues and cells based on scRNA-seq. Further Kaplan-Meier survival analysis and immune infiltration analysis indicated that HUB genes are promising biomarkers for ESCC diagnosis and prognosis. Additionally, studies utilizing scRNA-seq technology have elucidated intercellular interactions in gliomas, identifying autocrine ligand-receptor signaling that significantly impacts prognosis in glioma patients [ 25 ] . Collectively, these findings demonstrate that scRNA-seq technology can effectively dissect and identify potential prognostic biomarkers, which are crucial for pinpointing therapeutic targets and improving patient survival outcomes. In our study, the prognostic signature composed of nine marker genes (S100P, PIGR, RAB11FIP1, USP53, CDH1, LGALS4, ATP10B, SLC12A2, and LAMB3) may provide valuable insights into the molecular mechanisms of colorectal cancer (CRC). For instance, S100P, a 95-amino acid protein and member of the S100 family, plays a crucial role in regulating cell differentiation, proliferation, migration, apoptosis, and other biological functions by interacting with various signaling proteins such as P53, β-catenin, and nuclear factor-κB (NF-κB). Through these interactions, S100P is involved in tumorigenesis and tumor progression. Research has shown that SIX3 can downregulate S100P via the Wnt/β-catenin signaling pathway, thereby inhibiting cell migration and proliferation [ 26 ] . Another study identified that S100P mRNA levels correlate with the activation status of the PI3K/AKT pathway, a classical pathway involved in promoting cancer migration, invasion, proliferation, and drug resistance [ 27 ] . The performance of the prognostic model based on the nine marker genes was validated in both the test and GEO cohorts, yielding consistent results across the two cohorts, indicating good effectiveness and reproducibility of the model. Various validation methods, including univariate, multivariate, and clinical indicator analyses, demonstrated that the nomogram model has high predictive accuracy. Therefore, the nomogram can guide the establishment of personalized examination procedures for CRC patients, promoting the effective utilization of medical resources. Given that the tumor microenvironment (TME) plays a crucial role in anti-tumor responses and significantly influences tumor diagnosis, survival outcomes, and clinical treatment sensitivity [ 28 ] , we investigated the relationship between risk score and tumor immune infiltration. Firstly, we observed significant decreases in activated dendritic cells, plasma cells, and resting CD4 memory T cells in the high-risk group, suggesting that these patients may be in a relatively immunosuppressive state. Secondly, the study results showed significant positive correlations between the risk score and M0 macrophages, CD8 T cells, and M2 macrophages, and significant negative correlations with resting CD4 memory T cells, activated dendritic cells, eosinophils, and plasma cells. This indicates that the TME of the high-risk group may function to reduce inflammation, promote tumor growth, and suppress immunity. To better guide CRC treatment, we conducted drug sensitivity analyses on different risk groups, studying six common chemotherapy drugs for colorectal cancer. The results indicated that the low-risk group is sensitive to five anticancer drugs, while the high-risk group is sensitive to one anticancer drug. These findings provide a reference for the clinical selection of chemotherapy drugs. In future studies, we will further explore the clinical significance of these drugs for LUSC patients. Inevitably, our study has some inherent limitations. Firstly, all cohort studies are retrospective and require further validation in prospective cohort studies. Secondly, further mechanistic studies are needed to reveal the exact role of each gene, and drug sensitivity needs further confirmation through cellular experiments. Thirdly, the number and volume of scRNA-seq samples available in public databases are limited, resulting in an incomplete analysis of clinical and pathological parameters, which may lead to potential biases. Therefore, it is necessary to conduct multicenter, large-sample, prospective double-blind trials for further verification in the future. Conclusion This study comprehensively applied various bioinformatics methods to reveal the distribution of different cell types, gene expression characteristics, and their association with clinical prognosis in colorectal cancer tissues. The constructed risk score model demonstrates good predictive performance, offering valuable insights for the personalized treatment of colorectal cancer patients and guiding further exploration of the disease's pathogenesis and therapeutic targets. Declarations Author Contributions All the authors participated in writing the manuscript and in drawing the figures. All authors read and approved the final manuscript. Declaration of Interests The authors declare no competing interests. Data Availability scRNA-seq and RNA-seq data can be obtained from the GEO and TCGA databases. (GSE221575 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE221575) References Bray F, Ferlay J, Soerjomataram I, et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries[J]. CA Cancer J Clin,2018,68(6):394-424. Wei X, Liang J, Liu J, et al. Anchang Yuyang Decoction inhibits experimental colitis-related carcinogenesis by regulating PPAR signaling pathway and affecting metabolic homeostasis of host and microbiota[J]. J Ethnopharmacol,2024,326:117995. Murphy C C, Wallace K, Sandler R S, et al. Racial Disparities in Incidence of Young-Onset Colorectal Cancer and Patient Survival[J]. Gastroenterology,2019,156(4):958-965. Rho Y S, Gilabert M, Polom K, et al. Comparing Clinical Characteristics and Outcomes of Young-onset and Late-onset Colorectal Cancer: An International Collaborative Study[J]. Clin Colorectal Cancer,2017,16(4):334-342. Dekker E, Tanis P J, Vleugels J, et al. Colorectal cancer[J]. Lancet,2019,394(10207):1467-1480. Medema J P. Cancer stem cells: the challenges ahead[J]. Nat Cell Biol,2013,15(4):338-344. Nassar D, Blanpain C. Cancer Stem Cells: Basic Concepts and Therapeutic Implications[J]. Annu Rev Pathol,2016,11:47-76. Bondeven P, Hagemann-Madsen R H, Laurberg S, et al. Extent and completeness of mesorectal excision evaluated by postoperative magnetic resonance imaging[J]. Br J Surg,2013,100(10):1357-1367. André T, Boni C, Navarro M, et al. Improved overall survival with oxaliplatin, fluorouracil, and leucovorin as adjuvant treatment in stage II or III colon cancer in the MOSAIC trial[J]. J Clin Oncol,2009,27(19):3109-3116. Haller D G, Tabernero J, Maroun J, et al. Capecitabine plus oxaliplatin compared with fluorouracil and folinic acid as adjuvant therapy for stage III colon cancer[J]. J Clin Oncol,2011,29(11):1465-1471. Ma B, Gao P, Wang H, et al. What has preoperative radio(chemo)therapy brought to localized rectal cancer patients in terms of perioperative and long-term outcomes over the past decades? A systematic review and meta-analysis based on 41,121 patients[J]. Int J Cancer,2017,141(5):1052-1065. Alderson P, Tan T. The use of Cochrane Reviews in NICE clinical guidelines[J]. Cochrane Database Syst Rev,2011,2011(12):ED000032. Primrose J N, Perera R, Gray A, et al. Effect of 3 to 5 years of scheduled CEA and CT follow-up to detect recurrence of colorectal cancer: the FACS randomized clinical trial[J]. JAMA,2014,311(3):263-270. Pagès F, Mlecnik B, Marliot F, et al. International validation of the consensus Immunoscore for the classification of colon cancer: a prognostic and accuracy study[J]. Lancet,2018,391(10135):2128-2139. Sargent D J, Marsoni S, Monges G, et al. Defective mismatch repair as a predictive marker for lack of efficacy of fluorouracil-based adjuvant therapy in colon cancer[J]. J Clin Oncol,2010,28(20):3219-3226. Le DT, Durham J N, Smith K N, et al. Mismatch repair deficiency predicts response of solid tumors to PD-1 blockade[J]. Science,2017,357(6349):409-413. Olsen T K, Baryawno N. Introduction to Single-Cell RNA Sequencing[J]. Curr Protoc Mol Biol,2018,122(1):e57. Torroja C, Sanchez-Cabo F. Corrigendum: Digitaldlsorter: Deep-Learning on scRNA-Seq to Deconvolute Gene Expression Data[J]. Front Genet,2019,10:1373. Torre L A, Bray F, Siegel R L, et al. Global cancer statistics, 2012[J]. CA Cancer J Clin,2015,65(2):87-108. Brenner H, Kloor M, Pox C P. Colorectal cancer[J]. Lancet,2014,383(9927):1490-1502. Pinsky P F, Doroudi M. Colorectal Cancer Screening[J]. JAMA,2016,316(16):1715. Siegel R L, Miller K D, Jemal A. Cancer statistics, 2015[J]. CA Cancer J Clin,2015,65(1):5-29. Lu J, Chen Y, Zhang X, et al. A novel prognostic model based on single-cell RNA sequencing data for hepatocellular carcinoma[J]. Cancer Cell Int,2022,22(1):38. Zheng L, Li L, Xie J, et al. Six Novel Biomarkers for Diagnosis and Prognosis of Esophageal squamous cell carcinoma: validated by scRNA-seq and qPCR[J]. J Cancer,2021,12(3):899-911. Yuan D, Tao Y, Chen G, et al. Systematic expression analysis of ligand-receptor pairs reveals important cell-to-cell interactions inside glioma[J]. Cell Commun Signal,2019,17(1):48. Liu S, Tian Y, Zheng Y, et al. TRIM27 acts as an oncogene and regulates cell proliferation and metastasis in non-small cell lung cancer through SIX3-β-catenin signaling[J]. Aging (Albany NY),2020,12(24):25564-25580. De Marco C, Laudanna C, Rinaldo N, et al. Specific gene expression signatures induced by the multiple oncogenic alterations that occur within the PTEN/PI3K/AKT pathway in lung cancer[J]. PLoS One,2017,12(6):e0178865. Pitt J M, Marabelle A, Eggermont A, et al. Targeting the tumor microenvironment: removing obstruction to anticancer immune responses and immunotherapy[J]. Ann Oncol,2016,27(8):1482-1492. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 07 Mar, 2025 Read the published version in Scientific Reports → Version 1 posted Editorial decision: Revision requested 08 Jan, 2025 Reviews received at journal 07 Jan, 2025 Reviewers agreed at journal 26 Dec, 2024 Reviewers agreed at journal 18 Nov, 2024 Reviews received at journal 18 Oct, 2024 Reviewers agreed at journal 18 Oct, 2024 Reviewers invited by journal 08 Aug, 2024 Editor assigned by journal 31 Jul, 2024 Editor invited by journal 26 Jul, 2024 Submission checks completed at journal 24 Jul, 2024 First submitted to journal 22 Jul, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4780290","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":339971912,"identity":"02e6a076-4504-4a56-a6c6-f150a88964cd","order_by":0,"name":"Liyang Cai","email":"","orcid":"","institution":"Southern Medical University","correspondingAuthor":false,"prefix":"","firstName":"Liyang","middleName":"","lastName":"Cai","suffix":""},{"id":339971914,"identity":"0858b9b7-0b09-4cfc-97a9-e5b017d5af09","order_by":1,"name":"Xin Guo","email":"","orcid":"","institution":"Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences, Southern Medical University","correspondingAuthor":false,"prefix":"","firstName":"Xin","middleName":"","lastName":"Guo","suffix":""},{"id":339971915,"identity":"2fc5a1dc-8c5c-44c7-960f-7592e406ed40","order_by":2,"name":"Yucheng Zhang","email":"","orcid":"","institution":"Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences, Southern Medical University","correspondingAuthor":false,"prefix":"","firstName":"Yucheng","middleName":"","lastName":"Zhang","suffix":""},{"id":339971916,"identity":"607b15c5-83b6-4589-8a0f-d43d66c62a00","order_by":3,"name":"Huajie Xie","email":"","orcid":"","institution":"Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences, Southern Medical University","correspondingAuthor":false,"prefix":"","firstName":"Huajie","middleName":"","lastName":"Xie","suffix":""},{"id":339971917,"identity":"166e4351-ff01-4f6e-905d-b2b70c66f2fb","order_by":4,"name":"Yongfeng Liu","email":"","orcid":"","institution":"Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences, Southern Medical University","correspondingAuthor":false,"prefix":"","firstName":"Yongfeng","middleName":"","lastName":"Liu","suffix":""},{"id":339971918,"identity":"fc0ba331-b3eb-49d8-b227-9e7959e977f5","order_by":5,"name":"Jianlong Zhou","email":"","orcid":"","institution":"Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences, Southern Medical University","correspondingAuthor":false,"prefix":"","firstName":"Jianlong","middleName":"","lastName":"Zhou","suffix":""},{"id":339971919,"identity":"213fd18d-b448-4a70-b8f1-cd5b4d6510c0","order_by":6,"name":"Huolun Feng","email":"","orcid":"","institution":"Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences, Southern Medical University","correspondingAuthor":false,"prefix":"","firstName":"Huolun","middleName":"","lastName":"Feng","suffix":""},{"id":339971920,"identity":"b4891158-018d-411a-ba4d-636f82a0b581","order_by":7,"name":"Jiabin Zheng","email":"","orcid":"","institution":"Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences, Southern Medical University","correspondingAuthor":false,"prefix":"","firstName":"Jiabin","middleName":"","lastName":"Zheng","suffix":""},{"id":339971921,"identity":"08ad2225-d0fc-426a-9ffd-f062a72587c0","order_by":8,"name":"Yong Li","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAzElEQVRIiWNgGAWjYBACPmYwZcPAcICHSC1sEC1ppGiBUIdJ0cLOY/i44Nd5u77bvQcYPu6pZeCf3UDIYTzGxjP7bifPvHMugXHGs+MMEncOENRiJs3bczvZ4EaOATPPgWMMBhIJRGk5R6oWnh8H7KBaaojRwlZszNuQnCB554zBwRkHDvBI3CCghZ//8MbHPH/s7Plu9xg++HCgTo5/BgEtDAwcBgyMbQyJDRLAqAFGEDGxw/6AgeEPgz2DBJhXR4SOUTAKRsEoGGkAAMU6QUBxUKNEAAAAAElFTkSuQmCC","orcid":"","institution":"Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences, Southern Medical University","correspondingAuthor":true,"prefix":"","firstName":"Yong","middleName":"","lastName":"Li","suffix":""}],"badges":[],"createdAt":"2024-07-22 08:29:47","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4780290/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4780290/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41598-025-91761-y","type":"published","date":"2025-03-07T15:57:06+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":63287037,"identity":"716fb27f-60d8-42f3-9c3d-450fe872068b","added_by":"auto","created_at":"2024-08-26 13:48:46","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":268710,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSingle-cell analysis workflow.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea-b\u003c/strong\u003e Quality control of scRNA-seq data. \u003cstrong\u003ec\u003c/strong\u003e Variance plot showing 24,978 genes across all cells, with red dots representing the top 2000 highly variable genes, highlighting the top 5 genes based on standard deviation. \u003cstrong\u003ed-f\u003c/strong\u003e Sequential data processing steps including normalization, scaling, PCA, and harmony analysis.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-4780290/v1/13bedfcdab57c5aacac7d37f.png"},{"id":63288396,"identity":"a9a38abf-8ba4-4aa5-901d-1aa03c0777a3","added_by":"auto","created_at":"2024-08-26 13:56:46","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":140188,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSingle-cell overview of tumor samples.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea-b\u003c/strong\u003e UMAP analysis identifies 19 cell clusters annotated with cell types. \u003cstrong\u003ec-d\u003c/strong\u003e Bubble plots display expression profiles of classical markers for 11 cell types; bar charts show proportions of each cell type. \u003cstrong\u003ee\u003c/strong\u003e Identification of significantly upregulated genes in tumor vs control samples, highlighting Epithelial cells with the highest disease contribution.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-4780290/v1/4eabbc5174b73a20602dde44.png"},{"id":63287038,"identity":"431cde0c-3839-45ce-9871-69d6f5e3567e","added_by":"auto","created_at":"2024-08-26 13:48:46","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":219586,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eConstruction and prediction of the prognostic model.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea-c\u003c/strong\u003e Lasso regression analysis. \u003cstrong\u003ed-e\u003c/strong\u003eKaplan-Meier curves in the training and testing sets. \u003cstrong\u003ef-g\u003c/strong\u003e ROC curves in the training and testing sets. \u003cstrong\u003eh-i\u003c/strong\u003e External validation demonstrating strong prognostic performance of the model for patients.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-4780290/v1/d0b50c2302dcded1606cf691.png"},{"id":63288397,"identity":"2861beda-4373-446a-bf65-9863be5d84ee","added_by":"auto","created_at":"2024-08-26 13:56:46","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":196601,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eImmune cell infiltration analysis.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea\u003c/strong\u003e Proportions of immune cell content between high and low-risk groups. \u003cstrong\u003eb\u003c/strong\u003e Differences in immune cell content between high and low-risk groups. \u003cstrong\u003ec\u003c/strong\u003e Relationship between risk scores and immune cells.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-4780290/v1/0a75f8dd4114110f56125494.png"},{"id":63287039,"identity":"1f30022c-bfc0-4462-8ecc-f9000eeecc13","added_by":"auto","created_at":"2024-08-26 13:48:46","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":217802,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePotential molecular mechanisms by which the risk score affects tumor progression.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea\u003c/strong\u003e Relationship between the risk score and sensitivity to common chemotherapy drugs. \u003cstrong\u003eb-c\u003c/strong\u003e GSVA and GSEA analyses exploring signaling pathway differences between high and low-risk groups. \u003cstrong\u003ed\u003c/strong\u003e Molecular interaction network among the pathways.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-4780290/v1/c39cf23182ffce7cecc26b20.png"},{"id":63287043,"identity":"0c1bd46e-f836-4afa-8c26-7a3536df02c7","added_by":"auto","created_at":"2024-08-26 13:48:46","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":114392,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eConstruction of the nomogram prediction model.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea-b\u003c/strong\u003e Univariate and multivariate analyses identify the risk score as an independent prognostic factor for colorectal cancer patients. \u003cstrong\u003ec\u003c/strong\u003e Regression analysis shows that the risk score significantly contributes to the nomogram prediction model. \u003cstrong\u003ed\u003c/strong\u003ePredictions for three-year and five-year survival rates of colorectal cancer patients.\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-4780290/v1/def2ab420cc2fc96e19c9319.png"},{"id":63287040,"identity":"b4c21114-301f-4035-aaab-02a3cfbbd77c","added_by":"auto","created_at":"2024-08-26 13:48:46","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":253894,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eClinical indicator analysis further demonstrates the applicability of the risk score to colorectal cancer samples.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea-g\u003c/strong\u003e Box plots display the distribution of risk scores across different groups based on clinical indicator values.\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-4780290/v1/463c09e3dcc7c2959d9b0069.png"},{"id":78183787,"identity":"2f72803b-bd78-4199-aff1-998b8c4b42a8","added_by":"auto","created_at":"2025-03-10 18:18:26","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2250233,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4780290/v1/ed190c79-cebb-4926-b859-34111f63056d.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"A novel prognostic model for colorectal cancer based on epithelial cell marker genes identified and validated by combining Single-Cell and Bulk RNA- Sequencing","fulltext":[{"header":"Background","content":"\u003cp\u003eColorectal cancer is a prevalent malignant tumor, comprising approximately 10% of global cancer diagnoses and cancer-related deaths each year \u003csup\u003e[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]\u003c/sup\u003e, with nearly 9\u0026nbsp;million deaths annually. Over the past few decades, the incidence of colorectal cancer in high-income countries has stabilized or declined\u003csup\u003e[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]\u003c/sup\u003e, largely due to increased acceptance of colorectal cancer screening and colonoscopic polypectomy among the elderly\u003csup\u003e[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]\u003c/sup\u003e. In contrast, there has been a global rise in the incidence of colorectal cancer diagnosed in young people, known as early-onset colorectal cancer \u003csup\u003e[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]\u003c/sup\u003e. Additionally, factors such as family history, obesity, poor diet (high-fat, low-fiber diets), and long-term inflammatory bowel disease are also considered related to the occurrence of colorectal cancer\u003csup\u003e[\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]\u003c/sup\u003e. Most colorectal cancers originate from polyps, a process that starts with abnormal crypts, evolves into pre-tumor lesions (polyps), and ultimately develops into colorectal cancer within an estimated period of 10\u0026ndash;15 years\u003csup\u003e[\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]\u003c/sup\u003e. It is currently believed that the cells of origin for most colorectal cancers are stem cells or stem cell-like cells.\u003csup\u003e[\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]\u003c/sup\u003e. The emergence of these cancer stem cells stems from the progressive accumulation of genetic and epigenetic alterations that deactivate tumor suppressor genes and activate oncogenes.\u003c/p\u003e \u003cp\u003ePresently, the primary treatment modalities for colorectal cancer encompass surgical resection, chemotherapy, radiotherapy, and targeted therapy. Surgical resection represents the cornerstone of curative treatment for colorectal cancer, with the quality of resection playing a pivotal role in prognosis. Assessment of resection quality can be achieved through objective parameters\u003csup\u003e[\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]\u003c/sup\u003e. As adjuvant therapy, fluoropyrimidine-based chemotherapy can improve the survival rates of resected stage III and high-risk stage II colon cancer (such as high-risk T4, poorly differentiated)\u003csup\u003e[\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]\u003c/sup\u003e. Preoperative radiotherapy is beneficial in reducing the risk of local recurrence, with the absolute risk reduction depending on clinical staging and surgical quality\u003csup\u003e[\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]\u003c/sup\u003e. Currently available biomarkers for predicting prognosis and treatment response in CRC patients, such as carcinoembryonic antigen (CEA) and carbohydrate antigen 19\u0026thinsp;\u0026minus;\u0026thinsp;9 (CA19-9), have suboptimal sensitivity and specificity\u003csup\u003e[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]\u003c/sup\u003e. Therefore, there is a need for more precise biomarkers. With the advancement of high-throughput genomic screening technologies, such as next-generation sequencing and microarray analysis, a multitude of molecular biomarkers and features have been identified. These have potential clinical prognostic and predictive value, identified through comprehensive association and bioinformatics analyses. Notably, several genomics-based biomarkers, such as mismatch repair (MMR) or microsatellite instability (MSI) status, have entered clinical practice and have been validated as predictive markers for adjuvant chemotherapy in stage II CRC patients\u003csup\u003e[\u003cspan additionalcitationids=\"CR15\" citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eWith the rapid advancements in next-generation sequencing technologies, an increasing number of studies are employing RNA sequencing (RNA-seq) to analyze gene expression patterns in colorectal cancer. However, RNA sequencing (RNA-seq) is typically conducted in bulk, where the data reflect the average gene expression patterns across a large population of cells\u003csup\u003e[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]\u003c/sup\u003e. Notably, single-cell RNA sequencing (ScRNA-seq) is a cutting-edge sequencing technology that offers detailed insights into the characteristics of individual immune cells or tumor cells\u003csup\u003e[\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]\u003c/sup\u003e. ScRNA-seq indeed highlights intratumoral heterogeneity by revealing different subpopulations of cells within tumors. It also has the capability to quantify and analyze immune cell infiltration patterns within tumor tissues. In this study, we aimed to explore the distribution of different cell types, gene expression characteristics, and their correlations with clinical prognosis in colorectal cancer tissues using both single-cell RNA sequencing and gene chip analysis. Additionally, we developed a novel prognostic model leveraging feature genes identified through single-cell analysis.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eData source and preprocessing\u003c/h2\u003e \u003cp\u003eThe GEO database, managed by NCBI, stores gene expression data from various studies, aiding researchers in accessing and analyzing biological and biomedical research datasets. We retrieved the single-cell data files for GSE221575 from the GEO database, specifically selecting datasets from 4 samples that feature comprehensive single-cell expression profiles, essential for our detailed single-cell analysis. We additionally obtained the Series Matrix File data for GSE17536 from the NCBI GEO public database, annotated with GPL570, encompassing expression profiles from 177 patients. Furthermore, we acquired the Series Matrix File data for GSE38832 from the NCBI GEO public database, annotated with GPL570, featuring expression profile data from 122 patients. The TCGA database (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://portal.gdc.cancer.gov/\u003c/span\u003e\u003cspan address=\"https://portal.gdc.cancer.gov/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), recognized as the largest repository of cancer-related genetic information, comprehensively archives diverse data types such as gene expression data, copy number variations, SNPs, and beyond. In our study, we accessed raw mRNA expression data specifically for colorectal cancer, encompassing a total of 701 samples, comprising 51 normal samples and 650 tumor samples.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eSingle-cell analysis\u003c/h2\u003e \u003cp\u003eInitially, the expression profiles were imported using the Seurat package. Subsequently, aberrant samples were excluded by evaluating UMI counts, the quantity of genes detected per cell, and the mitochondrial gene fraction. The data were subsequently standardized, normalized, and subjected to PCA (Principal Component Analysis) to achieve linear dimensionality reduction. The optimal number of principal components was identified using an elbow plot. Subsequently, UMAP (Uniform Manifold Approximation and Projection) was employed for nonlinear dimensionality reduction to elucidate the spatial relationships between clusters. Cell types and their associated marker genes within the tissue were identified and annotated through utilization of the CellMarker and PanglaoDB databases, supplemented by consulting pertinent literature sources.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003eContribution of different cell subpopulations to colorectal cancer\u003c/h2\u003e \u003cp\u003eWe characterized the contribution of various cell subpopulations to the disease by evaluating both the cell counts and alterations in gene expression patterns. In summary, our approach involved conducting differential gene expression analysis to pinpoint the top 100 highly expressed genes in control versus tumor samples, treating these genes as feature markers for each group. Subsequently, we computed the differential expression levels and expression proportions of these genes within each cell subtype. We utilized the square root of the product of fold change (FC) and percentage proportion (PctProp) to gauge the contribution of these genes to the disease.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eModel construction and prognosis\u003c/h2\u003e \u003cp\u003eA candidate gene set was identified, and lasso regression was employed to develop a prognostic model. This involved incorporating the expression values of each selected gene to formulate a risk score formula for every patient. The coefficients derived from the lasso regression analysis were utilized to weight the contribution of each gene in the risk score calculation. This approach helps in predicting patient outcomes based on the expression levels of the selected genes. Based on the risk score formula derived from the lasso regression, patients were categorized into low-risk and high-risk groups using the median risk score as the threshold. To assess survival disparities between these groups, Kaplan-Meier analysis was conducted, and the results were compared using the log-rank test. Lasso regression analysis and stratified analysis were employed to rigorously examine the impact of the risk score on predicting patient prognosis. The precision of the model's predictions was meticulously evaluated by analyzing Receiver Operating Characteristic (ROC) curves, ensuring a thorough assessment of its predictive power.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eImmune cell infiltration analysis\u003c/h2\u003e \u003cp\u003eThe CIBERSORT method is indeed widely recognized for its application in evaluating immune cell types within biological microenvironments. This approach utilizes support vector regression to perform deconvolution analysis on the expression matrix of immune cell subtypes. With a robust framework built on 547 biomarkers, CIBERSORT effectively discriminates among 22 distinct human immune cell phenotypes. These phenotypes encompass a broad spectrum, encompassing T cells, B cells, plasma cells, and various myeloid subgroups. In this study, we used the CIBERSORT algorithm to analyze patient data, estimating the proportions of 22 immune infiltrating cell types. We then performed correlation analysis between gene expression and immune cell content.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eDrug sensitivity analysis\u003c/h2\u003e \u003cp\u003eBased on data from the Genomics of Drug Sensitivity in Cancer (GDSC) database, we utilized the R package \"pRRophetic\" to forecast the sensitivity of each tumor sample to chemotherapy. We obtained IC50 estimates for each specific chemotherapy drug using regression methods and validated the accuracy of these predictions through 10-fold cross-validation on the GDSC training set. In our analysis, default parameters were used throughout, including \"combat\" for batch effect removal and averaging for handling duplicate gene expression data.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003eGSVA analysis (gene set difference analysis)\u003c/h2\u003e \u003cp\u003eGene Set Variation Analysis (GSVA) is a non-parametric, unsupervised method for evaluating transcriptome gene set enrichment. GSVA converts gene-level changes to pathway-level changes by comprehensively scoring gene sets of interest, thereby assessing the biological functions of samples. In this study, gene sets were downloaded from the Molecular Signatures Database (v7.0) and scored using the GSVA algorithm to assess potential biological function changes in different samples.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003eGSEA analysis\u003c/h2\u003e \u003cp\u003ePatients were meticulously stratified into high-risk and low-risk cohorts according to the nuanced risk scores generated by the model, followed by an intricate exploration of signal pathway variances between these delineated groups using the powerful analytical tool known as Gene Set Enrichment Analysis (GSEA). The background gene sets were meticulously curated from the comprehensive MsigDB database, specifically tailored for subtype pathway annotation and rigorous differential expression analysis across subtypes. Enriched gene sets achieving statistical significance (adjusted p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05) were meticulously prioritized based on their consistency scores. Utilizing GSEA analysis, a widely adopted approach, allowed for an in-depth exploration into the intricate associations linking tumor subtypes with their profound biological implications.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eNomogram model construction\u003c/h2\u003e \u003cp\u003eA sophisticated nomogram, meticulously crafted through rigorous regression analysis, elegantly integrates both the nuanced risk scores and intricate clinical symptoms. Scaled line segments were meticulously plotted on a unified plane, each segment proportionally representing the interdependencies among variables within the predictive model. Through the construction of a multivariate regression model, distinct scores were assigned to each tier of influential factors, delineated by their respective contributions (regression coefficients) to the outcome variable. The cumulative total score was then computed by aggregating these individual scores, thereby yielding the predictive value.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eStatistical analysis\u003c/h2\u003e \u003cp\u003eSurvival curves were generated using the Kaplan-Meier method and were compared using the log-rank test. Multivariate analysis was conducted using the Cox proportional hazards model. All statistical analyses were conducted using R (version 4.3.0), with p\u0026thinsp;\u0026lt;\u0026thinsp;0.05 considered statistically significant.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eDefinition of clusters and dimensionality reduction for visual representation of the cells\u003c/h2\u003e \u003cp\u003eFirst, we read the expression profiles using the Seurat package and filtered out low-expression genes (nFeature_RNA\u0026thinsp;\u0026gt;\u0026thinsp;200 \u0026amp; nFeature_RNA\u0026thinsp;\u0026lt;\u0026thinsp;6000 \u0026amp; percent.mt\u0026thinsp;\u0026lt;\u0026thinsp;25 \u0026amp; nCount_RNA\u0026thinsp;\u0026lt;\u0026thinsp;40000), resulting in 5,241 cells (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea-b). We displayed the top 5 genes with the highest standard deviation among these cells (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec). The data were then sequentially processed for standardization, normalization, PCA, and Harmony analysis (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ed-f).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eUsing UMAP analysis, we determined the positional relationships between each cluster, identifying 19 cell clusters (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea). Further annotation of each subtype in this study revealed that all cell clusters were annotated into the following cell categories: Plasma cell, T cell, Profiling NKT cell, Macrophage cell, B cell, Epithelial cell, Fibroblasts, Goblet cell, Mast cell, EC cell, and Cholangiocytes (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb). We also presented a bubble plot of the classic markers for these 11 cell types (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec) and a bar plot showing the proportion of cells corresponding to each group (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ed).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn comparing control versus tumor samples, we conducted screening to identify highly expressed genes. Subsequently, we quantified the differential expression levels and determined the expression proportions of these genes within each cell subtype. The disease contribution was determined by the square root of the FC * PctProp value, with Epithelial cells showing the highest contribution (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ee). Therefore, we selected highly expressed genes in control vs. tumor samples with avg_log2FC\u0026thinsp;\u0026gt;\u0026thinsp;1 and p_val_adj\u0026thinsp;\u0026lt;\u0026thinsp;0.05 as the candidate gene set for subsequent analysis.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eConstruction and validation of the predictive model based on epithelial cell marker genes\u003c/h2\u003e \u003cp\u003eUsing the candidate genes obtained in the previous step, we applied the lasso regression feature selection algorithm to identify characteristic genes in colorectal cancer. The processed colorectal cancer dataset from the TCGA database, containing patient survival information, was randomly divided into a training set and a test set at a 4:1 ratio. After lasso regression analysis (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea-Fig.\u0026nbsp;3c), we obtained the optimal risk score value for each sample for subsequent analysis.RiskScore\u0026thinsp;=\u0026thinsp;S100P * (-0.127895799000468)\u0026thinsp;+\u0026thinsp;PIGR * (-0.110505479982379)\u0026thinsp;+\u0026thinsp;RAB11FIP1 * (-0.110168920940582)\u0026thinsp;+\u0026thinsp;USP53 * (-0.0657253153577585)\u0026thinsp;+\u0026thinsp;CDH1 * (-0.0447173205642763)\u0026thinsp;+\u0026thinsp;LGALS4 * (-0.026899354451419)\u0026thinsp;+\u0026thinsp;ATP10B * (-0.0253973538959315)\u0026thinsp;+\u0026thinsp;SLC12A2 * (-0.0148820970575857)\u0026thinsp;+\u0026thinsp;LAMB3 * (0.0964297473594213).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003ePatients were stratified into high-risk and low-risk groups according to their calculated risk scores, and subsequent survival analysis was performed utilizing Kaplan-Meier curves. In both the training set and the test set, the survival rate was markedly lower in the high-risk group compared to the low-risk group (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ed-Fig.\u0026nbsp;3e). Furthermore, ROC curve analysis from both the training and test sets indicated strong validation performance of the model (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ef-Fig.\u0026nbsp;3g).\u003c/p\u003e \u003cp\u003eWe downloaded processed colorectal cancer patient data with survival information from the GEO database (GSE17536 and GSE38832). Using our model, we predicted the clinical classification of colorectal cancer patients sourced from the GEO database. Subsequently, we assessed survival differences between the predicted groups using Kaplan-Meier analysis to evaluate the stability and predictive accuracy of the model. The findings revealed a significant disparity in survival rates between the high-risk and low-risk groups within the external validation set obtained from the GEO database (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eh). To validate the accuracy of our model, we conducted ROC curve analysis using the external dataset. The results illustrated robust predictive performance of the model in assessing patient prognosis. (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ei).\u003c/p\u003e \u003cp\u003e \u003cb\u003eAnalysis of immune cell infiltration to explore the impact of risk scores on the immune microenvironment in colorectal cancer\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe tumor microenvironment (TME) primarily consists of tumor-associated fibroblasts, immune cells, extracellular matrix, various growth factors, inflammatory factors, specific physicochemical characteristics, and the cancer cells themselves. The TME significantly influences tumor diagnosis, survival outcomes, and clinical treatment sensitivity. By analyzing the relationship between risk scores and tumor immune infiltration, we further investigated the potential molecular mechanisms through which risk scores impact colorectal cancer progression.\u003c/p\u003e \u003cp\u003eThe proportions of immune cells in the high-risk and low-risk groups are illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea. Additionally, we compared the differences in immune cell content between these two groups. The results showed that the high-risk group had significantly lower levels of activated dendritic cells, plasma cells, and resting CD4 memory T cells, while the levels of M0 macrophages and CD8 T cells were significantly higher (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb). Subsequently, we investigated the relationship between the risk score and immune cells. The study results indicated that the risk score was significantly positively correlated with M0 macrophages, CD8 T cells, and M2 macrophages, and significantly negatively correlated with resting CD4 memory T cells, activated dendritic cells, eosinophils, and plasma cells (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ec).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eFurther analysis to explore the potential molecular mechanisms of risk scores impacting tumor progression\u003c/h2\u003e \u003cp\u003eThe treatment of early-stage colorectal cancer with surgery combined with chemotherapy has demonstrated clear efficacy. Our study utilized drug sensitivity data from the GDSC database and employed the R package \"pRRophetic\" to predict the chemotherapy sensitivity of each tumor sample. This approach allowed us to further explore the relationship between risk scores and sensitivity to common chemotherapy drugs. The study results indicated that the risk score was significantly associated with sensitivity to drugs such as AKT inhibitor VIII, Axitinib, BAY 61-3606, BIBW2992, BMS 708163, and Bicalutamide (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eNext, we examined the specific signaling pathways involved in the high-risk and low-risk models to investigate the potential molecular mechanisms by which risk scores influence tumor progression. GSVA results showed that the differential pathways between the two groups were primarily enriched in the HEDGEHOG_SIGNALING, WNT_BETA_CATENIN_SIGNALING, and IL6_JAK_STAT3_SIGNALING pathways(Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eb). GSEA results indicated that the pathways involved included the Wnt signaling pathway, cell adhesion molecules, and the MAPK signaling pathway (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ec). The molecular interaction network among these pathways is illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ed.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eConstruction of nomogram model\u003c/h2\u003e \u003cp\u003eIn this study, both univariate and multivariate analyses demonstrated that the risk score is an independent prognostic factor for colorectal cancer patients (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ea-Fig.\u0026nbsp;6b). Subsequently, samples were stratified into high-risk and low-risk groups based on the median value of the risk score. The results of the regression analysis were visualized using column plots, which demonstrated that the risk score significantly contributes to the scoring process of the nomogram prediction model across all samples (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ec). Additionally, predictions were made for both the three-year and five-year survival periods in colorectal cancer (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ed).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eClinical indicator analysis further demonstrates the applicability of risk score to colorectal cancer samples\u003c/h2\u003e \u003cp\u003eNext, we stratified samples based on the values of clinical indicators and displayed the corresponding risk score values using box plots(Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003ea-g). Through rank-sum tests, we identified significant differences in risk score distributions among groups defined by clinical indicators such as Fustat, M, N, T, and Stage (p-value\u0026thinsp;\u0026lt;\u0026thinsp;0.05). These findings suggest that the risk score derived from the modeling analysis is well-suited for subtyping colorectal cancer samples.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Disscussion","content":"\u003cp\u003eColorectal cancer (CRC) is one of the most common cancers worldwide, with high incidence and mortality rates. It is reported that nearly 1.4\u0026nbsp;million new cases of CRC and 700,000 CRC-related deaths occur globally each year\u003csup\u003e[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]\u003c/sup\u003e. Brenner et al. found that patients with early-diagnosed colorectal cancer have a 5-year survival rate exceeding 90%\u003csup\u003e[\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]\u003c/sup\u003e. However, due to inadequate diagnostic methods, colorectal cancer is often diagnosed at advanced stages\u003csup\u003e[\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]\u003c/sup\u003e. Despite significant improvements in diagnosis and treatment, the 5-year survival rate for patients diagnosed with metastatic colorectal cancer remains low, at approximately 12%\u003csup\u003e[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]\u003c/sup\u003e. Therefore, there is an urgent need to elucidate the molecular mechanisms of colorectal cancer development and to identify novel biomarkers for early detection and prognosis assessment to improve survival outcomes.\u003c/p\u003e \u003cp\u003eSingle-cell RNA sequencing (scRNA-seq) has emerged as a valuable tool for transcriptomic profiling of various cancer cell types, crucial for identifying potential therapeutic targets. In this study, we utilized colorectal cancer scRNA-seq data from the GEO database to define cellular subpopulations within tumors and characterize their contributions to the disease based on cell numbers and gene expression changes. We then selected marker genes with the highest disease relevance from these subpopulations as a candidate gene set for further analysis. This led to the construction of a prognostic risk model with favorable prognostic efficiency, which serves as a biomarker for predicting immunotherapy response. Similar viewpoints were also proposed by Juan et al.\u003csup\u003e[\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]\u003c/sup\u003e. They applied scRNA-seq to analyze the heterogeneity of tumor immune cells, developing a 3-gene biomarker (including CLTA, TALDO1, and CSTB) based on tumor immune microenvironment (TIME) heterogeneity to predict survival outcomes and immunotherapy responses. Zheng et al.\u003csup\u003e[\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]\u003c/sup\u003e selected 6 prognosis-related HUB genes from GEO esophageal squamous cell carcinoma (ESCC) and TCGA esophageal cancer datasets, showing significantly increased expression of HUB genes in normal tissues and cells based on scRNA-seq.\u0026nbsp;Further Kaplan-Meier survival analysis and immune infiltration analysis indicated that HUB genes are promising biomarkers for ESCC diagnosis and prognosis. Additionally, studies utilizing scRNA-seq technology have elucidated intercellular interactions in gliomas, identifying autocrine ligand-receptor signaling that significantly impacts prognosis in glioma patients\u003csup\u003e[\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]\u003c/sup\u003e. Collectively, these findings demonstrate that scRNA-seq technology can effectively dissect and identify potential prognostic biomarkers, which are crucial for pinpointing therapeutic targets and improving patient survival outcomes.\u003c/p\u003e \u003cp\u003eIn our study, the prognostic signature composed of nine marker genes (S100P, PIGR, RAB11FIP1, USP53, CDH1, LGALS4, ATP10B, SLC12A2, and LAMB3) may provide valuable insights into the molecular mechanisms of colorectal cancer (CRC). For instance, S100P, a 95-amino acid protein and member of the S100 family, plays a crucial role in regulating cell differentiation, proliferation, migration, apoptosis, and other biological functions by interacting with various signaling proteins such as P53, β-catenin, and nuclear factor-κB (NF-κB). Through these interactions, S100P is involved in tumorigenesis and tumor progression. Research has shown that SIX3 can downregulate S100P via the Wnt/β-catenin signaling pathway, thereby inhibiting cell migration and proliferation\u003csup\u003e[\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]\u003c/sup\u003e. Another study identified that S100P mRNA levels correlate with the activation status of the PI3K/AKT pathway, a classical pathway involved in promoting cancer migration, invasion, proliferation, and drug resistance\u003csup\u003e[\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThe performance of the prognostic model based on the nine marker genes was validated in both the test and GEO cohorts, yielding consistent results across the two cohorts, indicating good effectiveness and reproducibility of the model. Various validation methods, including univariate, multivariate, and clinical indicator analyses, demonstrated that the nomogram model has high predictive accuracy. Therefore, the nomogram can guide the establishment of personalized examination procedures for CRC patients, promoting the effective utilization of medical resources.\u003c/p\u003e \u003cp\u003eGiven that the tumor microenvironment (TME) plays a crucial role in anti-tumor responses and significantly influences tumor diagnosis, survival outcomes, and clinical treatment sensitivity\u003csup\u003e[\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]\u003c/sup\u003e, we investigated the relationship between risk score and tumor immune infiltration. Firstly, we observed significant decreases in activated dendritic cells, plasma cells, and resting CD4 memory T cells in the high-risk group, suggesting that these patients may be in a relatively immunosuppressive state. Secondly, the study results showed significant positive correlations between the risk score and M0 macrophages, CD8 T cells, and M2 macrophages, and significant negative correlations with resting CD4 memory T cells, activated dendritic cells, eosinophils, and plasma cells. This indicates that the TME of the high-risk group may function to reduce inflammation, promote tumor growth, and suppress immunity.\u003c/p\u003e \u003cp\u003eTo better guide CRC treatment, we conducted drug sensitivity analyses on different risk groups, studying six common chemotherapy drugs for colorectal cancer. The results indicated that the low-risk group is sensitive to five anticancer drugs, while the high-risk group is sensitive to one anticancer drug. These findings provide a reference for the clinical selection of chemotherapy drugs. In future studies, we will further explore the clinical significance of these drugs for LUSC patients.\u003c/p\u003e \u003cp\u003eInevitably, our study has some inherent limitations. Firstly, all cohort studies are retrospective and require further validation in prospective cohort studies. Secondly, further mechanistic studies are needed to reveal the exact role of each gene, and drug sensitivity needs further confirmation through cellular experiments. Thirdly, the number and volume of scRNA-seq samples available in public databases are limited, resulting in an incomplete analysis of clinical and pathological parameters, which may lead to potential biases. Therefore, it is necessary to conduct multicenter, large-sample, prospective double-blind trials for further verification in the future.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study comprehensively applied various bioinformatics methods to reveal the distribution of different cell types, gene expression characteristics, and their association with clinical prognosis in colorectal cancer tissues. The constructed risk score model demonstrates good predictive performance, offering valuable insights for the personalized treatment of colorectal cancer patients and guiding further exploration of the disease's pathogenesis and therapeutic targets.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAuthor Contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll the authors participated in writing the manuscript and in drawing the figures. All authors read and approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDeclaration of Interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003escRNA-seq and RNA-seq data can be obtained from the GEO and TCGA databases.\u003c/p\u003e\n\u003cp\u003e(GSE221575 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE221575)\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eBray F, Ferlay J, Soerjomataram I, et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries[J]. CA Cancer J Clin,2018,68(6):394-424.\u003c/li\u003e\n\u003cli\u003eWei X, Liang J, Liu J, et al. Anchang Yuyang Decoction inhibits experimental colitis-related carcinogenesis by regulating PPAR signaling pathway and affecting metabolic homeostasis of host and microbiota[J]. J Ethnopharmacol,2024,326:117995.\u003c/li\u003e\n\u003cli\u003eMurphy C C, Wallace K, Sandler R S, et al. Racial Disparities in Incidence of Young-Onset Colorectal Cancer and Patient Survival[J]. Gastroenterology,2019,156(4):958-965.\u003c/li\u003e\n\u003cli\u003eRho Y S, Gilabert M, Polom K, et al. Comparing Clinical Characteristics and Outcomes of Young-onset and Late-onset Colorectal Cancer: An International Collaborative Study[J]. Clin Colorectal Cancer,2017,16(4):334-342.\u003c/li\u003e\n\u003cli\u003eDekker E, Tanis P J, Vleugels J, et al. Colorectal cancer[J]. Lancet,2019,394(10207):1467-1480.\u003c/li\u003e\n\u003cli\u003eMedema J P. Cancer stem cells: the challenges ahead[J]. Nat Cell Biol,2013,15(4):338-344.\u003c/li\u003e\n\u003cli\u003eNassar D, Blanpain C. Cancer Stem Cells: Basic Concepts and Therapeutic Implications[J]. Annu Rev Pathol,2016,11:47-76.\u003c/li\u003e\n\u003cli\u003eBondeven P, Hagemann-Madsen R H, Laurberg S, et al. Extent and completeness of mesorectal excision evaluated by postoperative magnetic resonance imaging[J]. Br J Surg,2013,100(10):1357-1367.\u003c/li\u003e\n\u003cli\u003eAndr\u0026eacute; T, Boni C, Navarro M, et al. Improved overall survival with oxaliplatin, fluorouracil, and leucovorin as adjuvant treatment in stage II or III colon cancer in the MOSAIC trial[J]. J Clin Oncol,2009,27(19):3109-3116.\u003c/li\u003e\n\u003cli\u003eHaller D G, Tabernero J, Maroun J, et al. Capecitabine plus oxaliplatin compared with fluorouracil and folinic acid as adjuvant therapy for stage III colon cancer[J]. J Clin Oncol,2011,29(11):1465-1471.\u003c/li\u003e\n\u003cli\u003eMa B, Gao P, Wang H, et al. What has preoperative radio(chemo)therapy brought to localized rectal cancer patients in terms of perioperative and long-term outcomes over the past decades? A systematic review and meta-analysis based on 41,121 patients[J]. Int J Cancer,2017,141(5):1052-1065.\u003c/li\u003e\n\u003cli\u003eAlderson P, Tan T. The use of Cochrane Reviews in NICE clinical guidelines[J]. Cochrane Database Syst Rev,2011,2011(12):ED000032.\u003c/li\u003e\n\u003cli\u003ePrimrose J N, Perera R, Gray A, et al. Effect of 3 to 5 years of scheduled CEA and CT follow-up to detect recurrence of colorectal cancer: the FACS randomized clinical trial[J]. JAMA,2014,311(3):263-270.\u003c/li\u003e\n\u003cli\u003ePag\u0026egrave;s F, Mlecnik B, Marliot F, et al. International validation of the consensus Immunoscore for the classification of colon cancer: a prognostic and accuracy study[J]. Lancet,2018,391(10135):2128-2139.\u003c/li\u003e\n\u003cli\u003eSargent D J, Marsoni S, Monges G, et al. Defective mismatch repair as a predictive marker for lack of efficacy of fluorouracil-based adjuvant therapy in colon cancer[J]. J Clin Oncol,2010,28(20):3219-3226.\u003c/li\u003e\n\u003cli\u003eLe DT, Durham J N, Smith K N, et al. Mismatch repair deficiency predicts response of solid tumors to PD-1 blockade[J]. Science,2017,357(6349):409-413.\u003c/li\u003e\n\u003cli\u003eOlsen T K, Baryawno N. Introduction to Single-Cell RNA Sequencing[J]. Curr Protoc Mol Biol,2018,122(1):e57.\u003c/li\u003e\n\u003cli\u003eTorroja C, Sanchez-Cabo F. Corrigendum: Digitaldlsorter: Deep-Learning on scRNA-Seq to Deconvolute Gene Expression Data[J]. Front Genet,2019,10:1373.\u003c/li\u003e\n\u003cli\u003eTorre L A, Bray F, Siegel R L, et al. Global cancer statistics, 2012[J]. CA Cancer J Clin,2015,65(2):87-108.\u003c/li\u003e\n\u003cli\u003eBrenner H, Kloor M, Pox C P. Colorectal cancer[J]. Lancet,2014,383(9927):1490-1502.\u003c/li\u003e\n\u003cli\u003ePinsky P F, Doroudi M. Colorectal Cancer Screening[J]. JAMA,2016,316(16):1715.\u003c/li\u003e\n\u003cli\u003eSiegel R L, Miller K D, Jemal A. Cancer statistics, 2015[J]. CA Cancer J Clin,2015,65(1):5-29.\u003c/li\u003e\n\u003cli\u003eLu J, Chen Y, Zhang X, et al. A novel prognostic model based on single-cell RNA sequencing data for hepatocellular carcinoma[J]. Cancer Cell Int,2022,22(1):38.\u003c/li\u003e\n\u003cli\u003eZheng L, Li L, Xie J, et al. Six Novel Biomarkers for Diagnosis and Prognosis of Esophageal squamous cell carcinoma: validated by scRNA-seq and qPCR[J]. J Cancer,2021,12(3):899-911.\u003c/li\u003e\n\u003cli\u003eYuan D, Tao Y, Chen G, et al. Systematic expression analysis of ligand-receptor pairs reveals important cell-to-cell interactions inside glioma[J]. Cell Commun Signal,2019,17(1):48.\u003c/li\u003e\n\u003cli\u003eLiu S, Tian Y, Zheng Y, et al. TRIM27 acts as an oncogene and regulates cell proliferation and metastasis in non-small cell lung cancer through SIX3-\u0026beta;-catenin signaling[J]. Aging (Albany NY),2020,12(24):25564-25580.\u003c/li\u003e\n\u003cli\u003eDe Marco C, Laudanna C, Rinaldo N, et al. Specific gene expression signatures induced by the multiple oncogenic alterations that occur within the PTEN/PI3K/AKT pathway in lung cancer[J]. PLoS One,2017,12(6):e0178865.\u003c/li\u003e\n\u003cli\u003ePitt J M, Marabelle A, Eggermont A, et al. Targeting the tumor microenvironment: removing obstruction to anticancer immune responses and immunotherapy[J]. Ann Oncol,2016,27(8):1482-1492.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"colorectal cancer, prognostic model, scRNA-seq, epithelial cell marker genes","lastPublishedDoi":"10.21203/rs.3.rs-4780290/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4780290/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eColorectal cancer (CRC) is a prevalent malignant tumor characterized by high global incidence and mortality rates. Furthermore, it is imperative to comprehend the molecular mechanisms underlying its development and to identify effective prognostic markers. These efforts are crucial for pinpointing potential therapeutic targets and enhancing patient survival rates. Therefore, We develop a novel prognostic model aimed at providing new theoretical support for clinical prognosis evaluation and treatment.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eWe downloaded data from the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) databases. Subsequently, we performed single-cell analysis and developed a prognostic model associated with colorectal cancer.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eWe divided the scRNA-seq dataset (GSE221575) into 19 cell clusters and classified these clusters into 11 distinct cell types using marker genes. Using univariate Cox regression and LASSO (Least Absolute Shrinkage and Selection Operator) analyses, we developed a prognostic model consisting of 9 genes. Based on our 9-gene model, we divided patients into high-risk and low-risk groups using the median risk score. The high-risk group demonstrated significant positive correlations with M0 macrophages, CD8\u0026thinsp;+\u0026thinsp;T cells, and M2 macrophages. The enrichment analyses indicate significant enrichment of immune-related pathways in the high-risk group, including HEDGEHOG_SIGNALING, Wnt signaling pathway, and cell adhesion molecules. Drug sensitivity analysis revealed that the low-risk group was sensitive to 5 chemotherapeutic drugs, while the high-risk group was sensitive to only 1. Additionally, we developed a highly reliable nomogram for clinical application. This suggests that the risk score derived from our modeling analysis is highly effective for stratifying colorectal cancer samples.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eThis study comprehensively applied bioinformatics methods to construct a risk score model. The model showed good predictive performance, offering potential guidance for individualized treatment of colorectal cancer patients. Furthermore, it may provide valuable insights into the disease's pathogenesis and identify potential therapeutic targets for further research.\u003c/p\u003e","manuscriptTitle":"A novel prognostic model for colorectal cancer based on epithelial cell marker genes identified and validated by combining Single-Cell and Bulk RNA- Sequencing","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-08-26 13:48:41","doi":"10.21203/rs.3.rs-4780290/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-01-08T07:24:10+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-01-07T21:12:43+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"21398306519715597449502893331967177473","date":"2024-12-27T03:09:29+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"96172213770687887792122963205927092272","date":"2024-11-18T16:42:47+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-10-18T23:40:15+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"98838112731994602657893667324405474736","date":"2024-10-18T23:00:58+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-08-09T02:26:48+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-07-31T21:04:28+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2024-07-26T14:39:52+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-07-24T04:27:57+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2024-07-22T08:28:25+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"36643091-c60b-4903-b486-bbd479982174","owner":[],"postedDate":"August 26th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":36016596,"name":"Biological sciences/Cancer/Cancer screening"},{"id":36016597,"name":"Biological sciences/Cancer/Gastrointestinal cancer"},{"id":36016598,"name":"Biological sciences/Cancer/Tumour biomarkers"}],"tags":[],"updatedAt":"2025-03-10T17:46:33+00:00","versionOfRecord":{"articleIdentity":"rs-4780290","link":"https://doi.org/10.1038/s41598-025-91761-y","journal":{"identity":"scientific-reports","isVorOnly":false,"title":"Scientific Reports"},"publishedOn":"2025-03-07 15:57:06","publishedOnDateReadable":"March 7th, 2025"},"versionCreatedAt":"2024-08-26 13:48:41","video":"","vorDoi":"10.1038/s41598-025-91761-y","vorDoiUrl":"https://doi.org/10.1038/s41598-025-91761-y","workflowStages":[]},"version":"v1","identity":"rs-4780290","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4780290","identity":"rs-4780290","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00