Sinle-Cell Transcriptomics and Machine Learning Algorithms Unveil Metastasis-Associated Cellular Subtypes and Prognostic Signatures in Colorectal Cancer

preprint OA: closed
Full text JSON View at publisher
Full text 101,138 characters · extracted from preprint-html · click to expand
Sinle-Cell Transcriptomics and Machine Learning Algorithms Unveil Metastasis-Associated Cellular Subtypes and Prognostic Signatures in Colorectal Cancer | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Sinle-Cell Transcriptomics and Machine Learning Algorithms Unveil Metastasis-Associated Cellular Subtypes and Prognostic Signatures in Colorectal Cancer Ke Pu This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6479548/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background Colorectal cancer (CRC) is a prevalent digestive tract malignancy, with liver metastasis occurring in up to 50% of cases. Identifying reliable early metastasis markers is crucial for improving CRC prognosis. Methods In this study, we analyzed single-cell RNA sequencing data from CRC patients, including primary tumors, adjacent normal tissues, and liver metastases. Copy number variation (CNV) analysis using CopyKAT algorithm distinguished tumor from non-tumor cells. We identified key tumor subtypes influencing metastasis through differential gene expression and pathway analyses. Leveraging 103 machine learning algorithms, we developed a metastasis-associated risk model based on identified biomarkers. The model was validated across multiple external datasets.. Results We delineated five tumor cell subtypes, with EMP1 + cells emerging as a key subtype in CRC metastasis. The machine learning approach identified a five-gene signature (SPINK1, PLAC8, LAMB3, CEACAM5, CDA) for metastasis risk prediction. The risk model significantly stratified patients into high- and low-risk groups across six independent cohorts, with high-risk scores correlating with poorer survival. Gene set enrichment analysis revealed enrichment of epithelial-mesenchymal transition (EMT) pathways in the high-risk group. Mutation analysis showed higher overall mutation frequencies in the high-risk group, particularly in genes like APC, TP53, and KRAS. Conclusion Our single-cell transcriptomics and machine learning approach uncovered novel cellular subtypes and a gene signature associated with CRC metastasis, providing new insights for early diagnosis and potential therapeutic targets. Oncology Bioinformatics Computational Biology Colorectal cancer (CRC) Single-cell RNA sequencing Machine Learning Metastasis Prognostic biomarkers Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Introduction Colorectal cancer (CRC) is the third most common cancer worldwide and the second leading cause of cancer-related deaths. It is estimated that there were 1,931,590 new cases of CRC in 2020, accounting for 10.01% of all new cancer cases; and 935,173 CRC-related deaths in 2020, accounting for 9.39% of all cancer-related deaths( 1 ). Due to improvements in early detection methods and the widespread use of colonoscopy, the annual incidence of CRC in the United States decreased by 46% from 1985 to 2019, showing a stable downward trend. However, the incidence and mortality rates of CRC in China are still increasing, with a large population base, resulting in a heavier burden of CRC in China( 2 ). Distant metastasis is the key reason for death in colorectal cancer-related deaths, with a 5-year survival rate of approximately 14%. The liver and lung are the main sites of metastasis. Meanwhile, about 20% of newly diagnosed CRC cases have distant metastases, and approximately 50% of patients will develop liver metastases during subsequent follow-up( 3 ), possibly due to blood flow from the gastrointestinal tract through the portal vein to the liver( 4 ). Despite the continuous development of drug therapy, early diagnosis and surgical resection of CRC are still considered effective key measures. Traditional serum markers for the early detection and diagnosis of CRC metastasis include CEA, CA19-9, and CA125, but they reduce the reliability of diagnosis in terms of sensitivity and specificity. Liver color Doppler ultrasound and abdominal CT are not sensitive enough to diagnose early CRC liver metastasis, while PET-CT and liver biopsy are usually not preferred due to economic and operational reasons( 5 ). Therefore, further research on the molecular mechanisms of CRC metastasis progression is essential for identifying key genes for CRC metastasis and prognosis. Benefiting from the technological advancements in RNA sequencing and the availability of large amounts of public data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), we can analyze transcriptomic data to identify key factors in CRC metastasis, including intrinsic cellular factors (genetic abnormalities, tumor cell heterogeneity, epithelial-mesenchymal transition (EMT)) and the tumor microenvironment (TME)( 6 ). This analysis can lead to the identification of new biomarkers, including DNA, proteins, and RNA, to help us make more accurate predictions( 7 ). Machine Learning (ML) has been widely applied in emerging technologies such as pathology and imaging. By combining machine learning algorithms and using transcriptomic data to build models, it effectively predicts tumor outcomes, treatment responses, and risk markers( 8 ). In this study, through a series of bioinformatics analyses, we identified key tumor cell subgroups related to metastasis. Patients were divided into high and low-risk groups based on CRC metastasis risk scores, each with different prognostic statuses, functional enrichments, and tumor mutation burdens. Methods and Materials Data Collection Single-cell RNA sequencing (scRNA-seq) data were obtained from the Gene Expression Omnibus (GEO) database (accession number: GSE221575). This dataset comprises scRNA-seq data from two pairs of colorectal tumor and adjacent normal tissues, as well as one matched liver metastasis tissue sample. Clinical information associated with the samples was also retrieved.e. Transcriptome expression data and clinical information data were downloaded from The Cancer Genome Atlas (TCGA) and GEO (GSE17536, GSE17538, GSE221575, GSE38832, and GSE39582) databases. Single-cell Data Analysis Data processing and analysis were performed using the Seurat package (version V4) in R (version 4.3.1). Cells were filtered based on the following criteria: gene count between 300 and 10,000, unique molecular identifier (UMI) count between 1,000 and 100,000, mitochondrial read rate <30%, and hemoglobin read rate <5%. Data normalization was conducted using the LogNormalize method with a scale factor of 10,000. The top 2,000 highly variable genes were identified using the vst method. Batch effect correction was performed using the Harmony algorithm. Principal component analysis (PCA) was conducted, and the optimal number of principal components (PCs) was determined using ElbowPlot and JackStrawPlot. Uniform Manifold Approximation and Projection (UMAP) was employed for dimensionality reduction and visualization. Cell type markers were obtained from the CellMaker 2.0 website (http://117.50.127.228/CellMarker/). Differential Gene Expression Analysis. Cells were clustered using the Louvain algorithm with a resolution of 0.3. Differential gene expression analysis was performed using the FindAllMarkers function in Seurat, with the following parameters: only positive markers, minimum percentage of cells expressing the gene in either of the two groups ≥25%, and log fold change ≥0.25. Metabolic Pathway Analysis Metabolic activities of different cell types were analyzed using the scMetabolism algorithm. The activities of various metabolic pathways were assessed using four distinct methods: AUCell, UCell, singscore, and ssgsea, as implemented in the irGSEA package. Gene Set Enrichment Analysis. Gene set enrichment analysis was performed using the irGSEA package. The Molecular Signatures Database (MSigDB) Hallmark gene sets were used for this analysis. Enrichment scores were calculated for each cell using the aforementioned four methods. Results were integrated and visualized using the irGSEA.integrate function, with cells grouped by patient type (normal, tumor, metastasis). Copy Number Variation (CNV) Analysis To distinguish between benign and malignant cells in colorectal cancer (CRC) patients, we employed the CopyKAT algorithm (version 1.1.0). This method infers large-scale chromosomal copy number variations from single-cell RNA sequencing data. Tumor Cell Subtype Classification Cells identified as malignant by CopyKAT were subjected to further analysis. We performed unsupervised clustering using Seurat (version V4) with the following parameters: 2000 highly variable genes, 5 principal components, and a resolution of 0.2 for the FindClusters function. This resulted in the identification of five distinct tumor cell subtypes. Subtype Characterization Each subtype was characterized based on its marker genes, identified through differential expression analysis (FindAllMarkers function in Seurat, logFC threshold = 0.25, adjusted p-value < 0.05). The subtypes were named according to their most prominent marker genes: CCL20 cells, ARGLU1 cells, EMP1 cells, TOP2A cells, and SRGN cells. Metastasis-related Pathway Analysis To identify key cell subtypes potentially affecting metastasis, we focused on pathways related to epithelial-mesenchymal transition (EMT). Enrichment scores for EMT-related gene sets were calculated for each tumor subtype using the AUCell method from the GSVA package (version 1.46.0). Pseudotime Analysis The Monocle3 algorithm was used to construct a single-cell pseudotime trajectory for tumor cell subtypes, and the expression changes of the top 10 genes of DEGs and EMP1 cells in the pseudotime trajectory were displayed. Cell Interaction Analysis Cellchat was used to determine the differences in cell communication between tumor subtypes, calculate the strength of receptor-ligand interactions, and display the key cell groups involved in metastasis (EPM1 cells). Enrichment Analysis of Tumor Cell Subtypes The FindAllMarkers function was used to calculate marker genes for each cell cluster compared to all cells in their respective subsets. The ClusterProfiler package was used to perform KEGG (Kyoto Encyclopedia of Genes and Genomes) and GO enrichment analysis for selected DEGs. Construction of Risk Model Based on Machine Learning and Evaluation of Prognostic Value of Risk Model Machine learning selects feature variables from a large number of input variables for developing and evaluating classification and prediction algorithms. In this study, we evaluated 103 combinations of machine learning methods, selected feature genes based on the best algorithm, and constructed a risk model.Based on the median of risk features, each cohort and external validation cohort were divided into high-risk and low-risk groups, and the prognostic value of the risk model in each cohort was analyzed using the Kaplan-Meier method. GSEA and TMB Analysis We analyzed the somatic mutation spectrum of GEO-CRC patient samples using the R package "maftools". The top 15 genes with the most mutations in samples from different risk groups were analyzed. To explore the functional enrichment pathways and signature gene sets related to high and low metastasis risk groups, gene enrichment analysis was performed using the clusterProfiler R package, and signature gene sets were downloaded from the Molecular Signatures Database (MSigDB, http://software.broadinstitute.org/gsea/msigdb/) (h.all.v7.3). Results scRNA-seq analysis delineated distinct cellular compositions in normal, primary tumor, and liver metastasis samples of colorectal cancer. We obtained single-cell sequencing transcriptome data (GSE221575) from two paired colorectal tumor and adjacent normal tissues, as well as one matched liver metastatic tissue. After quality control and batch effect removal, single-cell clustering analysis demonstrated the total cells such as epithelial cells, endothelial cells, fibroblasts, and immune cells distributed in the normal, primary tumor and liver metastasis samples (Fig. 1 A-C). Besides, we also displayed the four cell types according to their molecular biomarkers (Fig. 1 D-G). In terms of the cell proportions, primary non-metastatic tumor tissues had the higher proportions of epithelial cells and fibroblasts, and lower proportion of immune cells than those in the normal and metastasis tumor. However, it seemed as if no distinctly cell proportion difference between the metastatic tumor and the controls except for the endothelial cells (Fig. 1 H-I). scRNA-seq analysis delineated distinct signaling pathways in normal, primary tumor, and liver metastasis samples of colorectal cancer. Using the above methods, we identified differential genes (n = 1222) in the normal, tumor, and metastasis cohorts. Based on these 1222 differential genes, we performed enrichment analysis of metabolic pathways activity in the three groups (Fig. 2 A). We found significant heterogeneity in metabolic pathways among the three groups; for instance, in the normal cohort, tyrosine metabolism and tryptophan metabolism were significantly stronger than in the tumor and metastasis cohorts; in the tumor cohort, ketone body metabolism, purine metabolism, pyrimidine metabolism, and propanoate metabolism were significantly stronger than in the normal and metastasis cohorts. Similarly, this significant inter-cohort heterogeneity was also observed in the metastasis cohort. Subsequently, we conducted Aucell scoring based on the signaling pathways enriched by differential genes. As shown in Fig. 2 B, comparing with the normal samples, primary and metastatic CRC activated cell proliferation and migration associated pathways such as Wnt/β-catenin, TGF-β, PI3K/AKT/mTOR, KRAS, EMT (epithelial mesenchymal transition) signaling pathways, as well as cell metabolic pathways including glycolysis, cholesterol homeostasis. Metastatic CRC existed some deficient pathways as TP53, Notch, myogenesis, adipogenesis, hypoxia and apoptosis and proficient pathways including interferon-α response, allograft rejection, IL2-STAT5, HEME-metabolism, fatty-acid metabolism, interferon-β response. We semi-quantitatively presented the landscapes of the metastasis related highly activated signaling pathways including alpha-interferon metabolism, HEME metabolism, allograft rejection, and epithelial-to-mesenchymal transition (Fig. 2 C-F). The enhanced activation of metabolic pathways in these liver metastasis CRC cohorts may be associated with cancer related metabolism microenvironment. Copy number variation (CNV) analysis was performed on normal, primary tumor, and liver metastasis samples of colorectal cancer In order to explore CRC subpopulation cells which associated with tumor cell metastasis. Based on CNV features, we used Copykat software to classify all cells into aneuploid (tumor), diploid (normal), and not defined cell types. The results showed that primary CRC and metastatic CRC had more majority of aneuploidy aggregated within epithelial cells than normal samples (Fig. 3 A-C). Subsequently, we analyzed the subpopulation of tumor-associated epithelial cells, and found there were five subpopulation identified such as CCL20 cells, ARGLU1 cells, EMP1 cells, TOP2A cells, and SRGN cells (Fig. 3 E). We also illustrated the annotation information with Top 3 markers to define these five epithelial subpopulations. CCL20 + epithelial cells characterized by high expression levels of CCL20, CXCL2, and CXCL3. ARGLU1 + epithelial cells positively associated with ARGLU1, PNN, and ANKRD36. EMP1 cells mainly expressed the markers of EMP1, CEACAM5, and ERO1A. TOP2A cells expressed highly with TOP2A, CENPF, and ASPM, and SRGN cells characterized by the markers of SRGN, LCP1, and RGS1 (Fig. 3 F). To identify key epithelial subpopulations influencing metastasis. To observe the distribution and proportion of five epithelial subpopulations in different samples, UMAP clustering analyzed found primary and liver metastatic samples exhibited higher content and proportion of CCL20 + epithelial cells and EMP1 + epithelial cells than normal sample (Fig. 4 A-B). epithelial-to-mesenchymal transition (EMT) plays an essential role in tumor progression, the loss of tumor cell polarity and adhesion leaded to detachment from the basement membrane, promoting cell migration and metastasis ( 9 ). AUCell scoring for five tumor-associated epithelial cells in the EMT signaling pathway indicated EMP1 + epithelial cells and CCL20 + epithelial cells showing highly ranking score, consistent with the landscape plot results, suggesting that EMP1 cells may play a significant role in colorectal cancer metastasis (Fig. 4 C-D). Enrichment analysis of tumor cell subtypes. A total of 897 differentially expressed genes (DEGs) among the five cell subtypes, with 423 upregulated and 474 downregulated DEGs. Using the FindAllMarkers function, we identified characteristic genes for each of the five cell subtypes and plotted volcano plots to display the top 5 significantly upregulated or downregulated genes for each subtype (Fig. 5 A). Using R software, we performed gene ontology (GO) and KEGG pathway analyses for the differentially expressed genes (DEGs) in the tumor subtypes cells. In biological processes, DEGs of EMP1 cells were enriched in Salmonella infection, tight junctions, adhesive junctions, and focal adhesions (Fig. 5 B). DEGs of EMP1 cells were mainly enriched in KEGG signaling pathways related to calcium-binding proteins, calcium-dependent protein binding, DNA-binding transcription factor binding, and calcium-binding proteins involved in cell-cell adhesion, cell-matrix adhesion, and cell adhesion molecule activity (Fig. 5 C). Previous studies suggest that CRC cells regulate calcium-binding proteins and other junctions between epithelial cells to separate normal and tumor cells from each other, which is also the beginning of the cascade reaction of CRC invasion and metastasis. Based on the results of GO and KEGG analyses of EMP1 cell DEGs, DEGs are enriched in calcium-binding-related pathways, suggesting that EMP1 tumor subtype cells may play a critical role in CRC distant metastasis. Cell trajectory analysis. Using the Monocle3 algorithm, we constructed a single-cell pseudo-time trajectory of tumor subtype cells, and found that the expression levels of the top 10 genes vary along the pseudo-time trajectory. B2M, TM4SF1, FTH1, and S100A11 show gradually increasing expression levels with differentiation progression from normal cells to cancer cells, while MALAT1, CENPF, and TOP2A show decreasing expression levels during differentiation progression (Fig. 6 A, 6 B). Meanwhile, the top 10 genes of EMP1 cells exhibit an upward trend in expression along the pseudo-time trajectory (Fig. 6 C, 6 D). Finally, we present the pseudo-time analysis heatmap of the top differential genes (Fig. 6 E). Cell communication analysis. We conducted cell-cell communication analysis using the "CellChat" package to study the interactions and communications among tumor subtype cells, aiming to infer important biological interactions among these cells and calculate the probability and significance of these interactions. We visualized the relationships and importance of cell-cell interactions using circular plots and bubble plots. We particularly focused on EMP1 cells, which play a crucial role in the tumor metastasis process. We found that EMP1 cells and CCL20 cells act as primary senders and receivers of signals among the major tumor subtype cells, having connections with the other four cell types, but the signal exchange between them is the strongest (Fig. 7 A-C). PRSS3-PARD3 and PRSS3-F2RL1 dominate in ligand-receptor cell communication (Fig. 7 E). PARs are transmembrane G-protein-coupled receptors that, when activated by upstream signaling molecules, regulate cell proliferation, apoptosis, adhesion, and migration, thereby contributing to tumor initiation, invasion, and metastasis. EMP1 cells are involved in PARs signaling pathways, indicating their critical role in CRC metastasis and invasion (Fig. 7 F-G). Developing a prognostic model for transfer-related genes based on machine learning. To identify genes associated with CRC metastasis, we employed a machine learning approach to fit 103 prediction models and calculated the average C-index of each model on the validation dataset (Fig. 8 A). StepCox [Backward] + RSF was determined as the optimal algorithm, with the highest average C-index of 0.616. Metastasis-related genes including SPINK1, PLAC8, LAMB3, CEACAM5, and CDA were identified as feature genes. Next, based on this algorithm, we calculated the risk scores associated with metastatic genes and constructed a prognostic model. Patients in the CRC cohort were divided into low-risk and high-risk groups based on the median risk score, and in multiple external cohorts, patients in the high-risk group showed significantly shorter OS durations compared to those in the low-risk group (Fig. 8 B-G). Enrichment and mutation analysis. GSEA analysis was conducted on the differentially enriched gene sets in the high and low-risk groups, revealing that the signaling pathways enriched in the high RS group mainly included Apical junction, apical surface, epithelial mesenchymal transition, hedgehog signaling, myogenesis. Meanwhile, the pathways affected by the low RS group were primarily enriched in E2F targets, KRAS signaling, MYC Targets, oxidative phosphorylation, pancreas β-cells (Fig. 9 A-B). Additionally, we examined the overall gene mutations between the high and low-risk groups and found that the overall mutation frequency was higher in the high-risk group than in the low-risk group (100% vs. 94.63%). We identified the top 15 mutated genes in the high and low RS groups, revealing that the high and low-risk groups had exactly the same mutated genes. The top 5 mutated genes in the high-risk group were APC, TP53, TTN, KRAS, PIK3CA, with mutation rates of 87%, 64%, 47%, 43%, 36%, respectively. The top five mutated genes in the low-risk group were APC, TP53, TTN, KRAS, PIK3CA, with mutation rates of 69%, 49%, 46%, 43%, 30%, respectively (Fig. 9 C-D). Discussion Currently, distant metastasis of CRC remains one of the important causes of death, and due to the spatial heterogeneity of colorectal cancer, there are significant differences in treatment and prognosis between primary CRC lesions and CRC metastatic lesions ( 10 ). To address this challenge, we sought to identify metastasis-specific biomarkers based on the heterogeneity of tumor subtypes in CRC, aiming to improve the prediction of metastasis and prognosis in CRC patients. In this study, we conducted GO and KEGG enrichment analyses on differentially expressed genes DEGs across three independent cohorts. Our analyses revealed significant differences in metabolic pathways among the cohorts. Additionally, tumor cells were clustered and annotated into five distinct subtypes. Previous studies have linked tumor subtypes to distinct stages of epithelial-mesenchymal transition (EMT), with invasive and metastatic potential increasing alongside the progression of EMT ( 11 ). Building on these findings, we identified EMP1 cells as a critical tumor cell subtype involved in CRC metastasis. Further enrichment analyses of DEGs in tumor subgroups revealed that the “calcium adhesion protein”-related signaling pathway was highly active in EMP1 cells. This aligns with existing evidence that calmodulin activation or inhibition plays a vital role in cancer progression and metastasis by modulating N-cadherin upregulation and E-cadherin downregulation, key processes in EMT ( 12 ). These results highlight the significant role of EMP1 cells in driving CRC metastasis. To refine the identification of metastatic biomarkers, we applied a machine learning approach, which pinpointed SPINK1, PLAC8, LAMB3, CEACAM5, and CDA as key feature genes. Using these genes, we developed a metastatic risk signature and validated it in an external cohort. Patients were stratified into high-risk and low-risk groups based on their metastatic risk scores. Gene Set Enrichment Analysis (GSEA) demonstrated that high-risk group genes were significantly enriched in metastasis-associated pathways, including Wnt signaling, EMT, and Hedgehog signaling, further corroborating our findings. Liver metastasis of CRC is a highly heterogeneous condition influenced by the unique microenvironment of the liver. This heterogeneity manifests both inter-tumor and intra-tumor, affecting gene expression, tumor microenvironment, and biological behavior. The processes of cancer stem cells, epithelial-mesenchymal transition (EMT), and the tumor microenvironment are recognized as key drivers of distant metastasis in CRC ( 13 ). Recent research has identified cancer stem cells marked by the expression of EMP1 as significant contributors to CRC liver metastasis ( 14 ). Targeted elimination of EMP1 cells has shown promise in preventing postoperative recurrence in mice, indicating the special role of EMP1-marked high recurrence cells in initiating CRC metastasis ( 15 ). Similarly, the special role of the L1 cell adhesion molecule (L1CAM) cells in initiating CRC metastasis also has been highlighted ( 16 ). EMT plays a crucial role in cancer progression, equipping epithelial cells with mesenchymal-like characteristics associated with enhanced motility, invasiveness, and metastatic potential ( 17 ), Previous study suggests that The EMP1 gene has been implicated in promoting tumor proliferation and metastasis through EMT and focal adhesions in various cancers including ovarian cancer, osteosarcoma, and bladder cancer ( 18 – 20 ). Tumor subgroups marked by EMP1 and CCL20 exhibit strong intercellular communication within tumor cell subpopulations. CCL20 can stimulate CRC cells to increase proliferation and migration in vitro, as well as p130 phosphorylation (a protein associated with cell adhesion and migration, and other focal adhesion-related proteins/scaffold proteins)( 21 ). This may indicate a crucial role of CCL20 in CRC and colorectal liver metastasis. In our study, feature genes identified through the overlapping of three machine learning methods have shown promising predictive value for CRC overall survival (OS).Serine protease inhibitor Kazal type 1 (SPINK1) is a pancreatic trypsin inhibitor. Previous studies have shown that high expression of SPINK1 is associated with poor prognosis in various cancers, including pancreatic cancer, prostate cancer, ovarian cancer, breast cancer, liver cancer, lung cancer, and colorectal cancer( 22 – 27 ). In the study by Chen YT et al., SPINK1 induces epithelial-mesenchymal transition (EMT) mediated by EGFR signaling, which is associated with promoting CRC metastasis, meanwhile, in vitro experiments have demonstrated that upregulation of endogenous SPINK1 can counteract the decreased proliferation, migration, and invasion abilities of CRC tumors induced by EGFR inhibitors( 28 ). Genes related to tumor progression, such as Placenta-specific 8 (PLAC8), have been found in fecal specimens of CRC patients ( 29 ). In mouse in vivo experiments, CRC cells with higher levels of PLAC8 expression grow faster than those with lower levels of PLAC8 expression, indicating that downregulation of PLAC8 in CRC may alter the expression of proliferation genes, thereby slowing down the growth and migration of CRC cells( 19 ). LAMB3 is a gene encoding the laminin-332 (Ln-332) β3 subunit, and there is evidence that Ln-332 can interact with integrins on the cell surface, thereby participating in the transduction of multiple cancer signaling pathways ( 30 – 32 ). Research by Zhu Z et al. shows that the overexpression of LAMB3 promoted cell proliferation and migration, while low expression of LAMB3 has the opposite effect ( 33 ). CEA protein encoded by CEACAM5 expressed lowly in healthy human tissues, which is widely used clinically as a biomarker for blood or solid tumors of epithelial tissue origin. High expression level of CEA was associated with poor prognosis, and was considered as an independent prognostic factor in many studies ( 34 ). Actually, CEACAM5 was validated as an oncogene through tumor migration, and F-box and WD repeat domain 7 (FBW7) regulates CEACAM5 expression in a HIF1α-dependent manner, thereby inhibiting CRC metastasis and progression( 35 ). The mechanism by which CEACAM5 is involved in CRC metastasis may be achieved through loss of nestin-mediated apoptosis( 36 ).Cytidine deaminase (CDA) is an enzyme in the pyrimidine salvage pathway that catalyzes the deamination of cytidine and deoxycytidine. CDA is frequently overexpressed in many cancers, including pancreatic cancer, gastric cancer, testicular cancer, and vaginal cancer( 37 ). In the study by Heo H et al., knockdown of CDA expression in LR cells can reverse EMT, thereby reducing lung cancer metastasis, indicating that overexpression of CDA may also be associated with EMT( 38 ). In summary, based on public transcriptomic databases, we employed machine learning to screen a large number of variables, confirming the crucial EMP1 tumor cell subgroup influencing CRC metastasis. Building upon this, we constructed risk features for CRC metastasis, aiding in our understanding of CRC progression and providing a new strategy and foundation for the development of biomarkers for early detection and diagnosis of CRC. Conclusion We employed an optimal combination of machine learning algorithms to screen for biomarkers associated with CRC metastasis (SPINK1, PLAC8, LAMB3, CEACAM5, CDA), which may offer new insights into early diagnosis of CRC metastasis. This study lacks validation of feature gene expression levels in CRC and metastatic lesions through in vivo and in vitro experiments. Furthermore, we have not further clarified the mechanisms by which risk features impact CRC metastasis, which will be a direction for future efforts. References Xi Y, Xu P. Global colorectal cancer burden in 2020 and projections to 2040. Translational oncology. 2021;14(10):101174. Li Q, Wu H, Cao M, Li H, He S, Yang F, et al. Colorectal cancer burden, trends and risk factors in China: A review and comparison with the United States. Chinese journal of cancer research = Chung-kuo yen cheng yen chiu. 2022;34(5):483-95. Benson AB, 3rd, Venook AP, Cederquist L, Chan E, Chen YJ, Cooper HS, et al. Colon Cancer, Version 1.2017, NCCN Clinical Practice Guidelines in Oncology. Journal of the National Comprehensive Cancer Network : JNCCN. 2017;15(3):370-98. Valderrama-Treviño AI, Barrera-Mera B, Ceballos-Villalva JC, Montalvo-Javé EE. Hepatic Metastasis from Colorectal Cancer. Euroasian journal of hepato-gastroenterology. 2017;7(2):166-75. Zhou H, Liu Z, Wang Y, Wen X, Amador EH, Yuan L, et al. Colorectal liver metastasis: molecular mechanism and interventional therapy. Signal transduction and targeted therapy. 2022;7(1):70. Shin AE, Giancotti FG, Rustgi AK. Metastatic colorectal cancer: mechanisms and emerging therapeutics. Trends in pharmacological sciences. 2023;44(4):222-36. Loktionov A. Biomarkers for detecting colorectal cancer non-invasively: DNA, RNA or proteins? World journal of gastrointestinal oncology. 2020;12(2):124-48. Swanson K, Wu E, Zhang A, Alizadeh AA, Zou J. From patterns to patients: Advances in clinical machine learning for cancer diagnosis, prognosis, and treatment. Cell. 2023;186(8):1772-91. Campbell K. Contribution of epithelial-mesenchymal transitions to organogenesis and cancer metastasis. Current opinion in cell biology. 2018;55:30-5. Chen H, Zhai C, Xu X, Wang H, Han W, Shen J. Multilevel Heterogeneity of Colorectal Cancer Liver Metastasis. Cancers. 2023;16(1). Pastushenko I, Brisebarre A, Sifrim A, Fioramonti M, Revenco T, Boumahdi S, et al. Identification of the tumour transition states occurring during EMT. Nature. 2018;556(7702):463-8. Kaszak I, Witkowska-Piłaszewicz O, Niewiadomska Z, Dworecka-Kaszak B, Ngosa Toka F, Jurka P. Role of Cadherins in Cancer-A Review. International journal of molecular sciences. 2020;21(20). Zhao W, Dai S, Yue L, Xu F, Gu J, Dai X, et al. Emerging mechanisms progress of colorectal cancer liver metastasis. Frontiers in endocrinology. 2022;13:1081585. Gvozdenovic A, Aceto N. EMP1-positive cells found guilty of metastatic relapse in colorectal cancer. Developmental cell. 2022;57(24):2673-4. Cañellas-Socias A, Cortina C, Hernando-Momblona X, Palomo-Ponce S, Mulholland EJ, Turon G, et al. Metastatic recurrence in colorectal cancer arises from residual EMP1(+) cells. Nature. 2022;611(7936):603-13. Ganesh K, Basnet H, Kaygusuz Y, Laughney AM, He L, Sharma R, et al. L1CAM defines the regenerative origin of metastasis-initiating cells in colorectal cancer. Nature cancer. 2020;1(1):28-45. Li J, Liu J, Wang H, Ma J, Wang Y, Xu W. Single-cell analyses EMP1 as a marker of the ratio of M1/M2 macrophages is associated with EMT, immune infiltration, and prognosis in bladder cancer. Bladder (San Francisco, Calif). 2023;10:e21200011. Ahmat Amin MKB, Shimizu A, Zankov DP, Sato A, Kurita S, Ito M, et al. Epithelial membrane protein 1 promotes tumor metastasis by enhancing cell migration via copine-III and Rac1. Oncogene. 2018;37(40):5416-34. Huang CC, Shen MH, Chen SK, Yang SH, Liu CY, Guo JW, et al. Gut butyrate-producing organisms correlate to Placenta Specific 8 protein: Importance to colorectal cancer progression. Journal of advanced research. 2020;22:7-20. Wang M, Liu T, Hu X, Yin A, Liu J, Wang X. EMP1 promotes the malignant progression of osteosarcoma through the IRX2/MMP9 axis. Panminerva medica. 2020;62(3):150-4. Brand S, Olszak T, Beigel F, Diebold J, Otte JM, Eichhorst ST, et al. Cell differentiation dependent expressed CCR6 mediates ERK-1/2, SAPK/JNK, and Akt signaling resulting in proliferation and migration of colorectal cancer cells. Journal of cellular biochemistry. 2006;97(4):709-23. Kelloniemi E, Rintala E, Finne P, Stenman UH. Tumor-associated trypsin inhibitor as a prognostic factor during follow-up of bladder cancer. Urology. 2003;62(2):249-53. Mehner C, Oberg AL, Kalli KR, Nassar A, Hockla A, Pendlebury D, et al. Serine protease inhibitor Kazal type 1 (SPINK1) drives proliferation and anoikis resistance in a subset of ovarian cancers. Oncotarget. 2015;6(34):35737-54. Soon WW, Miller LD, Black MA, Dalmasso C, Chan XB, Pang B, et al. Combined genomic and phenotype screening reveals secretory factor SPINK1 as an invasion and survival factor associated with patient prognosis in breast cancer. EMBO molecular medicine. 2011;3(8):451-64. Xu L, Lu C, Huang Y, Zhou J, Wang X, Liu C, et al. SPINK1 promotes cell growth and metastasis of lung adenocarcinoma and acts as a novel prognostic biomarker. BMB reports. 2018;51(12):648-53. Ying HY, Gong CJ, Feng Y, Jing DD, Lu LG. Serine protease inhibitor Kazal type 1 (SPINK1) downregulates E-cadherin and induces EMT of hepatoma cells to promote hepatocellular carcinoma metastasis via the MEK/ERK signaling pathway. Journal of digestive diseases. 2017;18(6):349-58. Zhang X, Yin X, Shen P, Sun G, Yang Y, Liu J, et al. The association between SPINK1 and clinical outcomes in patients with prostate cancer: a systematic review and meta-analysis. OncoTargets and therapy. 2017;10:3123-30. Chen YT, Tseng TT, Tsai HP, Kuo SH, Huang MY, Wang JY, et al. Serine protease inhibitor Kazal type 1 (SPINK1) promotes proliferation, migration, invasion and radiation resistance in rectal cancer patients receiving concurrent chemoradiotherapy: a potential target for precision medicine. Human cell. 2022;35(6):1912-27. Li C, Ma H, Wang Y, Cao Z, Graves-Deal R, Powell AE, et al. Excess PLAC8 promotes an unconventional ERK2-dependent EMT in colon cancer. The Journal of clinical investigation. 2014;124(5):2172-87. Hintermann E, Bilban M, Sharabi A, Quaranta V. Inhibitory role of alpha 6 beta 4-associated erbB-2 and phosphoinositide 3-kinase in keratinocyte haptotactic migration dependent on alpha 3 beta 1 integrin. The Journal of cell biology. 2001;153(3):465-78. Kariya Y, Miyazaki K. The basement membrane protein laminin-5 acts as a soluble cell motility factor. Experimental cell research. 2004;297(2):508-20. Nikolopoulos SN, Blaikie P, Yoshioka T, Guo W, Puri C, Tacchetti C, et al. Targeted deletion of the integrin beta4 signaling domain suppresses laminin-5-dependent nuclear entry of mitogen-activated protein kinases and NF-kappaB, causing defects in epidermal growth and migration. Molecular and cellular biology. 2005;25(14):6090-102. Zhu Z, Song J, Guo Y, Huang Z, Chen X, Dang X, et al. LAMB3 promotes tumour progression through the AKT-FOXO3/4 axis and is transcriptionally regulated by the BRD2/acetylated ELK4 complex in colorectal cancer. Oncogene. 2020;39(24):4666-80. Thirunavukarasu P, Sukumar S, Sathaiah M, Mahan M, Pragatheeshwar KD, Pingpank JF, et al. C-stage in colon cancer: implications of carcinoembryonic antigen biomarker in staging, prognosis, and management. Journal of the National Cancer Institute. 2011;103(8):689-97. Li Q, Li Y, Li J, Ma Y, Dai W, Mo S, et al. FBW7 suppresses metastasis of colorectal cancer by inhibiting HIF1α/CEACAM5 functional axis. International journal of biological sciences. 2018;14(7):726-35. Ordoñez C, Screaton RA, Ilantzis C, Stanners CP. Human carcinoembryonic antigen functions as a general inhibitor of anoikis. Cancer research. 2000;60(13):3419-24. Zauri M, Berridge G, Thézénas ML, Pugh KM, Goldin R, Kessler BM, et al. CDA directs metabolism of epigenetic nucleosides revealing a therapeutic window in cancer. Nature. 2015;524(7563):114-8. Heo H, Kim JH, Lim HJ, Kim JH, Kim M, Koh J, et al. DNA methylome and single-cell transcriptome analyses reveal CDA as a potential druggable target for ALK inhibitor-resistant lung cancer therapy. Experimental & molecular medicine. 2022;54(8):1236-49. Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6479548","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":444869205,"identity":"c7fb0c2f-8167-45cc-8997-e3e632c6d72f","order_by":0,"name":"Ke Pu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAw0lEQVRIiWNgGAWjYFACNsYHCRVQNg+RWpgNPpwhUQub5Mw2UrSYS6QlSPPOu2Ov236A8cHbNgZ5c0JaLGekHTDm3fYscduZBGbDuW0MhjsbCGgxuJHekMy77XCC2Q0GNmneNoYEgwNEaDnMO+ewPVAL+28itaQdbJzZcJhxG9AWZqK0WPY8S2b4cOww0C+JzZJzzkkYbiCkxZw9zfxHQg3QYccPH/zwpsxGnrDDEEzGBiAhQUA9qpZRMApGwSgYBTgAALXRQfu1/J7XAAAAAElFTkSuQmCC","orcid":"https://orcid.org/0000-0002-9627-4887","institution":"Department of Gastroenterology, Affiliated Hospital of North Sichuan Medical College, Sichuan, China","correspondingAuthor":true,"prefix":"","firstName":"Ke","middleName":"","lastName":"Pu","suffix":""}],"badges":[],"createdAt":"2025-04-18 14:03:16","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":true,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":true},"doi":"10.21203/rs.3.rs-6479548/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6479548/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":81174579,"identity":"6faeb08e-2468-4805-a598-2df35cbb2721","added_by":"auto","created_at":"2025-04-23 06:06:39","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":5990968,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eUMAP Single-cell clustering analysis of CRC samples. (A) \u003c/strong\u003eUMAP plots of cellular compositions of three different sample types. (\u003cstrong\u003eB\u003c/strong\u003e) UMAP plots of cellular compositions from different sample sources. (\u003cstrong\u003eC\u003c/strong\u003e) Cell types identified in CRC by scRNA-seq. (\u003cstrong\u003eD\u003c/strong\u003e) Expression of marker genes for cell types. (\u003cstrong\u003eE\u003c/strong\u003e) UMAP plots of cellular compositions in CRC normal, primary tumor, and liver metastasis samples. (\u003cstrong\u003eF\u003c/strong\u003e) Proportional representation of cell types in three sample types.\u003c/p\u003e","description":"","filename":"figure1.png","url":"https://assets-eu.researchsquare.com/files/rs-6479548/v1/4ae5bfb1ccfc4cbb831a25c6.png"},{"id":81174587,"identity":"9a2c842e-0549-4822-8451-12825360d79d","added_by":"auto","created_at":"2025-04-23 06:06:39","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":10160397,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFunctional enrichment analysis of DEGs between normal, primary tumor, and liver metastasis tissues in CRC.\u003c/strong\u003e (\u003cstrong\u003eA\u003c/strong\u003e) Calculation of metabolic pathway activity in different types of patients using scMetabolism. (\u003cstrong\u003eB\u003c/strong\u003e) Calculation of pathway activity in different types of patients using AUcell. (\u003cstrong\u003eC\u003c/strong\u003e) Most active pathway in liver metastasis.\u003c/p\u003e","description":"","filename":"figure2.png","url":"https://assets-eu.researchsquare.com/files/rs-6479548/v1/2089438e6ad11fb8406d7f39.png"},{"id":81175697,"identity":"a4f513a4-d5ac-41dc-9b96-85b4b13edf22","added_by":"auto","created_at":"2025-04-23 06:14:39","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":5891524,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eIdentification of tumor cell subpopulations in CRC samples. \u003c/strong\u003e(\u003cstrong\u003eA\u003c/strong\u003e) Utilizing Copycat algorithm to calculate copy numbers to infer whether cells are normal (diploid) or tumor cells (aneuploid), with NA representing low-quality cells filtered out. (\u003cstrong\u003eB\u003c/strong\u003e) UMAP distribution plots showed the diploid, aneuploidy, and not defined CNV distribution of epithelial cells in different sample. (\u003cstrong\u003eC\u003c/strong\u003e) Proportion plots of diploid and aneuploid from different samples. (\u003cstrong\u003eD-E\u003c/strong\u003e) Subsequent dimension reduction clustering of extracted tumor cells into five cell subtypes. (\u003cstrong\u003eF\u003c/strong\u003e) Bubble plots of the top three genes for the five tumor cell subtypes.\u003c/p\u003e","description":"","filename":"figure3.png","url":"https://assets-eu.researchsquare.com/files/rs-6479548/v1/0b90946f442b04e591658495.png"},{"id":81174581,"identity":"35e46203-6610-49d1-b277-13241c2caee7","added_by":"auto","created_at":"2025-04-23 06:06:39","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":5086881,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eIdentification of key epithelial subpopulations influencing metastasis. \u003c/strong\u003e(\u003cstrong\u003eA\u003c/strong\u003e)\u003cstrong\u003e \u003c/strong\u003eUMAP plots of the normal, primary tumor, and liver metastasis tissues. (\u003cstrong\u003eB\u003c/strong\u003e) Proportion plots of five epithelial subpopulations in the three samples. (\u003cstrong\u003eC\u003c/strong\u003e) Identification of key epithelial subpopulations influencing metastasis based on AUCell scores. (\u003cstrong\u003eD\u003c/strong\u003e) Density plots of metastasis scores.\u003c/p\u003e","description":"","filename":"figure4.png","url":"https://assets-eu.researchsquare.com/files/rs-6479548/v1/9343e55d2ca50303cb239a35.png"},{"id":81174596,"identity":"94f8bfb7-9df4-4e62-b65b-26f40053c514","added_by":"auto","created_at":"2025-04-23 06:06:40","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":4195833,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eEnrichment analysis of DEGs among tumor subtypes cells. \u003c/strong\u003e(\u003cstrong\u003eA\u003c/strong\u003e) Top higher and lower expressed genes of tumor subtypes cells. (\u003cstrong\u003eB\u003c/strong\u003e) GO analysis of grouped differential express genes (DEGs). (\u003cstrong\u003eC\u003c/strong\u003e) KEGG analysis of grouped DEGs.\u003c/p\u003e","description":"","filename":"figure5.png","url":"https://assets-eu.researchsquare.com/files/rs-6479548/v1/2185100cc451ecf7c6c321fc.png"},{"id":81174591,"identity":"197ab563-e2c4-4eb3-9003-c6999f11d7b9","added_by":"auto","created_at":"2025-04-23 06:06:39","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":11969836,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePseudo-time analysis of cells. \u003c/strong\u003e(\u003cstrong\u003eA\u003c/strong\u003e) Pseudo-time expression trajectory plot of TOP 10 genes during the tumor progression from the normal cells to cancel cells, with darker to lighter colors representing the pseudo-time sequence. (\u003cstrong\u003eB\u003c/strong\u003e) Expression changes of TOP 10 genes across differentiated cell types (EMP cells, CCL20 cells, ARGLU1 cells, TOP2A cells). (\u003cstrong\u003eC\u003c/strong\u003e) Pseudo-time expression trajectory plot of TOP 10 genes in EMP1 cells, with darker to lighter colors representing the pseudo-time sequence. (\u003cstrong\u003eD\u003c/strong\u003e) Pseudo-time expression trajectory plot of TOP 10 genes in EMP1 cells, with darker to lighter colors representing the pseudo-time sequence (EMP cells, CCL20 cells, ARGLU1 cells, TOP2A cells). (\u003cstrong\u003eE\u003c/strong\u003e) Heatmap of top 10 pseudo-time differential genes.\u003c/p\u003e","description":"","filename":"figure6.png","url":"https://assets-eu.researchsquare.com/files/rs-6479548/v1/3185be2f7df2f6532bc8dbf7.png"},{"id":81174580,"identity":"77cb2772-f0c8-445c-be1a-5b87943b58e3","added_by":"auto","created_at":"2025-04-23 06:06:39","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":3446306,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCell communication analysis.\u003c/strong\u003e (\u003cstrong\u003eA\u003c/strong\u003e)Network diagram showing the quantity of cell communication among the 5 tumor subtypes, where node color represents different cell types, node size represents cell quantity, line color between nodes corresponds to cell types, and line thickness represents the number of detected ligand-receptor pairs between different cell types. (\u003cstrong\u003eB\u003c/strong\u003e) Network diagram showing the interaction strength among cells, where node color represents different cell types, line color corresponds to cell types, and line thickness represents the interaction strength between different cell types. (\u003cstrong\u003eC\u003c/strong\u003e) Interaction diagram between EMP1 cells and other subtype cells. (\u003cstrong\u003eD\u003c/strong\u003e) Bubble plot showing the interaction of receptors and ligands among different types of cells. (\u003cstrong\u003eE\u003c/strong\u003e) Expression plot of PARs genes. (\u003cstrong\u003eF-G\u003c/strong\u003e) Chord diagram of cell communication.\u003c/p\u003e","description":"","filename":"figure7.png","url":"https://assets-eu.researchsquare.com/files/rs-6479548/v1/6aa08bb068aa3525d72f6ad7.png"},{"id":81174590,"identity":"758013a7-a3cc-4001-8b44-3841b2c60656","added_by":"auto","created_at":"2025-04-23 06:06:39","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":17118648,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eA machine learning model for liver metastasis-related genes.\u003c/strong\u003e (\u003cstrong\u003eA\u003c/strong\u003e)We employed the LOOCV framework to predict across 107 models and computed the C-index of 107 machine learning predictive models combinations in six validation cohorts. (\u003cstrong\u003eB-G\u003c/strong\u003e) Kaplan-Meier curves of OS for CRC metastasis-related genes in six validation cohorts (GSE17536, GES17538, GSE38832, GSE39582, GSE72970, TCGA).\u003c/p\u003e","description":"","filename":"figure8.png","url":"https://assets-eu.researchsquare.com/files/rs-6479548/v1/1fece124fcdba2102ce04b24.png"},{"id":81174604,"identity":"48f00cab-f893-4f37-9fc3-f1bc34b6d5b6","added_by":"auto","created_at":"2025-04-23 06:06:40","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":3736707,"visible":true,"origin":"","legend":"\u003cp\u003eGene Set Enrichment Analysis (GSEA) of CRC transfer-related risk features (\u003cstrong\u003eA-B\u003c/strong\u003e) and mutational analysis of risk groups in tumors (\u003cstrong\u003eC-D\u003c/strong\u003e).\u003c/p\u003e","description":"","filename":"figure9.png","url":"https://assets-eu.researchsquare.com/files/rs-6479548/v1/30bf3b00402d6225a50fcda0.png"},{"id":81176955,"identity":"99227050-9102-402b-80e6-2c17463a8e16","added_by":"auto","created_at":"2025-04-23 06:31:11","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":66562585,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6479548/v1/5f80495a-6332-4e2a-a60a-6322fb8f7654.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eSinle-Cell Transcriptomics and Machine Learning Algorithms Unveil Metastasis-Associated Cellular Subtypes and Prognostic Signatures in Colorectal Cancer\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"Introduction","content":"\u003cp\u003eColorectal cancer (CRC) is the third most common cancer worldwide and the second leading cause of cancer-related deaths. It is estimated that there were 1,931,590 new cases of CRC in 2020, accounting for 10.01% of all new cancer cases; and 935,173 CRC-related deaths in 2020, accounting for 9.39% of all cancer-related deaths(\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e). Due to improvements in early detection methods and the widespread use of colonoscopy, the annual incidence of CRC in the United States decreased by 46% from 1985 to 2019, showing a stable downward trend. However, the incidence and mortality rates of CRC in China are still increasing, with a large population base, resulting in a heavier burden of CRC in China(\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e). Distant metastasis is the key reason for death in colorectal cancer-related deaths, with a 5-year survival rate of approximately 14%. The liver and lung are the main sites of metastasis. Meanwhile, about 20% of newly diagnosed CRC cases have distant metastases, and approximately 50% of patients will develop liver metastases during subsequent follow-up(\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e), possibly due to blood flow from the gastrointestinal tract through the portal vein to the liver(\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eDespite the continuous development of drug therapy, early diagnosis and surgical resection of CRC are still considered effective key measures. Traditional serum markers for the early detection and diagnosis of CRC metastasis include CEA, CA19-9, and CA125, but they reduce the reliability of diagnosis in terms of sensitivity and specificity. Liver color Doppler ultrasound and abdominal CT are not sensitive enough to diagnose early CRC liver metastasis, while PET-CT and liver biopsy are usually not preferred due to economic and operational reasons(\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e). Therefore, further research on the molecular mechanisms of CRC metastasis progression is essential for identifying key genes for CRC metastasis and prognosis. Benefiting from the technological advancements in RNA sequencing and the availability of large amounts of public data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), we can analyze transcriptomic data to identify key factors in CRC metastasis, including intrinsic cellular factors (genetic abnormalities, tumor cell heterogeneity, epithelial-mesenchymal transition (EMT)) and the tumor microenvironment (TME)(\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e). This analysis can lead to the identification of new biomarkers, including DNA, proteins, and RNA, to help us make more accurate predictions(\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eMachine Learning (ML) has been widely applied in emerging technologies such as pathology and imaging. By combining machine learning algorithms and using transcriptomic data to build models, it effectively predicts tumor outcomes, treatment responses, and risk markers(\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e). In this study, through a series of bioinformatics analyses, we identified key tumor cell subgroups related to metastasis. Patients were divided into high and low-risk groups based on CRC metastasis risk scores, each with different prognostic statuses, functional enrichments, and tumor mutation burdens.\u003c/p\u003e"},{"header":"Methods and Materials","content":"\u003cp\u003e\u003cstrong\u003eData Collection\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSingle-cell RNA sequencing (scRNA-seq) data were obtained from the Gene Expression Omnibus (GEO) database (accession number: GSE221575). This dataset comprises scRNA-seq data from two pairs of colorectal tumor and adjacent normal tissues, as well as one matched liver metastasis tissue sample. Clinical information associated with the samples was also retrieved.e. Transcriptome expression data and clinical information data were downloaded from The Cancer Genome Atlas (TCGA) and GEO (GSE17536, GSE17538, GSE221575, GSE38832, and GSE39582) databases.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSingle-cell Data Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eData processing and analysis were performed using the Seurat package (version V4) in R (version 4.3.1). Cells were filtered based on the following criteria: gene count between 300 and 10,000, unique molecular identifier (UMI) count between 1,000 and 100,000, mitochondrial read rate \u0026lt;30%, and hemoglobin read rate \u0026lt;5%. Data normalization was conducted using the LogNormalize method with a scale factor of 10,000. The top 2,000 highly variable genes were identified using the vst method. Batch effect correction was performed using the Harmony algorithm. Principal component analysis (PCA) was conducted, and the optimal number of principal components (PCs) was determined using ElbowPlot and JackStrawPlot. Uniform Manifold Approximation and Projection (UMAP) was employed for dimensionality reduction and visualization. Cell type markers were obtained from the CellMaker 2.0 website (http://117.50.127.228/CellMarker/).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDifferential Gene Expression Analysis.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCells were clustered using the Louvain algorithm with a resolution of 0.3. Differential gene expression analysis was performed using the FindAllMarkers function in Seurat, with the following parameters: only positive markers, minimum percentage of cells expressing the gene in either of the two groups ≥25%, and log fold change ≥0.25.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMetabolic Pathway Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eMetabolic activities of different cell types were analyzed using the scMetabolism algorithm.\u0026nbsp;The activities of various metabolic pathways were assessed using four distinct methods: AUCell, UCell, singscore, and ssgsea, as implemented in the irGSEA package.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eGene Set Enrichment Analysis.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eGene set enrichment analysis was performed using the irGSEA package. The Molecular Signatures Database (MSigDB) Hallmark gene sets were used for this analysis. Enrichment scores were calculated for each cell using the aforementioned four methods. Results were integrated and visualized using the irGSEA.integrate function, with cells grouped by patient type (normal, tumor, metastasis).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCopy Number Variation (CNV) Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo distinguish between benign and malignant cells in colorectal cancer (CRC) patients, we employed the CopyKAT algorithm (version 1.1.0). This method infers large-scale chromosomal copy number variations from single-cell RNA sequencing data.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTumor Cell Subtype Classification\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCells identified as malignant by CopyKAT were subjected to further analysis. We performed unsupervised clustering using Seurat (version V4) with the following parameters: 2000 highly variable genes, 5 principal components, and a resolution of 0.2 for the FindClusters function. This resulted in the identification of five distinct tumor cell subtypes.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eSubtype Characterization\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eEach subtype was characterized based on its marker genes, identified through differential expression analysis (FindAllMarkers function in Seurat, logFC threshold = 0.25, adjusted p-value \u0026lt; 0.05). The subtypes were named according to their most prominent marker genes: CCL20 cells, ARGLU1 cells, EMP1 cells, TOP2A cells, and SRGN cells.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMetastasis-related Pathway Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo identify key cell subtypes potentially affecting metastasis, we focused on pathways related to epithelial-mesenchymal transition (EMT). Enrichment scores for EMT-related gene sets were calculated for each tumor subtype using the AUCell method from the GSVA package (version 1.46.0).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePseudotime Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe Monocle3 algorithm was used to construct a single-cell pseudotime trajectory for tumor cell subtypes, and the expression changes of the top 10 genes of DEGs and EMP1 cells in the pseudotime trajectory were displayed.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCell Interaction Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCellchat was used to determine the differences in cell communication between tumor subtypes, calculate the strength of receptor-ligand interactions, and display the key cell groups involved in metastasis (EPM1 cells).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEnrichment Analysis of Tumor Cell Subtypes\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe FindAllMarkers function was used to calculate marker genes for each cell cluster compared to all cells in their respective subsets. The ClusterProfiler package was used to perform KEGG (Kyoto Encyclopedia of Genes and Genomes) and GO enrichment analysis for selected DEGs.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConstruction of Risk Model Based on Machine Learning and Evaluation of Prognostic Value of Risk Model\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eMachine learning selects feature variables from a large number of input variables for developing and evaluating classification and prediction algorithms. In this study, we evaluated 103 combinations of machine learning methods, selected feature genes based on the best algorithm, and constructed a risk model.Based on the median of risk features, each cohort and external validation cohort were divided into high-risk and low-risk groups, and the prognostic value of the risk model in each cohort was analyzed using the Kaplan-Meier method.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eGSEA and TMB Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe analyzed the somatic mutation spectrum of GEO-CRC patient samples using the R package \"maftools\". The top 15 genes with the most mutations in samples from different risk groups were analyzed. To explore the functional enrichment pathways and signature gene sets related to high and low metastasis risk groups, gene enrichment analysis was performed using the clusterProfiler R package, and signature gene sets were downloaded from the Molecular Signatures Database (MSigDB, http://software.broadinstitute.org/gsea/msigdb/) (h.all.v7.3).\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e \u003cb\u003escRNA-seq analysis delineated distinct cellular compositions in normal, primary tumor, and liver metastasis samples of colorectal cancer.\u003c/b\u003e \u003c/p\u003e \u003cp\u003eWe obtained single-cell sequencing transcriptome data (GSE221575) from two paired colorectal tumor and adjacent normal tissues, as well as one matched liver metastatic tissue. After quality control and batch effect removal, single-cell clustering analysis demonstrated the total cells such as epithelial cells, endothelial cells, fibroblasts, and immune cells distributed in the normal, primary tumor and liver metastasis samples (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eA-C). Besides, we also displayed the four cell types according to their molecular biomarkers (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eD-G). In terms of the cell proportions, primary non-metastatic tumor tissues had the higher proportions of epithelial cells and fibroblasts, and lower proportion of immune cells than those in the normal and metastasis tumor. However, it seemed as if no distinctly cell proportion difference between the metastatic tumor and the controls except for the endothelial cells (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eH-I).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003escRNA-seq analysis delineated distinct signaling pathways in normal, primary tumor, and liver metastasis samples of colorectal cancer.\u003c/b\u003e \u003c/p\u003e \u003cp\u003eUsing the above methods, we identified differential genes (n\u0026thinsp;=\u0026thinsp;1222) in the normal, tumor, and metastasis cohorts. Based on these 1222 differential genes, we performed enrichment analysis of metabolic pathways activity in the three groups (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA). We found significant heterogeneity in metabolic pathways among the three groups; for instance, in the normal cohort, tyrosine metabolism and tryptophan metabolism were significantly stronger than in the tumor and metastasis cohorts; in the tumor cohort, ketone body metabolism, purine metabolism, pyrimidine metabolism, and propanoate metabolism were significantly stronger than in the normal and metastasis cohorts. Similarly, this significant inter-cohort heterogeneity was also observed in the metastasis cohort.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eSubsequently, we conducted Aucell scoring based on the signaling pathways enriched by differential genes. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB, comparing with the normal samples, primary and metastatic CRC activated cell proliferation and migration associated pathways such as Wnt/β-catenin, TGF-β, PI3K/AKT/mTOR, KRAS, EMT (epithelial mesenchymal transition) signaling pathways, as well as cell metabolic pathways including glycolysis, cholesterol homeostasis. Metastatic CRC existed some deficient pathways as TP53, Notch, myogenesis, adipogenesis, hypoxia and apoptosis and proficient pathways including interferon-α response, allograft rejection, IL2-STAT5, HEME-metabolism, fatty-acid metabolism, interferon-β response. We semi-quantitatively presented the landscapes of the metastasis related highly activated signaling pathways including alpha-interferon metabolism, HEME metabolism, allograft rejection, and epithelial-to-mesenchymal transition (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eC-F). The enhanced activation of metabolic pathways in these liver metastasis CRC cohorts may be associated with cancer related metabolism microenvironment.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eCopy number variation (CNV) analysis was performed on normal, primary tumor, and liver metastasis samples of colorectal cancer\u003c/b\u003e \u003c/p\u003e \u003cp\u003eIn order to explore CRC subpopulation cells which associated with tumor cell metastasis. Based on CNV features, we used Copykat software to classify all cells into aneuploid (tumor), diploid (normal), and not defined cell types. The results showed that primary CRC and metastatic CRC had more majority of aneuploidy aggregated within epithelial cells than normal samples (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA-C). Subsequently, we analyzed the subpopulation of tumor-associated epithelial cells, and found there were five subpopulation identified such as CCL20 cells, ARGLU1 cells, EMP1 cells, TOP2A cells, and SRGN cells (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eE). We also illustrated the annotation information with Top 3 markers to define these five epithelial subpopulations. CCL20\u0026thinsp;+\u0026thinsp;epithelial cells characterized by high expression levels of CCL20, CXCL2, and CXCL3. ARGLU1\u0026thinsp;+\u0026thinsp;epithelial cells positively associated with ARGLU1, PNN, and ANKRD36. EMP1 cells mainly expressed the markers of EMP1, CEACAM5, and ERO1A. TOP2A cells expressed highly with TOP2A, CENPF, and ASPM, and SRGN cells characterized by the markers of SRGN, LCP1, and RGS1 (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eF).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eTo identify key epithelial subpopulations influencing metastasis.\u003c/b\u003e \u003c/p\u003e \u003cp\u003eTo observe the distribution and proportion of five epithelial subpopulations in different samples, UMAP clustering analyzed found primary and liver metastatic samples exhibited higher content and proportion of CCL20\u0026thinsp;+\u0026thinsp;epithelial cells and EMP1\u0026thinsp;+\u0026thinsp;epithelial cells than normal sample (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eA-B).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eepithelial-to-mesenchymal transition (EMT) plays an essential role in tumor progression, the loss of tumor cell polarity and adhesion leaded to detachment from the basement membrane, promoting cell migration and metastasis (\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e). AUCell scoring for five tumor-associated epithelial cells in the EMT signaling pathway indicated EMP1\u0026thinsp;+\u0026thinsp;epithelial cells and CCL20\u0026thinsp;+\u0026thinsp;epithelial cells showing highly ranking score, consistent with the landscape plot results, suggesting that EMP1 cells may play a significant role in colorectal cancer metastasis (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eC-D).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eEnrichment analysis of tumor cell subtypes.\u003c/b\u003e \u003c/p\u003e \u003cp\u003eA total of 897 differentially expressed genes (DEGs) among the five cell subtypes, with 423 upregulated and 474 downregulated DEGs. Using the FindAllMarkers function, we identified characteristic genes for each of the five cell subtypes and plotted volcano plots to display the top 5 significantly upregulated or downregulated genes for each subtype (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eA). Using R software, we performed gene ontology (GO) and KEGG pathway analyses for the differentially expressed genes (DEGs) in the tumor subtypes cells. In biological processes, DEGs of EMP1 cells were enriched in Salmonella infection, tight junctions, adhesive junctions, and focal adhesions (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eB). DEGs of EMP1 cells were mainly enriched in KEGG signaling pathways related to calcium-binding proteins, calcium-dependent protein binding, DNA-binding transcription factor binding, and calcium-binding proteins involved in cell-cell adhesion, cell-matrix adhesion, and cell adhesion molecule activity (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eC). Previous studies suggest that CRC cells regulate calcium-binding proteins and other junctions between epithelial cells to separate normal and tumor cells from each other, which is also the beginning of the cascade reaction of CRC invasion and metastasis. Based on the results of GO and KEGG analyses of EMP1 cell DEGs, DEGs are enriched in calcium-binding-related pathways, suggesting that EMP1 tumor subtype cells may play a critical role in CRC distant metastasis.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eCell trajectory analysis.\u003c/b\u003e \u003c/p\u003e \u003cp\u003eUsing the Monocle3 algorithm, we constructed a single-cell pseudo-time trajectory of tumor subtype cells, and found that the expression levels of the top 10 genes vary along the pseudo-time trajectory. B2M, TM4SF1, FTH1, and S100A11 show gradually increasing expression levels with differentiation progression from normal cells to cancer cells, while MALAT1, CENPF, and TOP2A show decreasing expression levels during differentiation progression (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eA, \u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eB). Meanwhile, the top 10 genes of EMP1 cells exhibit an upward trend in expression along the pseudo-time trajectory (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eC, \u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eD). Finally, we present the pseudo-time analysis heatmap of the top differential genes (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eE).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eCell communication analysis.\u003c/b\u003e \u003c/p\u003e \u003cp\u003eWe conducted cell-cell communication analysis using the \"CellChat\" package to study the interactions and communications among tumor subtype cells, aiming to infer important biological interactions among these cells and calculate the probability and significance of these interactions. We visualized the relationships and importance of cell-cell interactions using circular plots and bubble plots. We particularly focused on EMP1 cells, which play a crucial role in the tumor metastasis process. We found that EMP1 cells and CCL20 cells act as primary senders and receivers of signals among the major tumor subtype cells, having connections with the other four cell types, but the signal exchange between them is the strongest (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eA-C). PRSS3-PARD3 and PRSS3-F2RL1 dominate in ligand-receptor cell communication (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eE). PARs are transmembrane G-protein-coupled receptors that, when activated by upstream signaling molecules, regulate cell proliferation, apoptosis, adhesion, and migration, thereby contributing to tumor initiation, invasion, and metastasis. EMP1 cells are involved in PARs signaling pathways, indicating their critical role in CRC metastasis and invasion (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003eF-G).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eDeveloping a prognostic model for transfer-related genes based on machine learning.\u003c/b\u003e \u003c/p\u003e \u003cp\u003eTo identify genes associated with CRC metastasis, we employed a machine learning approach to fit 103 prediction models and calculated the average C-index of each model on the validation dataset (Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eA). StepCox [Backward]\u0026thinsp;+\u0026thinsp;RSF was determined as the optimal algorithm, with the highest average C-index of 0.616. Metastasis-related genes including SPINK1, PLAC8, LAMB3, CEACAM5, and CDA were identified as feature genes. Next, based on this algorithm, we calculated the risk scores associated with metastatic genes and constructed a prognostic model. Patients in the CRC cohort were divided into low-risk and high-risk groups based on the median risk score, and in multiple external cohorts, patients in the high-risk group showed significantly shorter OS durations compared to those in the low-risk group (Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003eB-G).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eEnrichment and mutation analysis.\u003c/b\u003e \u003c/p\u003e \u003cp\u003eGSEA analysis was conducted on the differentially enriched gene sets in the high and low-risk groups, revealing that the signaling pathways enriched in the high RS group mainly included Apical junction, apical surface, epithelial mesenchymal transition, hedgehog signaling, myogenesis. Meanwhile, the pathways affected by the low RS group were primarily enriched in E2F targets, KRAS signaling, MYC Targets, oxidative phosphorylation, pancreas β-cells (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003eA-B). Additionally, we examined the overall gene mutations between the high and low-risk groups and found that the overall mutation frequency was higher in the high-risk group than in the low-risk group (100% vs. 94.63%). We identified the top 15 mutated genes in the high and low RS groups, revealing that the high and low-risk groups had exactly the same mutated genes. The top 5 mutated genes in the high-risk group were APC, TP53, TTN, KRAS, PIK3CA, with mutation rates of 87%, 64%, 47%, 43%, 36%, respectively. The top five mutated genes in the low-risk group were APC, TP53, TTN, KRAS, PIK3CA, with mutation rates of 69%, 49%, 46%, 43%, 30%, respectively (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003eC-D).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eCurrently, distant metastasis of CRC remains one of the important causes of death, and due to the spatial heterogeneity of colorectal cancer, there are significant differences in treatment and prognosis between primary CRC lesions and CRC metastatic lesions (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e). To address this challenge, we sought to identify metastasis-specific biomarkers based on the heterogeneity of tumor subtypes in CRC, aiming to improve the prediction of metastasis and prognosis in CRC patients. In this study, we conducted GO and KEGG enrichment analyses on differentially expressed genes DEGs across three independent cohorts. Our analyses revealed significant differences in metabolic pathways among the cohorts. Additionally, tumor cells were clustered and annotated into five distinct subtypes.\u003c/p\u003e \u003cp\u003ePrevious studies have linked tumor subtypes to distinct stages of epithelial-mesenchymal transition (EMT), with invasive and metastatic potential increasing alongside the progression of EMT (\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e). Building on these findings, we identified EMP1 cells as a critical tumor cell subtype involved in CRC metastasis. Further enrichment analyses of DEGs in tumor subgroups revealed that the \u0026ldquo;calcium adhesion protein\u0026rdquo;-related signaling pathway was highly active in EMP1 cells. This aligns with existing evidence that calmodulin activation or inhibition plays a vital role in cancer progression and metastasis by modulating N-cadherin upregulation and E-cadherin downregulation, key processes in EMT (\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e). These results highlight the significant role of EMP1 cells in driving CRC metastasis. To refine the identification of metastatic biomarkers, we applied a machine learning approach, which pinpointed SPINK1, PLAC8, LAMB3, CEACAM5, and CDA as key feature genes. Using these genes, we developed a metastatic risk signature and validated it in an external cohort. Patients were stratified into high-risk and low-risk groups based on their metastatic risk scores. Gene Set Enrichment Analysis (GSEA) demonstrated that high-risk group genes were significantly enriched in metastasis-associated pathways, including Wnt signaling, EMT, and Hedgehog signaling, further corroborating our findings.\u003c/p\u003e \u003cp\u003eLiver metastasis of CRC is a highly heterogeneous condition influenced by the unique microenvironment of the liver. This heterogeneity manifests both inter-tumor and intra-tumor, affecting gene expression, tumor microenvironment, and biological behavior. The processes of cancer stem cells, epithelial-mesenchymal transition (EMT), and the tumor microenvironment are recognized as key drivers of distant metastasis in CRC (\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e). Recent research has identified cancer stem cells marked by the expression of EMP1 as significant contributors to CRC liver metastasis (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e). Targeted elimination of EMP1 cells has shown promise in preventing postoperative recurrence in mice, indicating the special role of EMP1-marked high recurrence cells in initiating CRC metastasis (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). Similarly, the special role of the L1 cell adhesion molecule (L1CAM) cells in initiating CRC metastasis also has been highlighted (\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e). EMT plays a crucial role in cancer progression, equipping epithelial cells with mesenchymal-like characteristics associated with enhanced motility, invasiveness, and metastatic potential (\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e), Previous study suggests that The EMP1 gene has been implicated in promoting tumor proliferation and metastasis through EMT and focal adhesions in various cancers including ovarian cancer, osteosarcoma, and bladder cancer (\u003cspan additionalcitationids=\"CR19\" citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e). Tumor subgroups marked by EMP1 and CCL20 exhibit strong intercellular communication within tumor cell subpopulations. CCL20 can stimulate CRC cells to increase proliferation and migration in vitro, as well as p130 phosphorylation (a protein associated with cell adhesion and migration, and other focal adhesion-related proteins/scaffold proteins)(\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e). This may indicate a crucial role of CCL20 in CRC and colorectal liver metastasis.\u003c/p\u003e \u003cp\u003eIn our study, feature genes identified through the overlapping of three machine learning methods have shown promising predictive value for CRC overall survival (OS).Serine protease inhibitor Kazal type 1 (SPINK1) is a pancreatic trypsin inhibitor. Previous studies have shown that high expression of SPINK1 is associated with poor prognosis in various cancers, including pancreatic cancer, prostate cancer, ovarian cancer, breast cancer, liver cancer, lung cancer, and colorectal cancer(\u003cspan additionalcitationids=\"CR23 CR24 CR25 CR26\" citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e). In the study by Chen YT et al., SPINK1 induces epithelial-mesenchymal transition (EMT) mediated by EGFR signaling, which is associated with promoting CRC metastasis, meanwhile, in vitro experiments have demonstrated that upregulation of endogenous SPINK1 can counteract the decreased proliferation, migration, and invasion abilities of CRC tumors induced by EGFR inhibitors(\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e). Genes related to tumor progression, such as Placenta-specific 8 (PLAC8), have been found in fecal specimens of CRC patients (\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e). In mouse in vivo experiments, CRC cells with higher levels of PLAC8 expression grow faster than those with lower levels of PLAC8 expression, indicating that downregulation of PLAC8 in CRC may alter the expression of proliferation genes, thereby slowing down the growth and migration of CRC cells(\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e). LAMB3 is a gene encoding the laminin-332 (Ln-332) β3 subunit, and there is evidence that Ln-332 can interact with integrins on the cell surface, thereby participating in the transduction of multiple cancer signaling pathways (\u003cspan additionalcitationids=\"CR31\" citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e). Research by Zhu Z et al. shows that the overexpression of LAMB3 promoted cell proliferation and migration, while low expression of LAMB3 has the opposite effect (\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e). CEA protein encoded by CEACAM5 expressed lowly in healthy human tissues, which is widely used clinically as a biomarker for blood or solid tumors of epithelial tissue origin. High expression level of CEA was associated with poor prognosis, and was considered as an independent prognostic factor in many studies (\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e). Actually, CEACAM5 was validated as an oncogene through tumor migration, and F-box and WD repeat domain 7 (FBW7) regulates CEACAM5 expression in a HIF1α-dependent manner, thereby inhibiting CRC metastasis and progression(\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e). The mechanism by which CEACAM5 is involved in CRC metastasis may be achieved through loss of nestin-mediated apoptosis(\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e).Cytidine deaminase (CDA) is an enzyme in the pyrimidine salvage pathway that catalyzes the deamination of cytidine and deoxycytidine. CDA is frequently overexpressed in many cancers, including pancreatic cancer, gastric cancer, testicular cancer, and vaginal cancer(\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e). In the study by Heo H et al., knockdown of CDA expression in LR cells can reverse EMT, thereby reducing lung cancer metastasis, indicating that overexpression of CDA may also be associated with EMT(\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIn summary, based on public transcriptomic databases, we employed machine learning to screen a large number of variables, confirming the crucial EMP1 tumor cell subgroup influencing CRC metastasis. Building upon this, we constructed risk features for CRC metastasis, aiding in our understanding of CRC progression and providing a new strategy and foundation for the development of biomarkers for early detection and diagnosis of CRC.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eWe employed an optimal combination of machine learning algorithms to screen for biomarkers associated with CRC metastasis (SPINK1, PLAC8, LAMB3, CEACAM5, CDA), which may offer new insights into early diagnosis of CRC metastasis. This study lacks validation of feature gene expression levels in CRC and metastatic lesions through in vivo and in vitro experiments. Furthermore, we have not further clarified the mechanisms by which risk features impact CRC metastasis, which will be a direction for future efforts.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eXi Y, Xu P. Global colorectal cancer burden in 2020 and projections to 2040. Translational oncology. 2021;14(10):101174.\u003c/li\u003e\n\u003cli\u003eLi Q, Wu H, Cao M, Li H, He S, Yang F, et al. Colorectal cancer burden, trends and risk factors in China: A review and comparison with the United States. Chinese journal of cancer research = Chung-kuo yen cheng yen chiu. 2022;34(5):483-95.\u003c/li\u003e\n\u003cli\u003eBenson AB, 3rd, Venook AP, Cederquist L, Chan E, Chen YJ, Cooper HS, et al. Colon Cancer, Version 1.2017, NCCN Clinical Practice Guidelines in Oncology. Journal of the National Comprehensive Cancer Network : JNCCN. 2017;15(3):370-98.\u003c/li\u003e\n\u003cli\u003eValderrama-Trevi\u0026ntilde;o AI, Barrera-Mera B, Ceballos-Villalva JC, Montalvo-Jav\u0026eacute; EE. Hepatic Metastasis from Colorectal Cancer. Euroasian journal of hepato-gastroenterology. 2017;7(2):166-75.\u003c/li\u003e\n\u003cli\u003eZhou H, Liu Z, Wang Y, Wen X, Amador EH, Yuan L, et al. Colorectal liver metastasis: molecular mechanism and interventional therapy. Signal transduction and targeted therapy. 2022;7(1):70.\u003c/li\u003e\n\u003cli\u003eShin AE, Giancotti FG, Rustgi AK. Metastatic colorectal cancer: mechanisms and emerging therapeutics. Trends in pharmacological sciences. 2023;44(4):222-36.\u003c/li\u003e\n\u003cli\u003eLoktionov A. Biomarkers for detecting colorectal cancer non-invasively: DNA, RNA or proteins? World journal of gastrointestinal oncology. 2020;12(2):124-48.\u003c/li\u003e\n\u003cli\u003eSwanson K, Wu E, Zhang A, Alizadeh AA, Zou J. From patterns to patients: Advances in clinical machine learning for cancer diagnosis, prognosis, and treatment. Cell. 2023;186(8):1772-91.\u003c/li\u003e\n\u003cli\u003eCampbell K. Contribution of epithelial-mesenchymal transitions to organogenesis and cancer metastasis. Current opinion in cell biology. 2018;55:30-5.\u003c/li\u003e\n\u003cli\u003eChen H, Zhai C, Xu X, Wang H, Han W, Shen J. Multilevel Heterogeneity of Colorectal Cancer Liver Metastasis. Cancers. 2023;16(1).\u003c/li\u003e\n\u003cli\u003ePastushenko I, Brisebarre A, Sifrim A, Fioramonti M, Revenco T, Boumahdi S, et al. Identification of the tumour transition states occurring during EMT. Nature. 2018;556(7702):463-8.\u003c/li\u003e\n\u003cli\u003eKaszak I, Witkowska-Piłaszewicz O, Niewiadomska Z, Dworecka-Kaszak B, Ngosa Toka F, Jurka P. Role of Cadherins in Cancer-A Review. International journal of molecular sciences. 2020;21(20).\u003c/li\u003e\n\u003cli\u003eZhao W, Dai S, Yue L, Xu F, Gu J, Dai X, et al. Emerging mechanisms progress of colorectal cancer liver metastasis. Frontiers in endocrinology. 2022;13:1081585.\u003c/li\u003e\n\u003cli\u003eGvozdenovic A, Aceto N. EMP1-positive cells found guilty of metastatic relapse in colorectal cancer. Developmental cell. 2022;57(24):2673-4.\u003c/li\u003e\n\u003cli\u003eCa\u0026ntilde;ellas-Socias A, Cortina C, Hernando-Momblona X, Palomo-Ponce S, Mulholland EJ, Turon G, et al. Metastatic recurrence in colorectal cancer arises from residual EMP1(+) cells. Nature. 2022;611(7936):603-13.\u003c/li\u003e\n\u003cli\u003eGanesh K, Basnet H, Kaygusuz Y, Laughney AM, He L, Sharma R, et al. L1CAM defines the regenerative origin of metastasis-initiating cells in colorectal cancer. Nature cancer. 2020;1(1):28-45.\u003c/li\u003e\n\u003cli\u003eLi J, Liu J, Wang H, Ma J, Wang Y, Xu W. Single-cell analyses EMP1 as a marker of the ratio of M1/M2 macrophages is associated with EMT, immune infiltration, and prognosis in bladder cancer. Bladder (San Francisco, Calif). 2023;10:e21200011.\u003c/li\u003e\n\u003cli\u003eAhmat Amin MKB, Shimizu A, Zankov DP, Sato A, Kurita S, Ito M, et al. Epithelial membrane protein 1 promotes tumor metastasis by enhancing cell migration via copine-III and Rac1. Oncogene. 2018;37(40):5416-34.\u003c/li\u003e\n\u003cli\u003eHuang CC, Shen MH, Chen SK, Yang SH, Liu CY, Guo JW, et al. Gut butyrate-producing organisms correlate to Placenta Specific 8 protein: Importance to colorectal cancer progression. Journal of advanced research. 2020;22:7-20.\u003c/li\u003e\n\u003cli\u003eWang M, Liu T, Hu X, Yin A, Liu J, Wang X. EMP1 promotes the malignant progression of osteosarcoma through the IRX2/MMP9 axis. Panminerva medica. 2020;62(3):150-4.\u003c/li\u003e\n\u003cli\u003eBrand S, Olszak T, Beigel F, Diebold J, Otte JM, Eichhorst ST, et al. Cell differentiation dependent expressed CCR6 mediates ERK-1/2, SAPK/JNK, and Akt signaling resulting in proliferation and migration of colorectal cancer cells. Journal of cellular biochemistry. 2006;97(4):709-23.\u003c/li\u003e\n\u003cli\u003eKelloniemi E, Rintala E, Finne P, Stenman UH. Tumor-associated trypsin inhibitor as a prognostic factor during follow-up of bladder cancer. Urology. 2003;62(2):249-53.\u003c/li\u003e\n\u003cli\u003eMehner C, Oberg AL, Kalli KR, Nassar A, Hockla A, Pendlebury D, et al. Serine protease inhibitor Kazal type 1 (SPINK1) drives proliferation and anoikis resistance in a subset of ovarian cancers. Oncotarget. 2015;6(34):35737-54.\u003c/li\u003e\n\u003cli\u003eSoon WW, Miller LD, Black MA, Dalmasso C, Chan XB, Pang B, et al. Combined genomic and phenotype screening reveals secretory factor SPINK1 as an invasion and survival factor associated with patient prognosis in breast cancer. EMBO molecular medicine. 2011;3(8):451-64.\u003c/li\u003e\n\u003cli\u003eXu L, Lu C, Huang Y, Zhou J, Wang X, Liu C, et al. SPINK1 promotes cell growth and metastasis of lung adenocarcinoma and acts as a novel prognostic biomarker. BMB reports. 2018;51(12):648-53.\u003c/li\u003e\n\u003cli\u003eYing HY, Gong CJ, Feng Y, Jing DD, Lu LG. Serine protease inhibitor Kazal type 1 (SPINK1) downregulates E-cadherin and induces EMT of hepatoma cells to promote hepatocellular carcinoma metastasis via the MEK/ERK signaling pathway. Journal of digestive diseases. 2017;18(6):349-58.\u003c/li\u003e\n\u003cli\u003eZhang X, Yin X, Shen P, Sun G, Yang Y, Liu J, et al. The association between SPINK1 and clinical outcomes in patients with prostate cancer: a systematic review and meta-analysis. OncoTargets and therapy. 2017;10:3123-30.\u003c/li\u003e\n\u003cli\u003eChen YT, Tseng TT, Tsai HP, Kuo SH, Huang MY, Wang JY, et al. Serine protease inhibitor Kazal type 1 (SPINK1) promotes proliferation, migration, invasion and radiation resistance in rectal cancer patients receiving concurrent chemoradiotherapy: a potential target for precision medicine. Human cell. 2022;35(6):1912-27.\u003c/li\u003e\n\u003cli\u003eLi C, Ma H, Wang Y, Cao Z, Graves-Deal R, Powell AE, et al. Excess PLAC8 promotes an unconventional ERK2-dependent EMT in colon cancer. The Journal of clinical investigation. 2014;124(5):2172-87.\u003c/li\u003e\n\u003cli\u003eHintermann E, Bilban M, Sharabi A, Quaranta V. Inhibitory role of alpha 6 beta 4-associated erbB-2 and phosphoinositide 3-kinase in keratinocyte haptotactic migration dependent on alpha 3 beta 1 integrin. The Journal of cell biology. 2001;153(3):465-78.\u003c/li\u003e\n\u003cli\u003eKariya Y, Miyazaki K. The basement membrane protein laminin-5 acts as a soluble cell motility factor. Experimental cell research. 2004;297(2):508-20.\u003c/li\u003e\n\u003cli\u003eNikolopoulos SN, Blaikie P, Yoshioka T, Guo W, Puri C, Tacchetti C, et al. Targeted deletion of the integrin beta4 signaling domain suppresses laminin-5-dependent nuclear entry of mitogen-activated protein kinases and NF-kappaB, causing defects in epidermal growth and migration. Molecular and cellular biology. 2005;25(14):6090-102.\u003c/li\u003e\n\u003cli\u003eZhu Z, Song J, Guo Y, Huang Z, Chen X, Dang X, et al. LAMB3 promotes tumour progression through the AKT-FOXO3/4 axis and is transcriptionally regulated by the BRD2/acetylated ELK4 complex in colorectal cancer. Oncogene. 2020;39(24):4666-80.\u003c/li\u003e\n\u003cli\u003eThirunavukarasu P, Sukumar S, Sathaiah M, Mahan M, Pragatheeshwar KD, Pingpank JF, et al. C-stage in colon cancer: implications of carcinoembryonic antigen biomarker in staging, prognosis, and management. Journal of the National Cancer Institute. 2011;103(8):689-97.\u003c/li\u003e\n\u003cli\u003eLi Q, Li Y, Li J, Ma Y, Dai W, Mo S, et al. FBW7 suppresses metastasis of colorectal cancer by inhibiting HIF1\u0026alpha;/CEACAM5 functional axis. International journal of biological sciences. 2018;14(7):726-35.\u003c/li\u003e\n\u003cli\u003eOrdo\u0026ntilde;ez C, Screaton RA, Ilantzis C, Stanners CP. Human carcinoembryonic antigen functions as a general inhibitor of anoikis. Cancer research. 2000;60(13):3419-24.\u003c/li\u003e\n\u003cli\u003eZauri M, Berridge G, Th\u0026eacute;z\u0026eacute;nas ML, Pugh KM, Goldin R, Kessler BM, et al. CDA directs metabolism of epigenetic nucleosides revealing a therapeutic window in cancer. Nature. 2015;524(7563):114-8.\u003c/li\u003e\n\u003cli\u003eHeo H, Kim JH, Lim HJ, Kim JH, Kim M, Koh J, et al. DNA methylome and single-cell transcriptome analyses reveal CDA as a potential druggable target for ALK inhibitor-resistant lung cancer therapy. Experimental \u0026amp; molecular medicine. 2022;54(8):1236-49.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Affiliated Hospital of North Sichuan Medical College","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Colorectal cancer (CRC), Single-cell RNA sequencing, Machine Learning, Metastasis, Prognostic biomarkers","lastPublishedDoi":"10.21203/rs.3.rs-6479548/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6479548/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eColorectal cancer (CRC) is a prevalent digestive tract malignancy, with liver metastasis occurring in up to 50% of cases. Identifying reliable early metastasis markers is crucial for improving CRC prognosis.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eIn this study, we analyzed single-cell RNA sequencing data from CRC patients, including primary tumors, adjacent normal tissues, and liver metastases. Copy number variation (CNV) analysis using CopyKAT algorithm distinguished tumor from non-tumor cells. We identified key tumor subtypes influencing metastasis through differential gene expression and pathway analyses. Leveraging 103 machine learning algorithms, we developed a metastasis-associated risk model based on identified biomarkers. The model was validated across multiple external datasets..\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eWe delineated five tumor cell subtypes, with EMP1\u0026thinsp;+\u0026thinsp;cells emerging as a key subtype in CRC metastasis. The machine learning approach identified a five-gene signature (SPINK1, PLAC8, LAMB3, CEACAM5, CDA) for metastasis risk prediction. The risk model significantly stratified patients into high- and low-risk groups across six independent cohorts, with high-risk scores correlating with poorer survival. Gene set enrichment analysis revealed enrichment of epithelial-mesenchymal transition (EMT) pathways in the high-risk group. Mutation analysis showed higher overall mutation frequencies in the high-risk group, particularly in genes like APC, TP53, and KRAS.\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e \u003cp\u003eOur single-cell transcriptomics and machine learning approach uncovered novel cellular subtypes and a gene signature associated with CRC metastasis, providing new insights for early diagnosis and potential therapeutic targets.\u003c/p\u003e","manuscriptTitle":"Sinle-Cell Transcriptomics and Machine Learning Algorithms Unveil Metastasis-Associated Cellular Subtypes and Prognostic Signatures in Colorectal Cancer","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-04-23 06:06:34","doi":"10.21203/rs.3.rs-6479548/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"981bf0fe-24d9-420d-bcd5-04680ed41820","owner":[],"postedDate":"April 23rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":47367095,"name":"Oncology"},{"id":47367096,"name":"Bioinformatics"},{"id":47367097,"name":"Computational Biology"}],"tags":[],"updatedAt":"2025-04-23T06:06:34+00:00","versionOfRecord":[],"versionCreatedAt":"2025-04-23 06:06:34","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6479548","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6479548","identity":"rs-6479548","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-19T01:45:01.086888+00:00