AtML: Accurate identification of Arabidopsis thaliana root cell identity via single-cell sequencing analysis using interpretable machine learning

preprint OA: closed
Full text JSON View at publisher
Full text 114,361 characters · extracted from preprint-html · click to expand
AtML: Accurate identification of Arabidopsis thaliana root cell identity via single-cell sequencing analysis using interpretable machine learning | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article AtML: Accurate identification of Arabidopsis thaliana root cell identity via single-cell sequencing analysis using interpretable machine learning Shicong Yu, Xiangzheng Fu, Xiaoshu Deng, Hao wang, shen yan, shuqin Zheng, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4349116/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Single-cell transcriptomics technologies are essential for understanding the developmental trajectory of plant roots. Although methods for interpreting single-cell transcriptomics data are rapidly advancing in Arabidopsis, precisely annotating cell identity and the occasional lack of canonical markers for certain cell types are major challenges in plant single-cell RNA sequencing (scRNA-seq) analysis. In this work, we trained a machine learning system, namely, AtML, using sequencing datasets from six cell subpopulations (comprising a total of 6000 cells) to predict Arabidopsis root cell stages and mine biomarkers via complete model interpretability. The results of performance testing using an external dataset revealed that AtML achieved 96.50% accuracy and 96.51% recall. With the power of interpretability provided by AtML, our model recognized 160 important marker genes, thus contributing to the understanding of cell type annotations. In conclusion, we trained AtML to efficiently identify Arabidopsis root cell stages, thereby providing insights into cellular heterogeneity in Arabidopsis root development studies. Machine learning Marker genes scRNA-seq Arabidopsis root tips Cell subpopulations Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction Roots are underground organs that develop through long-term adaptation to terrestrial life. Roots also play crucial roles in various physiological functions. Primarily, they facilitate the absorption of water and inorganic salts, which are subsequently transported to the stem and leaves[ 1 ]. In addition to their role in nutrient uptake, roots serve as anchors, as they utilize strong branching capabilities to firmly stabilize plants in conducive soil[ 2 ]. Moreover, roots play a significant role in assimilating diverse inorganic salts into organic substances via complex biochemical reactions[ 3 , 4 ]. They also serve as vital sites for synthesizing plant hormones, including cytokinins, auxins, abscisic acid, gibberellins, and ethylene, thereby exerting considerable influence on overall plant development[ 5 ]. Furthermore, certain types of roots possess the ability to expand and function as storage organs, thus facilitating both storage and reproductive processes. In recent years, remarkable advancements have been made in the domain of molecular biology research focused on root systems. A significant portion of this research has focused on the model plant Arabidopsis thaliana . The primary root of Arabidopsis comprises distinct layers, including the epidermis, cortex, and vascular cylinders the primary root undergoes growth and development[ 6 – 8 ], it generates lateral roots, thereby continually expanding the complexity and extent of the root system[ 4 , 9 ].The intricate development of the Arabidopsis root system is governed by multifaceted molecular processes, thus necessitating an in-depth investigation at a more refined level, particularly within specific cell subpopulations. Cells are the smallest functional units in plants, and physiological processes are typically carried out collaboratively by multiple cells. However, there is significant gene expression variation among cells. Therefore, more precise transcriptome technology has been applied[ 10 , 11 ]. Lopez-Anido constructed different models of cell state differentiation in the leaf tissue of Arabidopsis, analysing the potential heterogeneity of epidermal stomatal lineage cells[ 12 ]. Liu utilized the single-cell transcriptome atlas of early-stage Arabidopsis seedlings to identify new marker genes for leaf vein cells[ 13 ]. Kim determined the roles of various tissues in Arabidopsis true leaves starting from metabolic pathways, thus providing knowledge regarding the leaf vascular system and the relationships of leaf cell types[ 14 ]. Previous research on the underground part of Arabidopsis involved the construction of a high-resolution genetic map of the development process from stem cells to nucleus-free sieve tubes in the original endodermis of the Arabidopsis root system at seven developmental stages[ 15 ]. With the availability of scRNA-seq data, cell type identification has become an essential step for a multitude of downstream analyses[ 16 – 22 ]. In certain scenarios, the absence of reliable markers for essential cell populations poses a challenge in accurately defining cell types. Implementing an efficient machine learning prediction model to extract molecular markers and discern cell subpopulations from existing single-cell datasets has been proven to be a time-efficient and resource-saving strategy[ 23 – 27 ]. To address the limitations mentioned above, we proposed an ensemble computing framework named AtML, which enables the model to capture cell subpopulation biomarkers of Arabidopsis root tips and predict cell stages (Fig. 1 ). The AtML model combines MIC and XGBoost to assess the importance of genes in predicting Arabidopsis root tip cell subpopulations. Furthermore, we successfully applied our AtML model to data not included in the training dataset and demonstrated its superior predictive performance. By conducting a biological analysis of the optimal genes, we identified potential lineage-specific genes, which could help biologists better understand the heterogeneity of Arabidopsis root tips. Our work utilized a machine learning approach to aid in the development of markers for single-cell sequencing in Arabidopsis, thereby providing new insights and more accurate markers for cell type identification. Results Identification of Significant Genes Using Feature Selection and Machine Learning To identify essential genes related to the subpopulations of Arabidopsis root tip cells, three feature selection methods (MIC, CV2, and F score) were used. These methods were utilized to evaluate the significance of the 24,280 genes, and subsequent rankings were established based on their contribution values. Genes with an importance score less than or equal to zero were systematically excluded. Next, machine learning models were combined with incremental feature selection (IFS) to determine the optimal gene subsets (Fig. 2 A-C). Based on fivefold cross-validation, single-cell gene expression matrices (normalized raw read count) were used as input features to train machine learning models (KNN, XGBoost, SVM, and RFC). The training dataset results showed that the MIC combined with XGBoost (MIC_XGBoost) achieved optimal prediction performance for the top 160 genes, with an accuracy of 96.88% (Table 1 ). Notably, except for the KNN model, the other three learning models combined with the MIC and F score also obtained superior prediction performance. Using the 160 optimal genes on the test data, MIC_XGBoost also demonstrated the best performance, achieving an accuracy, precision, recall, and F1-measure of 96.50%, 95.49%, 94.51%, and 94.49%, respectively (Table 2 ). To mitigate the potential bias introduced by CV2, MIC, and the F score favouring the same genes, we selected the top 300 genes based on the score rankings of these three feature selection methods for comparison. Fewer gene intersections were identified according to the MIC, F score and CV2, which confirmed the effectiveness of the three methods (Fig. 2 D). Table 1 Performance evaluation of different feature selections combined with machine learning schemes (training dataset) Method Feature selection No. of features Accuracy KNN F score 510 93.13% XGBoost F score 310 96.18% SVM F score 2200 95.47% RFC F score 1400 91.27% KNN CV2 1400 89.54% XGBoost CV2 1100 96.89% SVM CV2 2200 96.43% RFC CV2 4800 90.95% KNN MIC 310 94.87% XGBoost MIC 160 96.88% SVM MIC 1700 96.68% RFC MIC 1100 92.65% Table 2 Performance comparison between AtML and the other algorithms (test dataset) Method Feature selection No. of features Accuracy Precision Recall F1-measure KNN F score 510 92.03% 92.34% 92.06% 91.99% XGBoost F score 310 96.31% 96.45% 96.41% 96.42% SVM F score 2200 95.54% 95.52% 95.51% 95.48% RFC F score 1400 91.58% 92.09% 91.44% 91.36% KNN CV2 1400 89.91% 90.68% 89.88% 89.97% XGBoost CV2 1100 96.31% 96.38% 95.42% 94.40% SVM CV2 2200 95.41% 95.39% 95.40% 95.38% RFC CV2 4800 92.15% 92.02% 91.03% 90.91% KNN MIC 310 94.16% 94.36% 94.20% 94.17% XGBoost MIC 160 96.50% 96.49% 96.51% 96.49% SVM MIC 1700 96.41% 96.39% 94.42% 96.40% RFC MIC 1100 92.25% 92.67% 92.15% 92.13% Investigating AtML model interpretability To demonstrate the consistently high performance of the AtML model across various cell subpopulations, we utilized the receiver operating characteristic (ROC) curve and confusion matrix. The results, as indicated by the low misclassification rate in both methods, demonstrated the power of AtML in predicting the performance of six Arabidopsis root cell subpopulations (Fig. 3 A and B). To validate the performance of the proposed model, we conducted cluster analyses on two datasets, including one dataset containing 24,280 genes and one dataset containing 160 genes. Uniform manifold approximation and projection (UMAP) and correlation analysis of 6000 single cells indicated that the overall performance of the 160 marker genes was significantly better than that of all the genes (Fig. 3 C and D). These results illustrated the ability of the proposed AtML model to extract potential genes, contributing to the understanding of cell type annotations. We also utilized marker genes for association analysis with the cell subpopulations of the six Arabidopsis root systems, thus confirming the high accuracy and specificity of the marker genes (Fig. 3 E). Similarly, by using SHAP to analyse the average impact on the model output magnitude, we selected the top 20 results with the highest SHAP values for display. Confirming the results of existing molecular biology studies for the 20 genes, AT1G65570, named root cap polycalacturonase ( RPCG ), was identified as a glycosyl hydrolase involved in root cap removal[ 28 ]. AT3G15680 and AT5G25490, named Jul1 and Jul2 , respectively, negatively regulate phloem development, and their expression in vascular bundles of roots is restricted to the elongation and maturation zones[ 29 ]. In earlier studies, a gene containing the PELPK motif At5g09530, which is most abundant in the mature vasculature of roots, was found to be closely associated with normal root development[ 30 ]. Additionally, AT3G51280, named altered phosphate starvation response 1 ( APSR1 ), is expressed in primary tips and is involved in root meristem maintenance[ 31 ]. Furthermore, a chitinase-like protein-encoding gene, CTL2 (chitinase-like protein 2), with the gene ID AT3G16920, regulates cellulose assembly to maintain normal root development. The SHAP model demonstrated the superior performance of our model (Fig. 3 F). Expression analysis of the AtML gene set By analysing the expression of 160 genes in six cell subpopulations, higher gene expression was found in trichoblast cells, while gene expression was similar in the remaining five cell subpopulations (Fig. 4A). Subsequently, we used UMAP to visualize the marker genes that have been reported for the relevant cell subpopulations (Fig. 4B). For example, AT4G22214 and AT2G01540 are specifically expressed in trichoblasts. AT3G08030 and AT1G62510 are highly specific to the atrichoblast and cortex, respectively. Moreover, we identified several newly discovered potential marker genes for specific expression within the cell subpopulations of Arabidopsis root cells. We observed that AT1G30750, AT3G11550 and AT5G20820 were specifically expressed in the endodermis. These highly ranked genes can be used as biomarkers to identify subpopulations of Arabidopsis thaliana root cells and provide support for further biological findings (Fig. 4C). Using multiple genes to characterize cell subpopulations in the Arabidopsis root tip could improve the ability to label them more accurately. We utilized Scanpy to display the 160 genes screened by AtML, highlighting the five genes with the highest expression in each cell subpopulation. These genes can collectively serve as candidate marker genes for more accurate labelling of different cell subsets (Fig. 4D). To further highlight the superiority of AtML, we conducted a comparative analysis of the expression levels of the top 20 genes in each cell subpopulation with those in the remaining five clusters. The results revealed that these genes exhibited significantly greater expression in each cell subpopulation than in the other five cell subpopulations. Notably, specific genes, such as AT1G62510, AT5G62210, and AT3G59930, exhibited distinctive expression in the cortex, while AT1G30750, AT3G11550, AT4G11190, and AT5G20820 exhibited specificity in the endodermis. Furthermore, AT4G12545 and AT2G16005 exhibited specific expression in Atrichoblast; AT3G16440, AT1G50060, and AT4G38410 exhibited specific expression in Lateral_Root_Cap; and AT2G01540 and AT4G02270 exhibited specific expression in Trichoblast (Fig. 5 A, supplementary Fig. S1 ). This observation indicated that the AtML model accurately identified candidate genes capable of distinguishing different cell subpopulations and exhibited invaluable advantages in processing single-cell RNA sequencing (scRNA-seq) data (Fig. 5 A and B and supplementary Fig. S1 ). Employing both single-cell expression profiles encompassing the entire gene set as well as a curated subset of 160 genes, we utilized these datasets as input for the construction of partition-based graph abstraction (PAGA) to delineate the biological landscape. The graphical representation derived from this analysis revealed a consistent topological structure, highlighting robust connections, particularly between atrichoblasts and trichoblasts. This observation suggested that AtML could be used to screen for essential molecular markers effectively while eliminating redundant information within the dataset (Fig. 5 C and D). Discussion scRNA-seq technology was first applied to study the developmental dynamics of plant root tip cells[ 32 ].Through meticulous screening, researchers isolated several gene markers that could classify the cell populations in the rice root system into 21 categories[ 33 ]. Single-cell transcriptome analyses of peanut leaves and cotton anthers have also demonstrated the ’powerful ability of this technology to address cellular heterogeneity[ 34 , 35 ]. In this study, we aimed to address a longstanding challenge in the analysis of single-cell RNA sequencing (scRNA-seq) data from Arabidopsis root cells— i.e., the scarcity of marker genes for cell types combined with high variability and poor reproducibility in manual assignments across research groups and experiments[ 28 , 36 , 37 ].Herein, we introduced a novel model, namely, AtML, which is an expression atlas-based ensemble learning framework. We aimed to overcome these limitations and provide a robust solution for the identification of significant genes. Using XGBoost combined with the MIC method, AtML was designed to comprehensively capture all the gene expression patterns and molecular events within Arabidopsis root cells. Notably, this study represents the integration of single-cell Arabidopsis data with artificial intelligence data, thus highlighting the pioneering nature of AtML in advancing the field. Moreover, the model identified a robust set of genes as potential cell-type markers for subpopulations of Arabidopsis root tip cells. These candidate marker genes play intricate regulatory roles during the construction of Arabidopsis crown roots; for example, RCPG (AT1G65570) is a key regulator that precisely controls root cap maturation and cell detachment in root tips[ 38 ]. Visualization of the optimal gene expression patterns demonstrated the ability of these materials to retain essential biological patterns, thus indicating promising significant applications in annotating Arabidopsis scRNA-seq datasets. However, it is essential to acknowledge the study's limitations, particularly the small sample size and the absence of external datasets for model validation. Collaborative efforts in data collection may prove instrumental in enhancing AtML's robustness and generalizability. Despite these limitations, our work serves as a valuable resource for exploring the physiological functions of Arabidopsis root cell types at both the molecular and single-cell levels. This study also provides insights into the unique molecular events influencing the development of resistant cells in Arabidopsis. We expect AtML to provide insights into the genetic basis of root cell fate determination in Arabidopsis. Materials and Methods Dataset construction and preprocessing Single-cell transcriptome data from the root tips of Arabidopsis plants comprising 6000 cells were collected from the National Center for Biotechnology Information (GSE152766). The dataset covers specific rice root cell subpopulations of interest and is easily accessible, thus ensuring transparency and verifiability. In addition, the dataset included a moderate sample size and high-quality sequencing data; therefore, it was selected for analysis. The scRNA-seq data were aligned to an Arabidopsis genome BSgenome object (“BSgenio.Athaliana.TAIR. TAIR9”) with an annotation file for the TAIR10 gene and counted using the Cell Ranger pipelines, resulting in 25,261 genes. The dataset included six different cell subpopulations: Endodermis (1000), Lateral Root Cap (1000), Atrichoblast (1000), Trichoblast (1000), Cortex (1000) and Procambium (1000). The dataset was split into a training dataset and a test dataset at a 7:3 ratio. More dataset details are provided in the Supporting Information (S1 Table). The Python packages Numpy (version 1.21.6), Pandas (version 1.3.5) and Scanpy (version 1.9.1) were used to read and process the data. Biological analysis and visualization We carried out an in-depth analysis to assess the effectiveness of 160 marker genes in distinguishing cell subpopulations. To identify specific cell subpopulations associated with these marker genes, we utilized the clustering analysis capability of Scanpy software (version 1.9.1), with all the settings remaining at their default values. We also used Pandas (version 1.4.4) to conduct Pearson’s correlation analysis on six rice root cell populations, focusing on 610 marker genes. For visual representation and cell trajectory analysis, we used the UMAP technique through the umap-learn python package (version 0.3.9) and PAGA analysis via Scanpy, respectively, adhering to the default settings for both methods. This approach included analyses of both the complete feature dataset and a reduced dataset comprising only the 160 selected genes, and the standard parameters were used throughout the analyses. MIC The core concept of the mutual information coefficient (MIC) revolves around the premise that if a relationship exists between two variables, a grid can be constructed to partition the scatterplot of these two variables, thus effectively capturing this relationship. These mutual information values are subsequently normalized to facilitate a fair comparison across grids of varying dimensions, ensuring consistency in the assessment of relationships between variables, regardless of their scale or complexity[ 39 – 41 ]. $$\begin{array}{c}I\left(X;Y\right)=\sum _{x,y}p\left(x,y\right)log\frac{p\left(x,y\right)}{p\left(x\right)p\left(y\right)}=H\left(X\right)-H\left(X\right|Y)\left(2\right) \end{array}$$ where \(‖x-{c}_{i}‖\) represents the Euclidean norm and \({c}_{i}\) , \({R}_{i}\) and \({\sigma }_{i}\) are the centre, width and output of the \(i\_th\) hidden unit, respectively. Model construction of AtML The gene expression profiles of Arabidopsis root cell subpopulations served as the input features for training the machine learning model. During the exploratory data analysis phase, the identification of critical relationships and weights between features was essential for eliminating less relevant or weaker information. The MIC, coefficient of variation squared (CV2)[ 42 ], and F score[ 43 ] were used to evaluate and prioritize the significance of each gene within the training model, and genes with weights of zero or less were excluded. Incremental feature selection (IFS)[ 44 ] was utilized to systematically train base models, including the KNN[ 45 ], XGBoost[ 46 – 48 ], SVM[ 49 – 52 ], and random forest classifier (RFC)[ 53 – 55 ] models, thus enabling a thorough comparison of their predictive capabilities. Leveraging the optimally identified gene set, XGBoost was chosen to develop the AtML model. Model evaluation The four classic metrics were used to quantify the performance of the model predictions, namely, the accuracy (Acc), recall (Re), precision (Pre), and F1 measure (F1), defined as [ 56 – 63 ]: $$\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}=\frac{TP+TN}{TP+TN+FP+FN} \left(3\right)$$ $$\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}=\frac{TP}{TP+FN} \left(4\right)$$ $$\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}=\frac{TP}{TP+FP} \left(5\right)$$ $$\text{F}1 \text{m}\text{e}\text{a}\text{s}\text{u}\text{r}\text{e}=\frac{2*\left(precision*recall\right)}{precision+recall} \left(6\right)$$ where \(TP, TN, FP \text{a}\text{n}\text{d} FN\) represent the numbers of true positives, true negatives, false positives and false negatives, respectively. In addition, the ROC curve was used to evaluate the performance of the AtML [ 64 , 65 ]. Abbreviations scRNA-seq Single-cell RNA sequencing MIC Maximal Information Coefficient CV2 Coefficient of Variation Squared IFS Incremental feature selection KNN K-Nearest Neighbors XGBoost Extreme Gradient Boosting SVM Support Vector Machine RFC Random Forest Classifier MIC_XGBoost MIC combined with XGBoost ROC Receiver Operating Characteristic UMAP Uniform manifold approximation and projection SHAP SHapley Additive exPlanations Declarations Competing interests The authors declare no Competing Financial or Non-Financial Interests. Funding This work was supported by the National Key R&D Program of China (2022YFF0711802). Ethics approval and consent to participate Not applicable Consent for publication All participants agreed to have their data published in an open access online publication. Availability of data and materials Single-cell transcriptome data from the root tips of Arabidopsis plants comprising 6000 cells were collected from the National Center for Biotechnology Information (GSE152766). The dataset covers specific rice root cell subpopulations of interest and is easily accessible, thus ensuring transparency and verifiability. Authors' contributions XD, XF and SY directed the research design. SY collected the data. YS, HW and XF designed the models and tested the model performance. XF and HW completed the main coding work of the web server. SY, SY, ZS, and HW analyzed the results. HW and SY drafted the manuscript, and JN and RL commented on and revised drafts. Acknowledgements The authors extend our sincere gratitude to the reviewers for their constructive suggestions for this article. This work was supported by the National Key R&D Program of China (2022YFF0711802). References Miyashima S. Nakajima K.The root endodermis: a hub of developmental signals and nutrient flow. Plant Signal Behav. 2011;6:1954–8. Andersen TG. Barberon M.Geldner N.Suberization—the second life of an endodermal cell. Curr Opin Plant Biol. 2015;28:9–15. Silva NDG, D,Murmu J,Chabot D,Hubbard K, Ryser P, Molina I et al. Root Suberin Plays Important Roles in Reducing Water Loss and Sodium Uptake in Arabidopsis thaliana.Metabolites.2021;11:735-. Marie.Barberon. The endodermis as a checkpoint for nutrients.New Phytologist.2017. Sun J, Niu Q W,Tarkowski PZB, Tarkowska D, Sandberg G, et al. The Arabidopsis AtIPT8/PGA22 Gene Encodes an Isopentenyl Transferase That Is Involved in De Novo Cytokinin Biosynthesis. Plant Physiol. 2003;131:167–76. Koizumi K. Hayashi T.Gallagher K.SCARECROW reinforces SHORT-ROOT signaling and inhibits periclinal cell divisions in the ground tissue by maintaining SHR at high levels in the endodermis. Plant Signal Behav. 2012;7:1573–7. Helariutta Y, Fukaki H, Wysocka-Diller J, Nakajima K, Jung J, Sena G, et al. The SHORT-ROOT gene controls radial patterning of the Arabidopsis root through radial. Signal Cell. 2000;101:555–67. Dolan L, Janmaat K, Willemsen V,Linstead P, Poethig S, Roberts K, et al. Cell organisation Arabidopsis thaliana root Dev. 1993;119:71–84. Menand B, Yi K, Jouannic S, Hoffmann L,Ryan E, Linstead P et al. An ancient mechanism controls the development of cells with a rooting function in land plants.Science (New York, N.Y.).2007;316:1477–1480. Navin N. Hicks J.Future medical applications of single-cell. sequencing cancer Genome Med. 2011;3:1–12. Navin N. Hicks J.Future medical applications of single-cell sequencing in cancer.Genome medicine.2011;3:31. Lopez-Anido C, B,Vatén A,Smoot NK, Sharma NGV, Gong Y et al. Single-cell resolution of lineage trajectories in the Arabidopsis stomatal lineage and developing leaf.Developmental cell.2021;56:1043–55. e1044. Liu Z, Wang J,Zhou Y, Zhang Y, Yu X, et al. Identification of novel regulators required for early development of vein pattern in the cotyledons by single-cell RNA‐sequencing. Plant J. 2022;110:7–22. Kim J-Y, Symeonidi E, Pang T Y,Denyer T,Weidauer D,Bezrutczyk M, et al. Distinct identities of leaf phloem cells revealed by single cell transcriptomics. Plant Cell. 2021;33:511–30. Roszak P, Heo J-, Toyokura K, Sugiyama Y, de Luis Balaguer MA et al. Cell-by-cell dissection of phloem development links a maturation gradient to cell specialization.Science.2021;374:eaba5531. Kiselev VY, Andrews TS. Correction: Challenges in unsupervised clustering of single-cell RNA-seq data.Nature reviews. Genetics. 2019;20:310–310. Zou G, Lin Y, Han T. L.DEMOC: a deep embedded multi-omics learning approach for clustering single-cell CITE-seq data. Brief Bioinform. 2022;23:bbac347. Zhang Z, Cui F, Cao C,Wang Q. Zou Q.Single-cell RNA analysis reveals the potential risk of organ-specific cell types vulnerable to SARS-CoV-2 infections.Computers in biology and medicine.2022;140:105092. Xu J, Xu J, Meng Y, Lu CCL, Zeng X et al. Graph embedding and Gaussian mixture variational autoencoder network for end-to-end analysis of single-cell RNA sequencing data. Cell Rep methods.2023;3. Zhao M, He W. Zou Q.Guo F.A hybrid deep learning framework for gene regulatory network inference from single-cell transcriptomic data.Briefings in bioinformatics.2022;23:bbab568. Zhao M, He W. Zou Q.Guo F.A comprehensive overview and critical evaluation of gene regulatory network inference technologies. Brief Bioinform. 2021;22:bbab009. Dai C, Jiang Y, Yin C,Su R, Zeng X, Zou Q, et al. scIMC: a platform for benchmarking comparison and visualization analysis of scRNA-seq data imputation methods. Nucleic Acids Res. 2022;50:4877–99. Jin S, Zeng X, Xia F,Huang W. Liu X.Application of deep learning methods in biological networks.Briefings in bioinformatics.2021;22:1902–1917. Wang J, Chen Y. Zou Q.Inferring gene regulatory network from single-cell transcriptomes with graph autoencoder model. PLoS Genet. 2023;19:e1010942. Liu Y, Shen X,Gong Y,Liu Y, Song B. Zeng X.Sequence Alignment/Map format: a comprehensive review of approaches and applications.Briefings in Bioinformatics.2023;24:bbad320. Chen L. Yu L.Gao L.Potent antibiotic design via guided search from antibacterial activity evaluations.Bioinformatics.2023;39:btad059. Wang R, Jiang Y,Jin J, Yin CYH, Wang F, et al. DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis. Nucleic Acids Res. 2023;51:3017–29. Kamiya M, Higashio S-Y, Isomoto A,Kim J-M,Seki M, Miyashima S, et al. Control of root cap maturation and cell detachment by BEARSKIN transcription factors. Arabidopsis Dev. 2016;143:4063–72. Cho H, Cho HS, Nam H,Jo H, Park C, et al. Translational control of phloem development by RNA G-quadruplex–JULGI determines plant sink strength. Nat plants. 2018;4:376–90. Rashid A. M K.PELPK1 (At5g09530) contains a unique pentapeptide repeat and is a positive regulator of germination in Arabidopsis thaliana. Plant Cell Rep. 2011;30:1735–45. González-Mendoza V, Zurita-Silva A, Sánchez-Calderón L, Sánchez-Sandoval ME, Oropeza-Aburto A, Gutiérrez-Alanís D, et al. APSR1, a novel gene required for meristem maintenance, is negatively regulated by low phosphate availability. Plant Sci. 2013;205:2–12. Shahan R, Hsu C-W,Nolan TM, Cole BJ, Taylor IW, Greenstreet L et al. A single-cell Arabidopsis root atlas reveals developmental trajectories in wild-type and cell identity mutants.Developmental cell.2022;57:543–60. e549. Zhang T-Q, Liu Y, Lin W-H. J-W.Single-cell transcriptome atlas and chromatin accessibility landscape reveal differentiation trajectories in the rice root.Nature communications.2021;12:2053. Qin Y, Sun M, Li W, Xu MSL, Liu Y, et al. Single-cell RNA‐seq reveals fate determination control of an individual fibre cell initiation in cotton (Gossypium hirsutum). Plant Biotechnol J. 2022;20:2372–88. Liu H, Hu D,Du PWL, Liang X, Li H et al. Single-cell RNA‐seq describes the transcriptome landscape and identifies critical transcription factors in the leaf blade of the allotetraploid peanut (Arachis hypogaea L.).Plant biotechnology journal.2021;19:2261–76. Li G, Xu A, Sim S, Priest JR, Tian X, Khan T, et al. Transcriptomic profiling maps anatomically patterned subpopulations among single embryonic cardiac cells. Dev Cell. 2016;39:491–507. Galdos FX, Xu S, Goodyer W R,Duan L,Huang YV, Lee S, et al. devCellPy is a machine learning-enabled pipeline for automated annotation of complex multilayered single-cell transcriptomic data. Nat Commun. 2022;13:5271. Reshef DN, Reshef YA, Finucane HK, Grossman S R,McVean PJ et al. Detecting novel associations in large data sets.science.2011;334:1518–1524. Reshef DN, Reshef YA, Finucane HK, Grossman SR, Mcvean G, Turnbaugh PJ et al. Detecting Novel Associations in Large Data Sets.Science.334. Albanese D, Filosi M, Visintainer RRS, Jurman G. Furlanello C.Minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers.Bioinformatics.2012;29:407–8. Zhou X, Wang X, Dougherty E,Russ D. Suh E.Gene clustering based on clusterwide mutual information. J Comput biology: J Comput Mol cell biology. 2004;11:147–61. Liang P, Zheng L, Long C, Yang W,Yang L. Zuo Y.HelPredictor models single-cell transcriptome to predict human embryo lineage allocation.Briefings in bioinformatics.2021;22. Liang P, Yang W,Chen X, Long CZL, Li H, et al. Machine Learning of Single-Cell Transcriptome Highly Identifies mRNA Signature by Comparing F-Score Selection with DGE Analysis.Molecular therapy. Nucleic acids. 2020;20:155–63. Yang H, Luo Y, Ren X,Wu M, Peng B et al. Risk Prediction of Diabetes: Big data mining with fusion of multifarious physical examination indicators.Information Fusion.2021. Krishnapuram B, Shah M. Smola A,Aggarwal C,Shen D.Rastogi R.Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016. Chang CC. J.LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol.2007;2. Zhu H, Hao H. Yu L.Identifying disease-related microbes based on multi-scale variational graph autoencoder embedding Wasserstein distance. BMC Biol. 2023;21:294. Li H, Pang Y. Liu B.BioSeq-BLM: a platform for analyzing DNA, RNA, and protein sequences based on biological language models. Nucleic Acids Res. 2021;49:e129. Yan J, Xu Y, Cheng QJS, Wang Q, Xiao Y et al. LightGBM: accelerated genomically designed crop breeding through ensemble learning.Genome biology.2021;22:271. Wang Y, Zhai Y, Ding Y, Zou QSBSM. -Pro: Support Bio-sequence Machine for Proteins.arXiv preprint.2023;arXiv:2308.10275. Zhang HY, Zou Q,Ju Y, Song CG. D.Distance-based Support Vector Machine to Predict DNA N6-methyladenine Modification. Curr Bioinform. 2022;17:473–82. Liu B, Gao XZH. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res. 2019;47:e127. Scornet E. Random forests and kernel methods. IEEE Trans Inf Theory. 2015;62:1485–500. Zou X, Ren L,Cai P, Zhang Y,Ding H, Deng K, et al. Accurately identifying hemagglutinin using sequence information and machine learning methods. Front Med (Lausanne). 2023;10:1281880. Zhu W, Yuan SS, Li J, Huang C B,Lin H. Liao B.A First Computational Frame for Recognizing Heparin-Binding Protein.Diagnostics (Basel).2023;13. Joshi PMVRR. SVM Based Approach Predicting Adverse Drug React Curr Bioinf. 2021;16:422–32. Geete KPMR. Transcription Factor Binding Site Prediction Using Deep Neural Networks. Curr Bioinform. 2020;15:1137–52. Ao C, Zhou W, Gao L. Dong B.Yu L.Prediction of antioxidant proteins using hybrid feature representation method and. random For Genomics. 2020;112:4666–74. Fu X, Zhu W, Cai L, Liao B,Peng L, Chen Y, et al. Improved Pre-miRNAs Identification Through Mutual Information of Pre-miRNA. Sequences Struct Front Genet. 2019;10:119. Fu X, Liao B, Zhu WCL. New 3D graphical representation for RNA structure analysis and its application in the pre-miRNA identification of plants. RSC Adv. 2018;8:30833–41. Qian Y, Ding Y, Zou QGF. Multi-View Kernel Sparse Representation for Identification of Membrane Protein Types. Ieee-Acm Trans Comput Biology Bioinf. 2023;20:1234–45. Ai C, Yang H, Ding YTJGF. Low Rank Matrix Factorization Algorithm Based on Multi-Graph Regularization for Detecting Drug-Disease Association. Ieee-Acm Trans Comput Biology Bioinf. 2023;20:3033–43. Tang Y, Pang YLB. IDP-Seq2Seq: identification of intrinsically disordered regions based on sequence to sequence. Learn Bioinf. 2021;36:5177–86. Zeng X. Zhang X.Zou Q.Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks. Brief Bioinform. 2016;17:193–203. Zulfiqar H, Guo Z, Ahmad RM, Ahmed Z,Cai P, Chen X et al. Deep-STP: a deep learning-based approach to predict snake toxin proteins by using word embeddings. Front Med.2024;10. Additional Declarations No competing interests reported. Supplementary Files supplementalinformation.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4349116","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":301985999,"identity":"675ccadf-180c-44ee-acab-7da30f426b97","order_by":0,"name":"Shicong Yu","email":"","orcid":"","institution":"Sichuan Agricultural University","correspondingAuthor":false,"prefix":"","firstName":"Shicong","middleName":"","lastName":"Yu","suffix":""},{"id":301986000,"identity":"7e83363a-61ad-4df7-abf1-a3d2c2f1d8f4","order_by":1,"name":"Xiangzheng Fu","email":"","orcid":"","institution":"Hunan University","correspondingAuthor":false,"prefix":"","firstName":"Xiangzheng","middleName":"","lastName":"Fu","suffix":""},{"id":301986001,"identity":"84808d77-e6bd-4eaf-984e-6824a5a366d7","order_by":2,"name":"Xiaoshu Deng","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA2ElEQVRIiWNgGAWjYDCCAxCqnp+Z+cCBDz9I0JIg2d6WeHBmDylaDM6cMT7MwUaEDr7jvYdf8/ypy2O4kfPhMAMPgzy/2AH8WiTPnEuznNl2uJhxRu6GwwUWDIYzZyfg12JwI8fM4GPDAcZmCaCWGTxAF94mpOX+GzODhD91jG0SOQ8O87ARo+UGj/GDD2zMiT08ZxiI0yJ5JseMEegXYwn2NgNgIEsQ9gvf8TPGn4EhJmd/mPnxhw8/bOT5pQloAQI2CSSOBE5lyID5A1HKRsEoGAWjYOQCAH1pTH1+u70iAAAAAElFTkSuQmCC","orcid":"","institution":"Sichuan Agricultural University","correspondingAuthor":true,"prefix":"","firstName":"Xiaoshu","middleName":"","lastName":"Deng","suffix":""},{"id":301986002,"identity":"1bd5205f-181a-4675-b8e5-888053a97b0c","order_by":3,"name":"Hao wang","email":"","orcid":"","institution":"Chinese Academy of Agricultural Sciences","correspondingAuthor":false,"prefix":"","firstName":"Hao","middleName":"","lastName":"wang","suffix":""},{"id":301986005,"identity":"ff234a4f-6c2b-438a-88eb-3b9a870fee4e","order_by":4,"name":"shen yan","email":"","orcid":"","institution":"Chinese Academy of Agricultural Sciences","correspondingAuthor":false,"prefix":"","firstName":"shen","middleName":"","lastName":"yan","suffix":""},{"id":301986008,"identity":"384481c4-9416-4c3d-8b33-ddff846bdc18","order_by":5,"name":"shuqin Zheng","email":"","orcid":"","institution":"Sichuan Agricultural University","correspondingAuthor":false,"prefix":"","firstName":"shuqin","middleName":"","lastName":"Zheng","suffix":""},{"id":301986010,"identity":"aac4ccc1-7e18-48fb-a7a4-4388fb590721","order_by":6,"name":"Jingpeng Ning","email":"","orcid":"","institution":"Sichuan Agricultural University","correspondingAuthor":false,"prefix":"","firstName":"Jingpeng","middleName":"","lastName":"Ning","suffix":""},{"id":301986012,"identity":"bbb866d6-a656-4d6a-9d6b-6e5332682c99","order_by":7,"name":"xiaoshu deng","email":"","orcid":"","institution":"Chongqing Academy of Chinese Matera Medica","correspondingAuthor":false,"prefix":"","firstName":"xiaoshu","middleName":"","lastName":"deng","suffix":""}],"badges":[],"createdAt":"2024-04-30 12:36:17","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4349116/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4349116/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":56504280,"identity":"5e25570d-0d79-449e-8c53-e4e3f4cb0a00","added_by":"auto","created_at":"2024-05-15 04:52:00","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":286295,"visible":true,"origin":"","legend":"\u003cp\u003eThe workflow forconstructing AtML.\u003c/p\u003e","description":"","filename":"Fig1.png","url":"https://assets-eu.researchsquare.com/files/rs-4349116/v1/95ae044839ce5e335776d592.png"},{"id":56504279,"identity":"3e3b48f3-e108-4904-8918-ee61d4ef2f81","added_by":"auto","created_at":"2024-05-15 04:52:00","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":118561,"visible":true,"origin":"","legend":"\u003cp\u003eThe results of feature selection. The IFS curves illustrate the efficacy of three feature selection methods (CV2, MIC and F score) across various gene subsets with the four classifiers, while the comparative Venn diagram in part D highlights the overlap among the top 300 genes identified by CV2, MIC and F score.\u003c/p\u003e","description":"","filename":"Fig2.png","url":"https://assets-eu.researchsquare.com/files/rs-4349116/v1/5e63314f78945ceeb15ca3f9.png"},{"id":56505240,"identity":"13bd3832-302d-458a-b2bc-638f7e53d1a3","added_by":"auto","created_at":"2024-05-15 05:08:00","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":457033,"visible":true,"origin":"","legend":"\u003cp\u003ePredictive performance of AtML. (A) The ROC curves for AtML within the training set. (B) The accuracy of AtML was determined by utilizing 160 genes identified by the MIC_XGB algorithm through a confusion matrix on the test dataset. (C) and (D) Examination of the clustering impact on 6,000 cells using 160 marker genes and the complete gene set, respectively, with C illustrating the analysis with all genes and D focusing on the 110 marker genes. Each point in the dataset is marked to represent a sample, with varying colours indicating different sample categories. (E) Correlation analysis of the 160 marker genes among the six subpopulations of rice root cells. (F) The bar graph shows the mean absolute value of the SHAP values of the first 20 genes for MIC_XGB.\u003c/p\u003e","description":"","filename":"Fig3.png","url":"https://assets-eu.researchsquare.com/files/rs-4349116/v1/1b73b3960caa367ad3f8e3d4.png"},{"id":56504945,"identity":"9b7280eb-cbb1-40d8-9bc3-3bf77c275284","added_by":"auto","created_at":"2024-05-15 05:00:00","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":317060,"visible":true,"origin":"","legend":"\u003cp\u003eBiological analysis of 160 marker genes. (A) Boxplots illustrating the mean expression levels (after normalizing the raw read counts) of 160 marker genes across six Arabidopsis root cell subpopulations. (B) Expression patterns of marker genes within various subpopulations of placental cells, as identified by the TURF optimal gene set. (C and D) Highlighted marker genes with high expression identified through Scanpy analysis.\u003c/p\u003e","description":"","filename":"Fig4.png","url":"https://assets-eu.researchsquare.com/files/rs-4349116/v1/ab313b02d4cafef5ca73aeea.png"},{"id":56505239,"identity":"f534524f-9342-4d1d-a29b-6807480a5562","added_by":"auto","created_at":"2024-05-15 05:08:00","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":329286,"visible":true,"origin":"","legend":"\u003cp\u003eComputational analysis of 160 marker genes. (A and B) The selection of marker genes by MIC_XGB (comprising 160 marker genes) was compared through split violin plots, which display the expression levels of these genes in specific cell types on the left in blue and their aggregate expression across the other five cell types on the right in orange. (C and D) PAGA was used to conduct expression trajectory analyses of the 160 marker genes (depicted as descending) and the entire genome (illustrated as ascending) across rice root cell subpopulations, with colour coding by cell type. The line thickness indicates the degree of cell connectivity.\u003c/p\u003e","description":"","filename":"Fig5.png","url":"https://assets-eu.researchsquare.com/files/rs-4349116/v1/75148b54c77e01f60824085c.png"},{"id":58787375,"identity":"1af13b98-eece-478c-a666-79617123524e","added_by":"auto","created_at":"2024-06-21 06:26:08","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1916576,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4349116/v1/904cbcca-82f0-4480-858b-c34b7ea29aa5.pdf"},{"id":56504284,"identity":"5eb84768-642a-4bd8-8988-f249b92fa968","added_by":"auto","created_at":"2024-05-15 04:52:00","extension":"docx","order_by":8,"title":"","display":"","copyAsset":false,"role":"supplement","size":249753,"visible":true,"origin":"","legend":"","description":"","filename":"supplementalinformation.docx","url":"https://assets-eu.researchsquare.com/files/rs-4349116/v1/9b72cd368b7377ac376892fa.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"AtML: Accurate identification of Arabidopsis thaliana root cell identity via single-cell sequencing analysis using interpretable machine learning","fulltext":[{"header":"Introduction","content":"\u003cp\u003eRoots are underground organs that develop through long-term adaptation to terrestrial life. Roots also play crucial roles in various physiological functions. Primarily, they facilitate the absorption of water and inorganic salts, which are subsequently transported to the stem and leaves[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. In addition to their role in nutrient uptake, roots serve as anchors, as they utilize strong branching capabilities to firmly stabilize plants in conducive soil[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Moreover, roots play a significant role in assimilating diverse inorganic salts into organic substances via complex biochemical reactions[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. They also serve as vital sites for synthesizing plant hormones, including cytokinins, auxins, abscisic acid, gibberellins, and ethylene, thereby exerting considerable influence on overall plant development[\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Furthermore, certain types of roots possess the ability to expand and function as storage organs, thus facilitating both storage and reproductive processes.\u003c/p\u003e \u003cp\u003eIn recent years, remarkable advancements have been made in the domain of molecular biology research focused on root systems. A significant portion of this research has focused on the model plant \u003cem\u003eArabidopsis thaliana\u003c/em\u003e. The primary root of Arabidopsis comprises distinct layers, including the epidermis, cortex, and vascular cylinders the primary root undergoes growth and development[\u003cspan additionalcitationids=\"CR7\" citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], it generates lateral roots, thereby continually expanding the complexity and extent of the root system[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e].The intricate development of the Arabidopsis root system is governed by multifaceted molecular processes, thus necessitating an in-depth investigation at a more refined level, particularly within specific cell subpopulations.\u003c/p\u003e \u003cp\u003eCells are the smallest functional units in plants, and physiological processes are typically carried out collaboratively by multiple cells. However, there is significant gene expression variation among cells. Therefore, more precise transcriptome technology has been applied[\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Lopez-Anido constructed different models of cell state differentiation in the leaf tissue of Arabidopsis, analysing the potential heterogeneity of epidermal stomatal lineage cells[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. Liu utilized the single-cell transcriptome atlas of early-stage Arabidopsis seedlings to identify new marker genes for leaf vein cells[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. Kim determined the roles of various tissues in Arabidopsis true leaves starting from metabolic pathways, thus providing knowledge regarding the leaf vascular system and the relationships of leaf cell types[\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Previous research on the underground part of Arabidopsis involved the construction of a high-resolution genetic map of the development process from stem cells to nucleus-free sieve tubes in the original endodermis of the Arabidopsis root system at seven developmental stages[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eWith the availability of scRNA-seq data, cell type identification has become an essential step for a multitude of downstream analyses[\u003cspan additionalcitationids=\"CR17 CR18 CR19 CR20 CR21\" citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. In certain scenarios, the absence of reliable markers for essential cell populations poses a challenge in accurately defining cell types. Implementing an efficient machine learning prediction model to extract molecular markers and discern cell subpopulations from existing single-cell datasets has been proven to be a time-efficient and resource-saving strategy[\u003cspan additionalcitationids=\"CR24 CR25 CR26\" citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eTo address the limitations mentioned above, we proposed an ensemble computing framework named AtML, which enables the model to capture cell subpopulation biomarkers of Arabidopsis root tips and predict cell stages (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The AtML model combines MIC and XGBoost to assess the importance of genes in predicting Arabidopsis root tip cell subpopulations. Furthermore, we successfully applied our AtML model to data not included in the training dataset and demonstrated its superior predictive performance. By conducting a biological analysis of the optimal genes, we identified potential lineage-specific genes, which could help biologists better understand the heterogeneity of Arabidopsis root tips. Our work utilized a machine learning approach to aid in the development of markers for single-cell sequencing in Arabidopsis, thereby providing new insights and more accurate markers for cell type identification.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eIdentification of Significant Genes Using Feature Selection and Machine Learning\u003c/h2\u003e \u003cp\u003eTo identify essential genes related to the subpopulations of Arabidopsis root tip cells, three feature selection methods (MIC, CV2, and F score) were used. These methods were utilized to evaluate the significance of the 24,280 genes, and subsequent rankings were established based on their contribution values. Genes with an importance score less than or equal to zero were systematically excluded. Next, machine learning models were combined with incremental feature selection (IFS) to determine the optimal gene subsets (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA-C). Based on fivefold cross-validation, single-cell gene expression matrices (normalized raw read count) were used as input features to train machine learning models (KNN, XGBoost, SVM, and RFC).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe training dataset results showed that the MIC combined with XGBoost (MIC_XGBoost) achieved optimal prediction performance for the top 160 genes, with an accuracy of 96.88% (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Notably, except for the KNN model, the other three learning models combined with the MIC and F score also obtained superior prediction performance. Using the 160 optimal genes on the test data, MIC_XGBoost also demonstrated the best performance, achieving an accuracy, precision, recall, and F1-measure of 96.50%, 95.49%, 94.51%, and 94.49%, respectively (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). To mitigate the potential bias introduced by CV2, MIC, and the F score favouring the same genes, we selected the top 300 genes based on the score rankings of these three feature selection methods for comparison. Fewer gene intersections were identified according to the MIC, F score and CV2, which confirmed the effectiveness of the three methods (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eD).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerformance evaluation of different feature selections combined with machine learning schemes (training dataset)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMethod\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFeature selection\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNo. of features\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eF score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e510\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e93.13%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eF score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e310\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e96.18%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSVM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eF score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2200\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e95.47%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRFC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eF score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1400\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e91.27%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCV2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1400\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e89.54%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCV2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1100\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e96.89%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSVM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCV2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2200\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e96.43%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRFC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCV2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4800\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e90.95%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMIC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e310\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e94.87%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMIC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e160\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e96.88%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSVM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMIC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1700\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e96.68%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRFC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMIC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1100\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e92.65%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerformance comparison between AtML and the other algorithms (test dataset)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMethod\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFeature selection\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eNo. of features\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eRecall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eF1-measure\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eF score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e510\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e92.03%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e92.34%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e92.06%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e91.99%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eF score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e310\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e96.31%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e96.45%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e96.41%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e96.42%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSVM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eF score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2200\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e95.54%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e95.52%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e95.51%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e95.48%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRFC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eF score\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1400\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e91.58%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e92.09%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e91.44%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e91.36%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCV2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1400\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e89.91%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e90.68%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e89.88%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e89.97%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCV2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1100\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e96.31%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e96.38%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e95.42%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e94.40%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSVM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCV2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2200\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e95.41%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e95.39%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e95.40%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e95.38%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRFC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCV2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4800\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e92.15%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e92.02%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e91.03%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e90.91%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKNN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMIC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e310\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e94.16%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e94.36%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e94.20%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e94.17%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGBoost\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMIC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e160\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e96.50%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e96.49%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e96.51%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e96.49%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSVM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMIC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1700\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e96.41%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e96.39%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e94.42%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e96.40%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRFC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMIC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1100\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e92.25%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e92.67%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e92.15%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e92.13%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eInvestigating AtML model interpretability\u003c/h2\u003e \u003cp\u003eTo demonstrate the consistently high performance of the AtML model across various cell subpopulations, we utilized the receiver operating characteristic (ROC) curve and confusion matrix. The results, as indicated by the low misclassification rate in both methods, demonstrated the power of AtML in predicting the performance of six Arabidopsis root cell subpopulations (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA and B). To validate the performance of the proposed model, we conducted cluster analyses on two datasets, including one dataset containing 24,280 genes and one dataset containing 160 genes. Uniform manifold approximation and projection (UMAP) and correlation analysis of 6000 single cells indicated that the overall performance of the 160 marker genes was significantly better than that of all the genes (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC and D). These results illustrated the ability of the proposed AtML model to extract potential genes, contributing to the understanding of cell type annotations. We also utilized marker genes for association analysis with the cell subpopulations of the six Arabidopsis root systems, thus confirming the high accuracy and specificity of the marker genes (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eE).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eSimilarly, by using SHAP to analyse the average impact on the model output magnitude, we selected the top 20 results with the highest SHAP values for display. Confirming the results of existing molecular biology studies for the 20 genes, AT1G65570, named root cap polycalacturonase (\u003cem\u003eRPCG\u003c/em\u003e), was identified as a glycosyl hydrolase involved in root cap removal[\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. AT3G15680 and AT5G25490, named \u003cem\u003eJul1\u003c/em\u003e and \u003cem\u003eJul2\u003c/em\u003e, respectively, negatively regulate phloem development, and their expression in vascular bundles of roots is restricted to the elongation and maturation zones[\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. In earlier studies, a gene containing the PELPK motif At5g09530, which is most abundant in the mature vasculature of roots, was found to be closely associated with normal root development[\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]. Additionally, AT3G51280, named altered phosphate starvation response 1 (\u003cem\u003eAPSR1\u003c/em\u003e), is expressed in primary tips and is involved in root meristem maintenance[\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. Furthermore, a chitinase-like protein-encoding gene, \u003cem\u003eCTL2\u003c/em\u003e (chitinase-like protein 2), with the gene ID AT3G16920, regulates cellulose assembly to maintain normal root development. The SHAP model demonstrated the superior performance of our model (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eF).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003eExpression analysis of the AtML gene set\u003c/h2\u003e \u003cp\u003eBy analysing the expression of 160 genes in six cell subpopulations, higher gene expression was found in trichoblast cells, while gene expression was similar in the remaining five cell subpopulations (Fig.\u0026nbsp;4A). Subsequently, we used UMAP to visualize the marker genes that have been reported for the relevant cell subpopulations (Fig.\u0026nbsp;4B). For example, AT4G22214 and AT2G01540 are specifically expressed in trichoblasts. AT3G08030 and AT1G62510 are highly specific to the atrichoblast and cortex, respectively. Moreover, we identified several newly discovered potential marker genes for specific expression within the cell subpopulations of Arabidopsis root cells. We observed that AT1G30750, AT3G11550 and AT5G20820 were specifically expressed in the endodermis. These highly ranked genes can be used as biomarkers to identify subpopulations of \u003cem\u003eArabidopsis thaliana\u003c/em\u003e root cells and provide support for further biological findings (Fig.\u0026nbsp;4C). Using multiple genes to characterize cell subpopulations in the Arabidopsis root tip could improve the ability to label them more accurately. We utilized Scanpy to display the 160 genes screened by AtML, highlighting the five genes with the highest expression in each cell subpopulation. These genes can collectively serve as candidate marker genes for more accurate labelling of different cell subsets (Fig.\u0026nbsp;4D).\u003c/p\u003e \u003cp\u003eTo further highlight the superiority of AtML, we conducted a comparative analysis of the expression levels of the top 20 genes in each cell subpopulation with those in the remaining five clusters. The results revealed that these genes exhibited significantly greater expression in each cell subpopulation than in the other five cell subpopulations. Notably, specific genes, such as AT1G62510, AT5G62210, and AT3G59930, exhibited distinctive expression in the cortex, while AT1G30750, AT3G11550, AT4G11190, and AT5G20820 exhibited specificity in the endodermis. Furthermore, AT4G12545 and AT2G16005 exhibited specific expression in Atrichoblast; AT3G16440, AT1G50060, and AT4G38410 exhibited specific expression in Lateral_Root_Cap; and AT2G01540 and AT4G02270 exhibited specific expression in Trichoblast (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003eA, supplementary Fig. \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). This observation indicated that the AtML model accurately identified candidate genes capable of distinguishing different cell subpopulations and exhibited invaluable advantages in processing single-cell RNA sequencing (scRNA-seq) data (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003eA and B and supplementary Fig.\u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). Employing both single-cell expression profiles encompassing the entire gene set as well as a curated subset of 160 genes, we utilized these datasets as input for the construction of partition-based graph abstraction (PAGA) to delineate the biological landscape. The graphical representation derived from this analysis revealed a consistent topological structure, highlighting robust connections, particularly between atrichoblasts and trichoblasts. This observation suggested that AtML could be used to screen for essential molecular markers effectively while eliminating redundant information within the dataset (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e5\u003c/span\u003eC and D).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003escRNA-seq technology was first applied to study the developmental dynamics of plant root tip cells[\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e].Through meticulous screening, researchers isolated several gene markers that could classify the cell populations in the rice root system into 21 categories[\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]. Single-cell transcriptome analyses of peanut leaves and cotton anthers have also demonstrated the \u0026rsquo;powerful ability of this technology to address cellular heterogeneity[\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e, \u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. In this study, we aimed to address a longstanding challenge in the analysis of single-cell RNA sequencing (scRNA-seq) data from Arabidopsis root cells\u0026mdash; i.e., the scarcity of marker genes for cell types combined with high variability and poor reproducibility in manual assignments across research groups and experiments[\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e, \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e, \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e].Herein, we introduced a novel model, namely, AtML, which is an expression atlas-based ensemble learning framework. We aimed to overcome these limitations and provide a robust solution for the identification of significant genes.\u003c/p\u003e \u003cp\u003eUsing XGBoost combined with the MIC method, AtML was designed to comprehensively capture all the gene expression patterns and molecular events within Arabidopsis root cells. Notably, this study represents the integration of single-cell Arabidopsis data with artificial intelligence data, thus highlighting the pioneering nature of AtML in advancing the field. Moreover, the model identified a robust set of genes as potential cell-type markers for subpopulations of Arabidopsis root tip cells. These candidate marker genes play intricate regulatory roles during the construction of Arabidopsis crown roots; for example, \u003cem\u003eRCPG\u003c/em\u003e (AT1G65570) is a key regulator that precisely controls root cap maturation and cell detachment in root tips[\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]. Visualization of the optimal gene expression patterns demonstrated the ability of these materials to retain essential biological patterns, thus indicating promising significant applications in annotating Arabidopsis scRNA-seq datasets.\u003c/p\u003e \u003cp\u003eHowever, it is essential to acknowledge the study's limitations, particularly the small sample size and the absence of external datasets for model validation. Collaborative efforts in data collection may prove instrumental in enhancing AtML's robustness and generalizability. Despite these limitations, our work serves as a valuable resource for exploring the physiological functions of Arabidopsis root cell types at both the molecular and single-cell levels. This study also provides insights into the unique molecular events influencing the development of resistant cells in Arabidopsis. We expect AtML to provide insights into the genetic basis of root cell fate determination in Arabidopsis.\u003c/p\u003e "},{"header":"Materials and Methods","content":"\u003cdiv id=\"Sec8\" class=\"Section3\"\u003e\n\u003ch2\u003eDataset construction and preprocessing\u003c/h2\u003e\n\u003cp\u003eSingle-cell transcriptome data from the root tips of Arabidopsis plants comprising 6000 cells were collected from the National Center for Biotechnology Information (GSE152766). The dataset covers specific rice root cell subpopulations of interest and is easily accessible, thus ensuring transparency and verifiability. In addition, the dataset included a moderate sample size and high-quality sequencing data; therefore, it was selected for analysis. The scRNA-seq data were aligned to an Arabidopsis genome BSgenome object (\u0026ldquo;BSgenio.Athaliana.TAIR. TAIR9\u0026rdquo;) with an annotation file for the TAIR10 gene and counted using the Cell Ranger pipelines, resulting in 25,261 genes. The dataset included six different cell subpopulations: Endodermis (1000), Lateral Root Cap (1000), Atrichoblast (1000), Trichoblast (1000), Cortex (1000) and Procambium (1000). The dataset was split into a training dataset and a test dataset at a 7:3 ratio. More dataset details are provided in the Supporting Information (S1 Table). The Python packages Numpy (version 1.21.6), Pandas (version 1.3.5) and Scanpy (version 1.9.1) were used to read and process the data.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\n\u003ch2\u003eBiological analysis and visualization\u003c/h2\u003e\n\u003cp\u003eWe carried out an in-depth analysis to assess the effectiveness of 160 marker genes in distinguishing cell subpopulations. To identify specific cell subpopulations associated with these marker genes, we utilized the clustering analysis capability of Scanpy software (version 1.9.1), with all the settings remaining at their default values. We also used Pandas (version 1.4.4) to conduct Pearson\u0026rsquo;s correlation analysis on six rice root cell populations, focusing on 610 marker genes. For visual representation and cell trajectory analysis, we used the UMAP technique through the umap-learn python package (version 0.3.9) and PAGA analysis via Scanpy, respectively, adhering to the default settings for both methods. This approach included analyses of both the complete feature dataset and a reduced dataset comprising only the 160 selected genes, and the standard parameters were used throughout the analyses.\u003c/p\u003e\n\u003cdiv id=\"Sec10\" class=\"Section3\"\u003e\n\u003ch2\u003eMIC\u003c/h2\u003e\n\u003cp\u003eThe core concept of the mutual information coefficient (MIC) revolves around the premise that if a relationship exists between two variables, a grid can be constructed to partition the scatterplot of these two variables, thus effectively capturing this relationship. These mutual information values are subsequently normalized to facilitate a fair comparison across grids of varying dimensions, ensuring consistency in the assessment of relationships between variables, regardless of their scale or complexity[\u003cspan class=\"CitationRef\"\u003e39\u003c/span\u003e\u0026ndash;\u003cspan class=\"CitationRef\"\u003e41\u003c/span\u003e].\u003c/p\u003e\n\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\n\u003cdiv id=\"FileID_Equa\" class=\"mathdisplay\"\u003e$$\\begin{array}{c}I\\left(X;Y\\right)=\\sum _{x,y}p\\left(x,y\\right)log\\frac{p\\left(x,y\\right)}{p\\left(x\\right)p\\left(y\\right)}=H\\left(X\\right)-H\\left(X\\right|Y)\\left(2\\right) \\end{array}$$\u003c/div\u003e\n\u003c/div\u003e\n\u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(‖x-{c}_{i}‖\\)\u003c/span\u003e\u003c/span\u003e represents the Euclidean norm and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({c}_{i}\\)\u003c/span\u003e\u003c/span\u003e, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({R}_{i}\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({\\sigma }_{i}\\)\u003c/span\u003e\u003c/span\u003e are the centre, width and output of the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(i\\_th\\)\u003c/span\u003e\u003c/span\u003e hidden unit, respectively.\u003c/p\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\n\u003ch2\u003eModel construction of AtML\u003c/h2\u003e\n\u003cp\u003eThe gene expression profiles of Arabidopsis root cell subpopulations served as the input features for training the machine learning model. During the exploratory data analysis phase, the identification of critical relationships and weights between features was essential for eliminating less relevant or weaker information. The MIC, coefficient of variation squared (CV2)[\u003cspan class=\"CitationRef\"\u003e42\u003c/span\u003e], and F score[\u003cspan class=\"CitationRef\"\u003e43\u003c/span\u003e] were used to evaluate and prioritize the significance of each gene within the training model, and genes with weights of zero or less were excluded.\u003c/p\u003e\n\u003cp\u003eIncremental feature selection (IFS)[\u003cspan class=\"CitationRef\"\u003e44\u003c/span\u003e] was utilized to systematically train base models, including the KNN[\u003cspan class=\"CitationRef\"\u003e45\u003c/span\u003e], XGBoost[\u003cspan class=\"CitationRef\"\u003e46\u003c/span\u003e\u0026ndash;\u003cspan class=\"CitationRef\"\u003e48\u003c/span\u003e], SVM[\u003cspan class=\"CitationRef\"\u003e49\u003c/span\u003e\u0026ndash;\u003cspan class=\"CitationRef\"\u003e52\u003c/span\u003e], and random forest classifier (RFC)[\u003cspan class=\"CitationRef\"\u003e53\u003c/span\u003e\u0026ndash;\u003cspan class=\"CitationRef\"\u003e55\u003c/span\u003e] models, thus enabling a thorough comparison of their predictive capabilities. Leveraging the optimally identified gene set, XGBoost was chosen to develop the AtML model.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\n\u003ch2\u003eModel evaluation\u003c/h2\u003e\n\u003cp\u003eThe four classic metrics were used to quantify the performance of the model predictions, namely, the accuracy (Acc), recall (Re), precision (Pre), and F1 measure (F1), defined as [\u003cspan class=\"CitationRef\"\u003e56\u003c/span\u003e\u0026ndash;\u003cspan class=\"CitationRef\"\u003e63\u003c/span\u003e]:\u003c/p\u003e\n\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\n\u003cdiv id=\"FileID_Equb\" class=\"mathdisplay\"\u003e$$\\text{A}\\text{c}\\text{c}\\text{u}\\text{r}\\text{a}\\text{c}\\text{y}=\\frac{TP+TN}{TP+TN+FP+FN} \\left(3\\right)$$\u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Equc\" class=\"Equation\"\u003e\n\u003cdiv id=\"FileID_Equc\" class=\"mathdisplay\"\u003e$$\\text{R}\\text{e}\\text{c}\\text{a}\\text{l}\\text{l}=\\frac{TP}{TP+FN} \\left(4\\right)$$\u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Equd\" class=\"Equation\"\u003e\n\u003cdiv id=\"FileID_Equd\" class=\"mathdisplay\"\u003e$$\\text{P}\\text{r}\\text{e}\\text{c}\\text{i}\\text{s}\\text{i}\\text{o}\\text{n}=\\frac{TP}{TP+FP} \\left(5\\right)$$\u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Eque\" class=\"Equation\"\u003e\n\u003cdiv id=\"FileID_Eque\" class=\"mathdisplay\"\u003e$$\\text{F}1 \\text{m}\\text{e}\\text{a}\\text{s}\\text{u}\\text{r}\\text{e}=\\frac{2*\\left(precision*recall\\right)}{precision+recall} \\left(6\\right)$$\u003c/div\u003e\n\u003c/div\u003e\n\u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(TP, TN, FP \\text{a}\\text{n}\\text{d} FN\\)\u003c/span\u003e\u003c/span\u003e represent the numbers of true positives, true negatives, false positives and false negatives, respectively. In addition, the ROC curve was used to evaluate the performance of the AtML [\u003cspan class=\"CitationRef\"\u003e64\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e65\u003c/span\u003e].\u003c/p\u003e\n\u003c/div\u003e"},{"header":"Abbreviations","content":"\u003cdiv class=\"DefinitionList\"\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003escRNA-seq\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eSingle-cell RNA sequencing\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eMIC\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eMaximal Information Coefficient\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eCV2\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eCoefficient of Variation Squared\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eIFS\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eIncremental feature selection\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eKNN\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eK-Nearest Neighbors\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eXGBoost\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eExtreme Gradient Boosting\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eSVM\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eSupport Vector Machine\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eRFC\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eRandom Forest Classifier\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eMIC_XGBoost\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eMIC combined with XGBoost\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eROC\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eReceiver Operating Characteristic\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eUMAP\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eUniform manifold approximation and projection\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eSHAP\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eSHapley Additive exPlanations\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no Competing Financial or Non-Financial Interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was supported by the National Key R\u0026amp;D Program of China (2022YFF0711802).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll participants agreed to have their data published in an open access online publication.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSingle-cell transcriptome data from the root tips of Arabidopsis plants comprising 6000 cells were collected from the National Center for Biotechnology Information (GSE152766). The dataset covers specific rice root cell subpopulations of interest and is easily accessible, thus ensuring transparency and verifiability.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026apos; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eXD, XF and SY directed the research design. SY collected the data. YS, HW and XF designed the models and tested the model performance. XF and HW completed the main coding work of the web server. SY, SY, ZS, and HW analyzed the results. HW and SY drafted the manuscript, and JN and RL commented on and revised drafts.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors extend our sincere gratitude to the reviewers for their constructive suggestions for this article. This work was supported by the National Key R\u0026amp;D Program of China (2022YFF0711802).\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eMiyashima S. Nakajima K.The root endodermis: a hub of developmental signals and nutrient flow. Plant Signal Behav. 2011;6:1954\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAndersen TG. Barberon M.Geldner N.Suberization\u0026mdash;the second life of an endodermal cell. Curr Opin Plant Biol. 2015;28:9\u0026ndash;15.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSilva NDG, D,Murmu J,Chabot D,Hubbard K, Ryser P, Molina I et al. Root Suberin Plays Important Roles in Reducing Water Loss and Sodium Uptake in Arabidopsis thaliana.Metabolites.2021;11:735-.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMarie.Barberon. The endodermis as a checkpoint for nutrients.New Phytologist.2017.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSun J, Niu Q W,Tarkowski PZB, Tarkowska D, Sandberg G, et al. The Arabidopsis AtIPT8/PGA22 Gene Encodes an Isopentenyl Transferase That Is Involved in De Novo Cytokinin Biosynthesis. Plant Physiol. 2003;131:167\u0026ndash;76.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKoizumi K. Hayashi T.Gallagher K.SCARECROW reinforces SHORT-ROOT signaling and inhibits periclinal cell divisions in the ground tissue by maintaining SHR at high levels in the endodermis. Plant Signal Behav. 2012;7:1573\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHelariutta Y, Fukaki H, Wysocka-Diller J, Nakajima K, Jung J, Sena G, et al. The SHORT-ROOT gene controls radial patterning of the Arabidopsis root through radial. Signal Cell. 2000;101:555\u0026ndash;67.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDolan L, Janmaat K, Willemsen V,Linstead P, Poethig S, Roberts K, et al. Cell organisation Arabidopsis thaliana root Dev. 1993;119:71\u0026ndash;84.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMenand B, Yi K, Jouannic S, Hoffmann L,Ryan E, Linstead P et al. An ancient mechanism controls the development of cells with a rooting function in land plants.Science (New York, N.Y.).2007;316:1477\u0026ndash;1480.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNavin N. Hicks J.Future medical applications of single-cell. sequencing cancer Genome Med. 2011;3:1\u0026ndash;12.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNavin N. Hicks J.Future medical applications of single-cell sequencing in cancer.Genome medicine.2011;3:31.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLopez-Anido C, B,Vat\u0026eacute;n A,Smoot NK, Sharma NGV, Gong Y et al. Single-cell resolution of lineage trajectories in the Arabidopsis stomatal lineage and developing leaf.Developmental cell.2021;56:1043\u0026ndash;55. e1044.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu Z, Wang J,Zhou Y, Zhang Y, Yu X, et al. Identification of novel regulators required for early development of vein pattern in the cotyledons by single-cell RNA‐sequencing. Plant J. 2022;110:7\u0026ndash;22.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim J-Y, Symeonidi E, Pang T Y,Denyer T,Weidauer D,Bezrutczyk M, et al. Distinct identities of leaf phloem cells revealed by single cell transcriptomics. Plant Cell. 2021;33:511\u0026ndash;30.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRoszak P, Heo J-, Toyokura K, Sugiyama Y, de Luis Balaguer MA et al. Cell-by-cell dissection of phloem development links a maturation gradient to cell specialization.Science.2021;374:eaba5531.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKiselev VY, Andrews TS. Correction: Challenges in unsupervised clustering of single-cell RNA-seq data.Nature reviews. Genetics. 2019;20:310\u0026ndash;310.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZou G, Lin Y, Han T. L.DEMOC: a deep embedded multi-omics learning approach for clustering single-cell CITE-seq data. Brief Bioinform. 2022;23:bbac347.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang Z, Cui F, Cao C,Wang Q. Zou Q.Single-cell RNA analysis reveals the potential risk of organ-specific cell types vulnerable to SARS-CoV-2 infections.Computers in biology and medicine.2022;140:105092.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu J, Xu J, Meng Y, Lu CCL, Zeng X et al. Graph embedding and Gaussian mixture variational autoencoder network for end-to-end analysis of single-cell RNA sequencing data. Cell Rep methods.2023;3.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhao M, He W. Zou Q.Guo F.A hybrid deep learning framework for gene regulatory network inference from single-cell transcriptomic data.Briefings in bioinformatics.2022;23:bbab568.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhao M, He W. Zou Q.Guo F.A comprehensive overview and critical evaluation of gene regulatory network inference technologies. Brief Bioinform. 2021;22:bbab009.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDai C, Jiang Y, Yin C,Su R, Zeng X, Zou Q, et al. scIMC: a platform for benchmarking comparison and visualization analysis of scRNA-seq data imputation methods. Nucleic Acids Res. 2022;50:4877\u0026ndash;99.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJin S, Zeng X, Xia F,Huang W. Liu X.Application of deep learning methods in biological networks.Briefings in bioinformatics.2021;22:1902\u0026ndash;1917.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang J, Chen Y. Zou Q.Inferring gene regulatory network from single-cell transcriptomes with graph autoencoder model. PLoS Genet. 2023;19:e1010942.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu Y, Shen X,Gong Y,Liu Y, Song B. Zeng X.Sequence Alignment/Map format: a comprehensive review of approaches and applications.Briefings in Bioinformatics.2023;24:bbad320.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen L. Yu L.Gao L.Potent antibiotic design via guided search from antibacterial activity evaluations.Bioinformatics.2023;39:btad059.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang R, Jiang Y,Jin J, Yin CYH, Wang F, et al. DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis. Nucleic Acids Res. 2023;51:3017\u0026ndash;29.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKamiya M, Higashio S-Y, Isomoto A,Kim J-M,Seki M, Miyashima S, et al. Control of root cap maturation and cell detachment by BEARSKIN transcription factors. Arabidopsis Dev. 2016;143:4063\u0026ndash;72.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCho H, Cho HS, Nam H,Jo H, Park C, et al. Translational control of phloem development by RNA G-quadruplex\u0026ndash;JULGI determines plant sink strength. Nat plants. 2018;4:376\u0026ndash;90.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRashid A. M K.PELPK1 (At5g09530) contains a unique pentapeptide repeat and is a positive regulator of germination in Arabidopsis thaliana. Plant Cell Rep. 2011;30:1735\u0026ndash;45.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGonz\u0026aacute;lez-Mendoza V, Zurita-Silva A, S\u0026aacute;nchez-Calder\u0026oacute;n L, S\u0026aacute;nchez-Sandoval ME, Oropeza-Aburto A, Guti\u0026eacute;rrez-Alan\u0026iacute;s D, et al. APSR1, a novel gene required for meristem maintenance, is negatively regulated by low phosphate availability. Plant Sci. 2013;205:2\u0026ndash;12.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShahan R, Hsu C-W,Nolan TM, Cole BJ, Taylor IW, Greenstreet L et al. A single-cell Arabidopsis root atlas reveals developmental trajectories in wild-type and cell identity mutants.Developmental cell.2022;57:543\u0026ndash;60. e549.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang T-Q, Liu Y, Lin W-H. J-W.Single-cell transcriptome atlas and chromatin accessibility landscape reveal differentiation trajectories in the rice root.Nature communications.2021;12:2053.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQin Y, Sun M, Li W, Xu MSL, Liu Y, et al. Single-cell RNA‐seq reveals fate determination control of an individual fibre cell initiation in cotton (Gossypium hirsutum). Plant Biotechnol J. 2022;20:2372\u0026ndash;88.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu H, Hu D,Du PWL, Liang X, Li H et al. Single-cell RNA‐seq describes the transcriptome landscape and identifies critical transcription factors in the leaf blade of the allotetraploid peanut (Arachis hypogaea L.).Plant biotechnology journal.2021;19:2261\u0026ndash;76.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi G, Xu A, Sim S, Priest JR, Tian X, Khan T, et al. Transcriptomic profiling maps anatomically patterned subpopulations among single embryonic cardiac cells. Dev Cell. 2016;39:491\u0026ndash;507.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGaldos FX, Xu S, Goodyer W R,Duan L,Huang YV, Lee S, et al. devCellPy is a machine learning-enabled pipeline for automated annotation of complex multilayered single-cell transcriptomic data. Nat Commun. 2022;13:5271.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eReshef DN, Reshef YA, Finucane HK, Grossman S R,McVean PJ et al. Detecting novel associations in large data sets.science.2011;334:1518\u0026ndash;1524.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eReshef DN, Reshef YA, Finucane HK, Grossman SR, Mcvean G, Turnbaugh PJ et al. Detecting Novel Associations in Large Data Sets.Science.334.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlbanese D, Filosi M, Visintainer RRS, Jurman G. Furlanello C.Minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers.Bioinformatics.2012;29:407\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou X, Wang X, Dougherty E,Russ D. Suh E.Gene clustering based on clusterwide mutual information. J Comput biology: J Comput Mol cell biology. 2004;11:147\u0026ndash;61.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiang P, Zheng L, Long C, Yang W,Yang L. Zuo Y.HelPredictor models single-cell transcriptome to predict human embryo lineage allocation.Briefings in bioinformatics.2021;22.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiang P, Yang W,Chen X, Long CZL, Li H, et al. Machine Learning of Single-Cell Transcriptome Highly Identifies mRNA Signature by Comparing F-Score Selection with DGE Analysis.Molecular therapy. Nucleic acids. 2020;20:155\u0026ndash;63.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang H, Luo Y, Ren X,Wu M, Peng B et al. Risk Prediction of Diabetes: Big data mining with fusion of multifarious physical examination indicators.Information Fusion.2021.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKrishnapuram B, Shah M. Smola A,Aggarwal C,Shen D.Rastogi R.Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChang CC. J.LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol.2007;2.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhu H, Hao H. Yu L.Identifying disease-related microbes based on multi-scale variational graph autoencoder embedding Wasserstein distance. BMC Biol. 2023;21:294.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi H, Pang Y. Liu B.BioSeq-BLM: a platform for analyzing DNA, RNA, and protein sequences based on biological language models. Nucleic Acids Res. 2021;49:e129.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYan J, Xu Y, Cheng QJS, Wang Q, Xiao Y et al. LightGBM: accelerated genomically designed crop breeding through ensemble learning.Genome biology.2021;22:271.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang Y, Zhai Y, Ding Y, Zou QSBSM. -Pro: Support Bio-sequence Machine for Proteins.arXiv preprint.2023;arXiv:2308.10275.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang HY, Zou Q,Ju Y, Song CG. D.Distance-based Support Vector Machine to Predict DNA N6-methyladenine Modification. Curr Bioinform. 2022;17:473\u0026ndash;82.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu B, Gao XZH. BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res. 2019;47:e127.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eScornet E. Random forests and kernel methods. IEEE Trans Inf Theory. 2015;62:1485\u0026ndash;500.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZou X, Ren L,Cai P, Zhang Y,Ding H, Deng K, et al. Accurately identifying hemagglutinin using sequence information and machine learning methods. Front Med (Lausanne). 2023;10:1281880.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhu W, Yuan SS, Li J, Huang C B,Lin H. Liao B.A First Computational Frame for Recognizing Heparin-Binding Protein.Diagnostics (Basel).2023;13.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJoshi PMVRR. SVM Based Approach Predicting Adverse Drug React Curr Bioinf. 2021;16:422\u0026ndash;32.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGeete KPMR. Transcription Factor Binding Site Prediction Using Deep Neural Networks. Curr Bioinform. 2020;15:1137\u0026ndash;52.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAo C, Zhou W, Gao L. Dong B.Yu L.Prediction of antioxidant proteins using hybrid feature representation method and. random For Genomics. 2020;112:4666\u0026ndash;74.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFu X, Zhu W, Cai L, Liao B,Peng L, Chen Y, et al. Improved Pre-miRNAs Identification Through Mutual Information of Pre-miRNA. Sequences Struct Front Genet. 2019;10:119.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFu X, Liao B, Zhu WCL. New 3D graphical representation for RNA structure analysis and its application in the pre-miRNA identification of plants. RSC Adv. 2018;8:30833\u0026ndash;41.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQian Y, Ding Y, Zou QGF. Multi-View Kernel Sparse Representation for Identification of Membrane Protein Types. Ieee-Acm Trans Comput Biology Bioinf. 2023;20:1234\u0026ndash;45.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAi C, Yang H, Ding YTJGF. Low Rank Matrix Factorization Algorithm Based on Multi-Graph Regularization for Detecting Drug-Disease Association. Ieee-Acm Trans Comput Biology Bioinf. 2023;20:3033\u0026ndash;43.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTang Y, Pang YLB. IDP-Seq2Seq: identification of intrinsically disordered regions based on sequence to sequence. Learn Bioinf. 2021;36:5177\u0026ndash;86.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZeng X. Zhang X.Zou Q.Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks. Brief Bioinform. 2016;17:193\u0026ndash;203.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZulfiqar H, Guo Z, Ahmad RM, Ahmed Z,Cai P, Chen X et al. Deep-STP: a deep learning-based approach to predict snake toxin proteins by using word embeddings. Front Med.2024;10.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Machine learning, Marker genes, scRNA-seq, Arabidopsis root tips, Cell subpopulations","lastPublishedDoi":"10.21203/rs.3.rs-4349116/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4349116/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eSingle-cell transcriptomics technologies are essential for understanding the developmental trajectory of plant roots. Although methods for interpreting single-cell transcriptomics data are rapidly advancing in Arabidopsis, precisely annotating cell identity and the occasional lack of canonical markers for certain cell types are major challenges in plant single-cell RNA sequencing (scRNA-seq) analysis. In this work, we trained a machine learning system, namely, AtML, using sequencing datasets from six cell subpopulations (comprising a total of 6000 cells) to predict Arabidopsis root cell stages and mine biomarkers via complete model interpretability. The results of performance testing using an external dataset revealed that AtML achieved 96.50% accuracy and 96.51% recall. With the power of interpretability provided by AtML, our model recognized 160 important marker genes, thus contributing to the understanding of cell type annotations. In conclusion, we trained AtML to efficiently identify Arabidopsis root cell stages, thereby providing insights into cellular heterogeneity in Arabidopsis root development studies.\u003c/p\u003e","manuscriptTitle":"AtML: Accurate identification of Arabidopsis thaliana root cell identity via single-cell sequencing analysis using interpretable machine learning","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-05-15 04:51:55","doi":"10.21203/rs.3.rs-4349116/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"97cd2e02-2a37-491f-9fde-927a484652e6","owner":[],"postedDate":"May 15th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-06-21T06:17:59+00:00","versionOfRecord":[],"versionCreatedAt":"2024-05-15 04:51:55","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4349116","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4349116","identity":"rs-4349116","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00