Guiding clustering and annotation in single-cell RNA sequencing using the average overlap metric

doi:10.1101/2025.05.06.652497

Guiding clustering and annotation in single-cell RNA sequencing using the average overlap metric

2025 · doi:10.1101/2025.05.06.652497

preprint OA: closed CC-BY-NC-ND-4.0

📄 Open PDF Full text JSON View at publisher

Full text 59,530 characters · extracted from oa-pdf · 7 sections · click to expand

Abstract

38 Defining cell types using unsupervised clustering algorithms based on transcriptional 39 similarity is a powerful application of single-cell RNA sequencing . A single clustering 40 resolution may not yield clusters that represent both broad, well-defined populations and 41 smaller subpopulations simultaneously. Therefore, when cell identities are not known 42 prior to sequencing, robust comparison and annotation of inferred de nov o clusters 43 remains a challenge. In this work, we define the distance between single-cell clusters by 44 proposing the use of the average overlap metric to compare ranked lists of differentially 45 expressed genes in a top-weighted manner. We first benchmark our approach in a truth-46 known dataset comprised of highly similar yet distinct T-cell populations and show that 47 evaluating clusters with average overlap results in a consistent, precise, and biologically 48 meaningful recapitulation of true cell identities. We then apply our approach to data of 49 unsorted mouse thymocytes and characterize stages of T-cell development in the thymus, 50 including minor populations of double -negative (CD4-CD8-) T-cells that are notoriously 51 difficult to confidently detect in unsorted single-cell data. We demonstrate that measuring 52 cluster similarity with average overlap of marker gene rankings enables robust, 53 reproducible characterization of single cells and clarifies biological interpretation of their 54 underlying identities in highly homogeneous populations. 55 56 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint

Introduction

57 Understanding complex biological processes, such as hematopoiesis or tumorigenesis, 58 requires an accurate identification of the types of cells present in a tissue sample and a 59 mapping of their interactions with one another. Bulk RNA sequencing methods measure 60 average gene expression across all cells 1 and cannot measure cell -to-cell gene 61 expression variation that may arise due to functional differentiation, such as those during 62 T-cell development, or time -dependent processes that occur across tumor clonal 63 evolution. Meanwhile, single-cell RNA sequencing (scRNA-seq) technologies have been 64 developed to address these challenges , leading to an improved understanding of 65 molecular processes in complex diseases2. 66 Defining cell types using unsupervised clustering algorithms based on transcriptional 67 similarity is a powerful application of scRNA-seq. There are several steps involved in the 68 computational analysis of scRNA -seq data 2-4, including initial quality control, 69 normalization, clustering, and identifying differentially expressed genes, which are 70 considered as marker genes 5 for each inferred cluster. One can then analyze these 71 marker genes and consult the literature and reference databases to assign biological 72 identities to each cluster. These steps are currently implemented as a complete workflow 73 in commonly used analytical packages like Seurat6, 7 and Scanpy8. 74 Still, c hallenges remain in clustering of scRNA-seq data stem ming from the nature of 75 measurements from single cells9. Cell identities are not known before sequencing without 76 a prior sorting process, and noise in measuring gene expression , especially for those 77 expressed in small subpopulations, can confound cell clusters inferred by unsupervised 78 community detection algorithms. Additionally, parameters that determine the number and 79 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint membership of clusters , such as the resolution in Leiden or Louvain methods10, 11, are 80 applied globally across an entire dataset, despite the possible presence of distinc t 81 subpopulations with varying sizes. To avert over - or under -clustering and subsequent 82 mischaracterization of novel cell populations and their marker genes, one can directly 83 annotate cells using their expression of pre-defined sets of marker genes12-15. However, 84 direct annotation is constrained to identities present in reference datasets and may miss 85 novel biological populations uncharacterized in the literature . Meanwhile, solutions for 86 optimizing clustering parameters have been proposed 16-20, but consistent biological 87 interpretation of inferred clusters remains an unresolved necessary step. 88 Inferring similar cell state or identity for clusters that share key marker genes is analogous 89 to a subjective call on two lists’ similarity based on the presence of a few key members. 90 Conversely, the mere presence of the same genes in two or more clusters’ marker gene 91 sets may not by itself indicate the same cell state or identity. Differences in the magnitude 92 of differential expression in each cluster, and by extension, differences in the rankings of 93 marker genes, may indicate a relevant differentiation trajectory or novel subpopulation . 94 Yet, methods are lacking f or the quantification of marker gene list similarity between 95 single-cell clusters to annotate cell identity. 96 Metrics that compare two lists based on the ranking of their elements address this 97 challenge by providing a measure to compare single -cell marker gene lists from cluster -98 based differential expression analyses. While rank -based methods for scRNA -seq 99 analyses have been developed21-23, none specifically address cluster similarity based on 100 differential expression, especially when clusters are computationally determined in an 101 unsupervised manner through community detection algorithms like Leiden or Louvain. To 102 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint date, no quantification of the similarity of differentially expressed genes and their rankings 103 has been employed in downstream scRNA-seq analysis. Here, we provide a definition for 104 similarity between single -cell clusters based on a rank -based metric called average 105 overlap24 and show that it can be used to calculate the significance of related cell identities 106 in a truth -known dataset with consistency, precision, and meaningful biological 107 interpretation. We then demonstrate the utility of average overlap for guiding the 108 clustering and annotation of cells in the thymus, revealing individual stages of thymocyte 109 development. 110 111

Results

112 Average overlap metric for cluster marker gene comparison 113 To compare single -cell clusters derived from unsupervised community detection 114 algorithms, we first defined each cluster’s marker gene set by its most differentially 115 expressed genes relative to the rest of the cells. We then ranked the marker gene set 116 based on the significance of differential expression and hypothesized that clusters that 117 share marker genes with similar rankings are more likely to have similar cell identity 118 and/or state. To test this hypothesis, we used a rank-based metric called average overlap 119 (AO)24, which is a top-weighted measure of ranked list similarity. In the context of marker 120 gene sets, AO weighs differences in the rankings of the most differentially expressed 121 genes higher than for the genes lower in the set. AO is defined by the mean of the 122 overlaps between two ranked lists calculated at a range of depths into the lists (Fig. 1a). 123 AO distances range from 0 to 1, from completely dissimilar to completely identical lists . 124 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint When calculated over randomly shuffled lists, AO closely follows a normal distribution 125 (Supplementary Fig. 1) which provides a statistical means for assigning significance to 126 AO scores and calculating the likelihood that two clusters share the same cell identity. 127 Benchmarking the average overlap metric in truth-known data 128 To evaluate AO for quantifying differences between highly similar cell types, we used T -129 cell populations from a dataset with truth-known, biological cell identities, including equal 130 numbers of purified T-helper, memory, naive, and regulatory CD4 T -cells as well as 131 cytotoxic CD8 T-cells20, 25. We refer to this dataset as Zhengmix8eq. 132 We used the Piccolo workflow26 to process and normalize the raw counts and the Leiden 133 method11 (resolution = 2.0) to produce single -cell clusters. After determining each 134 cluster’s top -k marker genes (k = 5, 10, 25, 50, and 100), we calculated pairwise AO 135 distances between the Leiden clusters based on their rankings of the union of all marker 136 gene sets. To obtain five cell populations that corresponded to the five present T -cell 137 populations in Zhengmix8eq, we performed hierarchical clustering of the initial Leiden 138 clusters using the pairwise AO distances, and then iteratively merged the clusters with 139 the highest AO similarity ( Fig. 1b). We also labelled each final cluster based on the 140 enrichment of ground truth labels and quantified the agreement between the five derived 141 cell populations and known T-cell identities using the adjusted Rand index (ARI) (Fig. 1c), 142 adjusted mutual information (AMI), and the Fowlkes-Mallows score (FMS) 143 (Supplementary Fig. 2). We repeated this analysis 100 times with a new random seed 144 for running the Leiden method. To benchmark AO against other metrics utilized for 145 assessing similarities between single -cell clusters, we performed the same hierarchical 146 clustering with pairwise Pearson, Spearman, and Kendall Tau correlations, using the 147 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint marker genes’ normalized expression counts. We also benchmarked the use of 148 correlation metrics applied to the first 50 principal components , which summarize 149 expression across all genes. 150 Five cell populations obtained from merging the Leiden clusters based on AO similarity 151 showed the highest correspondence to true T -cell populations present in the dataset 152 compared to all other metrics (Fig. 1c). Furthermore, AO reproducibly separated a CD8 153 cytotoxic T -cell population from the other CD4 T -cell populations. AO was able to 154 distinguish a CD8 -enriched population in 100% of iterations with 5, 10, and 25 marker 155 genes and in 97% of iterations with 50 and 100 maker genes. In contrast, CD8 -enriched 156 populations were not detected when we merged the Leiden clusters with Pearson, 157 Spearman, or Kendall Tau correlation of marker genes’ expression counts in 63-77%, 24-158 94%, and 30-96% of iterations, respectively (Fig. 1d). 159 Secondly, AO similarity consistently measured larger distances between the CD8 and 160 CD4 populations than the pair-wise distances among the CD4 sub-populations. A ratio of 161 the inter-CD8-to-CD4 distances versus the intra -CD4 distances that is greater than one 162 indicates larger separation of the CD8 group and the CD4 groups. When using AO, the 163 distance ratios were consistently greater than one across all marker gene numbers of 5 164 (range: 1.12-1.28, standard deviation (s.d.): 0.029), 10 (range: 1.06–1.15, s.d.: 0.018), 25 165 (range: 1.015–1.10, s.d.: 0.022), 50 (range: 0.97–1.10, s.d.: 0.034), and 100 (range: 0.96–166 1.09, s.d.: 0.039). In comparison, correlation-based methods showed substantially poorer 167 performance when applied to marker gene expression counts. Pearson, Spearman, and 168 Kendall-Tau distances distinguished CD8 populations from CD4 sub -populations in the 169 same manner of distance ratios greater than one in only 75%, 81%, and 80% of iterations 170 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint respectively at their best performance, which was achieved with 5 marker genes per 171 cluster (Pearson range: 0.94 –2.0, s.d.: 0.38; Spearman range: 0.89 –5.09, s.d.: 0.65; 172 Kendall Tau range: 0.84–2.05, s.d.: 0.24) (Fig. 1e). 173 Measuring single -cell cluster similarities using a smaller number of marker genes per 174 cluster resulted in better performance for AO and other tested rank -based metrics. 175 Curiously, although Spearman and Kendall -Tau correlations were able to recover CD8 176 populations distinct from CD4 sub -populations in 100% of iterations when they were 177 applied to 50 principal components (PC) ( Fig. 1e), merging Leiden clusters using PC -178 based distances resulted in the weakest correspondence to true T -cell labels present in 179 the dataset (Fig. 1c). 180 Put together, AO robustly and reproducibly measured similarities between single -cell 181 clusters based on the ranking of marker genes and significantly outperformed correlation-182 based metrics in identifying and characterizing T-cell populations present in 183 Zhengmix8eq. 184 Thymic T-cell development at single-cell resolution 185 Specific stages of T-cell development in the thymus have been characterized with single-186 cell transcriptomic studies, starting from double negative (DN) populations DN1 -DN4, 187 advancing to immature single positive (ISP) and double positive (DP), and eventually 188 mature CD4 and CD8 T-cells. While the T-cell development trajectory in the thymus has 189 been partially resolved in normal and diseased states, characterizing double negative cell 190 subpopulations using single -cell approaches has been especially challenging , in 191 particular when measuring total thymocytes without prior cell sorting, leading to grouping 192 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint of all DN subpopulations together without specifically differentiating DN1-DN4 states27-30. 193 To address this challenge and to demonstrate the utility of the AO metric in elucidating 194 the relationships between thymic cell populations, we used our methodology to analyze 195 single-cell transcriptomic data previously published from mouse thymocytes 28. We 196 processed the data using the Piccolo workflow and used the Leiden method for cell 197 clustering (resolution = 1.0), obtaining 19 single-cell clusters (Fig. 2a). Marker gene sets 198 for each single -cell cluster were defined by performing differential gene expression 199 analysis relative to all other cells . We then combined each cluster’s marker gene set , 200 composed of its top 25 genes ranked by significance, and performed hierarchical 201 clustering using pairwise AO distances between each cluster’s rankings of the marker 202 genes, testing the hypothesis that T-cell clusters with high AO correspond to cell 203 populations with similar identity and state (Fig. 2b,c). To guide annotating T-cell identities 204 for unsupervised clusters, we utilized the MyGeneSet tool, available as a part of the 205 Immunological Genome Project (ImmGen)31 and mapped the expression of each cluster’s 206 top 50 marker genes across 12 sorted mouse thymic cell populations using composite z-207 scores (Fig. 2d). This analysis linked the expression profiles of the 19 de novo single-cell 208 clusters to MPP4 (a subset of multipotent progenitors from the bone marrow), DN1, DN2a, 209 DN2b, DN3, DN4, ISP , DP, mature CD4 and CD8, natural killer, and  T-cells in ImmGen 210 cell populations32. 211 The cell cycle characterizes thymic single-cell clusters 212 Marker genes in multiple single-cell clusters included those related to phases of the cell 213 cycle. Hierarchical clustering based on AO of marker gene sets identified distinct groups 214 of single-cell clusters; therefore, we hypothesized that similarity in the rankings of 215 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint differentially expressed genes, as quantified by AO, would group the clusters based on 216 cell cycle phases. We tested this hypothesis by scoring each cell based on the expression 217 of known cell cycle genes and predicting their cell cycle phase33 (Fig. 2b). We found that 218 clusters 1, 5, and 9 were enriched with cells in the S phase while clusters 2, 6, 7, 8, 10, 219 and 11 were enriched with cells in the G2/M phase (Fig. 2c). The remaining clusters were 220 enriched with cells in the G1 phase , split into two groups of clusters 0, 3, 4, 16, 17, and 221 18 plus clusters 12, 13, 14, and 15 (Fig. 2c). These marker genes associated with the 222 cell cycle were also highly expressed in multiple ImmGen bulk populations, including the 223 DN2b, DN4, and ISP populations ( Fig. 2d). These results suggest that the groups that 224 emerge from AO clustering primarily reflect cell cycle signatures known to play crucial 225 roles in the identities of thymic cell populations, particularly pointing to stages of T-cell 226 development involving rapid expansion and proliferation such as those that occur after T-227 lineage commitment and β-selection34. 228 Annotation of thymic single-cell clusters during development 229 Using the annotations from MyGeneSet and Immgen sorted populations, we found that 230 the earliest thymocyte progenitors (MPP4 and DN1 cells) mapped most strongly to cluster 231 0, which had marker genes that did not show significant expression in any other ImmGen 232 thymic population. At the same time, cluster 0 did not show high similarity to any other 233 cluster according to AO; thus, we annotated cluster 0 as a mix of multipotent progenitors 234 (MPP) and DN1 cells. 235 Cluster 1 corresponded strongly to DN2a populations and cluster 2 was inferred as DN2b 236 cells based on its marker genes’ expression specifically in the DN2a/b and earlier 237 progenitor populations. Marker genes for cluster 2 were also highly expressed in 238 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint proliferating DN4 and ISP populations, stemming from a 36% overlap with known cell 239 cycle genes, including Hmmr and Nusap1. 240 Clusters 3 and 4 marker genes were highly expressed in DN3 populations and showed 241 marked AO similarity. In addition, marker genes for cluster 5 showed relatively high 242 expression in DN3 cells. For deeper characterization of these cells into DN3a and DN3b 243 populations, we scored cells in clusters 3, 4 and 5 against genes upregulated in purified 244 wildtype DN3a and 3b cells35. While cluster 4 highly correlated with DN3a, and cluster 3 245 modestly correlated with both phases, cluster 5, which is in the S phase of the cell cycle, 246 showed similarity to DN3b cells, together indicating a transition of cells from DN3a to 247 DN3b states (Supplementary Fig. 3). Accordingly, we merged clusters 3 and 4 together 248 to form a DN3a population, and annotated cluster 5 as the DN3b population. 249 Clusters 6, 7, 8, and 9 mapped strongly to ImmGen DN4 and ISP populations. While 250 cluster 9 was distinguished as DN4/ISP cells in the S phase of the cell cycle, clusters 6, 251 7, and 8 grouped together based on their pairwise AO and association with the G2 /M 252 phase of the cell cycle. Of note is cluster 8, which contained cells with low expression of 253 a select set of genes that were not expressed in any other clusters (Supplementary Fig. 254 4). Relative to cells in clusters 6 and 7, cells in cluster 8 showed an upregulation of mt -255 Co1, mt-Co2, mt-Co3, and other mitochondrial genes. In fact, mitochondrial genes made 256 up for on average 7.8% of this cluster’s expressed transcripts, indicating that a portion of 257 cluster 8 would fall just under the 12% mitochondrial expression threshold we used for 258 filtering low-quality cells at the start of analysis. Put together, we interpreted cluster 8 as 259 representing apoptotic DN4/ISP cells (Supplementary Fig. 5). 260 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint Cluster 10 showed the highest correspondence to ImmGen ISP s, while cluster 11 was 261 inferred as an intermediate population between the ISP and DP stages based on its 262 marker genes’ expression in late development T-cells (Supplementary Fig. 4). Clusters 263 12, 13, and 14 were identified as parts of a larger DP population, as they each mapped 264 strongly to ImmGen DP cells and showed the highest AO similarities across all clusters. 265 The maturing stages of T-cell development were represented by cells in clusters 15 as 266 single positive CD4 and cluster 16 as single positive CD8 T-cells. Finally, clusters 17 and 267 18 were assigned to smaller groups of natural killer cells and  T-cells, respectively. 268 Lastly, we explored characterizing the stages of T -cell development in our data using 269 orthogonal methods. First, we asked whether pseudotime inference could recover cell 270 identities in these developing T-cells and used diffusion pseudotime36 specifying cluster 271 0 as the starting point . T he inferred trajectory match ed our annotated stages of 272 development and supported later developmental trajectory for cells in clusters 6 and 7 273 compared to cells in cluster 2; hence, partially distinguishing DN2b from DN4/ISP cells 274 despite being in the same cell cycle phase (Supplementary Fig . 6a). This pattern , 275 however, did not hold when cluster 2 was chose n as the analysis’s starting point 276 (Supplementary Fig . 6b). Cells in clusters 6 and 7 were assigned the same starting 277 pseudotime value of 0 as the chosen cluster 2, indicating that pseudotime calculations 278 can be confounded by strong cell cycle signatures. Second, we asked whether Pearson 279 correlation applied to the expression counts of the marker genes could infer T-cell 280 similarities. The resulting hierarchical clustering, however, did not consistently group the 281 single-cell clusters based on phases of cell cycle or stages of T -cell development 282 (Supplementary Figure 7 ), highlighting the strength of AO in inferring biologically 283 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint meaningful relationship s by using the rankings of differentially expressed genes as 284 opposed to their counts, especially in the context of highly similar cells. 285 Overall, our extended analyses of truth -known hematopoietic populations as well as 286 unsorted mouse thymocytes showed that AO could accurately guide the clustering and 287 annotation of highly homogenous cell populations and help characterize the trajectory of 288 T-cell development in thymus, including the detection of elusive double-negative (CD4-289 CD8-) cells. We have implemented the AO metric in Python and have published the 290 ‘sc_average_overlap’ Python package, which is available at 291 https://github.com/chrisvthai/sc_average_overlap, designed to work seamlessly with in 292 the Scanpy framework. 293 294

Discussion

295 In this work, we propose using average overlap, a top-weighted metric that quantifies the 296 similarity of ranked lists , to compare clusters in scRNA -seq analysis. Hierarchical 297 clustering using rank - and correlation-based metrics to compare the transcriptomic 298 profiles of inferred clusters in scRNA -seq analysis is not new. Current implementations, 299 including those in Seurat and Scanpy, utilize distances based on gene expression counts 300 or principal components. In contrast, AO measures cluster similarity by relying on marker 301 gene rankings derived from differential gene expression analysis. 302 We compared AO to correlation-based metrics using Leiden clusters derived from a 303 biological, ground-truth dataset , containing five transcriptionally similar T-cell 304 populations20, 25. We showed that merging unsupervised clusters into five groups with AO 305 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint based on gene rankings resulted in cell labels that correspond ed to ground truth 306 populations more accurately than other tested methods. AO is also far more consistent in 307 its performance , and over many Leiden clustering iterations , produced merged 308 populations that were biologically meaningful, by differentiating CD8 cytotoxic populations 309 from the CD4 clusters. Expression counts across highly similar populations are high ly 310 correlated irrespective of the feature selection procedure and the amount of cluster 311 marker genes used. This high transcriptional correlation often results in inconsistencies 312 that make distinguishing subtle differences between T-cell subpopulations and inferring 313 cell clusters that map back to individual subpopulations a challenge. Restricting analysis 314 to just gene rankings based on differential expression in our implementation of AO allows 315 robustly measuring subtle transcriptional differences between single-cell clusters. 316 Moreover, AO performed well compared to other distance metrics due to its top-weighted 317 property. While performance for all metrics diminished as more marker genes were 318 included, AO-based hierarchical clustering was minimally affected . Reincorporating 319 information about gene expression variation by way of utilizing principal coordinates 320 recovered the performance of correlation-based metrics to an extent . We interpret this 321 observation as evidence for the effect commonly known as the curse of dimensionality37 322 where the differences in distances between clusters become increasingly similar as a 323 higher number of marker genes are added. In contrast, differences at the top of rankings 324 as measured by AO, correspond to the most differentially expressed genes, and reflect 325 cellular differences more precisely than the genes at the bottom of the rankings. As such, 326 AO is not as sensitive to the curse of dimensionality as the other benchmarked metrics. 327 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint AO can quantify subtle differences in otherwise highly transcriptionally correlated cell 328 populations. In our thymus data, AO-based hierarchical clustering revealed cell cycle 329 genes as the strongest driver s of marker gene similarity . However, s howing high 330 correlation with ImmGen reference expression profiles was not sufficient to characterize 331 cell cycle signatures in single-cell clusters that corresponded to DN2b, DN4, and ISP cells 332 at once. By measuring differential gene rankings, AO inferred cell cycle phases with great 333 specificity and guided identification of stages of T -cell development involving increased 334 expansion and proliferation. Additionally, AO helped characterize a small subset of cells 335 in the G2/M phase as a population of apoptotic DN4/ISP . Cells in this cluster were retained 336 simply due to the choice of mitochondrial expression threshold used to account for low 337 quality cells, yet showed relatively high correspondence to ImmGen DN4 and ISP 338 populations, supporting their biological relevan ce. Thymocytes may undergo positive 339 selection based on their ability to bind to MHC ligands; thymocytes unable to bind MHC 340 undergo death by neglect 38. The combination of relatively high mitochondrial gene 341 expression which typically indicates dying cells, high correspondence to bulk DN4 and 342 ISP populations in ImmGen, and the detection and quantification of unique marker genes 343 via the Leiden clustering algorithm and the AO metric supports this interpretation. 344 Altogether, unique subpopulations of thymocytes characterized with the aid of AO may be 345 missed when relying solely on reference transcriptomic profiles. 346 The analysis and comparison of cluster marker genes are necessary steps in uncovering 347 novel cell populations. As we have demonstrated with our analysis of thymus data, one 348 clustering resolution may not yield clusters that simultaneously represent broad well-349 defined populations as well as smaller novel subpopulations. Moreover, while some 350 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint clusters may separate due to existing biological variations, others may arise due to noisy 351 clustering. AO-measured distances based on the ranking of differentially expressed 352 genes provide a quantitative method for evaluating transcriptional similarity independent 353 of clustering algorithms and parameters used. In the context of differentiating cells , 354 alternative methods such as pseudotime inference, which calculates the relative position 355 of a cell across gene expression gradients, can theoretically aid cluster annotation36, 39, 356 40. However, pseudotime algorithms rely heavily on the choice of starting population, and 357 as we showed in an application to our thymus data, they may group cells in the same cell 358 cycle phase despite transcriptional correspondence to different biological populations . 359 Additionally, when there are groups of many distinc t cells in general, a trajectory cannot 360 be inferred altogether and a quantitative method for cluster comparison is still needed. 361 In conclusion, we propose average overlap as a metric for direct quantification of marker 362 gene similarity with broad applicability in diverse biological settings suited to explore and 363 clarify the heterogeneity of cells in most challenging contexts that arise from differentiating 364 cells or involve clonal populations, such as those driving cancer progression. 365 366

Methods

367 The sc_average_overlap Python package. We have implemented the average overlap 368 metric in Python and have published the ‘sc_average_overlap’ Python package, which is 369 available at https://github.com/chrisvthai/sc_average_overlap, designed to work 370 seamlessly with Scanpy, a commonly used Python library for scRNA-seq analyses. The 371 package includes functions for co mputing pairwise AO scores between two clusters, 372 performing hierarchical clustering of cell populations , and storing the resulting 373 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint dendrogram in Scanpy’s native AnnData object. For hierarchical clustering, the AO values 374 are subtracted from one to generate a distance metric, s ince AO ranges between 0 and 375 1, with higher overlap scores meaning a lower distance, and vice versa. 376 Benchmarking average overlap’s performance. To benchmark AO when used as the 377 metric for hierarchical clustering of cell populations in single cell RNA -seq data, we 378 utilized the Zhengmix8eq dataset, generated for the purpose of benchmarking clustering 379 performance in scRNA-seq analysis20. We used only five T-cell populations, which were 380 CD8 cytotoxic cells and CD4 populations including helper, memory, naïve, and regulatory 381 cells. We processed the data using Piccolo26 to perform feature selection and 382 normalization, selecting 3,000 highly variable genes for downstream analysis. We then 383 performed PCA and Leiden clustering with a resolution of 2.0 using Scanpy. We 384 performed this process and all future downstream analysis 100 times in total, each with 385 a different random seed given to the Leiden algorithm. 386 We generated marker genes for each resulting single-cell cluster by performing 387 differential expression analysis using the Wilcoxon rank -sum test and comparing each 388 cluster’s gene expression versus the rest of the cells , ranked by significance. We then 389 grouped single-cell clusters using hierarchical clustering. The metrics used for the 390 hierarchical clustering included Pearson correlation, Spearman correlation, Kendall -Tau 391 coefficient, and AO, and were calculated with either the set of all cluster marker genes, 392 composed of the union of each unsupervised cluster’s top 5, 10, 25, 50, or 100 393 differentially expressed marker genes, or the values of the first 50 principal components. 394 When using the set of all cluster marker genes, Pearson, Spearman, and Kendall -Tau 395 coefficients were calculated using each cluster’s expression values for each gene, while 396 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint AO was calculated using the rankings of the cluster marker genes. When using principal 397 component values, only Pearson, Spearman, and Kendall -Tau coefficients were 398 calculated. 399 Once hierarchical clustering was computed, the clusters were iteratively merged in an 400 automated manner, guided by the pair of clusters with the highest AO similarity, until five 401 final clusters remained, corresponding to the five represented T-cell populations in the 402 Zhengmix8eq dataset. Performance was evaluated in two ways. Firstly, using the ground 403 truth labels, we calculated adjusted Rand index (ARI), adjusted mutual information (AMI), 404 and the Fowlkes-Mallows score (FMS). Secondly, based on labeling each of the five final 405 clusters according to its most enriched-for ground-truth label, we evaluated the separation 406 of CD8 populations from CD4 populations . For the metrics and feature-sets used, we 407 counted the occurrences where no clusters enriched with CD8 cytotoxic cells were 408 identified. Additionally, in each clustering iteration , we calculated the ratio of the mean 409 pairwise distances between the CD8 population (if identified) and each CD4 population 410 versus the mean pairwise distances among the CD4 populations. In each occurrence 411 where no CD8 cluster was detected, we set the mean distance ratio between CD8 and 412 CD4 groups to 1.0 to indicate no detected difference in separation between CD8 and CD4 413 groups. 414 Thymocyte analysis using average overlap. We applied the AO metric to a previously 415 published dataset of mouse thymocytes 28. We first performed feature selection and 416 normalization using Piccolo, excluding cells expressing more than 12% mitochondrial 417 genes or 80% ribosomal genes and selecting 3,000 highly variable genes for downstream 418 analysis. We performed Principal Component Analysis ( PCA) and Leiden clustering at 419 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint resolution 1.0 using Scanpy, and visualized the single cell clusters using Uniform Manifold 420 Approximation and Projection ( UMAP). We performed differential gene expression 421 analysis using the Wilcoxon rank -sum test and assigned the top 25 differentially 422 expressed genes, ranked by significance, as a set of marker genes in each cluster. As 423 we were interested in T -cell populations only, we excluded small populations of 424 granulocytes and B -cells detected through scoring against known markers of each cell 425 type. We combined all cluster s’ marker genes into a single set for calculating AO for 426 subsequent calculation of pair -wise distances between the clusters and hierarchical 427 clustering. 428 We obtained and visualized the expression of each set of cluster marker genes in 12 429 sorted populations of thymic cells in mice from the ImmGen project using the MyGeneSet 430 tool31. To summarize the expression of each single-cell cluster’s marker genes in ImmGen 431 sorted bulk populations, we first transformed the expression counts into a normalized z-432 score and then combined the transformed values into a single composite z -score using 433 Stouffer’s method. 434 We used Scanpy’s score_genes function for scoring each cell according to phases of the 435 cell cycle using a list of 97 known cell cycle marker genes 33. We used the same scoring 436

Method

to distinguish between DN3a and DN3b populations based on a list of genes 437 upregulated in purified wildtype DN3a/b cells35. 438 439 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint FUNDING 440 This work was supported by the National Institutes of Health (R01CA236936 and 441 R01CA285513), the V Foundation (T2019 -012 and T2023 -024), The Leukemia & 442 Lymphoma Society (Scholar Award 1386 -23), the New Jersey Commission on Cancer 443 Research (COCR23PRG006) and Rutgers Cancer Institute of New Jersey Biomedical 444 Informatics Shared Resource (P30CA072720 -5917). CT was a fellow of the 445 Biotechnology Training Program at Rutgers University (NIH T32 GM135141 ). AS was 446 supported by the New Jersey Commission on Cancer Research (COCR24PDF015). 447 448 AUTHOR CONTRIBUTION 449 CT and HK conceived the study. CT performed the analytical experiments and conducted 450 all analyses with contribution and input from AS. DH and HK supervised the work. All 451 authors drafted the manuscript. All authors read and approved the final manuscript. 452 453 CONFLICTS OF INTEREST 454 CT, AS, and DH declare no existing conflict of interest. HK is a full -time employee of 455 Regeneron Pharmaceuticals. 456 457 458 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint FIGURES AND FIGURE LEGENDS 459 460 461 FIGURE 1. Workflow and benchmarking of average overlap for quantitative 462 comparison of inferred single-cell clusters in scRNA -seq data. a, Calculating the 463 average overlap distance between two ranked lists and compar ison with correlation-464 based metrics. Differentially expressed genes (G1 to G5) are ranked in clusters C1 and 465 C2 and the overlap between the ranked lists is calculated at each depth. Average overlap 466 of the entire lists is the mean of all overlaps. b, A schematic demonstrating the workflow 467 for hierarchical clustering using average overlap, following community detection, marker 468 gene identification, and pair -wise distance calculation . c, Adjusted Rand Index (ARI) 469 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint performance measures based on agreement with ground truth labels in five derived cell 470 populations, across different metrics and number of marker genes used, using the T-cells 471 in the Zhengmix8eq dataset. Adjusted mutual information (AMI) and the Fowlkes-Mallows 472 score (FMS) are shown in Supplementary Fig. 2. d, Percentage of benchmarking trials 473 in which a final cluster enriched with ground-truth CD8 naïve cytotoxic population was 474 identified. e, Ratio of CD8 to CD4 distances versus intra -CD4 distances. A ratio higher 475 than 1 indicates higher degree of separation between CD8 cytotoxic and CD4 populations 476 versus the separation among the CD4 populations. 477 478 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint 479 FIGURE 2. Utilizing average overlap to characterize stages of T-cell development in 480 a mouse thymus. a, A UMAP plot of mouse thymocytes showing 19 inferred clusters 481 using the Leiden algorithm. b, Cluster tree generated from hierarchical clustering of 482 Leiden clusters with average overlap of marker gene rankings , based on the combined 483 set of each cluster’s top 25 marker genes. Clusters largely group based on differential 484 expression of genes related to the cell cycle. c, A UMAP plot of the mouse thymocytes, 485 where each cell is annotated with its inferred cell cycle phase. d, A heatmap summarizing 486 expression of Leiden cluster marker genes in sorted bulk populations of thymocytes from 487 ImmGen. Each cluster is also colored with its inferred cell cycle phase. e, Final 488 annotations of developing mouse thymocytes. 489 490 491 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint SUPPLEMENTARY FIGURE LEGENDS 492 Supplementary Figure 1. Distributions of pair -wise average overlap scores on 493 randomly shuffled lists. Across 2,000 iterations, randomly shuffling two ranked lists that 494 contain the same set of elements yields average overlap distances that closely follow a 495 normal distribution with a mean of 0.5 and a variance that is inversely correlated with the 496 length of the list. 497 498 Supplementary Figure 2. a, Adjusted Mutual Information (AMI) and b, Fowlkes-Mallows 499 index (FMI) performance measures based on the enrichment of ground truth labels in five 500 derived cell populations, across different metrics and number of marker genes use d in 501 the T-cells in the Zhengmix8eq dataset. 502 503 Supplementary Figure 3 . Upregulation of DN3a - and DN3b -related genes in 504 developing mouse thymocytes. Cells were scored by expression of genes upregulated 505 in purified wildtype DN3a and 3b cells, determined by study from Vogel et al35. 506 507 Supplementary Figure 4. a, The expression of top 5 marker genes for each annotated 508 cell population. b, The expression of marker genes expressed in more than 50% of their 509 respective cell population. 510 511 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint Supplementary Figure 5. Pair -wise differential expression between apoptotic and 512 normal DN4/ISP cells in G2/M phase . Differentially expressed genes between these 513 groups were determined with the Wilcoxon rank -sum test. Normalized expression of the 514 top 20 genes upregulated in the apoptotic group are shown. 515 516 Supplementary Figure 6. Diffusion pseudotime inferred on developing thymocytes. 517 Among the group of cells that map to double negative thymocytes, when specifying 518 cluster 0 as the starting point, the inferred trajectory closely matches our annotated stages 519 of development. However, when setting cluster 2 as the start, two additional groups of 520 cells, clusters 7 and 8, also share the same pseudotime value of 0. 521 522 Supplementary Figure 7. Hierarchical clustering of Leiden clusters in thymus data 523 using Pearson correlation. Pearson correlation is calculated with the same set of top 524 25 marker genes in each Leiden cluster as previous analysis using the average overlap 525 metric. Grouping based on either cell cycle phase or neighboring stages of T -cell 526 development is inconsistent. 527 528 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint

References

529 1. Li, X. & Wang, C.Y. From bulk, single-cell to spatial RNA sequencing. Int J Oral Sci 530 13, 36 (2021). 531 2. Luecken, M.D. & Theis, F.J. Current best practices in single-cell RNA-seq analysis: 532 a tutorial. Molecular Systems Biology 15, e8746 (2019). 533 3. Haque, A., Engel, J., Teichmann, S.A. & Lonnberg, T. A practical guide to single -534 cell RNA-sequencing for biomedical research and clinical applications. Genome 535 Med 9, 75 (2017). 536 4. Hwang, B., Lee, J.H. & Bang, D. Single -cell RNA sequencing technologies and 537 bioinformatics pipelines. Exp Mol Med 50, 1-14 (2018). 538 5. Pullin, J.M. & McCarthy, D.J. A comparison of marker gene selection methods for 539 single-cell RNA sequencing data. Genome Biol 25, 56 (2024). 540 6. Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single -541 cell analysis. Nat Biotechnol 42, 293-304 (2024). 542 7. Satija, R., Farrell, J.A., Gennert, D., Schier, A.F. & Regev, A. Spatial reconstruction 543 of single-cell gene expression data. Nat Biotechnol 33, 495-502 (2015). 544 8. Wolf, F.A., Angerer, P. & Theis, F.J. SCANPY: large -scale single -cell gene 545 expression data analysis. Genome Biol 19, 15 (2018). 546 9. Kiselev, V.Y., Andrews, T.S. & Hemberg, M. Challenges in unsupervised clustering 547 of single-cell RNA-seq data. Nat Rev Genet 20, 273-282 (2019). 548 10. Blondel, V.D., Guillaume, J. -L., Lambiotte, R. & Lefebvre, E. Fast unfolding of 549 communities in large networks. Journal of statistical mechanics: theory and 550 experiment 2008, P10008 (2008). 551 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint 11. Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing 552 well-connected communities. Sci Rep 9, 5233 (2019). 553 12. Hou, R., Denisenko, E. & Forrest, A.R.R. scMatch: a single -cell gene expression 554 profile annotation tool using reference datasets. Bioinformatics 35, 4688 -4695 555 (2019). 556 13. Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a 557 transitional profibrotic macrophage. Nat Immunol 20, 163-172 (2019). 558 14. Ianevski, A., Giri, A.K. & Aittokallio, T. Fully -automated and ultra -fast cell -type 559 identification using specific marker combinations from single -cell transcriptomic 560 data. Nat Commun 13, 1246 (2022). 561 15. Pasquini, G., Rojo Arias, J.E., Schafer, P. & Busskamp, V. Automated methods for 562 cell type annotation on scRNA-seq data. Comput Struct Biotechnol J 19, 961-969 563 (2021). 564 16. Patterson-Cross, R.B., Levine, A.J. & Menon, V. Selecting single cell clustering 565 parameter values using subsampling -based robustness metrics. BMC 566 Bioinformatics 22, 39 (2021). 567 17. Kim, T. et al. Impact of similarity metrics on single -cell RNA-seq data clustering. 568 Brief Bioinform 20, 2316-2326 (2019). 569 18. Yu, L., Cao, Y., Yang, J.Y.H. & Yang, P. Benchmarking clustering algorithms on 570 estimating the number of cell types from single -cell RNA -sequencing data. 571 Genome Biol 23, 49 (2022). 572 19. Peyvandipour, A., Shafi, A., Saberian, N. & Draghici, S. Identification of cell types 573 from single cell data using stable clustering. Sci Rep 10, 12349 (2020). 574 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint 20. Duo, A., Robinson, M.D. & Soneson, C. A systematic performance evaluation of 575 clustering methods for single-cell RNA-seq data. F1000Res 7, 1141 (2018). 576 21. Xu, Y. et al. A Gene Rank Based Approach for Single Cell Similarity Assessment 577 and Clustering. IEEE/ACM Trans Comput Biol Bioinform 18, 431-442 (2021). 578 22. Vargo, A.H.S. & Gilbert, A.C. A rank -based marker selection method for high 579 throughput scRNA-seq data. BMC Bioinformatics 21, 477 (2020). 580 23. Oulas, A., Savva, K., Karathanasis, N. & Spyrou, G.M. Ranking of cell clusters in 581 a single -cell RNA-sequencing analysis framework using prior knowledge. PLoS 582 Comput Biol 20, e1011550 (2024). 583 24. Webber, W., Moffat, A. & Zobel, J. A Similarity Measure for Indefinite Rankings. 584 ACM Transactions on Information Systems 28, 20.21-20.38 (2010). 585 25. Zheng, G.X. et al. Massively parallel digital transcriptional profiling of single cells. 586 Nat Commun 8, 14049 (2017). 587 26. Singh, A. & Khiabanian, H. Feature selection followed by a novel residuals-based 588 normalization that includes variance stabilization simplifies and improves single -589 cell gene expression analysis. BMC Bioinformatics 25, 248 (2024). 590 27. Oh, S. et al. Distinct subpopulations of DN1 thymocytes exhibit preferential 591 gammadelta T lineage potential. Front Immunol 14, 1106652 (2023). 592 28. Tottone, L. et al. A Tumor Suppressor Enhancer of PTEN in T -cell development 593 and leukemia. Blood Cancer Discov 2, 92-109 (2021). 594 29. Park, J.E. et al. A cell atlas of human thymic development defines T cell repertoire 595 formation. Science 367 (2020). 596 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint 30. Belver, L. et al. GATA3 -Controlled Nucleosome Eviction Drives MYC Enhancer 597 Activity in T-cell Development and Leukemia. Cancer Discov 9, 1774-1791 (2019). 598 31. Heng, T.S., Painter, M.W. & Immunological Genome Project, C. The Immunological 599 Genome Project: networks of gene expression in immune cells. Nat Immunol 9, 600 1091-1094 (2008). 601 32. Yoshida, H. et al. The cis-Regulatory Atlas of the Mouse Immune System. Cell 176, 602 897-912 e820 (2019). 603 33. Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by 604 single-cell RNA-seq. Science 352, 189-196 (2016). 605 34. Oh, S. et al. Mapping the two distinct proliferative bursts early in T -cell 606 development. Immunol Cell Biol 101, 766-774 (2023). 607 35. Vogel, K.U., Bell, L.S., Galloway, A., Ahlfors, H. & Turner, M. The RNA -Binding 608 Proteins Zfp36l1 and Zfp36l2 Enforce the Thymic beta -Selection Checkpoint by 609 Limiting DNA Damage Response Signaling and Cell Cycle Progression. J Immunol 610 197, 2673-2685 (2016). 611 36. Haghverdi, L., Buttner, M., Wolf, F.A., Buettner, F. & Theis, F.J. Diffusion 612 pseudotime robustly reconstructs lineage branching. Nat Methods 13, 845 -848 613 (2016). 614 37. Bellman, R. & Rand Corporation. Dynamic programming. (Princeton University 615 Press, Princeton,; 1957). 616 38. Baldwin, T.A., Hogquist, K.A. & Jameson, S.C. The fourth way? Harnessing 617 aggressive tendencies in the thymus. J Immunol 173, 6515-6520 (2004). 618 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint 39. Street, K. et al. Slingshot: cell lineage and pseudotime inference for single -cell 619 transcriptomics. BMC Genomics 19, 477 (2018). 620 40. Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed 621 by pseudotemporal ordering of single cells. Nat Biotechnol 32, 381-386 (2014). 622 623 .CC-BY-NC-ND 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-24T02:00:01.246996+00:00

License: CC-BY-NC-ND-4.0