Abstract
38
Defining cell types using unsupervised clustering algorithms based on transcriptional 39
similarity is a powerful application of single-cell RNA sequencing . A single clustering 40
resolution may not yield clusters that represent both broad, well-defined populations and 41
smaller subpopulations simultaneously. Therefore, when cell identities are not known 42
prior to sequencing, robust comparison and annotation of inferred de nov o clusters 43
remains a challenge. In this work, we define the distance between single-cell clusters by 44
proposing the use of the average overlap metric to compare ranked lists of differentially 45
expressed genes in a top-weighted manner. We first benchmark our approach in a truth-46
known dataset comprised of highly similar yet distinct T-cell populations and show that 47
evaluating clusters with average overlap results in a consistent, precise, and biologically 48
meaningful recapitulation of true cell identities. We then apply our approach to data of 49
unsorted mouse thymocytes and characterize stages of T-cell development in the thymus, 50
including minor populations of double -negative (CD4-CD8-) T-cells that are notoriously 51
difficult to confidently detect in unsorted single-cell data. We demonstrate that measuring 52
cluster similarity with average overlap of marker gene rankings enables robust, 53
reproducible characterization of single cells and clarifies biological interpretation of their 54
underlying identities in highly homogeneous populations. 55
56
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
Introduction
57
Understanding complex biological processes, such as hematopoiesis or tumorigenesis, 58
requires an accurate identification of the types of cells present in a tissue sample and a 59
mapping of their interactions with one another. Bulk RNA sequencing methods measure 60
average gene expression across all cells 1 and cannot measure cell -to-cell gene 61
expression variation that may arise due to functional differentiation, such as those during 62
T-cell development, or time -dependent processes that occur across tumor clonal 63
evolution. Meanwhile, single-cell RNA sequencing (scRNA-seq) technologies have been 64
developed to address these challenges , leading to an improved understanding of 65
molecular processes in complex diseases2. 66
Defining cell types using unsupervised clustering algorithms based on transcriptional 67
similarity is a powerful application of scRNA-seq. There are several steps involved in the 68
computational analysis of scRNA -seq data 2-4, including initial quality control, 69
normalization, clustering, and identifying differentially expressed genes, which are 70
considered as marker genes 5 for each inferred cluster. One can then analyze these 71
marker genes and consult the literature and reference databases to assign biological 72
identities to each cluster. These steps are currently implemented as a complete workflow 73
in commonly used analytical packages like Seurat6, 7 and Scanpy8. 74
Still, c hallenges remain in clustering of scRNA-seq data stem ming from the nature of 75
measurements from single cells9. Cell identities are not known before sequencing without 76
a prior sorting process, and noise in measuring gene expression , especially for those 77
expressed in small subpopulations, can confound cell clusters inferred by unsupervised 78
community detection algorithms. Additionally, parameters that determine the number and 79
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
membership of clusters , such as the resolution in Leiden or Louvain methods10, 11, are 80
applied globally across an entire dataset, despite the possible presence of distinc t 81
subpopulations with varying sizes. To avert over - or under -clustering and subsequent 82
mischaracterization of novel cell populations and their marker genes, one can directly 83
annotate cells using their expression of pre-defined sets of marker genes12-15. However, 84
direct annotation is constrained to identities present in reference datasets and may miss 85
novel biological populations uncharacterized in the literature . Meanwhile, solutions for 86
optimizing clustering parameters have been proposed 16-20, but consistent biological 87
interpretation of inferred clusters remains an unresolved necessary step. 88
Inferring similar cell state or identity for clusters that share key marker genes is analogous 89
to a subjective call on two lists’ similarity based on the presence of a few key members. 90
Conversely, the mere presence of the same genes in two or more clusters’ marker gene 91
sets may not by itself indicate the same cell state or identity. Differences in the magnitude 92
of differential expression in each cluster, and by extension, differences in the rankings of 93
marker genes, may indicate a relevant differentiation trajectory or novel subpopulation . 94
Yet, methods are lacking f or the quantification of marker gene list similarity between 95
single-cell clusters to annotate cell identity. 96
Metrics that compare two lists based on the ranking of their elements address this 97
challenge by providing a measure to compare single -cell marker gene lists from cluster -98
based differential expression analyses. While rank -based methods for scRNA -seq 99
analyses have been developed21-23, none specifically address cluster similarity based on 100
differential expression, especially when clusters are computationally determined in an 101
unsupervised manner through community detection algorithms like Leiden or Louvain. To 102
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
date, no quantification of the similarity of differentially expressed genes and their rankings 103
has been employed in downstream scRNA-seq analysis. Here, we provide a definition for 104
similarity between single -cell clusters based on a rank -based metric called average 105
overlap24 and show that it can be used to calculate the significance of related cell identities 106
in a truth -known dataset with consistency, precision, and meaningful biological 107
interpretation. We then demonstrate the utility of average overlap for guiding the 108
clustering and annotation of cells in the thymus, revealing individual stages of thymocyte 109
development. 110
111
Results
112
Average overlap metric for cluster marker gene comparison 113
To compare single -cell clusters derived from unsupervised community detection 114
algorithms, we first defined each cluster’s marker gene set by its most differentially 115
expressed genes relative to the rest of the cells. We then ranked the marker gene set 116
based on the significance of differential expression and hypothesized that clusters that 117
share marker genes with similar rankings are more likely to have similar cell identity 118
and/or state. To test this hypothesis, we used a rank-based metric called average overlap 119
(AO)24, which is a top-weighted measure of ranked list similarity. In the context of marker 120
gene sets, AO weighs differences in the rankings of the most differentially expressed 121
genes higher than for the genes lower in the set. AO is defined by the mean of the 122
overlaps between two ranked lists calculated at a range of depths into the lists (Fig. 1a). 123
AO distances range from 0 to 1, from completely dissimilar to completely identical lists . 124
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
When calculated over randomly shuffled lists, AO closely follows a normal distribution 125
(Supplementary Fig. 1) which provides a statistical means for assigning significance to 126
AO scores and calculating the likelihood that two clusters share the same cell identity. 127
Benchmarking the average overlap metric in truth-known data 128
To evaluate AO for quantifying differences between highly similar cell types, we used T -129
cell populations from a dataset with truth-known, biological cell identities, including equal 130
numbers of purified T-helper, memory, naive, and regulatory CD4 T -cells as well as 131
cytotoxic CD8 T-cells20, 25. We refer to this dataset as Zhengmix8eq. 132
We used the Piccolo workflow26 to process and normalize the raw counts and the Leiden 133
method11 (resolution = 2.0) to produce single -cell clusters. After determining each 134
cluster’s top -k marker genes (k = 5, 10, 25, 50, and 100), we calculated pairwise AO 135
distances between the Leiden clusters based on their rankings of the union of all marker 136
gene sets. To obtain five cell populations that corresponded to the five present T -cell 137
populations in Zhengmix8eq, we performed hierarchical clustering of the initial Leiden 138
clusters using the pairwise AO distances, and then iteratively merged the clusters with 139
the highest AO similarity ( Fig. 1b). We also labelled each final cluster based on the 140
enrichment of ground truth labels and quantified the agreement between the five derived 141
cell populations and known T-cell identities using the adjusted Rand index (ARI) (Fig. 1c), 142
adjusted mutual information (AMI), and the Fowlkes-Mallows score (FMS) 143
(Supplementary Fig. 2). We repeated this analysis 100 times with a new random seed 144
for running the Leiden method. To benchmark AO against other metrics utilized for 145
assessing similarities between single -cell clusters, we performed the same hierarchical 146
clustering with pairwise Pearson, Spearman, and Kendall Tau correlations, using the 147
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
marker genes’ normalized expression counts. We also benchmarked the use of 148
correlation metrics applied to the first 50 principal components , which summarize 149
expression across all genes. 150
Five cell populations obtained from merging the Leiden clusters based on AO similarity 151
showed the highest correspondence to true T -cell populations present in the dataset 152
compared to all other metrics (Fig. 1c). Furthermore, AO reproducibly separated a CD8 153
cytotoxic T -cell population from the other CD4 T -cell populations. AO was able to 154
distinguish a CD8 -enriched population in 100% of iterations with 5, 10, and 25 marker 155
genes and in 97% of iterations with 50 and 100 maker genes. In contrast, CD8 -enriched 156
populations were not detected when we merged the Leiden clusters with Pearson, 157
Spearman, or Kendall Tau correlation of marker genes’ expression counts in 63-77%, 24-158
94%, and 30-96% of iterations, respectively (Fig. 1d). 159
Secondly, AO similarity consistently measured larger distances between the CD8 and 160
CD4 populations than the pair-wise distances among the CD4 sub-populations. A ratio of 161
the inter-CD8-to-CD4 distances versus the intra -CD4 distances that is greater than one 162
indicates larger separation of the CD8 group and the CD4 groups. When using AO, the 163
distance ratios were consistently greater than one across all marker gene numbers of 5 164
(range: 1.12-1.28, standard deviation (s.d.): 0.029), 10 (range: 1.06–1.15, s.d.: 0.018), 25 165
(range: 1.015–1.10, s.d.: 0.022), 50 (range: 0.97–1.10, s.d.: 0.034), and 100 (range: 0.96–166
1.09, s.d.: 0.039). In comparison, correlation-based methods showed substantially poorer 167
performance when applied to marker gene expression counts. Pearson, Spearman, and 168
Kendall-Tau distances distinguished CD8 populations from CD4 sub -populations in the 169
same manner of distance ratios greater than one in only 75%, 81%, and 80% of iterations 170
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
respectively at their best performance, which was achieved with 5 marker genes per 171
cluster (Pearson range: 0.94 –2.0, s.d.: 0.38; Spearman range: 0.89 –5.09, s.d.: 0.65; 172
Kendall Tau range: 0.84–2.05, s.d.: 0.24) (Fig. 1e). 173
Measuring single -cell cluster similarities using a smaller number of marker genes per 174
cluster resulted in better performance for AO and other tested rank -based metrics. 175
Curiously, although Spearman and Kendall -Tau correlations were able to recover CD8 176
populations distinct from CD4 sub -populations in 100% of iterations when they were 177
applied to 50 principal components (PC) ( Fig. 1e), merging Leiden clusters using PC -178
based distances resulted in the weakest correspondence to true T -cell labels present in 179
the dataset (Fig. 1c). 180
Put together, AO robustly and reproducibly measured similarities between single -cell 181
clusters based on the ranking of marker genes and significantly outperformed correlation-182
based metrics in identifying and characterizing T-cell populations present in 183
Zhengmix8eq. 184
Thymic T-cell development at single-cell resolution 185
Specific stages of T-cell development in the thymus have been characterized with single-186
cell transcriptomic studies, starting from double negative (DN) populations DN1 -DN4, 187
advancing to immature single positive (ISP) and double positive (DP), and eventually 188
mature CD4 and CD8 T-cells. While the T-cell development trajectory in the thymus has 189
been partially resolved in normal and diseased states, characterizing double negative cell 190
subpopulations using single -cell approaches has been especially challenging , in 191
particular when measuring total thymocytes without prior cell sorting, leading to grouping 192
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
of all DN subpopulations together without specifically differentiating DN1-DN4 states27-30. 193
To address this challenge and to demonstrate the utility of the AO metric in elucidating 194
the relationships between thymic cell populations, we used our methodology to analyze 195
single-cell transcriptomic data previously published from mouse thymocytes 28. We 196
processed the data using the Piccolo workflow and used the Leiden method for cell 197
clustering (resolution = 1.0), obtaining 19 single-cell clusters (Fig. 2a). Marker gene sets 198
for each single -cell cluster were defined by performing differential gene expression 199
analysis relative to all other cells . We then combined each cluster’s marker gene set , 200
composed of its top 25 genes ranked by significance, and performed hierarchical 201
clustering using pairwise AO distances between each cluster’s rankings of the marker 202
genes, testing the hypothesis that T-cell clusters with high AO correspond to cell 203
populations with similar identity and state (Fig. 2b,c). To guide annotating T-cell identities 204
for unsupervised clusters, we utilized the MyGeneSet tool, available as a part of the 205
Immunological Genome Project (ImmGen)31 and mapped the expression of each cluster’s 206
top 50 marker genes across 12 sorted mouse thymic cell populations using composite z-207
scores (Fig. 2d). This analysis linked the expression profiles of the 19 de novo single-cell 208
clusters to MPP4 (a subset of multipotent progenitors from the bone marrow), DN1, DN2a, 209
DN2b, DN3, DN4, ISP , DP, mature CD4 and CD8, natural killer, and T-cells in ImmGen 210
cell populations32. 211
The cell cycle characterizes thymic single-cell clusters 212
Marker genes in multiple single-cell clusters included those related to phases of the cell 213
cycle. Hierarchical clustering based on AO of marker gene sets identified distinct groups 214
of single-cell clusters; therefore, we hypothesized that similarity in the rankings of 215
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
differentially expressed genes, as quantified by AO, would group the clusters based on 216
cell cycle phases. We tested this hypothesis by scoring each cell based on the expression 217
of known cell cycle genes and predicting their cell cycle phase33 (Fig. 2b). We found that 218
clusters 1, 5, and 9 were enriched with cells in the S phase while clusters 2, 6, 7, 8, 10, 219
and 11 were enriched with cells in the G2/M phase (Fig. 2c). The remaining clusters were 220
enriched with cells in the G1 phase , split into two groups of clusters 0, 3, 4, 16, 17, and 221
18 plus clusters 12, 13, 14, and 15 (Fig. 2c). These marker genes associated with the 222
cell cycle were also highly expressed in multiple ImmGen bulk populations, including the 223
DN2b, DN4, and ISP populations ( Fig. 2d). These results suggest that the groups that 224
emerge from AO clustering primarily reflect cell cycle signatures known to play crucial 225
roles in the identities of thymic cell populations, particularly pointing to stages of T-cell 226
development involving rapid expansion and proliferation such as those that occur after T-227
lineage commitment and β-selection34. 228
Annotation of thymic single-cell clusters during development 229
Using the annotations from MyGeneSet and Immgen sorted populations, we found that 230
the earliest thymocyte progenitors (MPP4 and DN1 cells) mapped most strongly to cluster 231
0, which had marker genes that did not show significant expression in any other ImmGen 232
thymic population. At the same time, cluster 0 did not show high similarity to any other 233
cluster according to AO; thus, we annotated cluster 0 as a mix of multipotent progenitors 234
(MPP) and DN1 cells. 235
Cluster 1 corresponded strongly to DN2a populations and cluster 2 was inferred as DN2b 236
cells based on its marker genes’ expression specifically in the DN2a/b and earlier 237
progenitor populations. Marker genes for cluster 2 were also highly expressed in 238
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
proliferating DN4 and ISP populations, stemming from a 36% overlap with known cell 239
cycle genes, including Hmmr and Nusap1. 240
Clusters 3 and 4 marker genes were highly expressed in DN3 populations and showed 241
marked AO similarity. In addition, marker genes for cluster 5 showed relatively high 242
expression in DN3 cells. For deeper characterization of these cells into DN3a and DN3b 243
populations, we scored cells in clusters 3, 4 and 5 against genes upregulated in purified 244
wildtype DN3a and 3b cells35. While cluster 4 highly correlated with DN3a, and cluster 3 245
modestly correlated with both phases, cluster 5, which is in the S phase of the cell cycle, 246
showed similarity to DN3b cells, together indicating a transition of cells from DN3a to 247
DN3b states (Supplementary Fig. 3). Accordingly, we merged clusters 3 and 4 together 248
to form a DN3a population, and annotated cluster 5 as the DN3b population. 249
Clusters 6, 7, 8, and 9 mapped strongly to ImmGen DN4 and ISP populations. While 250
cluster 9 was distinguished as DN4/ISP cells in the S phase of the cell cycle, clusters 6, 251
7, and 8 grouped together based on their pairwise AO and association with the G2 /M 252
phase of the cell cycle. Of note is cluster 8, which contained cells with low expression of 253
a select set of genes that were not expressed in any other clusters (Supplementary Fig. 254
4). Relative to cells in clusters 6 and 7, cells in cluster 8 showed an upregulation of mt -255
Co1, mt-Co2, mt-Co3, and other mitochondrial genes. In fact, mitochondrial genes made 256
up for on average 7.8% of this cluster’s expressed transcripts, indicating that a portion of 257
cluster 8 would fall just under the 12% mitochondrial expression threshold we used for 258
filtering low-quality cells at the start of analysis. Put together, we interpreted cluster 8 as 259
representing apoptotic DN4/ISP cells (Supplementary Fig. 5). 260
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
Cluster 10 showed the highest correspondence to ImmGen ISP s, while cluster 11 was 261
inferred as an intermediate population between the ISP and DP stages based on its 262
marker genes’ expression in late development T-cells (Supplementary Fig. 4). Clusters 263
12, 13, and 14 were identified as parts of a larger DP population, as they each mapped 264
strongly to ImmGen DP cells and showed the highest AO similarities across all clusters. 265
The maturing stages of T-cell development were represented by cells in clusters 15 as 266
single positive CD4 and cluster 16 as single positive CD8 T-cells. Finally, clusters 17 and 267
18 were assigned to smaller groups of natural killer cells and T-cells, respectively. 268
Lastly, we explored characterizing the stages of T -cell development in our data using 269
orthogonal methods. First, we asked whether pseudotime inference could recover cell 270
identities in these developing T-cells and used diffusion pseudotime36 specifying cluster 271
0 as the starting point . T he inferred trajectory match ed our annotated stages of 272
development and supported later developmental trajectory for cells in clusters 6 and 7 273
compared to cells in cluster 2; hence, partially distinguishing DN2b from DN4/ISP cells 274
despite being in the same cell cycle phase (Supplementary Fig . 6a). This pattern , 275
however, did not hold when cluster 2 was chose n as the analysis’s starting point 276
(Supplementary Fig . 6b). Cells in clusters 6 and 7 were assigned the same starting 277
pseudotime value of 0 as the chosen cluster 2, indicating that pseudotime calculations 278
can be confounded by strong cell cycle signatures. Second, we asked whether Pearson 279
correlation applied to the expression counts of the marker genes could infer T-cell 280
similarities. The resulting hierarchical clustering, however, did not consistently group the 281
single-cell clusters based on phases of cell cycle or stages of T -cell development 282
(Supplementary Figure 7 ), highlighting the strength of AO in inferring biologically 283
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
meaningful relationship s by using the rankings of differentially expressed genes as 284
opposed to their counts, especially in the context of highly similar cells. 285
Overall, our extended analyses of truth -known hematopoietic populations as well as 286
unsorted mouse thymocytes showed that AO could accurately guide the clustering and 287
annotation of highly homogenous cell populations and help characterize the trajectory of 288
T-cell development in thymus, including the detection of elusive double-negative (CD4-289
CD8-) cells. We have implemented the AO metric in Python and have published the 290
‘sc_average_overlap’ Python package, which is available at 291
https://github.com/chrisvthai/sc_average_overlap, designed to work seamlessly with in 292
the Scanpy framework. 293
294
Discussion
295
In this work, we propose using average overlap, a top-weighted metric that quantifies the 296
similarity of ranked lists , to compare clusters in scRNA -seq analysis. Hierarchical 297
clustering using rank - and correlation-based metrics to compare the transcriptomic 298
profiles of inferred clusters in scRNA -seq analysis is not new. Current implementations, 299
including those in Seurat and Scanpy, utilize distances based on gene expression counts 300
or principal components. In contrast, AO measures cluster similarity by relying on marker 301
gene rankings derived from differential gene expression analysis. 302
We compared AO to correlation-based metrics using Leiden clusters derived from a 303
biological, ground-truth dataset , containing five transcriptionally similar T-cell 304
populations20, 25. We showed that merging unsupervised clusters into five groups with AO 305
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
based on gene rankings resulted in cell labels that correspond ed to ground truth 306
populations more accurately than other tested methods. AO is also far more consistent in 307
its performance , and over many Leiden clustering iterations , produced merged 308
populations that were biologically meaningful, by differentiating CD8 cytotoxic populations 309
from the CD4 clusters. Expression counts across highly similar populations are high ly 310
correlated irrespective of the feature selection procedure and the amount of cluster 311
marker genes used. This high transcriptional correlation often results in inconsistencies 312
that make distinguishing subtle differences between T-cell subpopulations and inferring 313
cell clusters that map back to individual subpopulations a challenge. Restricting analysis 314
to just gene rankings based on differential expression in our implementation of AO allows 315
robustly measuring subtle transcriptional differences between single-cell clusters. 316
Moreover, AO performed well compared to other distance metrics due to its top-weighted 317
property. While performance for all metrics diminished as more marker genes were 318
included, AO-based hierarchical clustering was minimally affected . Reincorporating 319
information about gene expression variation by way of utilizing principal coordinates 320
recovered the performance of correlation-based metrics to an extent . We interpret this 321
observation as evidence for the effect commonly known as the curse of dimensionality37 322
where the differences in distances between clusters become increasingly similar as a 323
higher number of marker genes are added. In contrast, differences at the top of rankings 324
as measured by AO, correspond to the most differentially expressed genes, and reflect 325
cellular differences more precisely than the genes at the bottom of the rankings. As such, 326
AO is not as sensitive to the curse of dimensionality as the other benchmarked metrics. 327
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
AO can quantify subtle differences in otherwise highly transcriptionally correlated cell 328
populations. In our thymus data, AO-based hierarchical clustering revealed cell cycle 329
genes as the strongest driver s of marker gene similarity . However, s howing high 330
correlation with ImmGen reference expression profiles was not sufficient to characterize 331
cell cycle signatures in single-cell clusters that corresponded to DN2b, DN4, and ISP cells 332
at once. By measuring differential gene rankings, AO inferred cell cycle phases with great 333
specificity and guided identification of stages of T -cell development involving increased 334
expansion and proliferation. Additionally, AO helped characterize a small subset of cells 335
in the G2/M phase as a population of apoptotic DN4/ISP . Cells in this cluster were retained 336
simply due to the choice of mitochondrial expression threshold used to account for low 337
quality cells, yet showed relatively high correspondence to ImmGen DN4 and ISP 338
populations, supporting their biological relevan ce. Thymocytes may undergo positive 339
selection based on their ability to bind to MHC ligands; thymocytes unable to bind MHC 340
undergo death by neglect 38. The combination of relatively high mitochondrial gene 341
expression which typically indicates dying cells, high correspondence to bulk DN4 and 342
ISP populations in ImmGen, and the detection and quantification of unique marker genes 343
via the Leiden clustering algorithm and the AO metric supports this interpretation. 344
Altogether, unique subpopulations of thymocytes characterized with the aid of AO may be 345
missed when relying solely on reference transcriptomic profiles. 346
The analysis and comparison of cluster marker genes are necessary steps in uncovering 347
novel cell populations. As we have demonstrated with our analysis of thymus data, one 348
clustering resolution may not yield clusters that simultaneously represent broad well-349
defined populations as well as smaller novel subpopulations. Moreover, while some 350
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
clusters may separate due to existing biological variations, others may arise due to noisy 351
clustering. AO-measured distances based on the ranking of differentially expressed 352
genes provide a quantitative method for evaluating transcriptional similarity independent 353
of clustering algorithms and parameters used. In the context of differentiating cells , 354
alternative methods such as pseudotime inference, which calculates the relative position 355
of a cell across gene expression gradients, can theoretically aid cluster annotation36, 39, 356
40. However, pseudotime algorithms rely heavily on the choice of starting population, and 357
as we showed in an application to our thymus data, they may group cells in the same cell 358
cycle phase despite transcriptional correspondence to different biological populations . 359
Additionally, when there are groups of many distinc t cells in general, a trajectory cannot 360
be inferred altogether and a quantitative method for cluster comparison is still needed. 361
In conclusion, we propose average overlap as a metric for direct quantification of marker 362
gene similarity with broad applicability in diverse biological settings suited to explore and 363
clarify the heterogeneity of cells in most challenging contexts that arise from differentiating 364
cells or involve clonal populations, such as those driving cancer progression. 365
366
Methods
367
The sc_average_overlap Python package. We have implemented the average overlap 368
metric in Python and have published the ‘sc_average_overlap’ Python package, which is 369
available at https://github.com/chrisvthai/sc_average_overlap, designed to work 370
seamlessly with Scanpy, a commonly used Python library for scRNA-seq analyses. The 371
package includes functions for co mputing pairwise AO scores between two clusters, 372
performing hierarchical clustering of cell populations , and storing the resulting 373
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
dendrogram in Scanpy’s native AnnData object. For hierarchical clustering, the AO values 374
are subtracted from one to generate a distance metric, s ince AO ranges between 0 and 375
1, with higher overlap scores meaning a lower distance, and vice versa. 376
Benchmarking average overlap’s performance. To benchmark AO when used as the 377
metric for hierarchical clustering of cell populations in single cell RNA -seq data, we 378
utilized the Zhengmix8eq dataset, generated for the purpose of benchmarking clustering 379
performance in scRNA-seq analysis20. We used only five T-cell populations, which were 380
CD8 cytotoxic cells and CD4 populations including helper, memory, naïve, and regulatory 381
cells. We processed the data using Piccolo26 to perform feature selection and 382
normalization, selecting 3,000 highly variable genes for downstream analysis. We then 383
performed PCA and Leiden clustering with a resolution of 2.0 using Scanpy. We 384
performed this process and all future downstream analysis 100 times in total, each with 385
a different random seed given to the Leiden algorithm. 386
We generated marker genes for each resulting single-cell cluster by performing 387
differential expression analysis using the Wilcoxon rank -sum test and comparing each 388
cluster’s gene expression versus the rest of the cells , ranked by significance. We then 389
grouped single-cell clusters using hierarchical clustering. The metrics used for the 390
hierarchical clustering included Pearson correlation, Spearman correlation, Kendall -Tau 391
coefficient, and AO, and were calculated with either the set of all cluster marker genes, 392
composed of the union of each unsupervised cluster’s top 5, 10, 25, 50, or 100 393
differentially expressed marker genes, or the values of the first 50 principal components. 394
When using the set of all cluster marker genes, Pearson, Spearman, and Kendall -Tau 395
coefficients were calculated using each cluster’s expression values for each gene, while 396
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
AO was calculated using the rankings of the cluster marker genes. When using principal 397
component values, only Pearson, Spearman, and Kendall -Tau coefficients were 398
calculated. 399
Once hierarchical clustering was computed, the clusters were iteratively merged in an 400
automated manner, guided by the pair of clusters with the highest AO similarity, until five 401
final clusters remained, corresponding to the five represented T-cell populations in the 402
Zhengmix8eq dataset. Performance was evaluated in two ways. Firstly, using the ground 403
truth labels, we calculated adjusted Rand index (ARI), adjusted mutual information (AMI), 404
and the Fowlkes-Mallows score (FMS). Secondly, based on labeling each of the five final 405
clusters according to its most enriched-for ground-truth label, we evaluated the separation 406
of CD8 populations from CD4 populations . For the metrics and feature-sets used, we 407
counted the occurrences where no clusters enriched with CD8 cytotoxic cells were 408
identified. Additionally, in each clustering iteration , we calculated the ratio of the mean 409
pairwise distances between the CD8 population (if identified) and each CD4 population 410
versus the mean pairwise distances among the CD4 populations. In each occurrence 411
where no CD8 cluster was detected, we set the mean distance ratio between CD8 and 412
CD4 groups to 1.0 to indicate no detected difference in separation between CD8 and CD4 413
groups. 414
Thymocyte analysis using average overlap. We applied the AO metric to a previously 415
published dataset of mouse thymocytes 28. We first performed feature selection and 416
normalization using Piccolo, excluding cells expressing more than 12% mitochondrial 417
genes or 80% ribosomal genes and selecting 3,000 highly variable genes for downstream 418
analysis. We performed Principal Component Analysis ( PCA) and Leiden clustering at 419
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
resolution 1.0 using Scanpy, and visualized the single cell clusters using Uniform Manifold 420
Approximation and Projection ( UMAP). We performed differential gene expression 421
analysis using the Wilcoxon rank -sum test and assigned the top 25 differentially 422
expressed genes, ranked by significance, as a set of marker genes in each cluster. As 423
we were interested in T -cell populations only, we excluded small populations of 424
granulocytes and B -cells detected through scoring against known markers of each cell 425
type. We combined all cluster s’ marker genes into a single set for calculating AO for 426
subsequent calculation of pair -wise distances between the clusters and hierarchical 427
clustering. 428
We obtained and visualized the expression of each set of cluster marker genes in 12 429
sorted populations of thymic cells in mice from the ImmGen project using the MyGeneSet 430
tool31. To summarize the expression of each single-cell cluster’s marker genes in ImmGen 431
sorted bulk populations, we first transformed the expression counts into a normalized z-432
score and then combined the transformed values into a single composite z -score using 433
Stouffer’s method. 434
We used Scanpy’s score_genes function for scoring each cell according to phases of the 435
cell cycle using a list of 97 known cell cycle marker genes 33. We used the same scoring 436
Method
to distinguish between DN3a and DN3b populations based on a list of genes 437
upregulated in purified wildtype DN3a/b cells35. 438
439
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
FUNDING 440
This work was supported by the National Institutes of Health (R01CA236936 and 441
R01CA285513), the V Foundation (T2019 -012 and T2023 -024), The Leukemia & 442
Lymphoma Society (Scholar Award 1386 -23), the New Jersey Commission on Cancer 443
Research (COCR23PRG006) and Rutgers Cancer Institute of New Jersey Biomedical 444
Informatics Shared Resource (P30CA072720 -5917). CT was a fellow of the 445
Biotechnology Training Program at Rutgers University (NIH T32 GM135141 ). AS was 446
supported by the New Jersey Commission on Cancer Research (COCR24PDF015). 447
448
AUTHOR CONTRIBUTION 449
CT and HK conceived the study. CT performed the analytical experiments and conducted 450
all analyses with contribution and input from AS. DH and HK supervised the work. All 451
authors drafted the manuscript. All authors read and approved the final manuscript. 452
453
CONFLICTS OF INTEREST 454
CT, AS, and DH declare no existing conflict of interest. HK is a full -time employee of 455
Regeneron Pharmaceuticals. 456
457
458
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
FIGURES AND FIGURE LEGENDS 459
460
461
FIGURE 1. Workflow and benchmarking of average overlap for quantitative 462
comparison of inferred single-cell clusters in scRNA -seq data. a, Calculating the 463
average overlap distance between two ranked lists and compar ison with correlation-464
based metrics. Differentially expressed genes (G1 to G5) are ranked in clusters C1 and 465
C2 and the overlap between the ranked lists is calculated at each depth. Average overlap 466
of the entire lists is the mean of all overlaps. b, A schematic demonstrating the workflow 467
for hierarchical clustering using average overlap, following community detection, marker 468
gene identification, and pair -wise distance calculation . c, Adjusted Rand Index (ARI) 469
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
performance measures based on agreement with ground truth labels in five derived cell 470
populations, across different metrics and number of marker genes used, using the T-cells 471
in the Zhengmix8eq dataset. Adjusted mutual information (AMI) and the Fowlkes-Mallows 472
score (FMS) are shown in Supplementary Fig. 2. d, Percentage of benchmarking trials 473
in which a final cluster enriched with ground-truth CD8 naïve cytotoxic population was 474
identified. e, Ratio of CD8 to CD4 distances versus intra -CD4 distances. A ratio higher 475
than 1 indicates higher degree of separation between CD8 cytotoxic and CD4 populations 476
versus the separation among the CD4 populations. 477
478
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
479
FIGURE 2. Utilizing average overlap to characterize stages of T-cell development in 480
a mouse thymus. a, A UMAP plot of mouse thymocytes showing 19 inferred clusters 481
using the Leiden algorithm. b, Cluster tree generated from hierarchical clustering of 482
Leiden clusters with average overlap of marker gene rankings , based on the combined 483
set of each cluster’s top 25 marker genes. Clusters largely group based on differential 484
expression of genes related to the cell cycle. c, A UMAP plot of the mouse thymocytes, 485
where each cell is annotated with its inferred cell cycle phase. d, A heatmap summarizing 486
expression of Leiden cluster marker genes in sorted bulk populations of thymocytes from 487
ImmGen. Each cluster is also colored with its inferred cell cycle phase. e, Final 488
annotations of developing mouse thymocytes. 489
490
491
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
SUPPLEMENTARY FIGURE LEGENDS 492
Supplementary Figure 1. Distributions of pair -wise average overlap scores on 493
randomly shuffled lists. Across 2,000 iterations, randomly shuffling two ranked lists that 494
contain the same set of elements yields average overlap distances that closely follow a 495
normal distribution with a mean of 0.5 and a variance that is inversely correlated with the 496
length of the list. 497
498
Supplementary Figure 2. a, Adjusted Mutual Information (AMI) and b, Fowlkes-Mallows 499
index (FMI) performance measures based on the enrichment of ground truth labels in five 500
derived cell populations, across different metrics and number of marker genes use d in 501
the T-cells in the Zhengmix8eq dataset. 502
503
Supplementary Figure 3 . Upregulation of DN3a - and DN3b -related genes in 504
developing mouse thymocytes. Cells were scored by expression of genes upregulated 505
in purified wildtype DN3a and 3b cells, determined by study from Vogel et al35. 506
507
Supplementary Figure 4. a, The expression of top 5 marker genes for each annotated 508
cell population. b, The expression of marker genes expressed in more than 50% of their 509
respective cell population. 510
511
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
Supplementary Figure 5. Pair -wise differential expression between apoptotic and 512
normal DN4/ISP cells in G2/M phase . Differentially expressed genes between these 513
groups were determined with the Wilcoxon rank -sum test. Normalized expression of the 514
top 20 genes upregulated in the apoptotic group are shown. 515
516
Supplementary Figure 6. Diffusion pseudotime inferred on developing thymocytes. 517
Among the group of cells that map to double negative thymocytes, when specifying 518
cluster 0 as the starting point, the inferred trajectory closely matches our annotated stages 519
of development. However, when setting cluster 2 as the start, two additional groups of 520
cells, clusters 7 and 8, also share the same pseudotime value of 0. 521
522
Supplementary Figure 7. Hierarchical clustering of Leiden clusters in thymus data 523
using Pearson correlation. Pearson correlation is calculated with the same set of top 524
25 marker genes in each Leiden cluster as previous analysis using the average overlap 525
metric. Grouping based on either cell cycle phase or neighboring stages of T -cell 526
development is inconsistent. 527
528
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
References
529
1. Li, X. & Wang, C.Y. From bulk, single-cell to spatial RNA sequencing. Int J Oral Sci 530
13, 36 (2021). 531
2. Luecken, M.D. & Theis, F.J. Current best practices in single-cell RNA-seq analysis: 532
a tutorial. Molecular Systems Biology 15, e8746 (2019). 533
3. Haque, A., Engel, J., Teichmann, S.A. & Lonnberg, T. A practical guide to single -534
cell RNA-sequencing for biomedical research and clinical applications. Genome 535
Med 9, 75 (2017). 536
4. Hwang, B., Lee, J.H. & Bang, D. Single -cell RNA sequencing technologies and 537
bioinformatics pipelines. Exp Mol Med 50, 1-14 (2018). 538
5. Pullin, J.M. & McCarthy, D.J. A comparison of marker gene selection methods for 539
single-cell RNA sequencing data. Genome Biol 25, 56 (2024). 540
6. Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single -541
cell analysis. Nat Biotechnol 42, 293-304 (2024). 542
7. Satija, R., Farrell, J.A., Gennert, D., Schier, A.F. & Regev, A. Spatial reconstruction 543
of single-cell gene expression data. Nat Biotechnol 33, 495-502 (2015). 544
8. Wolf, F.A., Angerer, P. & Theis, F.J. SCANPY: large -scale single -cell gene 545
expression data analysis. Genome Biol 19, 15 (2018). 546
9. Kiselev, V.Y., Andrews, T.S. & Hemberg, M. Challenges in unsupervised clustering 547
of single-cell RNA-seq data. Nat Rev Genet 20, 273-282 (2019). 548
10. Blondel, V.D., Guillaume, J. -L., Lambiotte, R. & Lefebvre, E. Fast unfolding of 549
communities in large networks. Journal of statistical mechanics: theory and 550
experiment 2008, P10008 (2008). 551
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
11. Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing 552
well-connected communities. Sci Rep 9, 5233 (2019). 553
12. Hou, R., Denisenko, E. & Forrest, A.R.R. scMatch: a single -cell gene expression 554
profile annotation tool using reference datasets. Bioinformatics 35, 4688 -4695 555
(2019). 556
13. Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a 557
transitional profibrotic macrophage. Nat Immunol 20, 163-172 (2019). 558
14. Ianevski, A., Giri, A.K. & Aittokallio, T. Fully -automated and ultra -fast cell -type 559
identification using specific marker combinations from single -cell transcriptomic 560
data. Nat Commun 13, 1246 (2022). 561
15. Pasquini, G., Rojo Arias, J.E., Schafer, P. & Busskamp, V. Automated methods for 562
cell type annotation on scRNA-seq data. Comput Struct Biotechnol J 19, 961-969 563
(2021). 564
16. Patterson-Cross, R.B., Levine, A.J. & Menon, V. Selecting single cell clustering 565
parameter values using subsampling -based robustness metrics. BMC 566
Bioinformatics 22, 39 (2021). 567
17. Kim, T. et al. Impact of similarity metrics on single -cell RNA-seq data clustering. 568
Brief Bioinform 20, 2316-2326 (2019). 569
18. Yu, L., Cao, Y., Yang, J.Y.H. & Yang, P. Benchmarking clustering algorithms on 570
estimating the number of cell types from single -cell RNA -sequencing data. 571
Genome Biol 23, 49 (2022). 572
19. Peyvandipour, A., Shafi, A., Saberian, N. & Draghici, S. Identification of cell types 573
from single cell data using stable clustering. Sci Rep 10, 12349 (2020). 574
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
20. Duo, A., Robinson, M.D. & Soneson, C. A systematic performance evaluation of 575
clustering methods for single-cell RNA-seq data. F1000Res 7, 1141 (2018). 576
21. Xu, Y. et al. A Gene Rank Based Approach for Single Cell Similarity Assessment 577
and Clustering. IEEE/ACM Trans Comput Biol Bioinform 18, 431-442 (2021). 578
22. Vargo, A.H.S. & Gilbert, A.C. A rank -based marker selection method for high 579
throughput scRNA-seq data. BMC Bioinformatics 21, 477 (2020). 580
23. Oulas, A., Savva, K., Karathanasis, N. & Spyrou, G.M. Ranking of cell clusters in 581
a single -cell RNA-sequencing analysis framework using prior knowledge. PLoS 582
Comput Biol 20, e1011550 (2024). 583
24. Webber, W., Moffat, A. & Zobel, J. A Similarity Measure for Indefinite Rankings. 584
ACM Transactions on Information Systems 28, 20.21-20.38 (2010). 585
25. Zheng, G.X. et al. Massively parallel digital transcriptional profiling of single cells. 586
Nat Commun 8, 14049 (2017). 587
26. Singh, A. & Khiabanian, H. Feature selection followed by a novel residuals-based 588
normalization that includes variance stabilization simplifies and improves single -589
cell gene expression analysis. BMC Bioinformatics 25, 248 (2024). 590
27. Oh, S. et al. Distinct subpopulations of DN1 thymocytes exhibit preferential 591
gammadelta T lineage potential. Front Immunol 14, 1106652 (2023). 592
28. Tottone, L. et al. A Tumor Suppressor Enhancer of PTEN in T -cell development 593
and leukemia. Blood Cancer Discov 2, 92-109 (2021). 594
29. Park, J.E. et al. A cell atlas of human thymic development defines T cell repertoire 595
formation. Science 367 (2020). 596
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
30. Belver, L. et al. GATA3 -Controlled Nucleosome Eviction Drives MYC Enhancer 597
Activity in T-cell Development and Leukemia. Cancer Discov 9, 1774-1791 (2019). 598
31. Heng, T.S., Painter, M.W. & Immunological Genome Project, C. The Immunological 599
Genome Project: networks of gene expression in immune cells. Nat Immunol 9, 600
1091-1094 (2008). 601
32. Yoshida, H. et al. The cis-Regulatory Atlas of the Mouse Immune System. Cell 176, 602
897-912 e820 (2019). 603
33. Tirosh, I. et al. Dissecting the multicellular ecosystem of metastatic melanoma by 604
single-cell RNA-seq. Science 352, 189-196 (2016). 605
34. Oh, S. et al. Mapping the two distinct proliferative bursts early in T -cell 606
development. Immunol Cell Biol 101, 766-774 (2023). 607
35. Vogel, K.U., Bell, L.S., Galloway, A., Ahlfors, H. & Turner, M. The RNA -Binding 608
Proteins Zfp36l1 and Zfp36l2 Enforce the Thymic beta -Selection Checkpoint by 609
Limiting DNA Damage Response Signaling and Cell Cycle Progression. J Immunol 610
197, 2673-2685 (2016). 611
36. Haghverdi, L., Buttner, M., Wolf, F.A., Buettner, F. & Theis, F.J. Diffusion 612
pseudotime robustly reconstructs lineage branching. Nat Methods 13, 845 -848 613
(2016). 614
37. Bellman, R. & Rand Corporation. Dynamic programming. (Princeton University 615
Press, Princeton,; 1957). 616
38. Baldwin, T.A., Hogquist, K.A. & Jameson, S.C. The fourth way? Harnessing 617
aggressive tendencies in the thymus. J Immunol 173, 6515-6520 (2004). 618
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
39. Street, K. et al. Slingshot: cell lineage and pseudotime inference for single -cell 619
transcriptomics. BMC Genomics 19, 477 (2018). 620
40. Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed 621
by pseudotemporal ordering of single cells. Nat Biotechnol 32, 381-386 (2014). 622
623
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted May 10, 2025. ; https://doi.org/10.1101/2025.05.06.652497doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.