Abstract
20
Spectral flow cytometry provides greater insights into cellular heterogeneity by simultaneous 21
measurement of up to 50 markers. However, analyzing such high-dimensional (HD) data is 22
complex through traditional manual gating strategy. To address this gap, we developed CAFE as 23
an open-source Python-based web application with a graphical user interface. Built with 24
Streamlit, CAFE incorporates libraries such as Scanpy for single-cell analysis, Pandas and 25
PyArrow for efficient data handling, and Matplotlib, Seaborn, Plotly for creating customizable 26
figures. Its robust toolset includes density-based down-sampling, dimensionality reduction, 27
batch correction, Leiden-based clustering, cluster merging and annotation. Using CAFE, we 28
demonstrated analysis of a human PBMC dataset of 350,000 cells identifying 16 distinct cell 29
clusters. CAFE can generate publication-ready figures in real time via interactive slider controls 30
and dropdown menus, eliminating the need for coding expertise and making HD data analysis 31
accessible to all. CAFE is licensed under MIT and is freely available at 32
https://github.com/mhbsiam/cafe. 33
34
Keywords
Cytometry, Bioinformatics, Protocol, Leiden, 35
Word count: 3986 36
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
37
Introduction
38
Flow cytometry is a widely used technique in immunology to identify and quantify immune cells 39
based on specific surface markers 1. The development of spectral flow cytometry (SFCM) has 40
further expanded immunophenotyping capabilities allowing the simultaneous analysis of a 41
greater number of parameters through the complete emission spectra of fluorophores 1. 42
Compared to conventional flow cytometry, SFCM uses spectral unmixing algorithms to 43
deconvolute the overlapping signals and achieves enhanced resolution and sensitivity to 44
distinguish between different cell populations 2. SFCM can incorporate broader range of 45
antibodies with up to 50 colors in a single panel improving upon the conventional FCM where 46
the number of parameters is limited by the instrument constraints 3. Incorporating more 47
parameters substantially increases the complexity in gating strategy which largely relies on 48
established convention and prior knowledge 4,5. Additional gating steps and combinations of 49
markers used to subset cells complicate the interpretation of such high-dimensional data. 50
Several clustering methods are available to identify cell populations such as FlowSOM 6, xShift7, 51
SPADE8 and Phenograph 9. SPADE and FlowSOM utilize hierarchical clustering with the latter 52
employing self-organizing maps (SOMs) to cluster cells, whereas xShift detects clusters based 53
on shifts in local cell density6–8. Phenograph, by contrast, constructs a K-nearest neighbor graph 54
and applies the Louvain algorithm to identify cell clusters, but Louvain can produce poorly 55
connected or disconnected communities9,10. 56
Recently, the Leiden clustering algorithm has emerged as a faster and more accurate 57
alternative to improve community detection in networks 10. Single-cell RNA sequencing (scRNA-58
seq) tools: Seurat11 (R) and Scanpy12 (Python) have integrated Leiden algorithms for community 59
detection. However, running Leiden within Seurat resulted in drawbacks including higher 60
memory usage, longer calculation time and random crashes in docker containers 13. Scanpy 61
resolves these issues, and unlike Seurat, Scanpy improves visualization quality by using 62
consistent KNN and SNN graphs for both clustering and uniform manifold approximation and 63
projection (UMAP) 13,14. In February 2020, Phenograph version 1.5.3 was released, which 64
incorporated an option to use Leiden for clustering; however, the default parameter is set to 65
Louvain through the latest release (v.1.5.7). In our previous work, we showed that the use of 66
Leiden algorithm in community detection for SFCM data provides superior result to Phenograph 67
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
(Louvain), FlowSOM, and xShift 15. Currently, there is a scarcity of open-source tools to utilize 68
Leiden algorithm for SCFM data analysis16. 69
Here we present CAFE, Cell Analyzer for Flow Experiment, a user-friendly, web application 70
developed in Python that works across Windows, MacOS, and Linux. The app is lightweight and 71
can perform high-dimensional SFCM data analysis using a standard computing machine (i.e., 72
Apple M1 chip with 16gb RAM), and it provides the flexibility to be deployed on HPC clusters for 73
enhanced scalability. Once installed, the tool runs entirely offline and does not require an active 74
internet connection to load files. This also enables users to maintain compliance with data 75
security, especially with protected health information (PHI) when analyzing patient samples. 76
CAFE can be used to process data, reduce dimension, batch correction, run Leiden clustering, 77
perform statistics and generate a wide range of figures. Figures can be adjusted and viewed 78
within the tool in real time. Additionally, the tool offers Kernel Density Estimation (KDE)-based 79
data downsampling, advanced clustering with predefined markers, cluster quality evaluation, 80
merging subclusters into metaclusters, and cell type annotation. Designed as an open-source, 81
interactive data analysis platform, CAFE enables biologists with no-coding experience to 82
analyze SFCM data and create publication quality visualization with customizable parameters. 83
CAFE is freely available to download at: https://github.com/mhbsiam/cafe 84
85
Methods
86
Implementation 87
The CAFE webtool was developed using Python programming language due to its compatibility 88
with Scanpy library 12. Figure 1 illustrates the components and workflows of CAFE. Streamlit 17 89
library was used to develop the web interface that provides dynamic updates based on user 90
inputs in the graphical user interface (GUI) without writing or editing code directly. Streamlit was 91
chosen due to its simplicity in development and compatibility with other Python libraries across 92
operating systems. Streamlit v1.39.0 is compatible with any modern HTML5 web browser. For 93
data loading and processing, we relied on Pandas v2.2.3 with PyArrow v18.0.0 which achieves 94
faster data loading and processing compared to Pandas alone. We used NumPy v1.26.4 for 95
data type and range selection, RGB array creation for color handling, and grid setup for 96
subplots. Seaborn v0.13.2, Matplotlib v3.9.2, and Plotly v4.24.1 libraries were used for data 97
visualization and users are provided with options to adjust parameters: plot size, color profile, 98
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
and output formats in PNG, JPG, SVG, or PDF. CAFE integrates AnnData 12, a widely used 99
framework in single-cell RNA sequencing analysis that allows for efficient storage and 100
manipulation of both sparse and dense matrices along with metadata. CAFE outputs an 101
AnnData object which can be used outside of CAFE if users wish to deposit their data with 102
analysis or perform custom analyses using other tools. 103
The Scanpy library was used to perform key analyses including dimension reduction, batch 104
correction, and Leiden clustering. The user has the option to reduce dimension (sc.tl.pca) of the 105
data through Principal Component Analysis (PCA) or skip it. PCA is a linear dimensionality 106
reduction technique that retains the global structure of the data by capturing the variances 107
across all dimensions. Because the app performs PCA through Scanpy library, by default, the 108
number of components retained is limited to the lesser of the two values: the number of cells or 109
the number of markers. Also, the Singular Value Decomposition (SVD) solver was set to “auto” 110
that chooses the most appropriate solver based on the size of the dataset; however, users have 111
options to set a percentage of variances they want to retain, and the type of solver used. The 112
reduced dataset is stored and can be further processed for batch correction (sc.pp.combat) 113
using ComBat (Combined Batch) 18. This is particularly useful if a user has collected samples in 114
different batches as the algorithm standardizes the data by making it comparable and removing 115
unwanted variability. 116
To group cells into distinct clusters based on marker expression profiles, Leiden clustering is run 117
(sc.tl.leiden) and users can select either ‘iGraph’ or ‘leidenalg’ algorithm flavor 10,19. To define 118
clustering resolution, a user can choose from 0.1 to 2.0 where the lowest value provides the 119
lowest number of clusters. The user can fine tune Leiden calculation by altering the number of 120
neighbors and minimum distance values in Uniform Manifold Approximation and Projection 121
(UMAP) calculation within the app. CAFE generates AnnData object (H5AD) file, CSV outputs, 122
and visual outputs including UMAP plots, dot-heatmaps, expression pots, and barplots as high-123
resolution images and provides download buttons to save them to a desired folder. The app 124
allows various visualization settings, with changes made and displayed immediately within the 125
app. The generated AnnData object (h5ad file) can be further used to perform a range of 126
statistical analyses. 127
The app includes advanced functionalities for clustering and cluster evaluation. Because setting 128
up appropriate values for Leiden resolution and UMAP parameters is central to obtaining quality 129
clustering results, a user can leverage CAFE’s Cluster Evaluation tab to generate multiple 130
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
AnnData files with various combinations of these parameters and compare UMAP plots as well 131
as Silhouette score, Calinski-Harabasz score, Davies-Bouldin score, and Elbow method to 132
assess clustering results. Additionally, CAFE provides clustering with pre-selected markers, 133
merging subclusters into metaclusters, and annotation of clusters directly within the app. The 134
advanced Downsampling tab offers to downsample (e.g. 20,000 events per sample) data using 135
a PCA-KDE based method. This approach combines PCA and KDE (Kernel Density Estimation) 136
based algorithm from scipy.stats using Gaussian kernel function and silverman bandwidth 20. 137
KDE is applied to the PCA-transformed data to estimate the density of data points. Based on 138
these density estimates, the code probabilistically downsamples data, thus reducing sampling 139
bias and preserving original data distribution. This offers an informed approach compared to 140
simple random downsampling, and this can be used to filter out noise while retaining meaningful 141
biological information in a smaller dataset. 142
143
Statistical analysis 144
In the visualization tab under statistical analyses section, users can perform different statistical 145
tests and generate plots. The Shapiro-Wilk test from “scipy.stats” is used to determine if marker 146
expression within each group or cluster follows a normal distribution. Based on these results, 147
the app recommends either parametric (T-test) or non-parametric (Mann-Whitney U test) tests 148
using “scipy.stats”. For comparing multiple clusters, the app allows users to perform ANOVA 149
(scipy.stats.f_oneway) or Kruskal-Wallis (scipy.stats.kruskal) tests. To reduce statistical artifacts, 150
multiple testing correction is applied using the Benjamini-Hochberg False Discovery Rate (FDR) 151
through “statsmodels.stats.multitest”. Additionally, effect size measures are computed to 152
complement statistical p-values, with users able to choose between parametric tests (Cohen’s 153
d) or non-parametric tests (Cliff’s Delta). Cohen’s d is calculated using basic functions from 154
Numpy, while Cliff’s Delta is computed with the “cliffs_delta” package. To assess associations 155
between clusters and groups, we used Chi-square testing from “scipy.stats.chi2_contingency” 156
and contingency tables with “pandas.crosstab”. Residual calculations were displayed in 157
Streamlit as tables to help users understand which clusters are more prevalent within certain 158
groups. 159
160
Performance and reproducibility 161
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
We have set a global setting for Scanpy (sc.settings.n_jobs = -1) to use all available CPU cores. 162
For advanced clustering, multi-threading was achieved using Python’s joblib library. Two other 163
libraries were used, watchdog v 5.0.2 and iGraph v 0.10.819. Watchdog helps in monitoring file 164
change events and improves performance of Streamlit by providing real-time feedback. iGraph 165
is designed to handle complex networks and graph operations and is used by Scanpy as part of 166
the Leiden clustering to perform graph operations. We recommend ‘iGraph’ over ‘leidenalg’ as 167
iGraph is implemented in C and achieves advantages in performance compared to high-level 168
interpreted languages such as Python. To export Plotly figures, we have used Kaleido engine 169
v0.2.1. We have tested the app with various datasets using an Apple M3 Pro System with 18GB 170
of random-access memory (RAM). CAFE is primarily intended to be used using local computer; 171
however, it can be scaled up using any High-Performance Computing (HPC) system that 172
supports an HTML5 web browser. We also provided scripts in our GitHub page to generate 173
AnnData with dimension reduction and Leiden clustering through command-line interface (CLI) 174
based HPC systems. 175
176
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
177
Figure 1: The flowchart outlines steps and components of CAFE’s workflow. 178
Preprocessing includes compensation, data scaling/transformation using a standard FCM 179
software and scaled CSV files are then exported and renamed as Sample_Group.csv. Data 180
processing performs error checks and concatenation of CSV files into an AnnData object/H5AD 181
file. Major steps requiring user input include dimension reduction, batch correction, UMAP 182
(UMAP Uniform Manifold Approximation and Projection) and community detection. Outputs are 183
downloadable as CSV, H5AD, PNG, JPG, SVG and PDF files. 184
185
Results
186
To demonstrate the functionality of the app, we have analyzed 35-color spectral flow cytometry 187
data (Publicly available at FlowRepository: FR-FCM-Z3WR) of human peripheral blood 188
mononuclear cells (PBMC) obtained from COVID-19 hospitalized patients and healthy 189
controls21. A total of 10 samples were analyzed with 5 from each group. For best practices, we 190
installed and ran CAFE through Pixi package manager. Users can also install and run the app 191
using Anaconda package manager as described in our Github documentation. Once initiated 192
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
through a terminal (pixi run cafe, or python cafe.py), a web browser opens with the CAFE app at 193
localhost on port 8501. The default data loading limit is set to 3GB, but a user can change the 194
value from the cafe.py script if necessary. 195
Data processing 196
The uploaded public data were available as doublets-debris removed and CD45+ gated; so we 197
obtained the CSV files just by exporting scaled values from FlowJo v10.10.0. Data scaling is 198
generally recommended for high resolution clustering but there may be instances where users 199
may use raw values. Data can be similarly exported from other flow cytometry software such as 200
FCS Express. It is required that flow cytometry data have proper compensation. We recommend 201
manual inspection of flow cytometry data and removal of debris, dead cells, and doublets prior 202
to exporting the scaled files. A user can also gate on appropriate cell type and export the data to 203
obtain more focused clustering results. To streamline downstream analysis, we have 204
implemented a naming convention for the CSV files. Each CSV file name must begin with a 205
unique “SampleName” followed by “GroupName”, separated by an underscore; for instance, 206
“Sample01_Control.csv” and “Sample02_Treatment.csv”. After loading the data, the app will 207
import the required libraries and perform initial checks for data structure and incorporate 208
SampleID and GroupName into the dataframe based on the CSV file names. Within the 209
dataframe, rows containing any missing values are skipped and anomalies in data structure are 210
reported. In this study, we reduced the dataset size using the advanced KDE-based 211
downsampling option in CAFE to downsample data to 35,000 cells per sample for a total of 212
350,000 cells and 12.25M data points (number of cells multiplied by number of markers). This is 213
an optional step prior to data processing. After loading the files, the app processed (7.8 sec) and 214
combined the expression data and metadata without errors to create an AnnData object. 215
Dimension reduction and batch effects 216
After generating the AnnData object (h5ad file), we selected dimension reduction using PCA 217
with SVD solver set to auto and retained components with 95% variance. The app ran PCA 218
(2.16 sec), kept 12 components and generated (PC1 x PC2) graphs by Groups. Depending on 219
the data size, a user can choose from auto, full, arpack, and randomized SVD solver. 220
Randomized, for example, is better suited for larger datasets as it provides a balance between 221
speed and accuracy. For batch correction, we applied ComBat (1.06 sec) and proceeded to 222
Leiden clustering. 223
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
Leiden clustering and metaclustering 224
For the dataset, we applied Leiden resolution of 1.0 with flavor set to iGraph, UMAP 225
n_neighbors to 15, min_dist to 0.1 and distance calculation method as Euclidean. A user has the 226
option to use a slider control to choose from resolution values 0.01 to 2.0. To find the optimal 227
resolution, we initially made use of Advanced Cluster Evaluation option in CAFE to generate a 228
series of AnnData files with varied Leiden resolution and n_neighbor values and observed the 229
UMAPs to find distinct clusters that are biologically meaningful for the dataset. With Leiden 230
resolution of 1.0, we initially obtained a total of 30 clusters for the PBMC dataset which took 231
11.5 minutes for calculation. Once clustering was completed, CAFE generated a frequency table 232
of each sample by Leiden cluster for the number of cells, frequency of cells, and median 233
fluorescence intensity of each marker for each cluster. Using these 3 tables, users can perform 234
statistical analyses to compare cluster count and frequency by groups and expression of marker 235
proteins within clusters by group. Using the Advanced Cluster Merging option, we merged the 236
subclusters with similar profile into corresponding metaclusters resulting in a new total of 16 237
clusters. 238
239
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
240
Figure 2: Profiling of Human PBMCs Reveal s Distinct Immune Subpopulations and 241
Marker Expression Patterns. (a) UMAP plots showing selected marker expression intensities 242
across all cells in the UMAP space to highlight lineage-specific marker distribution. (b) Dot plot 243
of all marker expression across all identified PBMC cell types. Dendrogram highlighted distinct 244
marker-based groupings. (c) UMAP visualization showing 16 distinct clusters with annotated cell 245
types including Naive CD4 and CD8 T cells, central memory CD4 and CD8 T cells (Tcm), 246
effector memory CD8 T cells (Tem), terminally differentiated effector memory CD8 T cells 247
(Temra), mucosal-associated invariant (MAIT) T cells, classical monocytes (cMO), intermediate 248
monocytes (iMO), B cells, NK cells, gamma delta ( γδ ) T cells (Tgd), conventional dendritic cell 249
(cDC) and plasmacytoid dendritic cell (pDC). 250
Characterization of PBMC Subpopulations 251
To characterize the phenotypic properties, we examined the expression of surface markers for 252
each identified cluster by protein expression UMAP plots (Figure 2a). We found high CD3 253
expression in T cells clusters with CD4 and CD8 expression showing corresponding T cell 254
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
subtypes. CD8+ effector memory (Tem) and central memory (Tcm) subsets were differentiated 255
by high CCR7 and CD27 expression in Tcm. We used CD45RA expression to identify terminally 256
differentiated Tem cells (Temra). Monocyte clusters were identified by CD14 and CD16 257
expression, distinguishing classical monocytes (cMO) from other monocyte subsets, while 258
natural killer (NK) cells showed high levels of CD56, corresponding with CD56 Bright NK cells. 259
Based on shared marker profiles and hierarchical ranking, T cell subsets (Tem, Tcm, and Temra) 260
formed a distinct grouping separate from B cells and myeloid-derived cells, reflecting the 261
differential expression of lineage-specific markers. 262
Based on the expression profiles of marker proteins, we annotated the clusters using CAFE’s 263
Advanced Annotation tab and classified them into 16 distinct cell types. We also used the 264
dotplot to confirm annotations of the cell types (Figure 2b). For instance, the B cell-specific 265
marker CD19 and CD20 were used to identify the B cell cluster, the CD14 marker to identify 266
monocytes, and the CD16 marker to identify NK cells. High CD20 expression in B cell cluster 267
indicated their mature stage in immune response. Our annotated UMAP (Figure 2c ) shows well-268
defined clusters that correspond to PBMC lineages, including Naive CD4+ and CD8+ T cell and 269
Tcm for both CD4+ and CD8+ subsets. We also identified Tem and Temra cells, as well as 270
mucosal associated invariant CD8+ T cell (MAIT). We also identified cMO, intermediate 271
monocytes (iMO), B cells, NK cells, Gamma delta (γδ ) T cells (Tgd) and dendritic cell (DC) types 272
(cDC and pDC). 273
274
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
275
Figure 3: Comparative Analysis of Immune Cell Subpopulations in Healthy and COVID-19 276
Individuals. (a) UMAP plots displaying distinct clustering patterns and differential distribution of 277
cell types in PBMC across healthy and COVID-19 group. ( b) Sankey diagram illustrates the 278
distribution of cells across groups, with thicker flow indicating more cells. (c) Composite bar-strip 279
plot summarizing cell count distribution across cell subpopulations. Dots represent each 280
individual samples colored by group. ( d) Stacked bar chat showing distribution of cells in 281
percentage across two groups. ( e) Effect size calculated using Cohen’s d indicating changes in 282
the number of cells in COVID-19 compared to reference healthy control. ( f) Comparison of 283
individual cell type frequencies between healthy and COVID-19 groups with p-values for 284
statistical significance. N=9/group 285
286
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
Distinct Cellular and Molecular Signatures Observed in COVID-19 Compared to Healthy 287
Controls 288
UMAP analysis of COVID-19 hospitalized patients compared to healthy controls revealed 289
distinct clustering patterns between groups, particularly among monocytes, NK cells, and CD8 T 290
cells (Figure 3a). To understand changes between the two groups, CAFE offers varied 291
visualization options, for instance, we used a Sankey diagram to demonstrate that MAIT cells 292
and Tgd are much less abundant in COVID-19 patients compared to healthy controls (Figure 293
3b). We also found that CD8 Tcm and B cells were significantly expanded in COVID-19 patients. 294
A composite bar-strip plot also demonstrates the distribution of cells in frequency where each 295
dot represented each sample colored by specific group (Figure 3c). The total number of cells in 296
iMO were largely reduced in COVID-19 patients compared to healthy controls (Figure 3d). 297
These data may indicate a possible shift from an innate response towards an adaptive 298
response. To quantify the effect size of changes observed, we compared cell types within the 299
COVID-19 group to healthy controls as a reference and found changes in naive CD8, TMAIT, 300
and iMO cells have a larger effect size, demonstrating a bigger difference between the two 301
groups (Figure 3e). We further compared these cell types by plotting box plots for individual cell 302
types (Figure 3f) which demonstrated a non-statistically significant increase in Tcm CD8 303
(p=0.0777) and statistically significant decrease in naive CD8 cells (p=0.0372) in COVID-19 304
patients compared to healthy controls. 305
306
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
307
Figure 4: Marker expression and distribution differences between COVID-19 and healthy 308
individuals. (a) Violin plot showed the median expression levels of CD161 across immune cell 309
subtypes. (b) Sankey diagram illustrated marker expression in CD8+ T cells, with thicker flows 310
indicating more cells expressing that marker. ( c) Bar chart showed median expression of 311
markers across all cell types between COVID-19 and healthy individuals where positive values 312
indicated upregulation. Box plots displayed (d) the number of cells expressing IgG, IgM and IgD 313
in B cells, and (e) the number of CD8+ T cells expressing CD25, HLA-DR, and CD38. 314
315
Altered Marker Expression Profiles in COVID-19 Patients 316
CAFE outputs a file with expression data for all markers on each cluster by sample for use in 317
external plotting and statistical software. In addition, further exploration of specific markers 318
within clusters can be performed within CAFE. We used this approach to identify that MAIT cells 319
have the greatest expression of CD161, as shown by violin plot (Figure 4a). We also examined 320
marker distribution in CD8 T cell populations and found that more cells within the Tem CD8 321
population appeared activated based on greater HLA-DR, CXCR3, and CCR5 expression 322
compared to other CD8 T cell subsets (Figure 4b). We found that median expression of CD8, 323
CD14, CD11b, IgG were increased in COVID-19 patients compared to healthy controls across 324
all clusters (Figure 4c). These reflect the overall differences in some of the cell populations we 325
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
observed between groups. We examined marker expression within the B cell cluster and found 326
that more B cells expressed IgG in COVID-19 samples compared to healthy controls although 327
the difference was not statistically significant (p=0.1631), while IgD expression was statistically 328
significantly reduced (p=0.0162) in COVID-19 samples compared to healthy controls (Figure 329
4d). We also examined activation markers in CD8+ T cells and found that COVID-19 patients 330
had more CD8+ T cells that expressed CD25 and HLA-DR than healthy controls (Figure 4e). 331
332
333
Discussion
334
335
As research in immunology increasingly relies on high-dimensional cytometric data, there is a 336
growing need for a user-friendly analysis tools for everyday use. Here, we present CAFE as a 337
free and open-source tool designed to address the analytical and accessibility issues posed by 338
SFCM data. CAFE uses a GUI and interactive controls to enable immunologists to analyze 339
complex data without needing specialized coding knowledge. 340
341
Using a jupyter notebook, we have previously shown the ability of Scanpy’s Leiden function 342
(scanpy.tl.leiden) to analyze a 50-color human PBMC dataset 15. CAFE acts as a wrapper 343
combining packages within Streamlit to provide an interactive web interface, offering more 344
accessible and extensive functionality than Jupyter notebooks’ CLI. We demonstrated analysis 345
of a 35-color human PBMC dataset using CAFE in this study with 350,000 cells and 12.25M 346
data points. Major steps including data processing, dimension reduction, batch correction and 347
Leiden clustering were completed in under 12 minutes using an Apple M3 Pro laptop. In our 348
analysis, we observed COVID-19 patients with altered immune cell distributions and marker 349
expression profiles, consistent with prior findings, as well as MAIT cells expressing high CD161 350
and B cells expressing high CD2021. 351
While developing CAFE, we have balanced compatibility and performance and included many 352
options for customization of how the code processes and analyzes data while integrating default 353
options and tooltips to help guide users. Our implementation of Pandas with PyArrow 354
significantly improves processing speed over Pandas alone. However, transitioning to Polars’ 355
lazy evaluation framework could further speed up processing once compatibility issues between 356
ARM and x86 machines are resolved. CAFE’s web app design and functionality also revolved 357
around simplicity as we de-emphasized features that are not commonly used in order to 358
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
streamline user-experience. For data dimension reduction, we adhered to a convention of using 359
PCA as the primary method and using PCA-reduced representations for constructing UMAP 360
neighbor graphs, as opposed to utilizing UMAP directly for primary dimensionality reduction. 361
Although PCA is designed for linear data, it effectively reduces noise and enhances clustering 362
performance. Users have the choice to skip PCA to perform Leiden clustering on the raw data 363
or use UMAP embeddings (i.e., X_umap) to use UMAP reduced data for clustering. One viable 364
alternative to PCA for non-linear data structure is Kernel PCA, but we have skipped adding the 365
kernel PCA option in the CAFE workflow because it may not be practical since it is 366
computationally taxing. 367
UMAP parameters and Leiden resolution largely influence the number of clusters for community 368
detection. Leiden clustering is performed on the graph structure, so evaluation of clustering 369
quality solely based on methods such as elbow or silhouette score is not ideal. Rather a 370
combined approach with prior biological knowledge can inform the most correct clustering 371
resolution. We recommend using CAFE’s advanced clustering evaluation tab to generate plots 372
with a range of varied UMAP parameters and Leiden resolutions for visual inspection. Using this 373
approach, users can select the most appropriate clustering resolution for each dataset. Since 374
this is an unsupervised algorithm, setting up an incorrect resolution can heavily skew the 375
interpretation of data. 376
Manual gating continues to be the gold standard in flow cytometry analysis, but it is limited by 377
sequentially drilling down into subsets of cells with 2-dimensional bi-axial gating. Thus, our goal 378
was to complement this hypothesis-driven approach with the unsupervised computational 379
algorithms. In this way, users can perform hypothesis-driven analysis with manual gating and 380
hypothesis-generating analysis with unsupervised clustering. Compared to other open-source 381
tools, CAFE provides a wide range of publication-ready visualization options. Due to its 382
underlying code in python, CAFE is highly scalable to datasets of millions of cells and takes 383
advantage of multi-threading to obtain higher performance. Previously, python-based Pytometry 384
and CRUSTY integrated unbiased clustering algorithms within their tools 22,23. Pytometry 385
incorporates the Leiden algorithm 10, which has been shown to be an improvement over the 386
predecessor Louvain algorithm 24; however, Pytometry requires coding using Python. CRUSTY 387
incorporates an easy-to-use GUI but it does not offer the Leiden algorithm and relies on 388
FlowSOM and Phenograph. Another limitation of CRUSTY is that the cloud-based service limits 389
users to analyzing 100,000 total events. There are also limited visualization and analysis 390
options in CRUSTY and they rely on most of the Phenograph and FlowSOM default settings 391
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
which cannot be customized. Cloud-based solutions may face limitations in availability, 392
scalability, and data security. Users may be prohibited from uploading data to cloud-based 393
systems that have protected health information due to HIPAA. Among a few other GUI based 394
tools, Cytoflow25, Floreada (floreada.io), EasyFlow 26 allow for flow cytometry data analysis but 395
do not offer clustering. FlowPy (flowpy.wikidot.com) allows for clustering but uses k-mean 396
clustering rather than the most advanced algorithms currently in use (i.e. Leiden). Additional 397
tools like terraFlow 27 and CellEngine (CellCarta, Montreal, Canada) are for-profit spectral flow 398
cytometry analysis softwares; however, the price of these may be restrictive for many users. 399
Finally, FlowJo is a staple for many immunologi sts and has some native clustering capabilities. 400
It also supports plugins for additional clustering algorithms, but these add-ons do not offer much 401
customization in the clustering parameters. 402
403
CAFE, while addressing many of these limitations as an open-source alternative, has its own 404
practical considerations. CAFE is intended to be run locally which requires installing a package 405
manager such as Pixi or Anaconda/Miniconda3 through terminal. Performance is also 406
dependent on the user’s machine. For larger dat asets, we recommend utilizing our provided 407
scripts in Github to run data processing step through an HPC cluster by allocating more RAM. 408
Once the user has Anndata file (.h5ad file) generated with cluster information, all the 409
downstream analysis and figure generation steps become significantly less computationally 410
demanding. Ultimately, CAFE’s aim is to become a secure, scalable, and open-source platform 411
accessible to a broad range of researchers to run complex analyses through a simple intuitive 412
graphical user interface. 413
414
Acknowledgements
This work was supported by the National Institutes of Health/National 415
Institute on Aging Grant R00 AG068309 (to D.J.T.); Nathan Shock Center, which is supported by 416
the National Institutes of Health/National Institute on Aging Grant P30 AG050886; and this 417
research was conducted while (Daniel Tyrrell) was an AFAR Grant for Junior Faculty awardee 418
A24063 (to D.J.T.). 419
420
Data availability: The raw data are publicly available at FlowRepository: FR-FCM-Z3WR. The 421
downsampled .csv files are available from Figshare: 27940719. The processed Anndata object 422
after dimensionality reduction and clustering is also available from Figshare: 27940752. 423
424
Code availability: CAFE is freely available at https://github.com/mhbsiam/cafe. 425
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
426
Contributions: M.H.B.S. and D.J.T. conceived the software development. M.H.B.S. created the 427
software and web application; M.H.B.S. analyzed the data. M.H.B.S. and D.J.T. wrote the paper. 428
All authors reviewed and edited the paper. 429
430
Corresponding author: Correspondence to Daniel J. Tyrrell at
[email protected]. 431
432
Ethics declarations: The authors declare they have no competing interests. 433
434
References
435
436
1. McKinnon, K. M. Flow Cytometry: An Overview. Current Protocols in Immunology 120, 437
5.1.1-5.1.11 (2018). 438
2. Nolan, J. P. The evolution of spectral flow cytometry. Cytometry Part A 101, 812–817 (2022). 439
3. Konecny, A. J., Mage, P. L., Tyznik, A. J., Prlic, M. & Mair, F. OMIP-102: 50-color 440
phenotyping of the human immune system with in-depth assessment of T cells and dendritic 441
cells. Cytometry Part A 105, 430–436 (2024). 442
4. Tyrrell, D. J. et al. Clonally expanded memory CD8+ T cells accumulate in atherosclerotic 443
plaques and are pro-atherogenic in aged mice. Nat Aging 3, 1576–1590 (2023). 444
5. Na, S., Choo, Y ., Yoon, T. H. & Paek, E. CyGate Provides a Robust Solution for Automatic 445
Gating of Single Cell Cytometry Data. Anal Chem 95, 16918–16926 (2023). 446
6. Van Gassen, S. et al. FlowSOM: Using self-organizing maps for visualization and 447
interpretation of cytometry data. Cytometry A 87, 636–645 (2015). 448
7. Samusik, N., Good, Z., Spitzer, M. H., Davis, K. L. & Nolan, G. P. Automated mapping of 449
phenotype space with single-cell data. Nat Methods 13, 493–496 (2016). 450
8. Qiu, P. et al. Extracting a cellular hierarchy from high-dimensional cytometry data with 451
SPADE. Nat Biotechnol 29, 886–891 (2011). 452
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
9. Levine, J. H. et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells 453
that Correlate with Prognosis. Cell 162, 184–197 (2015). 454
10. Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-455
connected communities. Sci Rep 9, 5233 (2019). 456
11. Hao, Y . et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573-3587.e29 457
(2021). 458
12. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data 459
analysis. Genome Biology 19, 15 (2018). 460
13. Rich, J. M. et al. The impact of package selection and versioning on single-cell RNA-seq 461
analysis. bioRxiv 2024.04.04.588111 (2024) doi:10.1101/2024.04.04.588111. 462
14. McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: Uniform Manifold Approximation 463
and Projection. Journal of Open Source Software 3, 861 (2018). 464
15. Vardaman, D. et al. Development of a Spectral Flow Cytometry Analysis Pipeline for High-465
dimensional Immune Cell Characterization. J Immunol 213, 1713–1724 (2024). 466
16. Burton, R. J. et al. CytoPy: An autonomous cytometry analysis framework. PLoS Comput 467
Biol 17, e1009071 (2021). 468
17. Streamlit A faster way to build and share data apps. https://streamlit.io/ (2021). 469
18. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data 470
using empirical Bayes methods. Biostatistics 8, 118–127 (2007). 471
19. Csardi, G. & Nepusz, T. The Igraph Software Package for Complex Network Research. 472
InterJournal Complex Systems, 1695 (2005). 473
20. Silverman, B. W. Density Estimation for Statistics and Data Analysis. (Routledge, New York, 474
2017). doi:10.1201/9781315140919. 475
21. Yu, C. et al. Mucosal-associated invariant T cell responses differ by sex in COVID-19. Med 476
2, 755- 772.e5 (2021). 477
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
22. Büttner, M., Hempel, F., Ryborz, T., Theis, F. J. & Schultze, J. L. Pytometry: Flow and mass 478
cytometry analytics in Python. 2022.10.10.511546 Preprint at 479
https://doi.org/10.1101/2022.10.10.511546 (2022). 480
23. Puccio, S. et al. CRUSTY: a versatile web platform for the rapid analysis and visualization of 481
high-dimensional flow cytometry data. Nat Commun 14, 5102 (2023). 482
24. Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities 483
in large networks. J. Stat. Mech. 2008, P10008 (2008). 484
25. Teague, B. Cytoflow: A Python Toolbox for Flow Cytometry. 2022.07.22.501078 Preprint at 485
https://doi.org/10.1101/2022.07.22.501078 (2022). 486
26. Ma, Y ., Eizenberg-Magar, I. & Antebi, Y . EasyFlow: An open-source, user-friendly cytometry 487
analyzer with graphic user interface (GUI). PLOS ONE 19, e0308873 (2024). 488
27. Freeman, D. et al. terraFlow, a high-parameter analysis tool, reveals T cell exhaustion and 489
dysfunctional cytokine production in classical Hodgkin’s lymphoma. Cell Reports 43, (2024). 490
491
Figure Legends: 492
Figure 1: The flowchart outlines steps and components of CAFE’s workflow. 493
Preprocessing includes compensation, data scaling/transformation using a standard FCM 494
software and scaled CSV files are then exported and renamed as Sample_Group.csv. Data 495
processing performs error checks and concatenation of CSV file s into an AnnData object/H5AD 496
file. Major steps requiring user input include dimension reduction, batch correction, UMAP 497
(UMAP Uniform Manifold Approximation and Projection) and community detection. Outputs are 498
downloadable as CSV, H5AD, PNG, JPG, SVG and PDF files. 499
Figure 2: Profiling of Human PBMCs Reveal s Distinct Immune Subpopulations and 500
Marker Expression Patterns. (a) UMAP plots showing selected marker expression intensities 501
across all cells in the UMAP space to highlight lineage-specific marker distribution. (b) Dot plot 502
of all marker expression across all identified PBMC cell types. Dendrogram highlighted distinct 503
marker-based groupings. (c) UMAP visualization showing 16 distinct clusters with annotated cell 504
types including Naive CD4 and CD8 T cells, central memory CD4 and CD8 T cells (Tcm), 505
effector memory CD8 T cells (Tem), terminally differentiated effector memory CD8 T cells 506
(Temra), mucosal-associated invariant (MAIT) T cells, classical monocytes (cMO), intermediate 507
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
monocytes (iMO), B cells, NK cells, gamma delta ( γδ ) T cells (Tgd), conventional dendritic cell 508
(cDC) and plasmacytoid dendritic cell (pDC). 509
Figure 3: Comparative Analysis of Immune Cell Subpopulations in Healthy and COVID-19 510
Individuals. (a) UMAP plots displaying distinct clustering patterns and differential distribution of 511
cell types in PBMC across healthy and COVID-19 group. ( b) Sankey diagram illustrates the 512
distribution of cells across groups, with thicker flow indicating more cells. (c) Composite bar-strip 513
plot summarizing cell count distribution across cell subpopulations. Dots represent each 514
individual samples colored by group. ( d) Stacked bar chat showing distribution of cells in 515
percentage across two groups. ( e) Effect size calculated using Cohen’s d indicating changes in 516
the number of cells in COVID-19 compared to reference healthy control. ( f) Comparison of 517
individual cell type frequencies between healthy and COVID-19 groups with p-values for 518
statistical significance. N=9/group. 519
Figure 4: Marker expression and distribution differences between COVID-19 and healthy 520
individuals. (a) Violin plot showed the median expression levels of CD161 across immune cell 521
subtypes. (b) Sankey diagram illustrated marker expression in CD8+ T cells, with thicker flows 522
indicating more cells expressing that marker. ( c) Bar chart showed median expression of 523
markers across all cell types between COVID-19 and healthy individuals where positive values 524
indicated upregulation. Box plots displayed (d) the number of cells expressing IgG, IgM and IgD 525
in B cells, and (e) the number of CD8+ T cells expressing CD25, HLA-DR, and CD38. 526
527
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
CSV files Data
Processing
Dimension
reduction Run PCA
H5AD file
Batch
correction
No
Yes
Community
detection
Run ComBat
Export files
KNN graphUMAP
embeddingSNN graph UMAP
X_pca
X_umap
Use X_pca, X_umap or none
Yes
No
CSV files
Counts
Frequency
Visualization
UMAP Plots
Bar Plots
Dot Plots
KDE Plots
Violin Plots
Sankey Plots
Heatmaps
Statistics
Annotate Clusters
Merge Clusters
Cluster evaluation
Downsampling
Preprocessing
optional
optional
optional
Violin Plots
Sankey Plots
Heatmaps
Box Plots
Strip Plots
Dendrogram
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
c
b
a
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
NK cell
Tcm CD4
Tem CD8
Naive CD4
Tem CD4
Temra CD8
B cell
Tcm CD8
TMAIT
Tgd cell
pDC
cDC
Naive CD8
cMO
Basophil
iMO
−20k
0
20k
40k
60k
Violin Plot of CD161 Expression across cell_type
cell_type
CD161 Expression
CD8 CD11b CD14 HLA-DR IgG IgM CD38 CD25 CD161
−2000
−1000
0
1000
2000
3000
4000
5000
Group
COVID19
Healthy
Median Expression of Markers Among Groups
Markers
Median Expression
Healthy COVID19
1,300
1,400
1,500
1,600
1,700
1,800
Healthy COVID19
5,000
10,000
15,000
Healthy COVID19
2,500
3,000
3,500
4,000
4,500
5,000
Marker Expression Comparisons
CD25 HLA-DR CD38
p-value = 0.1917 p-value = 0.3504 p-value = 0.6506
Healthy COVID19
4,000
6,000
8,000
Healthy COVID19
10,000
20,000
30,000
40,000
50,000
60,000
70,000
Healthy COVID19
20,000
40,000
60,000
80,000
100,000
Marker Expression Comparisons
IgG IgM IgD
p-value = 0.1631 p-value = 0.2759 p-value = 0.0182
T
T
HLA-DR
CXCR3
CCR5
CCR7
CXCR5
CCR6
Tem CD8
Naive CD8
TMAIT
Tcm CD8
Temra CD8
a b
c d
e
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.