CAFE: An Integrated Web App for High-Dimensional Analysis and Visualization in Spectral Flow Cytometry

preprint OA: closed CC-BY-NC-ND-4.0
📄 Open PDF Full text JSON View at publisher
Full text 53,466 characters · extracted from oa-pdf · 8 sections · click to expand

Abstract

20 Spectral flow cytometry provides greater insights into cellular heterogeneity by simultaneous 21 measurement of up to 50 markers. However, analyzing such high-dimensional (HD) data is 22 complex through traditional manual gating strategy. To address this gap, we developed CAFE as 23 an open-source Python-based web application with a graphical user interface. Built with 24 Streamlit, CAFE incorporates libraries such as Scanpy for single-cell analysis, Pandas and 25 PyArrow for efficient data handling, and Matplotlib, Seaborn, Plotly for creating customizable 26 figures. Its robust toolset includes density-based down-sampling, dimensionality reduction, 27 batch correction, Leiden-based clustering, cluster merging and annotation. Using CAFE, we 28 demonstrated analysis of a human PBMC dataset of 350,000 cells identifying 16 distinct cell 29 clusters. CAFE can generate publication-ready figures in real time via interactive slider controls 30 and dropdown menus, eliminating the need for coding expertise and making HD data analysis 31 accessible to all. CAFE is licensed under MIT and is freely available at 32 https://github.com/mhbsiam/cafe. 33 34

Keywords

Cytometry, Bioinformatics, Protocol, Leiden, 35 Word count: 3986 36 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint 37

Introduction

38 Flow cytometry is a widely used technique in immunology to identify and quantify immune cells 39 based on specific surface markers 1. The development of spectral flow cytometry (SFCM) has 40 further expanded immunophenotyping capabilities allowing the simultaneous analysis of a 41 greater number of parameters through the complete emission spectra of fluorophores 1. 42 Compared to conventional flow cytometry, SFCM uses spectral unmixing algorithms to 43 deconvolute the overlapping signals and achieves enhanced resolution and sensitivity to 44 distinguish between different cell populations 2. SFCM can incorporate broader range of 45 antibodies with up to 50 colors in a single panel improving upon the conventional FCM where 46 the number of parameters is limited by the instrument constraints 3. Incorporating more 47 parameters substantially increases the complexity in gating strategy which largely relies on 48 established convention and prior knowledge 4,5. Additional gating steps and combinations of 49 markers used to subset cells complicate the interpretation of such high-dimensional data. 50 Several clustering methods are available to identify cell populations such as FlowSOM 6, xShift7, 51 SPADE8 and Phenograph 9. SPADE and FlowSOM utilize hierarchical clustering with the latter 52 employing self-organizing maps (SOMs) to cluster cells, whereas xShift detects clusters based 53 on shifts in local cell density6–8. Phenograph, by contrast, constructs a K-nearest neighbor graph 54 and applies the Louvain algorithm to identify cell clusters, but Louvain can produce poorly 55 connected or disconnected communities9,10. 56 Recently, the Leiden clustering algorithm has emerged as a faster and more accurate 57 alternative to improve community detection in networks 10. Single-cell RNA sequencing (scRNA-58 seq) tools: Seurat11 (R) and Scanpy12 (Python) have integrated Leiden algorithms for community 59 detection. However, running Leiden within Seurat resulted in drawbacks including higher 60 memory usage, longer calculation time and random crashes in docker containers 13. Scanpy 61 resolves these issues, and unlike Seurat, Scanpy improves visualization quality by using 62 consistent KNN and SNN graphs for both clustering and uniform manifold approximation and 63 projection (UMAP) 13,14. In February 2020, Phenograph version 1.5.3 was released, which 64 incorporated an option to use Leiden for clustering; however, the default parameter is set to 65 Louvain through the latest release (v.1.5.7). In our previous work, we showed that the use of 66 Leiden algorithm in community detection for SFCM data provides superior result to Phenograph 67 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint (Louvain), FlowSOM, and xShift 15. Currently, there is a scarcity of open-source tools to utilize 68 Leiden algorithm for SCFM data analysis16. 69 Here we present CAFE, Cell Analyzer for Flow Experiment, a user-friendly, web application 70 developed in Python that works across Windows, MacOS, and Linux. The app is lightweight and 71 can perform high-dimensional SFCM data analysis using a standard computing machine (i.e., 72 Apple M1 chip with 16gb RAM), and it provides the flexibility to be deployed on HPC clusters for 73 enhanced scalability. Once installed, the tool runs entirely offline and does not require an active 74 internet connection to load files. This also enables users to maintain compliance with data 75 security, especially with protected health information (PHI) when analyzing patient samples. 76 CAFE can be used to process data, reduce dimension, batch correction, run Leiden clustering, 77 perform statistics and generate a wide range of figures. Figures can be adjusted and viewed 78 within the tool in real time. Additionally, the tool offers Kernel Density Estimation (KDE)-based 79 data downsampling, advanced clustering with predefined markers, cluster quality evaluation, 80 merging subclusters into metaclusters, and cell type annotation. Designed as an open-source, 81 interactive data analysis platform, CAFE enables biologists with no-coding experience to 82 analyze SFCM data and create publication quality visualization with customizable parameters. 83 CAFE is freely available to download at: https://github.com/mhbsiam/cafe 84 85

Methods

86 Implementation 87 The CAFE webtool was developed using Python programming language due to its compatibility 88 with Scanpy library 12. Figure 1 illustrates the components and workflows of CAFE. Streamlit 17 89 library was used to develop the web interface that provides dynamic updates based on user 90 inputs in the graphical user interface (GUI) without writing or editing code directly. Streamlit was 91 chosen due to its simplicity in development and compatibility with other Python libraries across 92 operating systems. Streamlit v1.39.0 is compatible with any modern HTML5 web browser. For 93 data loading and processing, we relied on Pandas v2.2.3 with PyArrow v18.0.0 which achieves 94 faster data loading and processing compared to Pandas alone. We used NumPy v1.26.4 for 95 data type and range selection, RGB array creation for color handling, and grid setup for 96 subplots. Seaborn v0.13.2, Matplotlib v3.9.2, and Plotly v4.24.1 libraries were used for data 97 visualization and users are provided with options to adjust parameters: plot size, color profile, 98 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint and output formats in PNG, JPG, SVG, or PDF. CAFE integrates AnnData 12, a widely used 99 framework in single-cell RNA sequencing analysis that allows for efficient storage and 100 manipulation of both sparse and dense matrices along with metadata. CAFE outputs an 101 AnnData object which can be used outside of CAFE if users wish to deposit their data with 102 analysis or perform custom analyses using other tools. 103 The Scanpy library was used to perform key analyses including dimension reduction, batch 104 correction, and Leiden clustering. The user has the option to reduce dimension (sc.tl.pca) of the 105 data through Principal Component Analysis (PCA) or skip it. PCA is a linear dimensionality 106 reduction technique that retains the global structure of the data by capturing the variances 107 across all dimensions. Because the app performs PCA through Scanpy library, by default, the 108 number of components retained is limited to the lesser of the two values: the number of cells or 109 the number of markers. Also, the Singular Value Decomposition (SVD) solver was set to “auto” 110 that chooses the most appropriate solver based on the size of the dataset; however, users have 111 options to set a percentage of variances they want to retain, and the type of solver used. The 112 reduced dataset is stored and can be further processed for batch correction (sc.pp.combat) 113 using ComBat (Combined Batch) 18. This is particularly useful if a user has collected samples in 114 different batches as the algorithm standardizes the data by making it comparable and removing 115 unwanted variability. 116 To group cells into distinct clusters based on marker expression profiles, Leiden clustering is run 117 (sc.tl.leiden) and users can select either ‘iGraph’ or ‘leidenalg’ algorithm flavor 10,19. To define 118 clustering resolution, a user can choose from 0.1 to 2.0 where the lowest value provides the 119 lowest number of clusters. The user can fine tune Leiden calculation by altering the number of 120 neighbors and minimum distance values in Uniform Manifold Approximation and Projection 121 (UMAP) calculation within the app. CAFE generates AnnData object (H5AD) file, CSV outputs, 122 and visual outputs including UMAP plots, dot-heatmaps, expression pots, and barplots as high-123 resolution images and provides download buttons to save them to a desired folder. The app 124 allows various visualization settings, with changes made and displayed immediately within the 125 app. The generated AnnData object (h5ad file) can be further used to perform a range of 126 statistical analyses. 127 The app includes advanced functionalities for clustering and cluster evaluation. Because setting 128 up appropriate values for Leiden resolution and UMAP parameters is central to obtaining quality 129 clustering results, a user can leverage CAFE’s Cluster Evaluation tab to generate multiple 130 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint AnnData files with various combinations of these parameters and compare UMAP plots as well 131 as Silhouette score, Calinski-Harabasz score, Davies-Bouldin score, and Elbow method to 132 assess clustering results. Additionally, CAFE provides clustering with pre-selected markers, 133 merging subclusters into metaclusters, and annotation of clusters directly within the app. The 134 advanced Downsampling tab offers to downsample (e.g. 20,000 events per sample) data using 135 a PCA-KDE based method. This approach combines PCA and KDE (Kernel Density Estimation) 136 based algorithm from scipy.stats using Gaussian kernel function and silverman bandwidth 20. 137 KDE is applied to the PCA-transformed data to estimate the density of data points. Based on 138 these density estimates, the code probabilistically downsamples data, thus reducing sampling 139 bias and preserving original data distribution. This offers an informed approach compared to 140 simple random downsampling, and this can be used to filter out noise while retaining meaningful 141 biological information in a smaller dataset. 142 143 Statistical analysis 144 In the visualization tab under statistical analyses section, users can perform different statistical 145 tests and generate plots. The Shapiro-Wilk test from “scipy.stats” is used to determine if marker 146 expression within each group or cluster follows a normal distribution. Based on these results, 147 the app recommends either parametric (T-test) or non-parametric (Mann-Whitney U test) tests 148 using “scipy.stats”. For comparing multiple clusters, the app allows users to perform ANOVA 149 (scipy.stats.f_oneway) or Kruskal-Wallis (scipy.stats.kruskal) tests. To reduce statistical artifacts, 150 multiple testing correction is applied using the Benjamini-Hochberg False Discovery Rate (FDR) 151 through “statsmodels.stats.multitest”. Additionally, effect size measures are computed to 152 complement statistical p-values, with users able to choose between parametric tests (Cohen’s 153 d) or non-parametric tests (Cliff’s Delta). Cohen’s d is calculated using basic functions from 154 Numpy, while Cliff’s Delta is computed with the “cliffs_delta” package. To assess associations 155 between clusters and groups, we used Chi-square testing from “scipy.stats.chi2_contingency” 156 and contingency tables with “pandas.crosstab”. Residual calculations were displayed in 157 Streamlit as tables to help users understand which clusters are more prevalent within certain 158 groups. 159 160 Performance and reproducibility 161 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint We have set a global setting for Scanpy (sc.settings.n_jobs = -1) to use all available CPU cores. 162 For advanced clustering, multi-threading was achieved using Python’s joblib library. Two other 163 libraries were used, watchdog v 5.0.2 and iGraph v 0.10.819. Watchdog helps in monitoring file 164 change events and improves performance of Streamlit by providing real-time feedback. iGraph 165 is designed to handle complex networks and graph operations and is used by Scanpy as part of 166 the Leiden clustering to perform graph operations. We recommend ‘iGraph’ over ‘leidenalg’ as 167 iGraph is implemented in C and achieves advantages in performance compared to high-level 168 interpreted languages such as Python. To export Plotly figures, we have used Kaleido engine 169 v0.2.1. We have tested the app with various datasets using an Apple M3 Pro System with 18GB 170 of random-access memory (RAM). CAFE is primarily intended to be used using local computer; 171 however, it can be scaled up using any High-Performance Computing (HPC) system that 172 supports an HTML5 web browser. We also provided scripts in our GitHub page to generate 173 AnnData with dimension reduction and Leiden clustering through command-line interface (CLI) 174 based HPC systems. 175 176 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint 177 Figure 1: The flowchart outlines steps and components of CAFE’s workflow. 178 Preprocessing includes compensation, data scaling/transformation using a standard FCM 179 software and scaled CSV files are then exported and renamed as Sample_Group.csv. Data 180 processing performs error checks and concatenation of CSV files into an AnnData object/H5AD 181 file. Major steps requiring user input include dimension reduction, batch correction, UMAP 182 (UMAP Uniform Manifold Approximation and Projection) and community detection. Outputs are 183 downloadable as CSV, H5AD, PNG, JPG, SVG and PDF files. 184 185

Results

186 To demonstrate the functionality of the app, we have analyzed 35-color spectral flow cytometry 187 data (Publicly available at FlowRepository: FR-FCM-Z3WR) of human peripheral blood 188 mononuclear cells (PBMC) obtained from COVID-19 hospitalized patients and healthy 189 controls21. A total of 10 samples were analyzed with 5 from each group. For best practices, we 190 installed and ran CAFE through Pixi package manager. Users can also install and run the app 191 using Anaconda package manager as described in our Github documentation. Once initiated 192 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint through a terminal (pixi run cafe, or python cafe.py), a web browser opens with the CAFE app at 193 localhost on port 8501. The default data loading limit is set to 3GB, but a user can change the 194 value from the cafe.py script if necessary. 195 Data processing 196 The uploaded public data were available as doublets-debris removed and CD45+ gated; so we 197 obtained the CSV files just by exporting scaled values from FlowJo v10.10.0. Data scaling is 198 generally recommended for high resolution clustering but there may be instances where users 199 may use raw values. Data can be similarly exported from other flow cytometry software such as 200 FCS Express. It is required that flow cytometry data have proper compensation. We recommend 201 manual inspection of flow cytometry data and removal of debris, dead cells, and doublets prior 202 to exporting the scaled files. A user can also gate on appropriate cell type and export the data to 203 obtain more focused clustering results. To streamline downstream analysis, we have 204 implemented a naming convention for the CSV files. Each CSV file name must begin with a 205 unique “SampleName” followed by “GroupName”, separated by an underscore; for instance, 206 “Sample01_Control.csv” and “Sample02_Treatment.csv”. After loading the data, the app will 207 import the required libraries and perform initial checks for data structure and incorporate 208 SampleID and GroupName into the dataframe based on the CSV file names. Within the 209 dataframe, rows containing any missing values are skipped and anomalies in data structure are 210 reported. In this study, we reduced the dataset size using the advanced KDE-based 211 downsampling option in CAFE to downsample data to 35,000 cells per sample for a total of 212 350,000 cells and 12.25M data points (number of cells multiplied by number of markers). This is 213 an optional step prior to data processing. After loading the files, the app processed (7.8 sec) and 214 combined the expression data and metadata without errors to create an AnnData object. 215 Dimension reduction and batch effects 216 After generating the AnnData object (h5ad file), we selected dimension reduction using PCA 217 with SVD solver set to auto and retained components with 95% variance. The app ran PCA 218 (2.16 sec), kept 12 components and generated (PC1 x PC2) graphs by Groups. Depending on 219 the data size, a user can choose from auto, full, arpack, and randomized SVD solver. 220 Randomized, for example, is better suited for larger datasets as it provides a balance between 221 speed and accuracy. For batch correction, we applied ComBat (1.06 sec) and proceeded to 222 Leiden clustering. 223 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint Leiden clustering and metaclustering 224 For the dataset, we applied Leiden resolution of 1.0 with flavor set to iGraph, UMAP 225 n_neighbors to 15, min_dist to 0.1 and distance calculation method as Euclidean. A user has the 226 option to use a slider control to choose from resolution values 0.01 to 2.0. To find the optimal 227 resolution, we initially made use of Advanced Cluster Evaluation option in CAFE to generate a 228 series of AnnData files with varied Leiden resolution and n_neighbor values and observed the 229 UMAPs to find distinct clusters that are biologically meaningful for the dataset. With Leiden 230 resolution of 1.0, we initially obtained a total of 30 clusters for the PBMC dataset which took 231 11.5 minutes for calculation. Once clustering was completed, CAFE generated a frequency table 232 of each sample by Leiden cluster for the number of cells, frequency of cells, and median 233 fluorescence intensity of each marker for each cluster. Using these 3 tables, users can perform 234 statistical analyses to compare cluster count and frequency by groups and expression of marker 235 proteins within clusters by group. Using the Advanced Cluster Merging option, we merged the 236 subclusters with similar profile into corresponding metaclusters resulting in a new total of 16 237 clusters. 238 239 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint 240 Figure 2: Profiling of Human PBMCs Reveal s Distinct Immune Subpopulations and 241 Marker Expression Patterns. (a) UMAP plots showing selected marker expression intensities 242 across all cells in the UMAP space to highlight lineage-specific marker distribution. (b) Dot plot 243 of all marker expression across all identified PBMC cell types. Dendrogram highlighted distinct 244 marker-based groupings. (c) UMAP visualization showing 16 distinct clusters with annotated cell 245 types including Naive CD4 and CD8 T cells, central memory CD4 and CD8 T cells (Tcm), 246 effector memory CD8 T cells (Tem), terminally differentiated effector memory CD8 T cells 247 (Temra), mucosal-associated invariant (MAIT) T cells, classical monocytes (cMO), intermediate 248 monocytes (iMO), B cells, NK cells, gamma delta ( γδ ) T cells (Tgd), conventional dendritic cell 249 (cDC) and plasmacytoid dendritic cell (pDC). 250 Characterization of PBMC Subpopulations 251 To characterize the phenotypic properties, we examined the expression of surface markers for 252 each identified cluster by protein expression UMAP plots (Figure 2a). We found high CD3 253 expression in T cells clusters with CD4 and CD8 expression showing corresponding T cell 254 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint subtypes. CD8+ effector memory (Tem) and central memory (Tcm) subsets were differentiated 255 by high CCR7 and CD27 expression in Tcm. We used CD45RA expression to identify terminally 256 differentiated Tem cells (Temra). Monocyte clusters were identified by CD14 and CD16 257 expression, distinguishing classical monocytes (cMO) from other monocyte subsets, while 258 natural killer (NK) cells showed high levels of CD56, corresponding with CD56 Bright NK cells. 259 Based on shared marker profiles and hierarchical ranking, T cell subsets (Tem, Tcm, and Temra) 260 formed a distinct grouping separate from B cells and myeloid-derived cells, reflecting the 261 differential expression of lineage-specific markers. 262 Based on the expression profiles of marker proteins, we annotated the clusters using CAFE’s 263 Advanced Annotation tab and classified them into 16 distinct cell types. We also used the 264 dotplot to confirm annotations of the cell types (Figure 2b). For instance, the B cell-specific 265 marker CD19 and CD20 were used to identify the B cell cluster, the CD14 marker to identify 266 monocytes, and the CD16 marker to identify NK cells. High CD20 expression in B cell cluster 267 indicated their mature stage in immune response. Our annotated UMAP (Figure 2c ) shows well-268 defined clusters that correspond to PBMC lineages, including Naive CD4+ and CD8+ T cell and 269 Tcm for both CD4+ and CD8+ subsets. We also identified Tem and Temra cells, as well as 270 mucosal associated invariant CD8+ T cell (MAIT). We also identified cMO, intermediate 271 monocytes (iMO), B cells, NK cells, Gamma delta (γδ ) T cells (Tgd) and dendritic cell (DC) types 272 (cDC and pDC). 273 274 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint 275 Figure 3: Comparative Analysis of Immune Cell Subpopulations in Healthy and COVID-19 276 Individuals. (a) UMAP plots displaying distinct clustering patterns and differential distribution of 277 cell types in PBMC across healthy and COVID-19 group. ( b) Sankey diagram illustrates the 278 distribution of cells across groups, with thicker flow indicating more cells. (c) Composite bar-strip 279 plot summarizing cell count distribution across cell subpopulations. Dots represent each 280 individual samples colored by group. ( d) Stacked bar chat showing distribution of cells in 281 percentage across two groups. ( e) Effect size calculated using Cohen’s d indicating changes in 282 the number of cells in COVID-19 compared to reference healthy control. ( f) Comparison of 283 individual cell type frequencies between healthy and COVID-19 groups with p-values for 284 statistical significance. N=9/group 285 286 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint Distinct Cellular and Molecular Signatures Observed in COVID-19 Compared to Healthy 287 Controls 288 UMAP analysis of COVID-19 hospitalized patients compared to healthy controls revealed 289 distinct clustering patterns between groups, particularly among monocytes, NK cells, and CD8 T 290 cells (Figure 3a). To understand changes between the two groups, CAFE offers varied 291 visualization options, for instance, we used a Sankey diagram to demonstrate that MAIT cells 292 and Tgd are much less abundant in COVID-19 patients compared to healthy controls (Figure 293 3b). We also found that CD8 Tcm and B cells were significantly expanded in COVID-19 patients. 294 A composite bar-strip plot also demonstrates the distribution of cells in frequency where each 295 dot represented each sample colored by specific group (Figure 3c). The total number of cells in 296 iMO were largely reduced in COVID-19 patients compared to healthy controls (Figure 3d). 297 These data may indicate a possible shift from an innate response towards an adaptive 298 response. To quantify the effect size of changes observed, we compared cell types within the 299 COVID-19 group to healthy controls as a reference and found changes in naive CD8, TMAIT, 300 and iMO cells have a larger effect size, demonstrating a bigger difference between the two 301 groups (Figure 3e). We further compared these cell types by plotting box plots for individual cell 302 types (Figure 3f) which demonstrated a non-statistically significant increase in Tcm CD8 303 (p=0.0777) and statistically significant decrease in naive CD8 cells (p=0.0372) in COVID-19 304 patients compared to healthy controls. 305 306 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint 307 Figure 4: Marker expression and distribution differences between COVID-19 and healthy 308 individuals. (a) Violin plot showed the median expression levels of CD161 across immune cell 309 subtypes. (b) Sankey diagram illustrated marker expression in CD8+ T cells, with thicker flows 310 indicating more cells expressing that marker. ( c) Bar chart showed median expression of 311 markers across all cell types between COVID-19 and healthy individuals where positive values 312 indicated upregulation. Box plots displayed (d) the number of cells expressing IgG, IgM and IgD 313 in B cells, and (e) the number of CD8+ T cells expressing CD25, HLA-DR, and CD38. 314 315 Altered Marker Expression Profiles in COVID-19 Patients 316 CAFE outputs a file with expression data for all markers on each cluster by sample for use in 317 external plotting and statistical software. In addition, further exploration of specific markers 318 within clusters can be performed within CAFE. We used this approach to identify that MAIT cells 319 have the greatest expression of CD161, as shown by violin plot (Figure 4a). We also examined 320 marker distribution in CD8 T cell populations and found that more cells within the Tem CD8 321 population appeared activated based on greater HLA-DR, CXCR3, and CCR5 expression 322 compared to other CD8 T cell subsets (Figure 4b). We found that median expression of CD8, 323 CD14, CD11b, IgG were increased in COVID-19 patients compared to healthy controls across 324 all clusters (Figure 4c). These reflect the overall differences in some of the cell populations we 325 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint observed between groups. We examined marker expression within the B cell cluster and found 326 that more B cells expressed IgG in COVID-19 samples compared to healthy controls although 327 the difference was not statistically significant (p=0.1631), while IgD expression was statistically 328 significantly reduced (p=0.0162) in COVID-19 samples compared to healthy controls (Figure 329 4d). We also examined activation markers in CD8+ T cells and found that COVID-19 patients 330 had more CD8+ T cells that expressed CD25 and HLA-DR than healthy controls (Figure 4e). 331 332 333

Discussion

334 335 As research in immunology increasingly relies on high-dimensional cytometric data, there is a 336 growing need for a user-friendly analysis tools for everyday use. Here, we present CAFE as a 337 free and open-source tool designed to address the analytical and accessibility issues posed by 338 SFCM data. CAFE uses a GUI and interactive controls to enable immunologists to analyze 339 complex data without needing specialized coding knowledge. 340 341 Using a jupyter notebook, we have previously shown the ability of Scanpy’s Leiden function 342 (scanpy.tl.leiden) to analyze a 50-color human PBMC dataset 15. CAFE acts as a wrapper 343 combining packages within Streamlit to provide an interactive web interface, offering more 344 accessible and extensive functionality than Jupyter notebooks’ CLI. We demonstrated analysis 345 of a 35-color human PBMC dataset using CAFE in this study with 350,000 cells and 12.25M 346 data points. Major steps including data processing, dimension reduction, batch correction and 347 Leiden clustering were completed in under 12 minutes using an Apple M3 Pro laptop. In our 348 analysis, we observed COVID-19 patients with altered immune cell distributions and marker 349 expression profiles, consistent with prior findings, as well as MAIT cells expressing high CD161 350 and B cells expressing high CD2021. 351 While developing CAFE, we have balanced compatibility and performance and included many 352 options for customization of how the code processes and analyzes data while integrating default 353 options and tooltips to help guide users. Our implementation of Pandas with PyArrow 354 significantly improves processing speed over Pandas alone. However, transitioning to Polars’ 355 lazy evaluation framework could further speed up processing once compatibility issues between 356 ARM and x86 machines are resolved. CAFE’s web app design and functionality also revolved 357 around simplicity as we de-emphasized features that are not commonly used in order to 358 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint streamline user-experience. For data dimension reduction, we adhered to a convention of using 359 PCA as the primary method and using PCA-reduced representations for constructing UMAP 360 neighbor graphs, as opposed to utilizing UMAP directly for primary dimensionality reduction. 361 Although PCA is designed for linear data, it effectively reduces noise and enhances clustering 362 performance. Users have the choice to skip PCA to perform Leiden clustering on the raw data 363 or use UMAP embeddings (i.e., X_umap) to use UMAP reduced data for clustering. One viable 364 alternative to PCA for non-linear data structure is Kernel PCA, but we have skipped adding the 365 kernel PCA option in the CAFE workflow because it may not be practical since it is 366 computationally taxing. 367 UMAP parameters and Leiden resolution largely influence the number of clusters for community 368 detection. Leiden clustering is performed on the graph structure, so evaluation of clustering 369 quality solely based on methods such as elbow or silhouette score is not ideal. Rather a 370 combined approach with prior biological knowledge can inform the most correct clustering 371 resolution. We recommend using CAFE’s advanced clustering evaluation tab to generate plots 372 with a range of varied UMAP parameters and Leiden resolutions for visual inspection. Using this 373 approach, users can select the most appropriate clustering resolution for each dataset. Since 374 this is an unsupervised algorithm, setting up an incorrect resolution can heavily skew the 375 interpretation of data. 376 Manual gating continues to be the gold standard in flow cytometry analysis, but it is limited by 377 sequentially drilling down into subsets of cells with 2-dimensional bi-axial gating. Thus, our goal 378 was to complement this hypothesis-driven approach with the unsupervised computational 379 algorithms. In this way, users can perform hypothesis-driven analysis with manual gating and 380 hypothesis-generating analysis with unsupervised clustering. Compared to other open-source 381 tools, CAFE provides a wide range of publication-ready visualization options. Due to its 382 underlying code in python, CAFE is highly scalable to datasets of millions of cells and takes 383 advantage of multi-threading to obtain higher performance. Previously, python-based Pytometry 384 and CRUSTY integrated unbiased clustering algorithms within their tools 22,23. Pytometry 385 incorporates the Leiden algorithm 10, which has been shown to be an improvement over the 386 predecessor Louvain algorithm 24; however, Pytometry requires coding using Python. CRUSTY 387 incorporates an easy-to-use GUI but it does not offer the Leiden algorithm and relies on 388 FlowSOM and Phenograph. Another limitation of CRUSTY is that the cloud-based service limits 389 users to analyzing 100,000 total events. There are also limited visualization and analysis 390 options in CRUSTY and they rely on most of the Phenograph and FlowSOM default settings 391 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint which cannot be customized. Cloud-based solutions may face limitations in availability, 392 scalability, and data security. Users may be prohibited from uploading data to cloud-based 393 systems that have protected health information due to HIPAA. Among a few other GUI based 394 tools, Cytoflow25, Floreada (floreada.io), EasyFlow 26 allow for flow cytometry data analysis but 395 do not offer clustering. FlowPy (flowpy.wikidot.com) allows for clustering but uses k-mean 396 clustering rather than the most advanced algorithms currently in use (i.e. Leiden). Additional 397 tools like terraFlow 27 and CellEngine (CellCarta, Montreal, Canada) are for-profit spectral flow 398 cytometry analysis softwares; however, the price of these may be restrictive for many users. 399 Finally, FlowJo is a staple for many immunologi sts and has some native clustering capabilities. 400 It also supports plugins for additional clustering algorithms, but these add-ons do not offer much 401 customization in the clustering parameters. 402 403 CAFE, while addressing many of these limitations as an open-source alternative, has its own 404 practical considerations. CAFE is intended to be run locally which requires installing a package 405 manager such as Pixi or Anaconda/Miniconda3 through terminal. Performance is also 406 dependent on the user’s machine. For larger dat asets, we recommend utilizing our provided 407 scripts in Github to run data processing step through an HPC cluster by allocating more RAM. 408 Once the user has Anndata file (.h5ad file) generated with cluster information, all the 409 downstream analysis and figure generation steps become significantly less computationally 410 demanding. Ultimately, CAFE’s aim is to become a secure, scalable, and open-source platform 411 accessible to a broad range of researchers to run complex analyses through a simple intuitive 412 graphical user interface. 413 414

Acknowledgements

This work was supported by the National Institutes of Health/National 415 Institute on Aging Grant R00 AG068309 (to D.J.T.); Nathan Shock Center, which is supported by 416 the National Institutes of Health/National Institute on Aging Grant P30 AG050886; and this 417 research was conducted while (Daniel Tyrrell) was an AFAR Grant for Junior Faculty awardee 418 A24063 (to D.J.T.). 419 420 Data availability: The raw data are publicly available at FlowRepository: FR-FCM-Z3WR. The 421 downsampled .csv files are available from Figshare: 27940719. The processed Anndata object 422 after dimensionality reduction and clustering is also available from Figshare: 27940752. 423 424 Code availability: CAFE is freely available at https://github.com/mhbsiam/cafe. 425 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint 426 Contributions: M.H.B.S. and D.J.T. conceived the software development. M.H.B.S. created the 427 software and web application; M.H.B.S. analyzed the data. M.H.B.S. and D.J.T. wrote the paper. 428 All authors reviewed and edited the paper. 429 430 Corresponding author: Correspondence to Daniel J. Tyrrell at [email protected]. 431 432 Ethics declarations: The authors declare they have no competing interests. 433 434

References

435 436 1. McKinnon, K. M. Flow Cytometry: An Overview. Current Protocols in Immunology 120, 437 5.1.1-5.1.11 (2018). 438 2. Nolan, J. P. The evolution of spectral flow cytometry. Cytometry Part A 101, 812–817 (2022). 439 3. Konecny, A. J., Mage, P. L., Tyznik, A. J., Prlic, M. & Mair, F. OMIP-102: 50-color 440 phenotyping of the human immune system with in-depth assessment of T cells and dendritic 441 cells. Cytometry Part A 105, 430–436 (2024). 442 4. Tyrrell, D. J. et al. Clonally expanded memory CD8+ T cells accumulate in atherosclerotic 443 plaques and are pro-atherogenic in aged mice. Nat Aging 3, 1576–1590 (2023). 444 5. Na, S., Choo, Y ., Yoon, T. H. & Paek, E. CyGate Provides a Robust Solution for Automatic 445 Gating of Single Cell Cytometry Data. Anal Chem 95, 16918–16926 (2023). 446 6. Van Gassen, S. et al. FlowSOM: Using self-organizing maps for visualization and 447 interpretation of cytometry data. Cytometry A 87, 636–645 (2015). 448 7. Samusik, N., Good, Z., Spitzer, M. H., Davis, K. L. & Nolan, G. P. Automated mapping of 449 phenotype space with single-cell data. Nat Methods 13, 493–496 (2016). 450 8. Qiu, P. et al. Extracting a cellular hierarchy from high-dimensional cytometry data with 451 SPADE. Nat Biotechnol 29, 886–891 (2011). 452 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint 9. Levine, J. H. et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells 453 that Correlate with Prognosis. Cell 162, 184–197 (2015). 454 10. Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-455 connected communities. Sci Rep 9, 5233 (2019). 456 11. Hao, Y . et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573-3587.e29 457 (2021). 458 12. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data 459 analysis. Genome Biology 19, 15 (2018). 460 13. Rich, J. M. et al. The impact of package selection and versioning on single-cell RNA-seq 461 analysis. bioRxiv 2024.04.04.588111 (2024) doi:10.1101/2024.04.04.588111. 462 14. McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: Uniform Manifold Approximation 463 and Projection. Journal of Open Source Software 3, 861 (2018). 464 15. Vardaman, D. et al. Development of a Spectral Flow Cytometry Analysis Pipeline for High-465 dimensional Immune Cell Characterization. J Immunol 213, 1713–1724 (2024). 466 16. Burton, R. J. et al. CytoPy: An autonomous cytometry analysis framework. PLoS Comput 467 Biol 17, e1009071 (2021). 468 17. Streamlit  A faster way to build and share data apps. https://streamlit.io/ (2021). 469 18. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data 470 using empirical Bayes methods. Biostatistics 8, 118–127 (2007). 471 19. Csardi, G. & Nepusz, T. The Igraph Software Package for Complex Network Research. 472 InterJournal Complex Systems, 1695 (2005). 473 20. Silverman, B. W. Density Estimation for Statistics and Data Analysis. (Routledge, New York, 474 2017). doi:10.1201/9781315140919. 475 21. Yu, C. et al. Mucosal-associated invariant T cell responses differ by sex in COVID-19. Med 476 2, 755- 772.e5 (2021). 477 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint 22. Büttner, M., Hempel, F., Ryborz, T., Theis, F. J. & Schultze, J. L. Pytometry: Flow and mass 478 cytometry analytics in Python. 2022.10.10.511546 Preprint at 479 https://doi.org/10.1101/2022.10.10.511546 (2022). 480 23. Puccio, S. et al. CRUSTY: a versatile web platform for the rapid analysis and visualization of 481 high-dimensional flow cytometry data. Nat Commun 14, 5102 (2023). 482 24. Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities 483 in large networks. J. Stat. Mech. 2008, P10008 (2008). 484 25. Teague, B. Cytoflow: A Python Toolbox for Flow Cytometry. 2022.07.22.501078 Preprint at 485 https://doi.org/10.1101/2022.07.22.501078 (2022). 486 26. Ma, Y ., Eizenberg-Magar, I. & Antebi, Y . EasyFlow: An open-source, user-friendly cytometry 487 analyzer with graphic user interface (GUI). PLOS ONE 19, e0308873 (2024). 488 27. Freeman, D. et al. terraFlow, a high-parameter analysis tool, reveals T cell exhaustion and 489 dysfunctional cytokine production in classical Hodgkin’s lymphoma. Cell Reports 43, (2024). 490 491 Figure Legends: 492 Figure 1: The flowchart outlines steps and components of CAFE’s workflow. 493 Preprocessing includes compensation, data scaling/transformation using a standard FCM 494 software and scaled CSV files are then exported and renamed as Sample_Group.csv. Data 495 processing performs error checks and concatenation of CSV file s into an AnnData object/H5AD 496 file. Major steps requiring user input include dimension reduction, batch correction, UMAP 497 (UMAP Uniform Manifold Approximation and Projection) and community detection. Outputs are 498 downloadable as CSV, H5AD, PNG, JPG, SVG and PDF files. 499 Figure 2: Profiling of Human PBMCs Reveal s Distinct Immune Subpopulations and 500 Marker Expression Patterns. (a) UMAP plots showing selected marker expression intensities 501 across all cells in the UMAP space to highlight lineage-specific marker distribution. (b) Dot plot 502 of all marker expression across all identified PBMC cell types. Dendrogram highlighted distinct 503 marker-based groupings. (c) UMAP visualization showing 16 distinct clusters with annotated cell 504 types including Naive CD4 and CD8 T cells, central memory CD4 and CD8 T cells (Tcm), 505 effector memory CD8 T cells (Tem), terminally differentiated effector memory CD8 T cells 506 (Temra), mucosal-associated invariant (MAIT) T cells, classical monocytes (cMO), intermediate 507 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint monocytes (iMO), B cells, NK cells, gamma delta ( γδ ) T cells (Tgd), conventional dendritic cell 508 (cDC) and plasmacytoid dendritic cell (pDC). 509 Figure 3: Comparative Analysis of Immune Cell Subpopulations in Healthy and COVID-19 510 Individuals. (a) UMAP plots displaying distinct clustering patterns and differential distribution of 511 cell types in PBMC across healthy and COVID-19 group. ( b) Sankey diagram illustrates the 512 distribution of cells across groups, with thicker flow indicating more cells. (c) Composite bar-strip 513 plot summarizing cell count distribution across cell subpopulations. Dots represent each 514 individual samples colored by group. ( d) Stacked bar chat showing distribution of cells in 515 percentage across two groups. ( e) Effect size calculated using Cohen’s d indicating changes in 516 the number of cells in COVID-19 compared to reference healthy control. ( f) Comparison of 517 individual cell type frequencies between healthy and COVID-19 groups with p-values for 518 statistical significance. N=9/group. 519 Figure 4: Marker expression and distribution differences between COVID-19 and healthy 520 individuals. (a) Violin plot showed the median expression levels of CD161 across immune cell 521 subtypes. (b) Sankey diagram illustrated marker expression in CD8+ T cells, with thicker flows 522 indicating more cells expressing that marker. ( c) Bar chart showed median expression of 523 markers across all cell types between COVID-19 and healthy individuals where positive values 524 indicated upregulation. Box plots displayed (d) the number of cells expressing IgG, IgM and IgD 525 in B cells, and (e) the number of CD8+ T cells expressing CD25, HLA-DR, and CD38. 526 527 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint CSV files Data Processing Dimension reduction Run PCA H5AD file Batch correction No Yes Community detection Run ComBat Export files KNN graphUMAP embeddingSNN graph UMAP X_pca X_umap Use X_pca, X_umap or none Yes No CSV files Counts Frequency Visualization UMAP Plots Bar Plots Dot Plots KDE Plots Violin Plots Sankey Plots Heatmaps Statistics Annotate Clusters Merge Clusters Cluster evaluation Downsampling Preprocessing optional optional optional Violin Plots Sankey Plots Heatmaps Box Plots Strip Plots Dendrogram .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint c b a .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint NK cell Tcm CD4 Tem CD8 Naive CD4 Tem CD4 Temra CD8 B cell Tcm CD8 TMAIT Tgd cell pDC cDC Naive CD8 cMO Basophil iMO −20k 0 20k 40k 60k Violin Plot of CD161 Expression across cell_type cell_type CD161 Expression CD8 CD11b CD14 HLA-DR IgG IgM CD38 CD25 CD161 −2000 −1000 0 1000 2000 3000 4000 5000 Group COVID19 Healthy Median Expression of Markers Among Groups Markers Median Expression Healthy COVID19 1,300 1,400 1,500 1,600 1,700 1,800 Healthy COVID19 5,000 10,000 15,000 Healthy COVID19 2,500 3,000 3,500 4,000 4,500 5,000 Marker Expression Comparisons CD25 HLA-DR CD38 p-value = 0.1917 p-value = 0.3504 p-value = 0.6506 Healthy COVID19 4,000 6,000 8,000 Healthy COVID19 10,000 20,000 30,000 40,000 50,000 60,000 70,000 Healthy COVID19 20,000 40,000 60,000 80,000 100,000 Marker Expression Comparisons IgG IgM IgD p-value = 0.1631 p-value = 0.2759 p-value = 0.0182 T T HLA-DR CXCR3 CCR5 CCR7 CXCR5 CCR6 Tem CD8 Naive CD8 TMAIT Tcm CD8 Temra CD8 a b c d e .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted December 7, 2024. ; https://doi.org/10.1101/2024.12.03.626714doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-23T02:00:01.238055+00:00
License: CC-BY-NC-ND-4.0