Abstract
Single cell technologies have advanced at a rapid pace, providing assays for various molecular
phenotypes. Droplet-based single cell technologies, particularly those based on nuclei isolation,
such as simultaneous RNA+ATAC single-cell multiome, are susceptible to exogenous ambient
molecule contamination, which can increase noise in cell type-level associations. We reasoned
that genotype-based sample multiplexing can provide an opportunity to infer this ambient
contamination by leveraging DNA variation in sequenced reads. Thus, we developed ambimux,
a likelihood-based method to estimate ambient fractions and demultiplex single-cell multiome
experiments using genotype-level data. Ambimux models the ambient or nuclear probability at
the read level and thus can classify empty droplets and estimate droplet-specific ambient
molecule fractions in each modality. We first evaluated our method using simulated data sets
across a range of parameters. We found that ambimux closely estimated the ground truth
droplet contamination fractions in the RNA (MAE=0.048) and ATAC (MAE=0.042) modalities. As
a result, ambimux maintained high specificity (>95%) and was able to correctly assign singlets
at considerably high ambient fractions (up to 60%) for both RNA and ATAC modalities. In
comparison with models that do not consider ambient contamination, these only maintained
similar sensitivity levels at considerably lower ambient fractions (up to 25%). We then generated
a real data set of seven visceral adipose tissue biopsies run on a single 10x Multiome channel.
We ran ambimux and detected 4,986 singlets, capturing similar numbers as other methods.
Then, we sought to evaluate the fidelity of the ambient fraction estimates from ambimux. We
split singlets into ambient-enriched (>5% contamination in both modalities) or nuclear-enriched
(<5% in both) droplets and performed gene-peak linkage analysis. Low ambient droplets
resulted in more significant hits with gene-peak links enriched at the transcription start site
relative to high ambient droplets, suggesting that the ambient droplets identified by ambimux
hamper the identification of biologically meaningful signals. In summary, we developed a joint
single-cell multiome demultiplexing method, ambimux, that accurately models and estimates
ambient molecule contamination in each modality.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
Introduction
The ability to profile multiple molecular phenotypes in single cells has provided powerful
opportunities to study the interaction between regulatory elements and gene expression1,2.
Single cell technologies can profile molecular phenotypes beyond gene expression, including
open chromatin3, methylation4, histone modifications5, and chromatin configuration6. These
assays allow for the characterization of cis-regulatory elements (CREs) that regulate gene
expression in a cell-type-specific manner7. These analyses are further aided by computational
Methods
that can integrate separate single-cell ATAC-seq and RNA-seq experiments8.
However, these approaches only infer the joint epigenomic and transcriptomic profiles of cell
types without direct measurements in the same cell. To overcome these challenges, so-called
multiome technologies have been developed to jointly profile open chromatin and RNA in the
same cell. This can provide valuable insights in connecting CREs with gene expression,
particularly in differentiating cell types where the epigenetic state may temporally mismatch
gene expression
9.
While joint profiling of single cells provides clear advantages over single-modality assays, it is
currently costly and thus difficult to scale over many samples
10. These limitations may hinder
population-scale studies of gene regulation at the single-cell level. Multiplexing by pooling
samples provides a feasible approach to increase sample sizes while keeping costs fixed11,12.
Additionally, multiplexing allows for better detection of heterotypic doublets. When samples are
genetically distinct, they can be pooled without additional experimental approaches required by
antibody- or lipid-based hashing strategies
13,14. Previous computational approaches developed
for droplet-based single-cell RNA-seq experiments leverage genetic variation as a natural
barcode in each cell to assign droplets to individual donors
12,15,16. However, these methods are
designed to run on a single modality, and do not leverage both RNA and ATAC reads in the
same droplet. Furthermore, these methods typically require prior specification of candidate
droplets. Genetic multiplexing offers the advantage to detect empty droplets based on variant
calls, which may be superior to detection via expression or accessibility.
An additional challenge encountered with RNA and ATAC single-cell multiome experiments is
the potential contamination of exogenous ambient molecules in droplets
14,17–19. This issue is
more prominent with nuclei, especially if isolated from solid or frozen tissues20. As current
single-cell multiome protocols rely on nuclei isolation9, methods to evaluate ambient molecule
contamination would be valuable for quality control or covariate correction purposes.
To address these limitations, we developed ambimux, a computational approach to demultiplex
single-cell multiome experiments and estimate ambient molecule fractions for each modality.
Our method estimates ambient molecule fraction per modality, providing a valuable filtering
metric. Additionally, ambimux classifies empty droplets based on genetic data, removing the
need for prior filtering. We demonstrate via simulations that ambimux outperforms existing
Methods
for demultiplexing contaminated multiome experiments, recovering a higher number of
singlets with greater accuracy. We also show that ambimux accurately estimates ambient
fractions in RNA and ATAC modalities in these synthetic data. Finally, we apply ambimux to a
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
single-cell multiome dataset from seven human visceral adipose tissue samples in a single run.
Overall, ambimux provides a fast and scalable approach for demultiplexing single-cell multiome
experiments and accounting for ambient molecule contamination.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
Results
Overview of the ambimux model
Ambimux is a likelihood-based method designed for demultiplexing single-cell multiome RNA-
and ATAC-seq experiments, while simultaneously accounting for ambient contamination. It
offers two novel features that are particularly beneficial for droplet filtering. First, ambimux can
classify all generated droplets as empty, singlet, or doublet, eliminating the need for prior
barcode filtering. Second, it estimates the fraction of ambient RNA and DNA molecules in each
droplet, enhancing assessment and filtering for downstream analyses. Ambimux accomplishes
this by using variant-overlapping base calls within individual reads along with sample genotypes
without relying on expression or accessibility data.
Ambimux accurately demultiplexes samples in contaminated simulated multiple data.
We hypothesized that explicit modeling of ambient molecules per droplet would lead to
improved demultiplexing. To test this, we simulated three multiome datasets with low (mean =
10%), medium (mean = 20%), and high (mean = 30%) ambient fractions (Methods; Fig. 1a).
Each simulation consisted of 10,000 nuclei with a 10% doublet rate and eight samples
multiplexed uniformly (see methods). We ran ambimux and compared the performance with
three other demultiplexing methods: Demuxlet
12, Vireo15, and SouporCell16. We ran these
Methods
in each of the RNA and ATAC modalities separately and ran ambimux jointly on both.
First, we compared precision across all methods by calculating the percent of classified singlets
with a correct donor assignment. Ambimux, Demuxlet, and Vireo had a precision of greater than
99% across each of the three ambient simulations (Supplementary Fig. 1a,b). The precision of
SouporCell was somewhat lower, ranging from 92%-98% and decaying slightly with increasing
contamination (Supplementary Fig. 1a,b). Overall, these results suggest that demultiplexing
under ambient contamination is not highly susceptible to false positives.
Next, we compared recall (sensitivity) across the methods by calculating how many singlets
each method could recover. Of the 9,000 singlets in each of the three contamination datasets,
we calculated the percent of these assigned as singlets to the correct donor. We observed that
ambimux was able to accurately classify over 99% of the 9,000 singlets in each of the low,
medium, and high ambient datasets (Fig. 1b,c). In contrast, all other methods showed a lower
recovery of singlets in the higher contamination datasets (Fig. 1b,c). For example, Vireo RNA,
the next best method, had a sensitivity of 98.4% in the low contamination data, but only 89.4%
in the highly contaminated singlets. Thus, we found that ambient contamination generally
decreased the ability to recover singlets when not accounted for.
As both coverage and contamination can likely affect demultiplexing, we sought to identify the
contribution of each to the ability to accurately assign singlets. First, we evaluated at what point
ambient contamination started to reduce sensitivity. For both Vireo and Demuxlet, sensitivity
started to degrade at around 25% ambient contamination, and most singlets were not recovered
at around 50% (Fig. 1d). With SouporCell, 20-30% of calls were ambiguous across the entire
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
range of ambient contamination, even at 0% (Fig. 1d). In contrast, ambimux maintained over
99% accuracy in singlet calls even when ambient fractions approached 60% (Fig. 1d). Droplet
coverage also affected singlet recovery in single-modality runs. In both Vireo, Demuxlet, and
SouporCell, lower coverage led to more ambiguous calls, especially with droplets containing
less than 1,000 UMIs or fragments (Supplementary Fig. 1c). In contrast, ambimux was largely
unaffected as both modalities contributed reads for demultiplexing (Supplementary Fig. 1c).
Accurate estimation of droplet-specific ambient contamination.
Multiplexing of single-cell experiments with genetically distinct individuals provides a unique
opportunity to assess ambient molecule contamination. The distinct “genotype” of the ambient
pool allows for probabilistic discrimination between donor and ambient molecules. To leverage
this, we incorporated the droplet fraction of ambient molecules as a parameter (see methods).
To test the accuracy of these contamination estimates, we combined the three simulated
datasets above and compared the estimates with the simulated true values. Overall, we found
these estimates to be highly accurate in both modalities (Fig. 2a,b), with a mean absolute error
(MAE) of 4.8% and 4.2% in the RNA and ATAC, respectively (Fig. 2a). We further hypothesized
that accuracy in droplet ambient estimation would be influenced by coverage. To test this, we fit
a local loess regression curve between the droplet absolute error and the number of variant-
overlapping reads (informative reads). As expected, we found lower errors in ambient estimates
with increasing coverage, with similar parameters in both modalities (Fig. 2c; Supplementary
Fig. 2). For example, droplets with 100 informative RNA reads were predicted to have an error
of 7.3%, while those with 1,000 informative RNA reads had a predicted error of only 2.7% (Fig.
2c).
We also compared our background fraction estimates with that of CellBender
18, a deep
generative model for ambient reads removal in single-cell RNA-seq using count data. As our
simulations generated a distinct background distribution, we reasoned that CellBender could be
applied to our three synthetic contaminated datasets. In addition to RNA, we tested ATAC
counts, although we note that the authors never evaluated CellBender in this modality. We
found that CellBender failed to properly estimate background fractions (Supplementary Fig. 3).
After combining the results from the three datasets, the mean absolute errors were 19.0% and
18.4% in the RNA and ATAC modalities, respectively (Supplementary Fig. 3a). Upon further
inspection, we found that CellBender underestimated the ambient fraction on average, and the
MAE increased with higher background reads (Supplementary Fig. 3b). Importantly, background
estimates from CellBender were uncorrelated with the simulated ambient fractions for each of
the 3 ambient datasets (Pearson R < 0.01) (Supplementary Fig. 3c). Overall, these comparisons
highlight how modeling genotypes can improve estimation of ambient read fractions when
compared to read counts alone.
Ambimux improves demultiplexing of contaminated droplets in single modalities.
While developed for single-cell multiome data, ambimux can easily run on a single modality. We
tested ambimux on the RNA and ATAC modalities separately, using the three simulated
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
datasets described above (low, medium, and high contamination). This allowed us to assess the
benefits of ambient modeling independently of coverage, as the RNA+ATAC approach doubled
the number of reads on average compared to the single-modality approach. We again
compared ambimux with Demuxlet
12, Vireo15, and SouporCell16 and assessed sensitivity in
accurately recovering the 9,000 singlets in each dataset (Fig. 2d). In the RNA modality, Vireo
accurately recovered the most singlets in the low (98.4%) and medium (95.5%) contamination
datasets, while ambimux accurately recovered 97.3% and 95.2%, respectively (Fig. 2d). In the
high contamination dataset, ambimux performed best, accurately assigning 92.5% of singlets,
whereas Vireo recovered 89.4% (Fig. 2d). Ambimux also outperformed all methods in the ATAC
modality, recovering 98.3%, 97.3%, and 95.2% of the 9,000 singlets in the low, medium, and
high ambient datasets, respectively (Fig. 2d). These results show that ambimux is robust to
ambient contamination even when run on either RNA or ATAC modalities individually, although
joint RNA+ATAC calling performed best (recall > 99%) in all three simulated datasets.
Ambimux maintains accuracy across pooling strategies.
An important question in the design of multiplexed experiments is how many samples to pool.
Often, this will involve balancing depth vs. breadth with budgetary constraints
21. We sought to
test the performance of ambimux across a wide range of pooling numbers. We simulated
synthetic data sets consisting of 2, 4, 8, 16, 32, and 64 pooled samples. As before, each pooling
experiment contained 9,000 singlets and 1,000 doublets. Droplet background fractions were
sampled from an equal mixture of low, medium, and high ambient distributions (Methods). We
observed that ambimux maintained excellent performance across all pooling numbers, with
precisions and recalls greater than 99.9% and 99.8%, respectively (Fig. 3a). We also found that
ambimux could accurately estimate ambient fractions with mean absolute errors of 4.6-6.1% for
the RNA and 4.1-5.3% for the ATAC (Fig. 3b). Interestingly, the accuracy of background fraction
estimates increased with higher pooling numbers (Fig. 3b).
To properly model ambient molecules in a droplet, ambimux requires the allele frequencies of
the ambient pool. While donor genotypes are given, the ambient pool “genotype” must be
estimated. To do so, we model the ambient allele frequency of a SNP as the average of the
donor genotypes weighted by estimated background donor proportions (see methods). We thus
asked whether our model can properly estimate these fractions of donors in a pool. In a
simulation of 16 samples with a 10-fold variation in abundance, we found that the estimated
donor proportions accurately reflected the true simulated values in both the background pool
and singlet droplets (Fig. 3c). This allowed for accurate estimation of ambient fractions with an
MAE of 0.041 and 0.046 for the ATAC and RNA, respectively (Supplementary Fig. 4a).
Additionally, ambimux maintained high precision (> 99.9%) and sensitivity (99.9%) in this
simulation (Supplementary Fig. 4b,c). Our results on these simulated data show that ambimux is
robust to variations in sample yield, both in the ambient and singlet droplet pools.
Next, we asked whether ambimux could accurately demultiplex datasets in which genotypes
were missing or samples dropped out. We tested this by simulating a multiome dataset of eight
donors and demultiplexing after removing donor genotypes or adding new donor genotypes in
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
addition to the 8. We performed a sweep by removing 1, 2, 3, and 4 sample genotypes,
simulating cases of missing genotypes, and adding 1, 2, 3, and 4 sample genotypes, simulating
cases with experimental dropout. We found that the addition of sample genotypes had minimal
effects on recall and precision (Fig. 3d), suggesting ambimux can accurately detect and assign
singlets after dropout of samples in a pool. When genotypes of pooled donors were missing,
recall of the genotyped donors was largely unaffected. For example, the recall from
demultiplexing 6 out of 8 samples was 0.747, near the maximum of 0.75 (Fig. 3d). However,
missing genotype samples detrimentally affected the ability to accurately assign singlets to
donors (Fig. 3d). The precision was 0.890 when only 4 of the 8 genotypes were present, while
the precision was 0.999 with the full data (Fig. 3d). Missing or added genotypes also slightly
worsened the accuracy of ambient estimates. While demultiplexing with the eight pooled
samples resulted in an MAE of 0.043 and 0.047 for the ATAC and RNA respectively,
demultiplexing with 4 missing genotypes resulted in MAEs of 0.088 and 0.091, for example
(Supplementary Fig. 5a). With missing genotypes, we found that there were singlets with higher,
overestimated ambient fractions above 0.5 (Supplementary Fig. 5b), likely originating from
ungenotyped samples assigned to an incorrect donor and fit with a high ambient fraction.
Ambient contamination reduces power to detect differential abundance
Many single-cell analyses involve detecting feature-level differences (such as gene expression
or ATAC accessibility) between conditions
21, including cell-type-specific marker identification
and disease association22. We investigated the extent to which ambient contamination can
affect differential abundance (DA) analysis of peaks or genes. To test this, we generated three
synthetic datasets with low (mean 10%), medium (mean 20%), and high (mean 30%)
Background
contamination levels. Each dataset comprised eight pooled individuals, with four
containing a disease cell subtype. We simulated various log fold-changes for DA features in the
disease cell-type, ranging from 0.1 to 2.0 for 1,000 genes and 1,000 peaks (Fig. 4a). DA of the
1,000 features in each modality showed that ambient contamination decreased the power to
detect differential abundance (Fig. 4b). Specifically, in the RNA modality, we detected 334, 286,
and 234 DA genes in the low, medium, and high ambient datasets, respectively (Fig. 4b). The
ATAC modality showed a similar trend, with 27, 20, and 10 DA peaks detected (Fig. 4b). As
expected, features with high fold-changes were most detectable, and estimated log fold-
changes correlated with the true log fold-changes for each dataset (Supplementary Fig. 6).
After confirming that higher background levels led to decreased power to detect DA, we sought
to understand how filtering out contaminated droplets could affect DA results. To investigate
this, we combined the three synthetic datasets above and tested DA after removing droplets
above various ambient fraction thresholds. We used a range of 0.1 to 0.5 for ambient fraction
removal. Filtering out the fewest droplets at a threshold of 0.5 resulted in the highest number of
DA genes, with 137 DA peaks and 501 DA genes in the ATAC and RNA modalities, respectively
(Fig. 4c). In contrast, removing singlets with an ambient fraction above 0.1 resulted in a much
lower number of significant features, with 3 DA peaks and 217 DA genes (Fig. 4c). These
observations are likely due to the fact that ambient contamination and healthy-disease states
were simulated independently of each other, such that contamination was not a confounder for
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
DA testing. Overall, these results suggest that ambient contamination reduces power to detect
differential feature abundance, and further filtering of features also reduces power.
Characterizing ambient contamination and filtering in a single-cell multiome dataset from
visceral adipose tissue.
Encouraged by our results on synthetic data, we next investigated the performance of our
Method
on observed data. We ran the 10x single-cell Multiome assay on seven pooled visceral
adipose tissue (VAT) samples from the KOBS cohort23,24. After mapping with CellRanger Arc
v2.0.225, ambimux identified 4,986 singlets and 512 doublets (9.3%). Several test droplets with
low coverage were classified as empty, and droplets with a low coverage in only one modality
could be assigned to a singlet with a higher coverage in the other modality (Fig. 5a).
Next, we compared ambimux droplet calls with those from demuxlet, Vireo, and SouporCell.
After restricting analysis to 6,414 candidate droplets with at least 100 UMIs and fragments,
ambimux identified 4,824 singlets, while other methods classified a range of 3,857 to 5,334
singlets (Fig. 5b). On average, 92.4% of droplets called as singlets in one method were called
as singlets in another method (Fig. 5c). Ambimux singlets overlapped with at least 94.2% of
singlets called in the other approaches and showed slightly higher overlap with ATAC-based
demultiplexing (Fig. 5c). Among the intersecting singlets, methods had high concordance (mean
= 98.5%) in their sample assignments (Supplementary Fig. 7a). We then compared singlets that
were unique to ambimux when compared to other methods applied to either ATAC or RNA. The
270 (ATAC) and 183 (RNA) droplets specific to ambimux tended to have higher contamination
scores or lower coverage (Fig. 5d). On the other hand, singlets specific to the other methods
tended to be called as doublets by ambimux (Supplementary Fig. 7b).
We considered that lysed or broken nuclei could release randomly fragmented DNA with loss of
chromatin structure and thus be accessible to the Tn5 transposase. This could lead to
extranuclear DNA preferentially originating from inaccessible regions that make up most of the
genome. We tested this in the VAT multiome dataset by using either fragments within peaks
(intra-peak) or between peaks (inter-peak) for demultiplexing. In line with our hypothesis, the
mean ATAC ambient fraction was higher from the inter-peak fragments (22.4%) compared to
those from the intra-peak fragments (6.3%) (paired Wilcoxon p < 2.2x10-16) (Fig. 6a). While
higher than the intra-peak ambient fractions, the inter-peak proportion (22.4%) is lower than the
fraction of the genome that these regions occupy (97.3%). Notably, the intra-peak contamination
estimates were within the same range as the RNA contamination estimates (mean 7.4%) (Fig.
6b). Due to these results, and the fact that downstream analyses rely on peak counts, we used
only intra-peak molecules for demultiplexing in this VAT dataset.
Next, we investigated whether RNA and ATAC ambient fractions were correlated with common
summary statistics in visceral adipose tissue singlets. Notably, the ambient fractions between
modalities were weakly correlated (Fig. 6c), with a Spearman coefficient of 0.08 (p = 4.9 x 10-8),
suggesting that background contamination of RNA and ATAC molecules occurs independently.
We also found that lower coverage droplets tended to have a somewhat higher amount of
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
contamination in both modalities (Fig. 6d), with Spearman correlations of -0.11 and -0.13 for the
ATAC and RNA, respectively. This is consistent with a model of uniform ambient molecule
abundance that would result in low coverage droplets containing a higher proportion of
background. Lastly, we found that the percent of mitochondrial reads, a typical filtering and QC
metric8,18, showed only slightly positive correlations with ambient contamination, with ATAC and
RNA Spearman correlations of 0.10 and 0.21, respectively (Supplementary Fig. 8a).
Finally, we were interested in seeing how ambient fractions could affect downstream analysis.
To do so, we performed gene-to-peak link analysis on clean and contaminated visceral adipose
tissue singlets. First, we clustered droplets and assigned them to 15 cell-types, including
vascular, stromal, myeloid, and adipocyte cells (Supplementary Fig. 8b). We found that ambient
contamination varied across cell-types (Fig. 6e), although this was largely correlated with cell-
type coverage (Supplementary Fig. 8c). We then split the nuclei into equally sized low ambient
( 5% in both RNA and ATAC) droplets (N =
167), controlling for read depth and cell-type (Supplementary Fig. 9). We found 877 and 508
significant links (FDR-adjusted p < 0.05) in the low and high groups respectively (Fig. 6f).
Notably, the associated peaks in the clean droplets tended to be closer to the transcription start
site (TSS) than those of the contaminated droplets (Fig. 6g). Overall, these results suggest that
ambient contamination can degrade the power to detect significant gene-peak links and lead to
potentially more spurious associations.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
Discussion
We developed ambimux for integrated demultiplexing of single-cell multiome experiments under
ambient contamination. We show that our method outperforms existing approaches in sensitivity
and specificity. Using synthetic data, we were able to show how our approach correctly handles
experiments with high ambient contamination and identify points where other methods break. As
part of the model, ambimux also outputs modality-specific ambient fraction estimates per
droplet. Our results on simulations show that ambimux can accurately estimate droplet
contamination with mean absolute errors as low as 2.7% for higher coverage droplets. We also
show that ambimux is capable of demultiplexing a wide range of pooling numbers, from just 2
donors to 64. Importantly, ambimux is robust to donor proportion variation, and is robust to
sample dropouts and missing genotypes. Finally, we applied ambimux to a real dataset of seven
pooled VAT samples and showed comparable results to competing methods.
While highly optimized protocols and cell line samples can yield higher quality data, intact cell or
nuclei isolation from tissue remains challenging in many cases
26. Furthermore, generating cDNA
single-cell libraries from fresh biopsies can present logistical challenges, particularly in scaling
to larger sample sizes. Therefore, frozen tissues might be the only practical approach. However,
frozen tissues are particularly subject to lysis and can lead to increased ambient molecule
concentrations
20,26. As single-cell RNA, ATAC, and multiome methods are applied to frozen
tissues, ambient fraction estimates from genetic data can provide a high confidence method to
assess the quality of the experiment, filter individual cells, and correct for potential confounding.
We have shown here that ambimux can accurately estimate these contamination fractions
across a wide range of experimental setups.
We hypothesize that ambimux will help enable population-scale single-cell multiome studies in
tissues. By providing a robust method to demultiplex samples and estimate background
contamination, ambimux allows researchers to increase pooling and sample sizes for costly
multiome experiments. This approach can help to better identify subtle regulatory mechanisms
across populations
21,27. Moreover, by providing grounded contamination estimates for use as a
covariate or filtering threshold, ambimux may help improve the accuracy of downstream
analyses, such as peak-to-gene links. Improved analyses will help advance our understanding
of cell-type-specific gene regulation and offer new insights into developmental processes,
disease mechanisms, and tissue heterogeneity
28,29.
Despite its advantages, ambimux has some limitations. First, the method relies on genetic
variation between samples, which may limit its applicability in cases where samples are
genetically identical, such as cell lines, or when working with model organisms with limited
genetic diversity. Additionally, we observed a drop in ambient estimation accuracy in low
coverage droplets, as sparsity of informative reads can lead to high variance in parameter
inference. Finally, ambimux was designed for multiplexed experiments in which all donors have
genotype data available. In practice, it may be challenging to obtain this for every donor. While
we observed that ambimux is robust to some missing genotypes, explicit modeling of this may
be preferable.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
Looking ahead, several promising avenues for further development of ambimux can be
identified. A key area is the extension of the model for ambient molecule correction. Ambimux
already models ambient contamination at the read level internally. By leveraging this modeling
approach, functionality could be developed to correct and remove ambient molecules from read
counts, similar to the approach used by CellBender
18. This would provide a more accurate
representation of true cellular content, potentially improving downstream analyses such as
differential feature abundance. As alluded to earlier, an extension of a genotype-free or missing-
genotype framework would further expand the utility of ambimux in these cases.
In conclusion, ambimux represents a significant addition to the single-cell field. By offering
multimodal ambient contamination estimates, ambimux addresses common challenges faced
with large-scale multiome studies in tissues. Its ability to handle various experimental designs
and contamination levels, coupled with its potential to facilitate larger-scale studies, positions
ambimux as a valuable tool for single-cell studies of gene regulation.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
Methods
Model Description
The foundation of our approach is to model the observed variant-overlapping base calls in reads
using a Bernoulli distribution. Our Bernoulli modeling approach is inspired by the work of Kang
et al
12,30, and we extend this by incorporating parameters for empty droplets and ambient reads.
We take a likelihood-based approach and use posterior probabilities to assign donors to
droplets. Since the model is the same for both RNA and ATAC modalities, we use the term
‘molecule’ to refer to either a deduplicated ATAC-seq fragment or a deduplicated RNA-seq
unique molecular index (UMI). We first define the components of the probability model and then
describe the estimation procedure.
In a multiome experiment, let D be the number of droplets containing at least one molecule. A
droplet d can have either 0, 1, or 2 nuclei encapsulated, denoted by the latent variable H
d. We
assume that when multiple nuclei are present in a droplet, their combined genotype resembles
that of the ambient pool, so we do not consider cases with 3 or more nuclei. This assumption
simplifies the estimation process by avoiding the complexity of modeling higher order
combinations.
Although the number of cells or nuclei encapsulated in a droplet roughly follows a Poisson
distribution, nuclei can adhere to each other, resulting in higher rates of doublets
31. To account
for this, we model Hd using a categorical distribution parameterized by λ.
Let N be the number of samples pooled in an experiment. The composition of donors in a
droplet, denoted by Sd, is conditional on the droplet type Hd. There are three possible droplet
types: empty droplets containing no donors, singlets containing one donor, and doublets
containing two donors. The experiment-wide probability of obtaining a singlet sample {i} or
doublet samples {i, j} is modeled using a Categorical distribution parameterized by π c, which is a
vector representing the proportions of each donor in the cells/nuclei.
For a singlet, the probability of being assigned to donor i is given by πci. For a doublet, the
probability of being assigned to donors i and j is given by the product πciπcj, assuming that the
probability of each donor in a doublet is independent of the other donor. To improve efficiency,
self-doublets are ignored, and only unique doublet possibilities are modeled.
Assume a droplet d contains M
d molecules. For a singlet or doublet, each molecule m can
originate from a sample si (i = 1, …, N) or from the ambient pool s0. We introduce the latent
random variable Tdm to indicate the origin of each molecule, which can be either a single sample
or the ambient pool. The probability of a molecule originating from the ambient pool is modeled
as a Bernoulli distribution with parameter αdhs, where h and s denote the specific Hd (number of
cells) and Sd (sample identity) for the droplet. If the droplet is empty (i.e., Hd = 0), then αdhs = 1.
For a doublet, we assume that each donor sample has an equal probability of contributing a
molecule. Note that we specify a droplet contamination parameter
αdhs specific to each
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
combination of Hd and Sd. Thus, for a clean singlet, the parameter estimate will be close to 0 for
its assigned donor but should be much higher for an incorrect donor assignment.
We model the probability of observing a base call at variant sites within a read. Let Bdmv
represent the observed base call at variant site v in molecule m of droplet d. Base calls b can be
either 0 (reference allele) or 1 (alternate allele). Our method focuses on bi-allelic SNPs,
disregarding other types of genetic variants.
The observed base depends on whether a sequencing error occurred, represented by the latent
variable E
dmv. Sequencing errors follow a Bernoulli distribution with probability τdmv, derived from
the Phred quality score of the base call.
In the absence of an error (Edmv = 0), the observed base call follows a Bernoulli distribution
parameterized by γiv, which is the alternate allele frequency of variant v for origin i (donor or
ambient pool). Donor allele frequencies are provided, while ambient allele frequencies are
estimated from the data. In the event of a sequencing error, we assume equal probability for
observing any base.
Taken together, the probability of a droplet is given by
/g1868 /g4666 /g1850 /g3031,/g1852 /g3031;Θ /g4667 /g3404/g1868 /g4666 /g1834 /g3031;/g2019 /g4667 /g1868/g4666/g1845 /g3031|/g1834 /g3031;/g2024 /g3030/g4667/g3537/g1868 /g4666 /g1846 /g3031/g3040|/g1834 /g3031,/g1845 /g3031;/g2009 /g3031/g4667
/g3014/g3279
/g3040/g2880/g2869
/g3537/g1868 /g4666 /g1828 /g3031/g3040/g3049|/g1846 /g3031/g3040;/g2011 ,/g2028 ,/g2024 /g3028/g4667
/g3023/g3279/g3288
/g3049/g2880/g2869
where Xd gives the observed data, Zd gives the latent variables Hd, Sd, Td1, …, Tdm, and Θ gives
the parameters λ, πc, α, πa, γ, and τ.
Parameter estimation and model fitting
The purpose of our method is threefold: 1) identify empty, singlet, and doublet droplets, 2)
assign droplets to donors, and 3) estimate ambient fractions in each droplet. We achieve this
using a combination of gradient-based methods and expectation maximization (EM)
32. The
output of interest consists of the posterior probabilities of Hd and Sd, and the modality-specific
droplet ambient estimates αd. While all droplets are input to the model, we only estimate
parameters for droplets with at least U = 100 molecules in either modality, treating all others as
empty, consistent with previous works
20,33. For single-modality data, the other modality is
ignored.
Accurate demultiplexing and ambient estimation relies on the alternate allele frequencies of
ambient molecules. Rather than directly calculating empirical allele frequencies from ambient
molecules, we model these frequencies as a function of the donor fractions in the ambient pool
and their allele frequencies. This approach regularizes estimates for variants with low coverage
and reduces noise. We set the ambient "genotypes" as a weighted sum of donor genotypes:
/g2011 /g3028/g3049/g3404 ∑ /g2024 /g3028/g3036/g2011 /g3036/g3049, where γiv are the given donor genotypes and πai are the donor proportions in the
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
ambient pool. We estimate πa by maximizing the log likelihood of the data using gradient ascent,
projecting the updated parameter onto the simplex34 to ensure non-negative proportions sum to
1 (Supplementary Note 1). For efficiency, we calculate πa using only fixed empty droplets and
keep this estimate constant.
To estimate droplet ambient parameters αdhs for all test droplets across Hd and Sd, we use
Newton-Raphson optimization for each droplet independently. We maximize the log likelihood
after marginalizing over Tdm, adding a prior to avoid collapsing the likelihood (Supplementary
Note 1).
With updated values for αdhs, we estimate λ and πc by maximizing the expected log likelihood,
where the expectation is taken with respect to the posterior of the latent variables. Each iteration
involves calculating the expected log likelihood and maximizing it with respect to
λ and πc. The
derivations of the update equations are provided in Supplementary Note 1.
Prior to estimation, we initialize λ0 as the proportion of fixed empty droplets, while λ1 and λ2 are
set to 0.9 and 0.1 times the proportion of test droplets, respectively. The values of πc and πa are
initialized uniformly.
The complete estimation procedure can be summarized as follows:
1. Initialize all parameters and estimate a fixed value for ambient sample fractions πa from the
empty droplets.
2. Iterate until convergence:
a) Estimate ambient fractions αdhs using Newton-Raphson.
b) Estimate λ and πc by maximizing the expected log likelihood.
c) Set the prior parameter β to the weighted average of αdhs across singlets.
3. Terminate iterations when the mean absolute difference in the parameter estimates is less
than
ε = 10-6.
Simulation of ambient-contaminated single-cell multiome experiments
While methods exist for simulating single-cell experiments
35, to our knowledge, none can
explicitly simulate multiplexed samples with genotypes and controlled ambient contamination.
To benchmark ambimux, we developed a simulation method called ambisim
19. This single-cell
multiome simulator generates droplets from a pool of donors, sampling variant alleles during
read generation.
Ambisim generates multiome droplets from a set of barcodes with information on donor, cell
type, and read types (donor vs. ambient). It produces two sets of droplets: empty and cellular.
For empty droplets, the number of reads in both modalities is drawn from a negative binomial
distribution (mean = 2, size = 0.1). For cellular droplets, reads are similarly drawn but with mean
= 5,000 and size = 1.5, constrained between 200 and 100,000 reads.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
In cellular droplets, reads are divided into donor and ambient. The fraction of ambient reads is
sampled from a beta distribution, with parameters varying by experiment: low ambient (shape
2,18; mean = 0.1), medium (shape 4,16; mean = 0.2), and high (shape 6,14; mean = 0.3). When
evaluating the accuracy of the ambient estimates, we pool the low, medium, and high datasets,
unless otherwise stated.
For singlets and doublets, one or two cell types are sampled from a uniform categorical
distribution. Genes or peaks are then sampled based on a Multinomial distribution of the
randomly drawn cell type(s) profiles.
In the gene expression modality, genes are sampled from the cell type's expression profile. For
ambient reads, an average profile weighted by cell type frequency is used. An isoform is
randomly selected with uniform probabilities, then classified as spliced or unspliced (probability
0.4 for cellular, 0.6 for ambient). This reflects the fact that nuclear RNA will be enriched for
unspliced mRNA relative to cytoplasmic or total cell RNA. The location of the single-end read
sequence is sampled uniformly within the cDNA sequence of the spliced or unspliced isoform.
The ATAC modality employs a similar procedure but accounts for inter-peak reads. Peaks are
sampled using a Multinomial distribution with cell type accessibility probabilities. The ambient
peak distribution combines cell type distributions weighted by their frequency. For inter-peak
reads, its genomic location is sampled randomly from the genome. Whether a read falls inside
or outside a peak depends on whether the droplet is empty. For an empty droplet, we sample a
read coming from a peak using a Beta distribution with shape parameters (1, 9). For non-empty
droplets, we sample a read coming from a peak using a Beta distribution with shape parameters
(4, 6). To generate the paired-end read, we sample an insert size from 0 to 150 uniformly.
For both ATAC-seq and RNA-seq, the read sequences are further modified as follows. First, all
variants overlapping the read alignment are selected. For each variant in a cellular read, an
allele is sampled from a Bernoulli using the allele frequency of the read’s donor and the base at
the read site is replaced with this allele. The procedure is similar for an ambient read, where the
ambient allele frequency is used instead of the donor’s allele frequency. To obtain the final read
sequence we sample a sequencing error with a probability of 0.01 for each base pair. Given an
error, the base is replaced randomly with any of the four nucleotides.
For all simulations, we set the cell types and gene and peak probability distributions from the
single-cell multiome visceral adipose tissue data set (see below). Briefly, the top 3,000 barcodes
ranked by gene expression UMIs were clustered using Seurat v4.3.0
8 with a resolution of 0.5
and otherwise default parameters. This resulted in nine cell types, and the proportion, gene
expression, and peak probabilities for each were calculated empirically. Donors and genotypes
were simulated from 1,000 Genomes
36. We randomly selected unrelated European donors for
all simulated data. For each dataset and demultiplexing run, we kept biallelic SNPs and
removed SNPs monomorphic in the pooled donors. This set of SNPs was used for both
generating the fastq files and demultiplexing donors.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
All simulated fastqs were aligned with CellRanger Arc v2.0.225 using GRCh38 and GENCODE
v4137, the same references from which the data were generated. Additionally, we provided the
peaks BED file to the ‘--peaks’ option to ensure that all simulations generated the same peak
count data.
Differential abundance analysis of RNA and ATAC features
To evaluate the effect of ambient contamination on differential expression and differential
accessibility, termed differential abundance (DA) here, we slightly modified the data generation
to include a disease cell subtype. We took the adipocyte cell-type and modified the multinomial
probabilities. We generated log2 fold-changes ranging from -2 to 2 for 1,000 randomly selected
features with an expression probability above 1 x 10
-5. For four individuals we replaced the
adipocyte cell-type with the disease cell-type. The three datasets were then generated as
described above with low, medium, and high ambient contamination.
To detect DA features, we first demultiplexed the aligned data with ambimux and extracted the
assigned singlets. The counts from CellRanger Arc were preprocessed using Signac
38, where
the RNA counts were normalized to sum to 1,000 and log transformed. Finally, we performed a
standard Wilcoxon test using the built-in “FindMarkers” function in Seurat
8 between the healthy
and disease cell-type. DA testing was performed only for the 1,000 disease features in each
modality. For evaluating the effect of filtering droplets by ambient contamination, we ran
FindMarkers using singlets for which the estimated ambient fraction of that modality was below
the threshold. We defined significant DA features using a Bonferroni-corrected p-value threshold
of 0.05.
Single nucleus multiome sequencing of visceral adipose tissue from seven participants
in the KOBS cohort
Visceral adipose tissue (VAT) biopsies were obtained from seven participants of the Kuopio
OBesity Surgery Study (KOBS) cohort
23,24 undergoing bariatric surgery. The participants of the
KOBS cohort were recruited in the University of Eastern Finland and Kuopio University Hospital,
Kuopio, Finland. All participants provided a written informed consent, and the KOBS study was
approved by the Ethics Committee of the Northern Savo Hospital District, in accordance with the
Declaration of Helsinki.
The seven VAT samples were processed using the 10x Single Cell Multiome ATAC + Gene
Expression kit on a 10x Chromium controller. Briefly, we pooled seven VAT biopsies in equal
ratios and isolated nuclei for droplet encapsulation, GEM formation, and cell barcoding. The
fresh-frozen samples were first combined and then manually minced over dry ice. The tissue
was then lysed in 500
μ L of buffer (10 mM Tris-HCl, 10 mM NaCl, 3 mM MgCl2, 0.1% Tween-
20, 0.1% IGEPAL CA-630, 0.01% Digitonin, 1% BSA, 1 mM DTT, and 1 U/μ L RNase inhibitor)
for 15 minutes on ice. The lysate was then mixed with 500 μ l of wash buffer (10 mM Tris-HCl,
10 mM NaCl, 3 mM MgCl2, 1% BSA, 0.1% Tween-20, 1 mM DTT, and 1 U/μ L RNase inhibitor)
and filtered through a 70 μ m FlowMi cell strainer. The nuclei were then centrifuged for 5 minutes
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
at 500 x g at 4°C. This was followed by a resuspension in 1 ml wash buffer, filtering with a 40
μ m FlowMi cell strainer, and another centrifugation at 500 x g for 5 minutes at 4°C. The
resulting pellet was suspended in 30 μ l of chilled nuclei buffer (1X Nuclei Buffer (10x
Genomics), 1 mM DTT, and 1 U/μ l RNase inhibitor). Concentration and quality were assessed
using a Countess II Automated Cell Counter with trypan blue and DAPI staining. Joint single
nucleus RNA and ATAC libraries were constructed with Single Cell Multiome ATAC + Gene
Expression Reagent Kit (10x Genomics). Concentration and quality of cDNA and libraries were
assessed using an Agilent Bioanalyzer. Finally, libraries were sequenced on an Illumina
NovaSeq X Plus for the RNA and an Illumina NextSeq 500 for the ATAC, targeting 400 million
reads in each modality. Reads were aligned with CellRanger Arc v2.0.2
25 using GRCh38 and
GENCODE v4137. Peak calls and read counts from CellRanger were then used for downstream
analyses.
Genotyping and imputation of KOBS participants
We genotyped the participants of the KOBS cohort using the Illumina Infinium Global Screening
Array-24 v1. We used plink v1.9
39 for basic filtering. Individuals with missingness > 2% were
excluded, and we verified reported sex. We then removed unstranded or strand-ambiguous
SNPs, monomorphic SNPs, SNPs with missingness > 2%, and SNPs with a Hardy-Weinberg
Equilibrium (HWE) p-value < 1x10
-6. The genotypes were then phased and imputed using the
HRC reference panel r1.1 201640 on the Michigan imputation server. SNPs with an allele
mismatch to the reference were removed, phased with EAGLE v2.441, and imputed with
minimac442. For demultiplexing, we removed monomorphic and kept biallelic genotyped and
imputed SNPs with R2 > 0.99, resulting in 3,995,059 SNPs.
Demultiplexing of simulated and VAT pooled multiome data
Demultiplexing of the simulated and VAT pooled multiome samples were performed in the same
manner as follows. For ambimux, we used all inter + intra peak reads for the simulations as
ambient fractions were kept constant for all genomic regions. We used intra peak reads for
demultiplexing the VAT data with ambimux unless specified otherwise. For Vireo v0.5.8
15, we
first ran a pileup with cellSNP (Cellsnp-lite v1.2.3)43 using the BAM and VCF files. We set the
UMI specifier to “UB” for the RNA runs and “None” for the ATAC, and required a minimum read
count of 1 for variant pileup. We ran Vireo on the cellSNP pileup output using default
parameters. For SouporCell v2.4
16, we used ‘--skip_remap True’ and 200 restarts for both RNA
and ATAC, ‘--no_umi True’ for the ATAC, and default settings otherwise. For Demuxlet v2, we
ran the popscle implementation (https://github.com/statgen/popscle
), where the pileup was
generated first for input to Demuxlet. We set the minimum base quality score to 19, set an
empty UMI tag with the ‘--tag-UMI’ parameter for the ATAC only, and used default parameters
otherwise. Ambimux was run on all droplets without prior filtering, while Vireo, SouporCell, and
Demuxlet were run on only the candidate set of droplets. These were defined as the known cell-
containing barcodes for the simulations. For the VAT dataset, the candidates were defined as
the 6,414 droplets with at least 100 RNA UMIs and 100 ATAC fragments from the 10x
CellRanger barcode summary output. Comparisons between methods were restricted to these
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
candidate droplets.
We called singlets and doublets and assigned donors based on the posterior probabilities. The
posterior probabilities were calculated from the log probabilities for SouporCell and from the log
likelihood difference between singlet and doublet for Demuxlet. We used a posterior probability
of 0.90 to classify droplets as singlet or doublet and classified them as ambiguous otherwise.
For ambimux, we also classified empty droplets using the same threshold. For donor
assignment, cells identified as singlets were assigned to the donor with the highest likelihood
score.
Clustering and gene-peak link analysis of VAT pooled multiome data
To evaluate how ambient contamination affects the ability to detect gene-peak links, we
clustered and separated the data by ambient fraction. First, we kept singlets identified by
ambimux with a posterior probability greater than 0.90. To ensure enough coverage, we kept
droplets with at least 200 RNA UMIs and 500 ATAC fragments. We processed and clustered the
nuclei using Signac v1.10
38, normalizing the RNA counts to sum to 1,000 and log transforming.
For clustering and cell-type identification, we selected the top 2,000 variable genes using the
variance-stabilizing transform, scaled the counts, and ran PCA. Then, we applied leiden
clustering using a resolution of 0.5 on the top 50 PCs. Cell-types were assigned manually based
on marker genes using ‘FindAllMarkers’. For visualization, we ran Uniform Manifold
Approximation and Projection (UMAP) on the multimodal neighbor graph after processing the
ATAC data with TFIDF and SVD on peaks with at least 5 counts.
We then separated droplets into low and high ambient groups ensuring equal coverage and cell-
type representation. To do so, we first separated droplets by low ambient contamination
(ambient fraction estimate less than 0.05 in both modalities) and high ambient contamination
(ambient fraction estimate greater than 0.05 in both modalities). Then we binned droplets based
on cell-type and coverage, using 10 equally-spaced bins based on log UMIs and log fragments.
As the high ambient group had fewer nuclei, we sampled equal numbers of droplets without
replacement from each group based on the bin distribution of the high ambient group. Finally,
we ran the LinkPeaks Signac function
9,38 on peaks within 500,000 base pairs of the target gene
TSS and genes with at least 1 count in at least 10% of nuclei. P-values were corrected for
multiple testing using FDR.
Precision and recall curves
For the simulated data, we generated precision and recall curves against singlet posterior
probability thresholds. Precision was defined as the fraction of singlets with a correct donor
assignment over the number of called singlets. Recall was defined as the fraction of singlets
with a correct donor assignment over the total number of singlets in the data.
Estimation of ambient fractions with CellBender
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
We compared our ambient fraction estimation results with those from CellBender v0.2.218, a
Method
to remove ambient reads from the count matrix alone. We used the three simulated
datasets of low, medium, and high contamination to have a ground truth metric for comparison.
Evaluations were run for the RNA and ATAC modalities, although we note that CellBender was
not explicitly designed for ATAC data. We set ‘--expected-cells 10000` and ‘--total-droplets-
included 11000’ to accurately reflect the number of droplets simulated and used default
parameters otherwise. Ambient fraction estimates were then extracted from the
‘background_fraction’ field of the h5 output for the singlets.
Data Availability
The visceral adipose tissue data (n=7) will be available under GEO accession number XXX
upon acceptance. Due to privacy concerns, the genotype data (n=7) are available from the
corresponding authors upon reasonable request and the data sharing involves a standard data
sharing agreement.
Code Availability
The ambimux software is available for download and use at
https://github.com/marcalva/ambimux
. The ambisim software used to generate the simulated
data can be found at https://github.com/marcalva/ambisim.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
References
1. Trevino, A. E. et al. Chromatin and gene-regulatory dynamics of the developing human
cerebral cortex at single-cell resolution. Cell 184, 5053-5069.e23 (2021).
2. Gur, C. et al. LGR5 expressing skin fibroblasts define a major cellular hub perturbed in
scleroderma. Cell 185, 1373-1388.e20 (2022).
3. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory
variation. Nature 523, 486–490 (2015).
4. Luo, C. et al. Single-cell methylomes identify neuronal subtypes and regulatory elements in
mammalian cortex. Science 357, 600–604 (2017).
5. Rotem, A. et al. Single-cell ChIP-seq reveals cell subpopulations defined by chromatin state.
Nat. Biotechnol. 33, 1165–1172 (2015).
6. Flyamer, I. M. et al. Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-
to-zygote transition. Nature 544, 110–114 (2017).
7. Preissl, S., Gaulton, K. J. & Ren, B. Characterizing cis-regulatory elements using single-cell
epigenomics. Nat. Rev. Genet. 24, 21–43 (2023).
8. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888-1902.e21
(2019).
9. Ma, S. et al. Chromatin Potential Identified by Shared Single-Cell Profiling of RNA and
Chromatin. Cell 183, 1103-1116.e20 (2020).
10. De Rop, F. V. et al. Hydrop enables droplet-based single-cell ATAC-seq and single-cell
RNA-seq using dissolvable hydrogel beads. ELife 11, (2022).
11. Stoeckius, M. et al. Cell Hashing with barcoded antibodies enables multiplexing and doublet
detection for single cell genomics. Genome Biol. 19, 224 (2018).
12. Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic
variation. Nat. Biotechnol. 36, 89–94 (2018).
13. Mylka, V. et al. Comparative analysis of antibody- and lipid-based multiplexing methods for
single-cell RNA-seq. Genome Biol. 23, 55 ( 2022).
14. Schaefer, N. K., Pavlovic, B. J. & Pollen, A. A. CellBouncer, A Unified Toolkit for Single-Cell
Demultiplexing and Ambient RNA Analysis, Reveals Hominid Mitochondrial
Incompatibilities. Preprint at bioRxiv https://doi.org/10.1101/2025.03.23.644821 (2025).
15. Huang, Y., McCarthy, D. J. & Stegle, O. Vireo: Bayesian demultiplexing of pooled single-cell
RNA-seq data without genotype reference. Genome Biol. 20, 273 (2019).
16. Heaton, H. et al. Souporcell: robust clustering of single-cell RNA-seq data by genotype
without reference genotypes. Nat. Methods 17, 615–620 (2020).
17. Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet-based
single-cell RNA sequencing data. Gigascience 9, (2020).
18. Fleming, S. J. et al. Unsupervised removal of systematic background noise from droplet-
based single-cell experiments using CellBender. Nat. Methods 20, 1323–1335 (2023).
19. Li, T. et al. The impact of ambient contamination on demultiplexing methods for single-
nucleus multiome experiments. eLife https://doi.org/10.7554/elife.106769.1 (2025).
20. Alvarez, M. et al. Enhancing droplet-based single-nucleus RNA-seq resolution using the
semi-supervised machine learning classifier DIEM. Sci. Rep. 10, 11019 (2020).
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
21. Schmid, K. T. et al. scPower accelerates and optimizes the design of multi-sample single
cell transcriptomic studies. Nat. Commun. 12, 6625 (2021).
22. Anderson, A. G. et al. Single nucleus multiomics identifies ZEB1 and MAFB as candidate
regulators of Alzheimer’s disease-specific -regulatory elements. Cell Genom. 3, 100263
(2023).
23. Männistö, V. T. et al. Lipoprotein subclass metabolism in nonalcoholic steatohepatitis. J.
Lipid Res. 55, 2676–2684 (2014).
24. Pihlajamäki, J. et al. Cholesterol absorption decreases after Roux-en-Y gastric bypass but
not after gastric banding. Metabolism 59, 866–872 (2010).
25. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat.
Commun. 8, 14049 (2017).
26. Slyper, M. et al. A single-cell and single-nucleus RNA-Seq toolbox for fresh and frozen
human tumors. Nat. Med. 26, 792–802 (2020).
27. Mandric, I. et al. Optimized design of single-cell RNA sequencing experiments for cell-type-
specific eQTL analysis. Nat. Commun. 11, 5504 (2020).
28. Mitra, S. et al. Single-cell multi-ome regression models identify functional and disease-
associated enhancers and enable chromatin potential analysis. Nat. Genet. 56, 627–636
(2024).
29. Wang, S. K. et al. Single-cell multiome of the human retina and deep learning nominate
causal variants in complex eye diseases. Cell Genom. 2, (2022).
30. Jun, G. et al. Detecting and estimating contamination of human DNA samples in sequencing
and array-based genotype data. Am. J. Hum. Genet. 91, 839–848 (2012).
31. Habib, N. et al. Massively parallel single-nucleus RNA-seq with DroNc-seq. Nat. Methods
14, 955–958 (2017).
32. Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via
the EM Algorithm. J. R. Stat. Soc. 39, 1–22 (1977).
33. Lun, A. T. L. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based
single-cell RNA sequencing data. Genome Biol. 20, 63 ( 2019).
34. Chen, Y. & Ye, X. Projection Onto A Simplex. Preprint at arXiv
https://arxiv.org/abs/1101.6081 (2011).
35. Cao, Y., Yang, P. & Yang, J. Y. H. A benchmark study of simulation methods for single-cell
RNA sequencing data. Nat. Commun. 12, 6911 (2021).
36. The 1000 Genomes Project Consortium et al. A global reference for human genetic
variation. Nature 526, 68–74 (2015).
37. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE
Project. Genome Res. 22, 1760–1774 (2012).
38. Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state
analysis with Signac. Nat. Methods 18, 1333–1341 (2021).
39. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer
datasets. Gigascience 4, 7 (2015).
40. McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat.
Genet. 48, 1279–1283 (2016).
41. Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium
panel. Nat. Genet. 48, 1443–1448 (2016).
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
42. Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48,
1284–1287 (2016).
43. Huang, X. & Huang, Y. Cellsnp-lite: an efficient tool for genotyping single cells.
Bioinformatics 37, 4569–4571 (2021).
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
Acknowledgements
We would like to acknowledge and thank the participants of the KOBS cohort who participated
in this study. We would like to acknowledge the Single Cell Genomics Core and Biocenter
Finland for infrastructure support.
Funding
This study was supported by NIH grants R01HL170604 (PP), R01DK132775 (PP), HG012079
(NZ) and R01MH125252 (NZ), and the Academy of Finland (333021, 335973, MUK). This
research was partly supported by the European Research Council (ERC) under the European
Union’s Horizon 2020 research and innovation program (Grant 802825 to MUK). MUK was
supported by Sigrid Juselius Foundation, Finnish Foundation for Cardiovascular Research
and by the European Union (ERC, SECRET, 101125115). Views and opinions expressed are
however those of the author(s) only and do not necessarily reflect those of the European Union
or the European Research Council. Neither the European Union nor the granting authority can
be held responsible for them. T.Ö. was supported by the Research Council of Finland,
Competitive Funding to Strengthen University Research Profiles, 7th Call, profiling measure
TransMed, funding decision number 352968. Kuopio Obesity Surgery Study was supported by
the Kuopio University Hospital Project grants (EVO/VTR grants 2005
‐ 2024) and the Academy of
Finland grant (Contract no. 138006).
Contributions
MA, EH, NZ, and PP conceived the project. MA designed the approach with contributions from
TL, ER, ZC, OA, EH, CL, NZ, and PP. MA wrote the software. UTA, IS, TO, DK, VM, JP, MUK,
and PP contributed towards generation of the visceral adipose tissue data. TL, STL, and AK
performed data analyses and interpretation of the visceral adipose tissue data. DK, VM, JP,
MUK, and PP collected cohort materials and data. MA, NZ, and PP wrote and contributed to the
final manuscript. All authors read and approved of the final manuscript.
Corresponding authors
Correspondence to Marcus Alvarez (
[email protected]
), Noah Zaitlen
(
[email protected]), and Päivi Pajukanta (
[email protected]).
Ethics approval and consent to participate
The KOBS study was approved by the Ethics Committee of the Northern Savo Hospital District.
The study adhered to the principles outlined in the Declaration of Helsinki. All participants
provided written informed consent.
Competing interests
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
The authors declare no competing interests.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
Figure 1: Ambimux recovers more contaminated singlets in a simulated multiome
experiment. a, Distribution of the proportion of ambient reads in the low (top), medium (middle),
and high (bottom) simulated single-cell multiome datasets. The vertical red lines indicate the
mean (0.1, 0.2, and 0.3). b, c, Recall (sensitivity) of singlet assignments in the three simulated
datasets from ambimux, Demuxlet12, SouporCell16, and Vireo15. Ambimux was run on the
combined modalities, while all other methods were applied to ATAC (b) and RNA (c) separately.
The dashed vertical line shows the 90% posterior probability threshold. d, Relationship between
droplet ambient fraction and classification accuracy across methods and modalities. The
stacked bars show the proportion of singlets (SNG) classified as accurate (assigned to
corrected donor), inaccurate (incorrect donor), non-singlets (assigned as doublet or empty), and
ambiguous.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
Figure 2: Accurate estimation of ambient proportions in simulated multiome datasets. a,
Mean absolute error (MAE) of ATAC and RNA ambient proportion estimates by ambimux after
combining results from the three simulated datasets of low (mean 0.1), medium (mean 0.2), and
high (mean 0.3) contamination. b, Correlation between true and estimated ambient fractions for
the ATAC (left) and RNA (right) modalities in the combined results. c, MAE of ambient fraction
estimates grouped by coverage, demonstrating improved estimation accuracy with increasing
coverage. d, Comparison between ambimux and competing methods in singlet (SNG)
classification when run on either ATAC (left) and RNA (right) data separately. Results are
grouped by ambient dataset. The bar plots show the proportion of ground truth singlets
assigned as correct singlets, incorrect singlets, incorrect doublets/empty, and ambiguous.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
Figure 3: Ambimux is robust to variations in pooling numbers and sample dropout. a,
Recall (left) and precision (right) curves for singlet assignment when pooling 2, 4, 8, 16, 32, and
64 simulated genetically distinct donors from 1000 Genomes36. The x-axis shows the singlet
posterior probability threshold while colors indicate pooling number. b, The mean absolute error
(MAE) for ATAC (top) and RNA (bottom) of the ambimux ambient fraction estimates based on
the number of pooled samples (y-axis). c, Comparison of ground truth and estimated donor
proportions in the background (top) and nuclei (bottom) pool in a simulated pool of 16 donors.
The data were generated such that the proportion of the 16 donors in the background pool was
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
independent of that of the singlets. d, Recall (left) and precision (right) curves for ambimux
demultiplexing with simulated donor experimental dropout (adding genotype samples) and
missing genotypes (removing genotype samples). Ambimux was run on the same simulated
pool of eight donors but varying the genotype samples used for demultiplexing, either removing
donors (negative numbers) or adding donor genotypes (positive numbers).
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
Figure 4: Ambient contamination reduces power to detect differentially abundant peaks
or genes. a, UMAP visualization of three simulated datasets with ambient fraction means of
10% (left), 20% (middle) and 30% (right). Each dataset was simulated from the same 9 cell-
types. We artificially added a disease condition for four of the eight donors, where we introduced
varying log fold-changes for 1,000 peaks and 1,000 genes in one cell-type. b, Number of
significant differentially abundant (DA) features (Bonferroni < 0.05) in the disease cell-type
between conditions for each of the three datasets. This shows ambient contamination
decreases power to detect significant DA features. c, Number of significant DA features
(Bonferroni < 0.05) as in (b), but after filtering out droplets by various ambient fraction
thresholds in the combined data. The three ambient datasets were merged, and DA feature
testing was performed after filtering by the ambient fraction threshold indicated in the x-axis.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
Fig. 5: Ambimux demultiplexing of seven visceral adipose tissue samples. a, Droplet
classification by ambimux run on multiome data generated from seven pooled visceral adipose
tissue samples. Low coverage droplets tended to be classified as empty, while higher count
droplets tended to be classified as doublets. b, Classification of 6,414 candidate droplets across
methods. Ambimux was run on RNA+ATAC, while Vireo, Demuxlet, and SouporCell were run
on either ATAC or RNA. Note that only ambimux is able to assign empty droplets. c,
Concordance of singlet classification between methods. The overlap is defined as the number of
droplets classified as singlets by both methods divided by the minimum number of singlets
called by either method. d, Scatterplot showing singlet agreement between ambimux and the
three methods from above. Singlets are plotted against estimated ambient fraction and read
coverage for ATAC (left) and RNA (right) and colored by whether the singlet was called by
ambimux only (blue), called by both ambimux and at least one other method (red), or called by
at least one other method only (green).
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
Figure 6: Estimation of ambient fractions in pooled multiome of seven visceral adipose
tissue donors. a, Distribution of ambient fraction estimates from inter- vs. intra-peak reads.
Inter-peak read ambient fraction is estimated from reads originating from outside peaks, while
intra-peak read ambient fraction is estimated from reads inside peaks. b, c, Relationship
between ATAC and RNA ambient fraction estimates in singlets. The density plot shows a similar
distribution of ambient fractions for both modalities (b), while the scatterplot shows a low
correlation between the two (c). d, Correlation between read coverage and ambimux ambient
fraction estimates in singlets for ATAC (left) and RNA (right). e, Distribution of ambimux ambient
fraction estimates per cell-type in the ATAC and RNA modalities. f, g, Number of significant
gene-peak links (FDR-corrected p < 0.05) in low- and high-ambient droplets (f), and the
distance between the peak and transcription start site (TSS) for these gene-peak links (g). Low-
and high-ambient droplets were defined as those with less than 5% or more than 5% estimated
ambient percent in both modalities, respectively.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.