{"paper_id":"681ca5e2-ea5e-4871-b728-80becf9682ad","body_text":"Integrated ambient modeling and genetic demultiplexing of single-cell RNA+ATAC \nmultiome experiments with Ambimux \n \nMarcus Alvarez1, Terence Li1, Seung Hyuk T. Lee1, Uma Thanigai Arasu2, Ilakya Selvarajan2, \nTiit Örd2, Elior Rahmani3, Zeyuan (Johnson) Chen4, Oren Avram3,4,5, Asha Kar1,6, Dorota \nKaminska7,8, Ville Männistö9, Eran Halperin1,3,4,5,10, Jussi Pihlajamäki7,11, Chongyuan Luo1,12, \nMinna U. Kaikkonen2, Noah Zaitlen3,13,*, Päivi Pajukanta1,10,* \n \n1Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, \nUSA \n2A. I. Virtanen Institute for Molecular Sciences, University of Eastern Finland, Kuopio, Finland \n3Department of Computational Medicine, David Geffen School of Medicine, University of \nCalifornia Los Angeles, Los Angeles, CA, USA \n4Department of Computer Science, University of California Los Angeles, Los Angeles, CA, USA \n5Department of Anesthesiology and Perioperative Medicine, David Geffen School of Medicine, \nUniversity of California Los Angeles, Los Angeles, CA, USA \n6Bioinformatics Interdepartmental Program, UCLA, Los Angeles, CA, USA \n7Institute of Public Health and Clinical Nutrition, University of Eastern Finland, Kuopio, Finland \n8Department of Medicine, Division of Digestive Diseases, UCLA, Los Angeles, CA, USA \n9Institute of Clinical Medicine, University of Eastern Finland and Kuopio University Hospital, \nKuopio, Finland \n10Institute of Precision Health, University of California Los Angeles, Los Angeles, CA, USA \n11Department of Medicine, Endocrinology and Clinical Nutrition, Kuopio University Hospital, \nKuopio, Finland \n12Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, \nDavid Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA \n13Department of Neurology, University of California Los Angeles, Los Angeles, CA, USA \n \n*These authors contributed equally to this manuscript \n \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nAbstract \n \nSingle cell technologies have advanced at a rapid pace, providing assays for various molecular \nphenotypes. Droplet-based single cell technologies, particularly those based on nuclei isolation, \nsuch as simultaneous RNA+ATAC single-cell multiome, are susceptible to exogenous ambient \nmolecule contamination, which can increase noise in cell type-level associations. We reasoned \nthat genotype-based sample multiplexing can provide an opportunity to infer this ambient \ncontamination by leveraging DNA variation in sequenced reads. Thus, we developed ambimux, \na likelihood-based method to estimate ambient fractions and demultiplex single-cell multiome \nexperiments using genotype-level data. Ambimux models the ambient or nuclear probability at \nthe read level and thus can classify empty droplets and estimate droplet-specific ambient \nmolecule fractions in each modality. We first evaluated our method using simulated data sets \nacross a range of parameters. We found that ambimux closely estimated the ground truth \ndroplet contamination fractions in the RNA (MAE=0.048) and ATAC (MAE=0.042) modalities. As \na result, ambimux maintained high specificity (>95%) and was able to correctly assign singlets \nat considerably high ambient fractions (up to 60%) for both RNA and ATAC modalities. In \ncomparison with models that do not consider ambient contamination, these only maintained \nsimilar sensitivity levels at considerably lower ambient fractions (up to 25%). We then generated \na real data set of seven visceral adipose tissue biopsies run on a single 10x Multiome channel. \nWe ran ambimux and detected 4,986 singlets, capturing similar numbers as other methods. \nThen, we sought to evaluate the fidelity of the ambient fraction estimates from ambimux. We \nsplit singlets into ambient-enriched (>5% contamination in both modalities) or nuclear-enriched \n(<5% in both) droplets and performed gene-peak linkage analysis. Low ambient droplets \nresulted in more significant hits with gene-peak links enriched at the transcription start site \nrelative to high ambient droplets, suggesting that the ambient droplets identified by ambimux \nhamper the identification of biologically meaningful signals. In summary, we developed a joint \nsingle-cell multiome demultiplexing method, ambimux, that accurately models and estimates \nambient molecule contamination in each modality. \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nIntroduction \n \nThe ability to profile multiple molecular phenotypes in single cells has provided powerful \nopportunities to study the interaction between regulatory elements and gene expression1,2. \nSingle cell technologies can profile molecular phenotypes beyond gene expression, including \nopen chromatin3, methylation4, histone modifications5, and chromatin configuration6. These \nassays allow for the characterization of cis-regulatory elements (CREs) that regulate gene \nexpression in a cell-type-specific manner7. These analyses are further aided by computational \nmethods that can integrate separate single-cell ATAC-seq and RNA-seq experiments8. \nHowever, these approaches only infer the joint epigenomic and transcriptomic profiles of cell \ntypes without direct measurements in the same cell. To overcome these challenges, so-called \nmultiome technologies have been developed to jointly profile open chromatin and RNA in the \nsame cell. This can provide valuable insights in connecting CREs with gene expression, \nparticularly in differentiating cell types where the epigenetic state may temporally mismatch \ngene expression\n9. \n \nWhile joint profiling of single cells provides clear advantages over single-modality assays, it is \ncurrently costly and thus difficult to scale over many samples\n10. These limitations may hinder \npopulation-scale studies of gene regulation at the single-cell level. Multiplexing by pooling \nsamples provides a feasible approach to increase sample sizes while keeping costs fixed11,12. \nAdditionally, multiplexing allows for better detection of heterotypic doublets. When samples are \ngenetically distinct, they can be pooled without additional experimental approaches required by \nantibody- or lipid-based hashing strategies\n13,14. Previous computational approaches developed \nfor droplet-based single-cell RNA-seq experiments leverage genetic variation as a natural \nbarcode in each cell to assign droplets to individual donors\n12,15,16. However, these methods are \ndesigned to run on a single modality, and do not leverage both RNA and ATAC reads in the \nsame droplet. Furthermore, these methods typically require prior specification of candidate \ndroplets. Genetic multiplexing offers the advantage to detect empty droplets based on variant \ncalls, which may be superior to detection via expression or accessibility. \n \nAn additional challenge encountered with RNA and ATAC single-cell multiome experiments is \nthe potential contamination of exogenous ambient molecules in droplets\n14,17–19. This issue is \nmore prominent with nuclei, especially if isolated from solid or frozen tissues20. As current \nsingle-cell multiome protocols rely on nuclei isolation9, methods to evaluate ambient molecule \ncontamination would be valuable for quality control or covariate correction purposes. \n \nTo address these limitations, we developed ambimux, a computational approach to demultiplex \nsingle-cell multiome experiments and estimate ambient molecule fractions for each modality. \nOur method estimates ambient molecule fraction per modality, providing a valuable filtering \nmetric. Additionally, ambimux classifies empty droplets based on genetic data, removing the \nneed for prior filtering. We demonstrate via simulations that ambimux outperforms existing \nmethods for demultiplexing contaminated multiome experiments, recovering a higher number of \nsinglets with greater accuracy. We also show that ambimux accurately estimates ambient \nfractions in RNA and ATAC modalities in these synthetic data. Finally, we apply ambimux to a \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nsingle-cell multiome dataset from seven human visceral adipose tissue samples in a single run. \nOverall, ambimux provides a fast and scalable approach for demultiplexing single-cell multiome \nexperiments and accounting for ambient molecule contamination. \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nResults \n \nOverview of the ambimux model \n \nAmbimux is a likelihood-based method designed for demultiplexing single-cell multiome RNA- \nand ATAC-seq experiments, while simultaneously accounting for ambient contamination. It \noffers two novel features that are particularly beneficial for droplet filtering. First, ambimux can \nclassify all generated droplets as empty, singlet, or doublet, eliminating the need for prior \nbarcode filtering. Second, it estimates the fraction of ambient RNA and DNA molecules in each \ndroplet, enhancing assessment and filtering for downstream analyses. Ambimux accomplishes \nthis by using variant-overlapping base calls within individual reads along with sample genotypes \nwithout relying on expression or accessibility data. \n \nAmbimux accurately demultiplexes samples in contaminated simulated multiple data. \n \nWe hypothesized that explicit modeling of ambient molecules per droplet would lead to \nimproved demultiplexing. To test this, we simulated three multiome datasets with low (mean = \n10%), medium (mean = 20%), and high (mean = 30%) ambient fractions (Methods; Fig. 1a). \nEach simulation consisted of 10,000 nuclei with a 10% doublet rate and eight samples \nmultiplexed uniformly (see methods). We ran ambimux and compared the performance with \nthree other demultiplexing methods: Demuxlet\n12, Vireo15, and SouporCell16. We ran these \nmethods in each of the RNA and ATAC modalities separately and ran ambimux jointly on both. \nFirst, we compared precision across all methods by calculating the percent of classified singlets \nwith a correct donor assignment. Ambimux, Demuxlet, and Vireo had a precision of greater than \n99% across each of the three ambient simulations (Supplementary Fig. 1a,b). The precision of \nSouporCell was somewhat lower, ranging from 92%-98% and decaying slightly with increasing \ncontamination (Supplementary Fig. 1a,b). Overall, these results suggest that demultiplexing \nunder ambient contamination is not highly susceptible to false positives. \n \nNext, we compared recall (sensitivity) across the methods by calculating how many singlets \neach method could recover. Of the 9,000 singlets in each of the three contamination datasets, \nwe calculated the percent of these assigned as singlets to the correct donor. We observed that \nambimux was able to accurately classify over 99% of the 9,000 singlets in each of the low, \nmedium, and high ambient datasets (Fig. 1b,c). In contrast, all other methods showed a lower \nrecovery of singlets in the higher contamination datasets (Fig. 1b,c). For example, Vireo RNA, \nthe next best method, had a sensitivity of 98.4% in the low contamination data, but only 89.4% \nin the highly contaminated singlets. Thus, we found that ambient contamination generally \ndecreased the ability to recover singlets when not accounted for. \n \nAs both coverage and contamination can likely affect demultiplexing, we sought to identify the \ncontribution of each to the ability to accurately assign singlets. First, we evaluated at what point \nambient contamination started to reduce sensitivity. For both Vireo and Demuxlet, sensitivity \nstarted to degrade at around 25% ambient contamination, and most singlets were not recovered \nat around 50% (Fig. 1d). With SouporCell, 20-30% of calls were ambiguous across the entire \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nrange of ambient contamination, even at 0% (Fig. 1d). In contrast, ambimux maintained over \n99% accuracy in singlet calls even when ambient fractions approached 60% (Fig. 1d). Droplet \ncoverage also affected singlet recovery in single-modality runs. In both Vireo, Demuxlet, and \nSouporCell, lower coverage led to more ambiguous calls, especially with droplets containing \nless than 1,000 UMIs or fragments (Supplementary Fig. 1c). In contrast, ambimux was largely \nunaffected as both modalities contributed reads for demultiplexing (Supplementary Fig. 1c). \n \nAccurate estimation of droplet-specific ambient contamination. \n \nMultiplexing of single-cell experiments with genetically distinct individuals provides a unique \nopportunity to assess ambient molecule contamination. The distinct “genotype” of the ambient \npool allows for probabilistic discrimination between donor and ambient molecules. To leverage \nthis, we incorporated the droplet fraction of ambient molecules as a parameter (see methods). \nTo test the accuracy of these contamination estimates, we combined the three simulated \ndatasets above and compared the estimates with the simulated true values. Overall, we found \nthese estimates to be highly accurate in both modalities (Fig. 2a,b), with a mean absolute error \n(MAE) of 4.8% and 4.2% in the RNA and ATAC, respectively (Fig. 2a). We further hypothesized \nthat accuracy in droplet ambient estimation would be influenced by coverage. To test this, we fit \na local loess regression curve between the droplet absolute error and the number of variant-\noverlapping reads (informative reads). As expected, we found lower errors in ambient estimates \nwith increasing coverage, with similar parameters in both modalities (Fig. 2c; Supplementary \nFig. 2). For example, droplets with 100 informative RNA reads were predicted to have an error \nof 7.3%, while those with 1,000 informative RNA reads had a predicted error of only 2.7% (Fig. \n2c). \n \nWe also compared our background fraction estimates with that of CellBender\n18, a deep \ngenerative model for ambient reads removal in single-cell RNA-seq using count data. As our \nsimulations generated a distinct background distribution, we reasoned that CellBender could be \napplied to our three synthetic contaminated datasets. In addition to RNA, we tested ATAC \ncounts, although we note that the authors never evaluated CellBender in this modality. We \nfound that CellBender failed to properly estimate background fractions (Supplementary Fig. 3). \nAfter combining the results from the three datasets, the mean absolute errors were 19.0% and \n18.4% in the RNA and ATAC modalities, respectively (Supplementary Fig. 3a). Upon further \ninspection, we found that CellBender underestimated the ambient fraction on average, and the \nMAE increased with higher background reads (Supplementary Fig. 3b). Importantly, background \nestimates from CellBender were uncorrelated with the simulated ambient fractions for each of \nthe 3 ambient datasets (Pearson R < 0.01) (Supplementary Fig. 3c). Overall, these comparisons \nhighlight how modeling genotypes can improve estimation of ambient read fractions when \ncompared to read counts alone. \n \nAmbimux improves demultiplexing of contaminated droplets in single modalities. \n \nWhile developed for single-cell multiome data, ambimux can easily run on a single modality. We \ntested ambimux on the RNA and ATAC modalities separately, using the three simulated \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\ndatasets described above (low, medium, and high contamination). This allowed us to assess the \nbenefits of ambient modeling independently of coverage, as the RNA+ATAC approach doubled \nthe number of reads on average compared to the single-modality approach. We again \ncompared ambimux with Demuxlet\n12, Vireo15, and SouporCell16 and assessed sensitivity in \naccurately recovering the 9,000 singlets in each dataset (Fig. 2d). In the RNA modality, Vireo \naccurately recovered the most singlets in the low (98.4%) and medium (95.5%) contamination \ndatasets, while ambimux accurately recovered 97.3% and 95.2%, respectively (Fig. 2d). In the \nhigh contamination dataset, ambimux performed best, accurately assigning 92.5% of singlets, \nwhereas Vireo recovered 89.4% (Fig. 2d). Ambimux also outperformed all methods in the ATAC \nmodality, recovering 98.3%, 97.3%, and 95.2% of the 9,000 singlets in the low, medium, and \nhigh ambient datasets, respectively (Fig. 2d). These results show that ambimux is robust to \nambient contamination even when run on either RNA or ATAC modalities individually, although \njoint RNA+ATAC calling performed best (recall > 99%) in all three simulated datasets. \n \nAmbimux maintains accuracy across pooling strategies. \n \nAn important question in the design of multiplexed experiments is how many samples to pool. \nOften, this will involve balancing depth vs. breadth with budgetary constraints\n21. We sought to \ntest the performance of ambimux across a wide range of pooling numbers. We simulated \nsynthetic data sets consisting of 2, 4, 8, 16, 32, and 64 pooled samples. As before, each pooling \nexperiment contained 9,000 singlets and 1,000 doublets. Droplet background fractions were \nsampled from an equal mixture of low, medium, and high ambient distributions (Methods). We \nobserved that ambimux maintained excellent performance across all pooling numbers, with \nprecisions and recalls greater than 99.9% and 99.8%, respectively (Fig. 3a). We also found that \nambimux could accurately estimate ambient fractions with mean absolute errors of 4.6-6.1% for \nthe RNA and 4.1-5.3% for the ATAC (Fig. 3b). Interestingly, the accuracy of background fraction \nestimates increased with higher pooling numbers (Fig. 3b). \n \nTo properly model ambient molecules in a droplet, ambimux requires the allele frequencies of \nthe ambient pool. While donor genotypes are given, the ambient pool “genotype” must be \nestimated. To do so, we model the ambient allele frequency of a SNP as the average of the \ndonor genotypes weighted by estimated background donor proportions (see methods). We thus \nasked whether our model can properly estimate these fractions of donors in a pool. In a \nsimulation of 16 samples with a 10-fold variation in abundance, we found that the estimated \ndonor proportions accurately reflected the true simulated values in both the background pool \nand singlet droplets (Fig. 3c). This allowed for accurate estimation of ambient fractions with an \nMAE of 0.041 and 0.046 for the ATAC and RNA, respectively (Supplementary Fig. 4a). \nAdditionally, ambimux maintained high precision (> 99.9%) and sensitivity (99.9%) in this \nsimulation (Supplementary Fig. 4b,c). Our results on these simulated data show that ambimux is \nrobust to variations in sample yield, both in the ambient and singlet droplet pools. \n \nNext, we asked whether ambimux could accurately demultiplex datasets in which genotypes \nwere missing or samples dropped out. We tested this by simulating a multiome dataset of eight \ndonors and demultiplexing after removing donor genotypes or adding new donor genotypes in \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\naddition to the 8. We performed a sweep by removing 1, 2, 3, and 4 sample genotypes, \nsimulating cases of missing genotypes, and adding 1, 2, 3, and 4 sample genotypes, simulating \ncases with experimental dropout. We found that the addition of sample genotypes had minimal \neffects on recall and precision (Fig. 3d), suggesting ambimux can accurately detect and assign \nsinglets after dropout of samples in a pool. When genotypes of pooled donors were missing, \nrecall of the genotyped donors was largely unaffected. For example, the recall from \ndemultiplexing 6 out of 8 samples was 0.747, near the maximum of 0.75 (Fig. 3d). However, \nmissing genotype samples detrimentally affected the ability to accurately assign singlets to \ndonors (Fig. 3d). The precision was 0.890 when only 4 of the 8 genotypes were present, while \nthe precision was 0.999 with the full data (Fig. 3d). Missing or added genotypes also slightly \nworsened the accuracy of ambient estimates. While demultiplexing with the eight pooled \nsamples resulted in an MAE of 0.043 and 0.047 for the ATAC and RNA respectively, \ndemultiplexing with 4 missing genotypes resulted in MAEs of 0.088 and 0.091, for example \n(Supplementary Fig. 5a). With missing genotypes, we found that there were singlets with higher, \noverestimated ambient fractions above 0.5 (Supplementary Fig. 5b), likely originating from \nungenotyped samples assigned to an incorrect donor and fit with a high ambient fraction. \n \nAmbient contamination reduces power to detect differential abundance \n \nMany single-cell analyses involve detecting feature-level differences (such as gene expression \nor ATAC accessibility) between conditions\n21, including cell-type-specific marker identification \nand disease association22. We investigated the extent to which ambient contamination can \naffect differential abundance (DA) analysis of peaks or genes. To test this, we generated three \nsynthetic datasets with low (mean 10%), medium (mean 20%), and high (mean 30%) \nbackground contamination levels. Each dataset comprised eight pooled individuals, with four \ncontaining a disease cell subtype. We simulated various log fold-changes for DA features in the \ndisease cell-type, ranging from 0.1 to 2.0 for 1,000 genes and 1,000 peaks (Fig. 4a). DA of the \n1,000 features in each modality showed that ambient contamination decreased the power to \ndetect differential abundance (Fig. 4b). Specifically, in the RNA modality, we detected 334, 286, \nand 234 DA genes in the low, medium, and high ambient datasets, respectively (Fig. 4b). The \nATAC modality showed a similar trend, with 27, 20, and 10 DA peaks detected (Fig. 4b). As \nexpected, features with high fold-changes were most detectable, and estimated log fold-\nchanges correlated with the true log fold-changes for each dataset (Supplementary Fig. 6). \n \nAfter confirming that higher background levels led to decreased power to detect DA, we sought \nto understand how filtering out contaminated droplets could affect DA results. To investigate \nthis, we combined the three synthetic datasets above and tested DA after removing droplets \nabove various ambient fraction thresholds. We used a range of 0.1 to 0.5 for ambient fraction \nremoval. Filtering out the fewest droplets at a threshold of 0.5 resulted in the highest number of \nDA genes, with 137 DA peaks and 501 DA genes in the ATAC and RNA modalities, respectively \n(Fig. 4c). In contrast, removing singlets with an ambient fraction above 0.1 resulted in a much \nlower number of significant features, with 3 DA peaks and 217 DA genes (Fig. 4c). These \nobservations are likely due to the fact that ambient contamination and healthy-disease states \nwere simulated independently of each other, such that contamination was not a confounder for \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nDA testing. Overall, these results suggest that ambient contamination reduces power to detect \ndifferential feature abundance, and further filtering of features also reduces power. \n \nCharacterizing ambient contamination and filtering in a single-cell multiome dataset from \nvisceral adipose tissue. \n \nEncouraged by our results on synthetic data, we next investigated the performance of our \nmethod on observed data. We ran the 10x single-cell Multiome assay on seven pooled visceral \nadipose tissue (VAT) samples from the KOBS cohort23,24. After mapping with CellRanger Arc \nv2.0.225, ambimux identified 4,986 singlets and 512 doublets (9.3%). Several test droplets with \nlow coverage were classified as empty, and droplets with a low coverage in only one modality \ncould be assigned to a singlet with a higher coverage in the other modality (Fig. 5a). \n \nNext, we compared ambimux droplet calls with those from demuxlet, Vireo, and SouporCell. \nAfter restricting analysis to 6,414 candidate droplets with at least 100 UMIs and fragments, \nambimux identified 4,824 singlets, while other methods classified a range of 3,857 to 5,334 \nsinglets (Fig. 5b). On average, 92.4% of droplets called as singlets in one method were called \nas singlets in another method (Fig. 5c). Ambimux singlets overlapped with at least 94.2% of \nsinglets called in the other approaches and showed slightly higher overlap with ATAC-based \ndemultiplexing (Fig. 5c). Among the intersecting singlets, methods had high concordance (mean \n= 98.5%) in their sample assignments (Supplementary Fig. 7a). We then compared singlets that \nwere unique to ambimux when compared to other methods applied to either ATAC or RNA. The \n270 (ATAC) and 183 (RNA) droplets specific to ambimux tended to have higher contamination \nscores or lower coverage (Fig. 5d). On the other hand, singlets specific to the other methods \ntended to be called as doublets by ambimux (Supplementary Fig. 7b). \n \nWe considered that lysed or broken nuclei could release randomly fragmented DNA with loss of \nchromatin structure and thus be accessible to the Tn5 transposase. This could lead to \nextranuclear DNA preferentially originating from inaccessible regions that make up most of the \ngenome. We tested this in the VAT multiome dataset by using either fragments within peaks \n(intra-peak) or between peaks (inter-peak) for demultiplexing. In line with our hypothesis, the \nmean ATAC ambient fraction was higher from the inter-peak fragments (22.4%) compared to \nthose from the intra-peak fragments (6.3%) (paired Wilcoxon p < 2.2x10-16) (Fig. 6a). While \nhigher than the intra-peak ambient fractions, the inter-peak proportion (22.4%) is lower than the \nfraction of the genome that these regions occupy (97.3%). Notably, the intra-peak contamination \nestimates were within the same range as the RNA contamination estimates (mean 7.4%) (Fig. \n6b). Due to these results, and the fact that downstream analyses rely on peak counts, we used \nonly intra-peak molecules for demultiplexing in this VAT dataset. \n \nNext, we investigated whether RNA and ATAC ambient fractions were correlated with common \nsummary statistics in visceral adipose tissue singlets. Notably, the ambient fractions between \nmodalities were weakly correlated (Fig. 6c), with a Spearman coefficient of 0.08 (p = 4.9 x 10-8), \nsuggesting that background contamination of RNA and ATAC molecules occurs independently. \nWe also found that lower coverage droplets tended to have a somewhat higher amount of \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\ncontamination in both modalities (Fig. 6d), with Spearman correlations of -0.11 and -0.13 for the \nATAC and RNA, respectively. This is consistent with a model of uniform ambient molecule \nabundance that would result in low coverage droplets containing a higher proportion of \nbackground. Lastly, we found that the percent of mitochondrial reads, a typical filtering and QC \nmetric8,18, showed only slightly positive correlations with ambient contamination, with ATAC and \nRNA Spearman correlations of 0.10 and 0.21, respectively (Supplementary Fig. 8a). \n \nFinally, we were interested in seeing how ambient fractions could affect downstream analysis. \nTo do so, we performed gene-to-peak link analysis on clean and contaminated visceral adipose \ntissue singlets. First, we clustered droplets and assigned them to 15 cell-types, including \nvascular, stromal, myeloid, and adipocyte cells (Supplementary Fig. 8b). We found that ambient \ncontamination varied across cell-types (Fig. 6e), although this was largely correlated with cell-\ntype coverage (Supplementary Fig. 8c). We then split the nuclei into equally sized low ambient \n(< 5% in both RNA and ATAC) and high ambient (> 5% in both RNA and ATAC) droplets (N = \n167), controlling for read depth and cell-type (Supplementary Fig. 9). We found 877 and 508 \nsignificant links (FDR-adjusted p < 0.05) in the low and high groups respectively (Fig. 6f). \nNotably, the associated peaks in the clean droplets tended to be closer to the transcription start \nsite (TSS) than those of the contaminated droplets (Fig. 6g). Overall, these results suggest that \nambient contamination can degrade the power to detect significant gene-peak links and lead to \npotentially more spurious associations. \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nDiscussion \n \nWe developed ambimux for integrated demultiplexing of single-cell multiome experiments under \nambient contamination. We show that our method outperforms existing approaches in sensitivity \nand specificity. Using synthetic data, we were able to show how our approach correctly handles \nexperiments with high ambient contamination and identify points where other methods break. As \npart of the model, ambimux also outputs modality-specific ambient fraction estimates per \ndroplet. Our results on simulations show that ambimux can accurately estimate droplet \ncontamination with mean absolute errors as low as 2.7% for higher coverage droplets. We also \nshow that ambimux is capable of demultiplexing a wide range of pooling numbers, from just 2 \ndonors to 64. Importantly, ambimux is robust to donor proportion variation, and is robust to \nsample dropouts and missing genotypes. Finally, we applied ambimux to a real dataset of seven \npooled VAT samples and showed comparable results to competing methods. \n \nWhile highly optimized protocols and cell line samples can yield higher quality data, intact cell or \nnuclei isolation from tissue remains challenging in many cases\n26. Furthermore, generating cDNA \nsingle-cell libraries from fresh biopsies can present logistical challenges, particularly in scaling \nto larger sample sizes. Therefore, frozen tissues might be the only practical approach. However, \nfrozen tissues are particularly subject to lysis and can lead to increased ambient molecule \nconcentrations\n20,26. As single-cell RNA, ATAC, and multiome methods are applied to frozen \ntissues, ambient fraction estimates from genetic data can provide a high confidence method to \nassess the quality of the experiment, filter individual cells, and correct for potential confounding. \nWe have shown here that ambimux can accurately estimate these contamination fractions \nacross a wide range of experimental setups. \n \nWe hypothesize that ambimux will help enable population-scale single-cell multiome studies in \ntissues. By providing a robust method to demultiplex samples and estimate background \ncontamination, ambimux allows researchers to increase pooling and sample sizes for costly \nmultiome experiments. This approach can help to better identify subtle regulatory mechanisms \nacross populations\n21,27. Moreover, by providing grounded contamination estimates for use as a \ncovariate or filtering threshold, ambimux may help improve the accuracy of downstream \nanalyses, such as peak-to-gene links. Improved analyses will help advance our understanding \nof cell-type-specific gene regulation and offer new insights into developmental processes, \ndisease mechanisms, and tissue heterogeneity\n28,29. \n \nDespite its advantages, ambimux has some limitations. First, the method relies on genetic \nvariation between samples, which may limit its applicability in cases where samples are \ngenetically identical, such as cell lines, or when working with model organisms with limited \ngenetic diversity. Additionally, we observed a drop in ambient estimation accuracy in low \ncoverage droplets, as sparsity of informative reads can lead to high variance in parameter \ninference. Finally, ambimux was designed for multiplexed experiments in which all donors have \ngenotype data available. In practice, it may be challenging to obtain this for every donor. While \nwe observed that ambimux is robust to some missing genotypes, explicit modeling of this may \nbe preferable. \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\n \nLooking ahead, several promising avenues for further development of ambimux can be \nidentified. A key area is the extension of the model for ambient molecule correction. Ambimux \nalready models ambient contamination at the read level internally. By leveraging this modeling \napproach, functionality could be developed to correct and remove ambient molecules from read \ncounts, similar to the approach used by CellBender\n18. This would provide a more accurate \nrepresentation of true cellular content, potentially improving downstream analyses such as \ndifferential feature abundance. As alluded to earlier, an extension of a genotype-free or missing-\ngenotype framework would further expand the utility of ambimux in these cases. \n \nIn conclusion, ambimux represents a significant addition to the single-cell field. By offering \nmultimodal ambient contamination estimates, ambimux addresses common challenges faced \nwith large-scale multiome studies in tissues. Its ability to handle various experimental designs \nand contamination levels, coupled with its potential to facilitate larger-scale studies, positions \nambimux as a valuable tool for single-cell studies of gene regulation. \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nMethods \n \nModel Description \n \nThe foundation of our approach is to model the observed variant-overlapping base calls in reads \nusing a Bernoulli distribution. Our Bernoulli modeling approach is inspired by the work of Kang \net al\n12,30, and we extend this by incorporating parameters for empty droplets and ambient reads. \nWe take a likelihood-based approach and use posterior probabilities to assign donors to \ndroplets. Since the model is the same for both RNA and ATAC modalities, we use the term \n‘molecule’ to refer to either a deduplicated ATAC-seq fragment or a deduplicated RNA-seq \nunique molecular index (UMI). We first define the components of the probability model and then \ndescribe the estimation procedure. \n \nIn a multiome experiment, let D be the number of droplets containing at least one molecule. A \ndroplet d can have either 0, 1, or 2 nuclei encapsulated, denoted by the latent variable H\nd. We \nassume that when multiple nuclei are present in a droplet, their combined genotype resembles \nthat of the ambient pool, so we do not consider cases with 3 or more nuclei. This assumption \nsimplifies the estimation process by avoiding the complexity of modeling higher order \ncombinations. \n \nAlthough the number of cells or nuclei encapsulated in a droplet roughly follows a Poisson \ndistribution, nuclei can adhere to each other, resulting in higher rates of doublets\n31. To account \nfor this, we model Hd using a categorical distribution parameterized by λ. \n \nLet N be the number of samples pooled in an experiment. The composition of donors in a \ndroplet, denoted by Sd, is conditional on the droplet type Hd. There are three possible droplet \ntypes: empty droplets containing no donors, singlets containing one donor, and doublets \ncontaining two donors. The experiment-wide probability of obtaining a singlet sample {i} or \ndoublet samples {i, j} is modeled using a Categorical distribution parameterized by π c, which is a \nvector representing the proportions of each donor in the cells/nuclei. \n \nFor a singlet, the probability of being assigned to donor i is given by πci. For a doublet, the \nprobability of being assigned to donors i and j is given by the product πciπcj, assuming that the \nprobability of each donor in a doublet is independent of the other donor. To improve efficiency, \nself-doublets are ignored, and only unique doublet possibilities are modeled. \n \nAssume a droplet d contains M\nd molecules. For a singlet or doublet, each molecule m can \noriginate from a sample si (i = 1, …, N) or from the ambient pool s0. We introduce the latent \nrandom variable Tdm to indicate the origin of each molecule, which can be either a single sample \nor the ambient pool. The probability of a molecule originating from the ambient pool is modeled \nas a Bernoulli distribution with parameter αdhs, where h and s denote the specific Hd (number of \ncells) and Sd (sample identity) for the droplet. If the droplet is empty (i.e., Hd = 0), then αdhs = 1. \nFor a doublet, we assume that each donor sample has an equal probability of contributing a \nmolecule. Note that we specify a droplet contamination parameter \nαdhs specific to each \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\ncombination of Hd and Sd. Thus, for a clean singlet, the parameter estimate will be close to 0 for \nits assigned donor but should be much higher for an incorrect donor assignment. \n \nWe model the probability of observing a base call at variant sites within a read. Let Bdmv \nrepresent the observed base call at variant site v in molecule m of droplet d. Base calls b can be \neither 0 (reference allele) or 1 (alternate allele). Our method focuses on bi-allelic SNPs, \ndisregarding other types of genetic variants. \n \nThe observed base depends on whether a sequencing error occurred, represented by the latent \nvariable E\ndmv. Sequencing errors follow a Bernoulli distribution with probability τdmv, derived from \nthe Phred quality score of the base call. \n \nIn the absence of an error (Edmv = 0), the observed base call follows a Bernoulli distribution \nparameterized by γiv, which is the alternate allele frequency of variant v for origin i (donor or \nambient pool). Donor allele frequencies are provided, while ambient allele frequencies are \nestimated from the data. In the event of a sequencing error, we assume equal probability for \nobserving any base. \n \nTaken together, the probability of a droplet is given by \n \n/g1868 /g4666 /g1850 /g3031,/g1852 /g3031;Θ /g4667 /g3404/g1868 /g4666 /g1834 /g3031;/g2019 /g4667 /g1868/g4666/g1845 /g3031|/g1834 /g3031;/g2024 /g3030/g4667/g3537/g1868 /g4666 /g1846 /g3031/g3040|/g1834 /g3031,/g1845 /g3031;/g2009 /g3031/g4667\n/g3014/g3279\n/g3040/g2880/g2869\n/g3537/g1868 /g4666 /g1828 /g3031/g3040/g3049|/g1846 /g3031/g3040;/g2011 ,/g2028 ,/g2024 /g3028/g4667\n/g3023/g3279/g3288\n/g3049/g2880/g2869\n \n \nwhere Xd gives the observed data, Zd gives the latent variables Hd, Sd, Td1, …, Tdm, and Θ  gives \nthe parameters λ, πc, α, πa, γ, and τ. \n \nParameter estimation and model fitting \n \nThe purpose of our method is threefold: 1) identify empty, singlet, and doublet droplets, 2) \nassign droplets to donors, and 3) estimate ambient fractions in each droplet. We achieve this \nusing a combination of gradient-based methods and expectation maximization (EM) \n32. The \noutput of interest consists of the posterior probabilities of Hd and Sd, and the modality-specific \ndroplet ambient estimates αd. While all droplets are input to the model, we only estimate \nparameters for droplets with at least U = 100 molecules in either modality, treating all others as \nempty, consistent with previous works\n20,33. For single-modality data, the other modality is \nignored. \n \nAccurate demultiplexing and ambient estimation relies on the alternate allele frequencies of \nambient molecules. Rather than directly calculating empirical allele frequencies from ambient \nmolecules, we model these frequencies as a function of the donor fractions in the ambient pool \nand their allele frequencies. This approach regularizes estimates for variants with low coverage \nand reduces noise. We set the ambient \"genotypes\" as a weighted sum of donor genotypes: \n/g2011 /g3028/g3049/g3404 ∑ /g2024 /g3028/g3036/g2011 /g3036/g3049, where γiv are the given donor genotypes and πai are the donor proportions in the \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nambient pool. We estimate πa by maximizing the log likelihood of the data using gradient ascent, \nprojecting the updated parameter onto the simplex34 to ensure non-negative proportions sum to \n1 (Supplementary Note 1). For efficiency, we calculate πa using only fixed empty droplets and \nkeep this estimate constant. \n \nTo estimate droplet ambient parameters αdhs for all test droplets across Hd and Sd, we use \nNewton-Raphson optimization for each droplet independently. We maximize the log likelihood \nafter marginalizing over Tdm, adding a prior to avoid collapsing the likelihood (Supplementary \nNote 1). \n \nWith updated values for αdhs, we estimate λ and πc by maximizing the expected log likelihood, \nwhere the expectation is taken with respect to the posterior of the latent variables. Each iteration \ninvolves calculating the expected log likelihood and maximizing it with respect to \nλ and πc. The \nderivations of the update equations are provided in Supplementary Note 1. \n \nPrior to estimation, we initialize λ0 as the proportion of fixed empty droplets, while λ1 and λ2 are \nset to 0.9 and 0.1 times the proportion of test droplets, respectively. The values of πc and πa are \ninitialized uniformly. \n \nThe complete estimation procedure can be summarized as follows: \n1. Initialize all parameters and estimate a fixed value for ambient sample fractions πa from the \nempty droplets. \n2. Iterate until convergence: \n   a) Estimate ambient fractions αdhs using Newton-Raphson. \n   b) Estimate λ and πc by maximizing the expected log likelihood. \n   c) Set the prior parameter β to the weighted average of αdhs across singlets. \n3. Terminate iterations when the mean absolute difference in the parameter estimates is less \nthan \nε = 10-6. \n \nSimulation of ambient-contaminated single-cell multiome experiments \n \nWhile methods exist for simulating single-cell experiments\n35, to our knowledge, none can \nexplicitly simulate multiplexed samples with genotypes and controlled ambient contamination. \nTo benchmark ambimux, we developed a simulation method called ambisim\n19. This single-cell \nmultiome simulator generates droplets from a pool of donors, sampling variant alleles during \nread generation. \n \nAmbisim generates multiome droplets from a set of barcodes with information on donor, cell \ntype, and read types (donor vs. ambient). It produces two sets of droplets: empty and cellular. \nFor empty droplets, the number of reads in both modalities is drawn from a negative binomial \ndistribution (mean = 2, size = 0.1). For cellular droplets, reads are similarly drawn but with mean \n= 5,000 and size = 1.5, constrained between 200 and 100,000 reads. \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nIn cellular droplets, reads are divided into donor and ambient. The fraction of ambient reads is \nsampled from a beta distribution, with parameters varying by experiment: low ambient (shape \n2,18; mean = 0.1), medium (shape 4,16; mean = 0.2), and high (shape 6,14; mean = 0.3). When \nevaluating the accuracy of the ambient estimates, we pool the low, medium, and high datasets, \nunless otherwise stated. \n \nFor singlets and doublets, one or two cell types are sampled from a uniform categorical \ndistribution. Genes or peaks are then sampled based on a Multinomial distribution of the \nrandomly drawn cell type(s) profiles. \n \nIn the gene expression modality, genes are sampled from the cell type's expression profile. For \nambient reads, an average profile weighted by cell type frequency is used. An isoform is \nrandomly selected with uniform probabilities, then classified as spliced or unspliced (probability \n0.4 for cellular, 0.6 for ambient). This reflects the fact that nuclear RNA will be enriched for \nunspliced mRNA relative to cytoplasmic or total cell RNA. The location of the single-end read \nsequence is sampled uniformly within the cDNA sequence of the spliced or unspliced isoform. \n \nThe ATAC modality employs a similar procedure but accounts for inter-peak reads. Peaks are \nsampled using a Multinomial distribution with cell type accessibility probabilities. The ambient \npeak distribution combines cell type distributions weighted by their frequency. For inter-peak \nreads, its genomic location is sampled randomly from the genome. Whether a read falls inside \nor outside a peak depends on whether the droplet is empty. For an empty droplet, we sample a \nread coming from a peak using a Beta distribution with shape parameters (1, 9). For non-empty \ndroplets, we sample a read coming from a peak using a Beta distribution with shape parameters \n(4, 6). To generate the paired-end read, we sample an insert size from 0 to 150 uniformly. \n \nFor both ATAC-seq and RNA-seq, the read sequences are further modified as follows. First, all \nvariants overlapping the read alignment are selected. For each variant in a cellular read, an \nallele is sampled from a Bernoulli using the allele frequency of the read’s donor and the base at \nthe read site is replaced with this allele. The procedure is similar for an ambient read, where the \nambient allele frequency is used instead of the donor’s allele frequency. To obtain the final read \nsequence we sample a sequencing error with a probability of 0.01 for each base pair. Given an \nerror, the base is replaced randomly with any of the four nucleotides. \n \nFor all simulations, we set the cell types and gene and peak probability distributions from the \nsingle-cell multiome visceral adipose tissue data set (see below). Briefly, the top 3,000 barcodes \nranked by gene expression UMIs were clustered using Seurat v4.3.0\n8 with a resolution of 0.5 \nand otherwise default parameters. This resulted in nine cell types, and the proportion, gene \nexpression, and peak probabilities for each were calculated empirically. Donors and genotypes \nwere simulated from 1,000 Genomes\n36. We randomly selected unrelated European donors for \nall simulated data. For each dataset and demultiplexing run, we kept biallelic SNPs and \nremoved SNPs monomorphic in the pooled donors. This set of SNPs was used for both \ngenerating the fastq files and demultiplexing donors.    \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nAll simulated fastqs were aligned with CellRanger Arc v2.0.225 using GRCh38 and GENCODE \nv4137, the same references from which the data were generated. Additionally, we provided the \npeaks BED file to the ‘--peaks’ option to ensure that all simulations generated the same peak \ncount data. \n \nDifferential abundance analysis of RNA and ATAC features \n \nTo evaluate the effect of ambient contamination on differential expression and differential \naccessibility, termed differential abundance (DA) here, we slightly modified the data generation \nto include a disease cell subtype. We took the adipocyte cell-type and modified the multinomial \nprobabilities. We generated log2 fold-changes ranging from -2 to 2 for 1,000 randomly selected \nfeatures with an expression probability above 1 x 10\n-5. For four individuals we replaced the \nadipocyte cell-type with the disease cell-type. The three datasets were then generated as \ndescribed above with low, medium, and high ambient contamination. \n \nTo detect DA features, we first demultiplexed the aligned data with ambimux and extracted the \nassigned singlets. The counts from CellRanger Arc were preprocessed using Signac\n38, where \nthe RNA counts were normalized to sum to 1,000 and log transformed. Finally, we performed a \nstandard Wilcoxon test using the built-in “FindMarkers” function in Seurat\n8 between the healthy \nand disease cell-type. DA testing was performed only for the 1,000 disease features in each \nmodality. For evaluating the effect of filtering droplets by ambient contamination, we ran \nFindMarkers using singlets for which the estimated ambient fraction of that modality was below \nthe threshold. We defined significant DA features using a Bonferroni-corrected p-value threshold \nof 0.05.  \n \nSingle nucleus multiome sequencing of visceral adipose tissue from seven participants \nin the KOBS cohort \n \nVisceral adipose tissue (VAT) biopsies were obtained from seven participants of the Kuopio \nOBesity Surgery Study (KOBS) cohort\n23,24 undergoing bariatric surgery. The participants of the \nKOBS cohort were recruited in the University of Eastern Finland and Kuopio University Hospital, \nKuopio, Finland. All participants provided a written informed consent, and the KOBS study was \napproved by the Ethics Committee of the Northern Savo Hospital District, in accordance with the \nDeclaration of Helsinki. \n \nThe seven VAT samples were processed using the 10x Single Cell Multiome ATAC + Gene \nExpression kit on a 10x Chromium controller. Briefly, we pooled seven VAT biopsies in equal \nratios and isolated nuclei for droplet encapsulation, GEM formation, and cell barcoding. The \nfresh-frozen samples were first combined and then manually minced over dry ice. The tissue \nwas then lysed in 500 \nμ L of buffer (10 mM Tris-HCl, 10 mM NaCl, 3 mM MgCl2, 0.1% Tween-\n20, 0.1% IGEPAL CA-630, 0.01% Digitonin, 1% BSA, 1 mM DTT, and 1 U/μ L RNase inhibitor) \nfor 15 minutes on ice. The lysate was then mixed with 500 μ l of wash buffer (10 mM Tris-HCl, \n10 mM NaCl, 3 mM MgCl2, 1% BSA, 0.1% Tween-20, 1 mM DTT, and 1 U/μ L RNase inhibitor) \nand filtered through a 70 μ m FlowMi cell strainer. The nuclei were then centrifuged for 5 minutes \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nat 500 x g at 4°C. This was followed by a resuspension in 1 ml wash buffer, filtering with a 40 \nμ m FlowMi cell strainer, and another centrifugation at 500 x g for 5 minutes at 4°C. The \nresulting pellet was suspended in 30 μ l of chilled nuclei buffer (1X Nuclei Buffer (10x \nGenomics), 1 mM DTT, and 1 U/μ l RNase inhibitor). Concentration and quality were assessed \nusing a Countess II Automated Cell Counter with trypan blue and DAPI staining. Joint single \nnucleus RNA and ATAC libraries were constructed with Single Cell Multiome ATAC + Gene \nExpression Reagent Kit (10x Genomics). Concentration and quality of cDNA and libraries were \nassessed using an Agilent Bioanalyzer. Finally, libraries were sequenced on an Illumina \nNovaSeq X Plus for the RNA and an Illumina NextSeq 500 for the ATAC, targeting 400 million \nreads in each modality. Reads were aligned with CellRanger Arc v2.0.2\n25 using GRCh38 and \nGENCODE v4137. Peak calls and read counts from CellRanger were then used for downstream \nanalyses. \n \nGenotyping and imputation of KOBS participants \n \nWe genotyped the participants of the KOBS cohort using the Illumina Infinium Global Screening \nArray-24 v1. We used plink v1.9\n39 for basic filtering. Individuals with missingness > 2% were \nexcluded, and we verified reported sex. We then removed unstranded or strand-ambiguous \nSNPs, monomorphic SNPs, SNPs with missingness > 2%, and SNPs with a Hardy-Weinberg \nEquilibrium (HWE) p-value < 1x10\n-6. The genotypes were then phased and imputed using the \nHRC reference panel r1.1 201640 on the Michigan imputation server. SNPs with an allele \nmismatch to the reference were removed, phased with EAGLE v2.441, and imputed with \nminimac442. For demultiplexing, we removed monomorphic and kept biallelic genotyped and \nimputed SNPs with R2 > 0.99, resulting in 3,995,059 SNPs. \n \nDemultiplexing of simulated and VAT pooled multiome data \n \nDemultiplexing of the simulated and VAT pooled multiome samples were performed in the same \nmanner as follows. For ambimux, we used all inter + intra peak reads for the simulations as \nambient fractions were kept constant for all genomic regions. We used intra peak reads for \ndemultiplexing the VAT data with ambimux unless specified otherwise. For Vireo v0.5.8\n15, we \nfirst ran a pileup with cellSNP (Cellsnp-lite v1.2.3)43 using the BAM and VCF files. We set the \nUMI specifier to “UB” for the RNA runs and “None” for the ATAC, and required a minimum read \ncount of 1 for variant pileup. We ran Vireo on the cellSNP pileup output using default \nparameters. For SouporCell v2.4\n16, we used ‘--skip_remap True’ and 200 restarts for both RNA \nand ATAC, ‘--no_umi True’ for the ATAC, and default settings otherwise. For Demuxlet v2, we \nran the popscle implementation (https://github.com/statgen/popscle\n), where the pileup was \ngenerated first for input to Demuxlet. We set the minimum base quality score to 19, set an \nempty UMI tag with the ‘--tag-UMI’ parameter for the ATAC only, and used default parameters \notherwise. Ambimux was run on all droplets without prior filtering, while Vireo, SouporCell, and \nDemuxlet were run on only the candidate set of droplets. These were defined as the known cell-\ncontaining barcodes for the simulations. For the VAT dataset, the candidates were defined as \nthe 6,414 droplets with at least 100 RNA UMIs and 100 ATAC fragments from the 10x \nCellRanger barcode summary output. Comparisons between methods were restricted to these \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\ncandidate droplets. \n \nWe called singlets and doublets and assigned donors based on the posterior probabilities. The \nposterior probabilities were calculated from the log probabilities for SouporCell and from the log \nlikelihood difference between singlet and doublet for Demuxlet. We used a posterior probability \nof 0.90 to classify droplets as singlet or doublet and classified them as ambiguous otherwise. \nFor ambimux, we also classified empty droplets using the same threshold. For donor \nassignment, cells identified as singlets were assigned to the donor with the highest likelihood \nscore. \n \nClustering and gene-peak link analysis of VAT pooled multiome data \n \nTo evaluate how ambient contamination affects the ability to detect gene-peak links, we \nclustered and separated the data by ambient fraction. First, we kept singlets identified by \nambimux with a posterior probability greater than 0.90. To ensure enough coverage, we kept \ndroplets with at least 200 RNA UMIs and 500 ATAC fragments. We processed and clustered the \nnuclei using Signac v1.10\n38, normalizing the RNA counts to sum to 1,000 and log transforming. \nFor clustering and cell-type identification, we selected the top 2,000 variable genes using the \nvariance-stabilizing transform, scaled the counts, and ran PCA. Then, we applied leiden \nclustering using a resolution of 0.5 on the top 50 PCs. Cell-types were assigned manually based \non marker genes using ‘FindAllMarkers’. For visualization, we ran Uniform Manifold \nApproximation and Projection (UMAP) on the multimodal neighbor graph after processing the \nATAC data with TFIDF and SVD on peaks with at least 5 counts. \n \nWe then separated droplets into low and high ambient groups ensuring equal coverage and cell-\ntype representation. To do so, we first separated droplets by low ambient contamination \n(ambient fraction estimate less than 0.05 in both modalities) and high ambient contamination \n(ambient fraction estimate greater than 0.05 in both modalities). Then we binned droplets based \non cell-type and coverage, using 10 equally-spaced bins based on log UMIs and log fragments. \nAs the high ambient group had fewer nuclei, we sampled equal numbers of droplets without \nreplacement from each group based on the bin distribution of the high ambient group. Finally, \nwe ran the LinkPeaks Signac function\n9,38 on peaks within 500,000 base pairs of the target gene \nTSS and genes with at least 1 count in at least 10% of nuclei. P-values were corrected for \nmultiple testing using FDR. \n \nPrecision and recall curves \n \nFor the simulated data, we generated precision and recall curves against singlet posterior \nprobability thresholds. Precision was defined as the fraction of singlets with a correct donor \nassignment over the number of called singlets. Recall was defined as the fraction of singlets \nwith a correct donor assignment over the total number of singlets in the data. \n \nEstimation of ambient fractions with CellBender \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nWe compared our ambient fraction estimation results with those from CellBender v0.2.218, a \nmethod to remove ambient reads from the count matrix alone. We used the three simulated \ndatasets of low, medium, and high contamination to have a ground truth metric for comparison. \nEvaluations were run for the RNA and ATAC modalities, although we note that CellBender was \nnot explicitly designed for ATAC data. We set ‘--expected-cells 10000` and ‘--total-droplets-\nincluded 11000’ to accurately reflect the number of droplets simulated and used default \nparameters otherwise. Ambient fraction estimates were then extracted from the \n‘background_fraction’ field of the h5 output for the singlets. \n \nData Availability \n \nThe visceral adipose tissue data (n=7) will be available under GEO accession number XXX \nupon acceptance. Due to privacy concerns, the genotype data (n=7) are available from the \ncorresponding authors upon reasonable request and the data sharing involves a standard data \nsharing agreement. \n \nCode Availability \n \nThe ambimux software is available for download and use at \nhttps://github.com/marcalva/ambimux\n. The ambisim software used to generate the simulated \ndata can be found at https://github.com/marcalva/ambisim. \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nReferences \n \n1. Trevino, A. E. et al. Chromatin and gene-regulatory dynamics of the developing human \ncerebral cortex at single-cell resolution. Cell 184, 5053-5069.e23 (2021). \n2. Gur, C. et al. LGR5 expressing skin fibroblasts define a major cellular hub perturbed in \nscleroderma. Cell 185, 1373-1388.e20 (2022). \n3. Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory \nvariation. Nature 523, 486–490 (2015). \n4. Luo, C. et al. Single-cell methylomes identify neuronal subtypes and regulatory elements in \nmammalian cortex. Science 357, 600–604 (2017). \n5. Rotem, A. et al. Single-cell ChIP-seq reveals cell subpopulations defined by chromatin state. \nNat. Biotechnol. 33, 1165–1172 (2015). \n6. Flyamer, I. M. et al. Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-\nto-zygote transition. Nature 544, 110–114 (2017). \n7. Preissl, S., Gaulton, K. J. & Ren, B. Characterizing cis-regulatory elements using single-cell \nepigenomics. Nat. Rev. Genet. 24, 21–43 (2023). \n8. Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888-1902.e21 \n(2019). \n9. Ma, S. et al. Chromatin Potential Identified by Shared Single-Cell Profiling of RNA and \nChromatin. Cell 183, 1103-1116.e20 (2020). \n10. De Rop, F. V. et al. Hydrop enables droplet-based single-cell ATAC-seq and single-cell \nRNA-seq using dissolvable hydrogel beads. ELife 11, (2022). \n11. Stoeckius, M. et al. Cell Hashing with barcoded antibodies enables multiplexing and doublet \ndetection for single cell genomics. Genome Biol. 19, 224 (2018). \n12. Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic \nvariation. Nat. Biotechnol. 36, 89–94 (2018). \n13. Mylka, V. et al. Comparative analysis of antibody- and lipid-based multiplexing methods for \nsingle-cell RNA-seq. Genome Biol. 23, 55 ( 2022). \n14. Schaefer, N. K., Pavlovic, B. J. & Pollen, A. A. CellBouncer, A Unified Toolkit for Single-Cell \nDemultiplexing and Ambient RNA Analysis, Reveals Hominid Mitochondrial \nIncompatibilities. Preprint at bioRxiv https://doi.org/10.1101/2025.03.23.644821 (2025). \n15. Huang, Y., McCarthy, D. J. & Stegle, O. Vireo: Bayesian demultiplexing of pooled single-cell \nRNA-seq data without genotype reference. Genome Biol. 20, 273 (2019). \n16. Heaton, H. et al. Souporcell: robust clustering of single-cell RNA-seq data by genotype \nwithout reference genotypes. Nat. Methods 17, 615–620 (2020). \n17. Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet-based \nsingle-cell RNA sequencing data. Gigascience 9, (2020). \n18. Fleming, S. J. et al. Unsupervised removal of systematic background noise from droplet-\nbased single-cell experiments using CellBender. Nat. Methods 20, 1323–1335 (2023). \n19. Li, T. et al. The impact of ambient contamination on demultiplexing methods for single-\nnucleus multiome experiments. eLife https://doi.org/10.7554/elife.106769.1 (2025). \n20. Alvarez, M. et al. Enhancing droplet-based single-nucleus RNA-seq resolution using the \nsemi-supervised machine learning classifier DIEM. Sci. Rep. 10, 11019 (2020). \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\n21. Schmid, K. T. et al. scPower accelerates and optimizes the design of multi-sample single \ncell transcriptomic studies. Nat. Commun. 12, 6625 (2021). \n22. Anderson, A. G. et al. Single nucleus multiomics identifies ZEB1 and MAFB as candidate \nregulators of Alzheimer’s disease-specific -regulatory elements. Cell Genom. 3, 100263 \n(2023). \n23. Männistö, V. T. et al. Lipoprotein subclass metabolism in nonalcoholic steatohepatitis. J. \nLipid Res. 55, 2676–2684 (2014). \n24. Pihlajamäki, J. et al. Cholesterol absorption decreases after Roux-en-Y gastric bypass but \nnot after gastric banding. Metabolism 59, 866–872 (2010). \n25. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. \nCommun. 8, 14049 (2017). \n26. Slyper, M. et al. A single-cell and single-nucleus RNA-Seq toolbox for fresh and frozen \nhuman tumors. Nat. Med. 26, 792–802 (2020). \n27. Mandric, I. et al. Optimized design of single-cell RNA sequencing experiments for cell-type-\nspecific eQTL analysis. Nat. Commun. 11, 5504 (2020). \n28. Mitra, S. et al. Single-cell multi-ome regression models identify functional and disease-\nassociated enhancers and enable chromatin potential analysis. Nat. Genet. 56, 627–636 \n(2024). \n29. Wang, S. K. et al. Single-cell multiome of the human retina and deep learning nominate \ncausal variants in complex eye diseases. Cell Genom. 2, (2022). \n30. Jun, G. et al. Detecting and estimating contamination of human DNA samples in sequencing \nand array-based genotype data. Am. J. Hum. Genet. 91, 839–848 (2012). \n31. Habib, N. et al. Massively parallel single-nucleus RNA-seq with DroNc-seq. Nat. Methods \n14, 955–958 (2017). \n32. Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via \nthe EM Algorithm. J. R. Stat. Soc. 39, 1–22 (1977). \n33. Lun, A. T. L. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based \nsingle-cell RNA sequencing data. Genome Biol. 20, 63 ( 2019). \n34. Chen, Y. & Ye, X. Projection Onto A Simplex. Preprint at arXiv \nhttps://arxiv.org/abs/1101.6081 (2011). \n35. Cao, Y., Yang, P. & Yang, J. Y. H. A benchmark study of simulation methods for single-cell \nRNA sequencing data. Nat. Commun. 12, 6911 (2021). \n36. The 1000 Genomes Project Consortium et al. A global reference for human genetic \nvariation. Nature 526, 68–74 (2015). \n37. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE \nProject. Genome Res. 22, 1760–1774 (2012). \n38. Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state \nanalysis with Signac. Nat. Methods 18, 1333–1341 (2021). \n39. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer \ndatasets. Gigascience 4, 7 (2015). \n40. McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. \nGenet. 48, 1279–1283 (2016). \n41. Loh, P.-R. et al. Reference-based phasing using the Haplotype Reference Consortium \npanel. Nat. Genet. 48, 1443–1448 (2016). \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\n42. Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, \n1284–1287 (2016). \n43. Huang, X. & Huang, Y. Cellsnp-lite: an efficient tool for genotyping single cells. \nBioinformatics 37, 4569–4571 (2021). \n \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nAcknowledgements \n \nWe would like to acknowledge and thank the participants of the KOBS cohort who participated \nin this study. We would like to acknowledge the Single Cell Genomics Core and Biocenter \nFinland for infrastructure support. \n \nFunding \n \nThis study was supported by NIH grants R01HL170604 (PP), R01DK132775 (PP), HG012079 \n(NZ) and R01MH125252 (NZ), and the Academy of Finland (333021, 335973, MUK). This \nresearch was partly supported by the European Research Council (ERC) under the European \nUnion’s Horizon 2020 research and innovation program (Grant 802825 to MUK). MUK was \nsupported by Sigrid Juselius Foundation, Finnish Foundation for Cardiovascular Research \nand by the European Union (ERC, SECRET, 101125115). Views and opinions expressed are \nhowever those of the author(s) only and do not necessarily reflect those of the European Union \nor the European Research Council. Neither the European Union nor the granting authority can \nbe held responsible for them. T.Ö. was supported by the Research Council of Finland, \nCompetitive Funding to Strengthen University Research Profiles, 7th Call, profiling measure \nTransMed, funding decision number 352968. Kuopio Obesity Surgery Study was supported by \nthe Kuopio University Hospital Project grants (EVO/VTR grants 2005\n‐ 2024) and the Academy of \nFinland grant (Contract no. 138006). \n \nContributions \n \nMA, EH, NZ, and PP conceived the project. MA designed the approach with contributions from \nTL, ER, ZC, OA, EH, CL, NZ, and PP. MA wrote the software. UTA, IS, TO, DK, VM, JP, MUK, \nand PP contributed towards generation of the visceral adipose tissue data. TL, STL, and AK \nperformed data analyses and interpretation of the visceral adipose tissue data. DK, VM, JP, \nMUK, and PP collected cohort materials and data. MA, NZ, and PP wrote and contributed to the \nfinal manuscript. All authors read and approved of the final manuscript. \n \nCorresponding authors \n \nCorrespondence to Marcus Alvarez (marcus.alvarez@ucsf.edu\n), Noah Zaitlen \n(nzaitlen@g.ucla.edu), and Päivi Pajukanta (ppajukanta@mednet.ucla.edu). \n \nEthics approval and consent to participate \n \nThe KOBS study was approved by the Ethics Committee of the Northern Savo Hospital District. \nThe study adhered to the principles outlined in the Declaration of Helsinki. All participants \nprovided written informed consent. \n \nCompeting interests \n \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nThe authors declare no competing interests. \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\n \n \nFigure 1: Ambimux recovers more contaminated singlets in a simulated multiome \nexperiment. a, Distribution of the proportion of ambient reads in the low (top), medium (middle), \nand high (bottom) simulated single-cell multiome datasets. The vertical red lines indicate the \nmean (0.1, 0.2, and 0.3). b, c, Recall (sensitivity) of singlet assignments in the three simulated \ndatasets from ambimux, Demuxlet12, SouporCell16, and Vireo15. Ambimux was run on the \ncombined modalities, while all other methods were applied to ATAC (b) and RNA (c) separately. \nThe dashed vertical line shows the 90% posterior probability threshold. d, Relationship between \ndroplet ambient fraction and classification accuracy across methods and modalities. The \nstacked bars show the proportion of singlets (SNG) classified as accurate (assigned to \ncorrected donor), inaccurate (incorrect donor), non-singlets (assigned as doublet or empty), and \nambiguous. \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\n \n \nFigure 2: Accurate estimation of ambient proportions in simulated multiome datasets. a, \nMean absolute error (MAE) of ATAC and RNA ambient proportion estimates by ambimux after \ncombining results from the three simulated datasets of low (mean 0.1), medium (mean 0.2), and \nhigh (mean 0.3) contamination. b, Correlation between true and estimated ambient fractions for \nthe ATAC (left) and RNA (right) modalities in the combined results. c, MAE of ambient fraction \nestimates grouped by coverage, demonstrating improved estimation accuracy with increasing \ncoverage. d, Comparison between ambimux and competing methods in singlet (SNG) \nclassification when run on either ATAC (left) and RNA (right) data separately. Results are \ngrouped by ambient dataset. The bar plots show the proportion of ground truth singlets \nassigned as correct singlets, incorrect singlets, incorrect doublets/empty, and ambiguous. \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\n \n \nFigure 3: Ambimux is robust to variations in pooling numbers and sample dropout. a, \nRecall (left) and precision (right) curves for singlet assignment when pooling 2, 4, 8, 16, 32, and \n64 simulated genetically distinct donors from 1000 Genomes36. The x-axis shows the singlet \nposterior probability threshold while colors indicate pooling number. b, The mean absolute error \n(MAE) for ATAC (top) and RNA (bottom) of the ambimux ambient fraction estimates based on \nthe number of pooled samples (y-axis). c, Comparison of ground truth and estimated donor \nproportions in the background (top) and nuclei (bottom) pool in a simulated pool of 16 donors. \nThe data were generated such that the proportion of the 16 donors in the background pool was \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\nindependent of that of the singlets. d, Recall (left) and precision (right) curves for ambimux \ndemultiplexing with simulated donor experimental dropout (adding genotype samples) and \nmissing genotypes (removing genotype samples). Ambimux was run on the same simulated \npool of eight donors but varying the genotype samples used for demultiplexing, either removing \ndonors (negative numbers) or adding donor genotypes (positive numbers). \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\n \n \nFigure 4: Ambient contamination reduces power to detect differentially abundant peaks \nor genes. a, UMAP visualization of three simulated datasets with ambient fraction means of \n10% (left), 20% (middle) and 30% (right). Each dataset was simulated from the same 9 cell-\ntypes. We artificially added a disease condition for four of the eight donors, where we introduced \nvarying log fold-changes for 1,000 peaks and 1,000 genes in one cell-type. b, Number of \nsignificant differentially abundant (DA) features (Bonferroni < 0.05) in the disease cell-type \nbetween conditions for each of the three datasets. This shows ambient contamination \ndecreases power to detect significant DA features. c, Number of significant DA features \n(Bonferroni < 0.05) as in (b), but after filtering out droplets by various ambient fraction \nthresholds in the combined data. The three ambient datasets were merged, and DA feature \ntesting was performed after filtering by the ambient fraction threshold indicated in the x-axis. \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\n \n \nFig. 5: Ambimux demultiplexing of seven visceral adipose tissue samples. a, Droplet \nclassification by ambimux run on multiome data generated from seven pooled visceral adipose \ntissue samples. Low coverage droplets tended to be classified as empty, while higher count \ndroplets tended to be classified as doublets. b, Classification of 6,414 candidate droplets across \nmethods. Ambimux was run on RNA+ATAC, while Vireo, Demuxlet, and SouporCell were run \non either ATAC or RNA. Note that only ambimux is able to assign empty droplets. c, \nConcordance of singlet classification between methods. The overlap is defined as the number of \ndroplets classified as singlets by both methods divided by the minimum number of singlets \ncalled by either method. d, Scatterplot showing singlet agreement between ambimux and the \nthree methods from above. Singlets are plotted against estimated ambient fraction and read \ncoverage for ATAC (left) and RNA (right) and colored by whether the singlet was called by \nambimux only (blue), called by both ambimux and at least one other method (red), or called by \nat least one other method only (green). \n  \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint \n\n \n \nFigure 6: Estimation of ambient fractions in pooled multiome of seven visceral adipose \ntissue donors. a, Distribution of ambient fraction estimates from inter- vs. intra-peak reads. \nInter-peak read ambient fraction is estimated from reads originating from outside peaks, while \nintra-peak read ambient fraction is estimated from reads inside peaks. b, c, Relationship \nbetween ATAC and RNA ambient fraction estimates in singlets. The density plot shows a similar \ndistribution of ambient fractions for both modalities (b), while the scatterplot shows a low \ncorrelation between the two (c). d, Correlation between read coverage and ambimux ambient \nfraction estimates in singlets for ATAC (left) and RNA (right). e, Distribution of ambimux ambient \nfraction estimates per cell-type in the ATAC and RNA modalities. f, g, Number of significant \ngene-peak links (FDR-corrected p < 0.05) in low- and high-ambient droplets (f), and the \ndistance between the peak and transcription start site (TSS) for these gene-peak links (g). Low- \nand high-ambient droplets were defined as those with less than 5% or more than 5% estimated \nambient percent in both modalities, respectively. \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted August 26, 2025. ; https://doi.org/10.1101/2025.08.21.671671doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}