Tissue-specific enhancer-gene maps from multimodal single-cell data identify causal disease alleles.

OA: closed
Full text 72,486 characters · extracted from pmc-nxml · 4 sections · click to expand

Methods

This study complies with all relevant ethical regulations as outlined and approved by the institutional review board of Mass General Brigham (protocol approval number 2021P001846) and the institutional review board of the Hospital for Special Surgery (#2014–233). Written informed consent was obtained from all study participants. Synovial tissues from patients with rheumatoid arthritis and osteoarthritis were collected from synovectomy or arthroplasty procedures followed by cryopreservation 110 . For rheumatoid arthritis samples, we examined histologic sections of synovial tissue and selected samples with inflammatory features. Sex and age of participants can be found in Supplementary Table 8 . Next, cryopreserved synovial tissue fragments were dissociated by a mechanical and enzymatic digestion 110 , followed by flow sorting to enrich for live synovial cells. For each tissue sample, the viable cells were isolated and lysed to extract and load approximately 10,000 nuclei according to manufacturer protocol (10x Genomics). Joint single-cell (sc)RNA-seq and scATAC–seq libraries were prepared using the 10x Genomics Single Cell Multiome ATAC + Gene Expression kit according to manufacturer’s instructions. Libraries were sequenced with paired-end reads on an Illumina Novaseq to a target depth of 30,000 read pairs per nucleus both for messenger RNA and for ATAC libraries. scRNA-seq fastq files were inputted into the Cell Ranger ARC pipeline (version 2.0.0) from 10x Genomics to generate barcoded count matrix of gene expression. For the scATAC–seq fastq files, we used Cell Ranger ARC to process barcodes and to map the reads to the hg38 genome by BWA-MEM with default parameters. To deduplicate reads from polymerase chain reaction amplification bias within a cell while keeping reads originating from the same positions but from different cells, we used in-house scripts. More detailed information on data and QC steps is described in ref. 58 . In addition to our arthritis-tissue multimodal dataset, we downloaded all publicly available multimodal RNA-seq/ATAC–seq datasets from adult human tissues ( n dataset = 9, as of April 2022). We processed the downloaded matrices of RNA-seq and ATAC–seq data and fragment files if available by using Signac (version 1.9.0), without re-aligning the original reads to reference genome due to lack of availability of raw sequence data. We applied QC to both the nuclear RNA data and the ATAC data based on RNA counts, ATAC fragments, nucleosome signal and TSS enrichment. We defined the QC threshold based on the distribution of these metrics in each of the datasets as described in Supplementary Table 9 . We only kept cells that had passed QC in both RNA-seq and ATAC–seq. Then to identify open chromatin regions (peaks), we used macs2 (version 2.2.7.1) to call open chromatin peaks using post-QC ATAC–seq data. We thus obtained count matrices of gene expression and ATAC peaks with corresponding cell barcodes. When cell barcodes after the QC from original publications were available or when fragment files are unavailable, we used the downloaded post-QC matrices for downstream analyses. Gene expression counts were normalized using the Normalize Data function (Seurat 111 version 4.3.0), scaled using the ScaleData function (Seurat), and batch corrected using Harmony 112 (version 0.1.1). We visualized the cells in two low-dimensional embeddings with uniform manifold approximation and projection (UMAP) by using 20 batch-corrected principal components from these normalized gene expression matrices ( Fig. 1c ). When original cell labels are provided by the authors, we used those labels to obtain broad cell type categories. When they are not available, we performed reference-query mapping by Seurat and PBMC reference object to define broad cell type labels. ATAC peak matrix was binarized to have 1 if a count is >0 and 0 otherwise. We defined cis -peaks as any peaks whose center is within the window ±500 kb from a given gene body. We modeled the association between peak’s binarized accessibility and the target gene’s expression with Poisson distribution: (1) E i ~ Poisson ( λ i ) log ( λ i ) = β 0 + β peak X peak + β % mito X % mito + β n UMI X n UMI + β batch X batch where E i is the observed expression count of i th gene, and λ i is the expected count under Poisson distribution. β peak indicates the effect of chromatin accessibility of a peak on i th gene expression. β %mito , β n UMI and β batch each represents the effect of covariates, percentage of mitochondrial reads per cell as a measure of cell quality, the number of unique molecular identifiers (UMIs) in the cell, and the batch, respectively. To empirically assess error and significance of β peak for each peak–gene combination, we used bootstrapping procedures. We resampled cells with replacement in each bootstrapping procedure and re-estimated β ′ peak within those resampled cells. We repeated this procedure N times, where we adaptively increased N (that is, the total number of bootstrapping) from at least 100 and up to 50,000, depending on the significance of β peak in each chunk of bootstrapping trials to reduce the computational burden. After N times of bootstrapping, we assessed the distribution of N β ′ peak s against null hypothesis ( β ′ peak = 0) to derive the significance of β peak (that is, two-sided bootstrapping-based P value for this peak–gene combination by counting the instances where the statistics are equal or more extreme than the null hypothesis of β ′ peak = 0; Extended Data Fig. 2 ). To avoid spurious associations from rare ATAC peak and rare gene expression, we quality controlled the cis -peak–gene pairs we test so that both peak and gene should have been expressed in at least 5% of the cells we analyze. We finally defined a set of significant peak–gene pairs for each cell type based on bootstrapping-based P values and FDR correction for multiple testing (Benjamini–Hochberg correction). When we tested the calibration of statistics from SCENT or other regression strategies ( Extended Data Fig. 1 ), we used null dataset where we randomly permuted cell labels in the ATAC–seq and ran the regression model we tested. We also analyzed single-cell multimodal datasets with ArchR 49 (version 1.0.2) and Signac 46 , 50 (version 1.9.0), which both have a function to define peak–gene linkages. In brief, ArchR takes multimodal data and creates low-overlapping aggregates of single cells based on k -nearest neighbor graph. Then it correlates peak accessibility with gene expression by Pearson correlation of aggregated and log 2 -normalized peak count and gene count. Signac computes the Pearson or Spearman correlation coefficient r (corSparse function in R) for each gene and for each peak within 500 kb of the gene TSS. Signac then compares the observed correlation coefficient with an expected correlation coefficient for each peak given the guanine–cytosine content, accessibility and length of the peak. Signac defines P value for each gene–peak links from the z score based on this comparison. We ran both methods on arthritis-tissue dataset with default parameters. We output statistics for all peak–gene pairs we tested without any cutoff for correlation r or P values. To obtain significant peak–gene linkages in ArchR and Signac that are comparable with those defined by SCENT with FDR <10%, we used the FDR value in the output from the peak2gene function of ArchR software and selected any linkages with FDR <10% as a significant set of linkages. Because Signac does not directly output FDR values, we computed FDR using P values in the output from Signac LinkPeaks function (either method = ‘pearson’ or method = ‘spearman’) by Benjamini–Hochberg correction and used this FDR to define a significant set of linkages with FDR <10%. We thus defined significant peak–gene linkages as those with FDR <10% and used varying correlation r to assess the precision and recall in the causal variant enrichment analysis (see later sections in Methods ). Because ArchR requires fragment files from ATAC–seq to run the basic functions, we ran ArchR and Signac for eight multimodal datasets in which fragment files are available. Since we have the same immune-related cell types across different multimodal datasets, we evaluated the concordance of enhancer–gene map in a discovery dataset (arthritis-tissue dataset) when compared with other replication datasets including immune-related cell types (Public PBMC, NeurIPS, SHARE-seq and NEAT-seq datasets). To this end, we used the most stringent FDR threshold for defining an enhancer–gene map in arthritis-tissue dataset that had the largest number of significant peak–gene linkages (FDR <1%). We then used more lenient threshold for defining an enhancer–gene map in replication datasets (FDR <10%), which is a similar strategy used in assessing replication in GWAS. For each cell type and for each replication dataset, we took the intersection of enhancer-gene links defined as significant in both datasets. We assessed the directional concordance (that is, concordance of the sign of β peak ) and the Pearson’s correlation r of β peak between the discovery and the replication for these peak–gene pairs. We then performed the same analysis for enhancer-gene map from ArchR and Signac software. To compare the evolutional conservation across species between our annotated peaks and the other peaks, we used phastCons 66 score. We downloaded the phastCons score for multiple alignments of 99 vertebrate genomes from ref. 113 . We lifted them over to GRCh38 by LiftOver software (version 2016). We used SCENT results for arthritis-tissue, Public PBMC and NeurIPS for conservation score analysis as representative datasets with the largest numbers of cells. Because each gene should have variable functional importance and conservation, we assessed each gene separately. For each gene, we took (1) an annotation of interest for the gene and (2) all cis -noncoding regions (<500 kb from a gene), and computed the mean phastCons score of each of two sets of the peaks. As annotations to be tested, we used (a) exonic regions of the gene, (b) SCENT peaks for the gene and (c) all ATAC peaks in cis -regions from the gene (<500 kb). Then, we took the difference between two mean differences (ΔphastCons score), and computed the mean differences across all the genes (mean ΔphastCons score) as follows: mean Δ phastCons score= 1 n gene ∑ gene ( phastCons ¯ g , in_annot − phastCons ¯ g , noncoding ) . By bootstrapping the genes, we calculated the 95% CI of the mean ΔphastCons score. If this metric is positive, that indicates that the annotated regions are more conserved than noncoding regions. We also calculated similar ΔphastCons score by comparing the SCENT peaks with TSS-distance-matched non-SCENT peaks in each dataset. mean Δ phastCons score= = 1 n gene ∑ gene ( phastCons ¯ g , peak_in_SCENT − phastCons ¯ g , peak_non_SCENT_matched ) By bootstrapping the genes, we again calculated the 95% CI of the mean ΔphastCons score. If this metric is positive, that indicates that SCENT peaks are more conserved than TSS-distance-matched non-SCENT peaks. To assess the effect of TSS distance when comparing SCENT peaks with non-SCENT peaks, we matched each one of the SCENT peak–gene pairs to one non-SCENT peak–gene pair, where the peak had the most similar TSS distance to the same gene among all the ATAC peaks in cis in each of the dataset. We confirmed that the resulting TSS-distance-matched non-SCENT peak–gene pairs demonstrated the similar distributions of TSS distance when compared with the SCENT peak–gene pairs ( Supplementary Fig. 3b ). We sought to investigate the relationship between the number of significant SCENT peaks for each gene and the gene’s evolutionary constraint. We used pLI (the probability of being loss-of-function intolerant) and LOEUF as metrics for the gene’s loss-of-function intolerance within human population. We downloaded both pLI and LOEUF scores from gnomAD browser 114 . We inverse-normal transformed the raw number of significant SCENT peaks for each gene, since the raw number of significant SCENT peaks for each gene is skewed toward zero ( Extended Data Fig. 4a ). We performed linear regression between the normalized number of significant SCENT peaks and pLI or LOEUF score with accounting for gene length, which could be potential confounding factor for pLI and LOEUF 69 , 70 . To validate our SCENT enhancer–gene links, we used published CRISPR-Flow FISH experiments as potential ground-truth positive enhancer element-gene links and negative enhancer element–gene links. We downloaded the experimental results from the Supplementary Table 5 of original publication 39 conducted in multiple cell lines and cell types (K562, Jurkat, THP1, GM12878, BJAB, NCCIT, hepatocytes and LNCAP). We used ‘Perturbation Target’ as candidate enhancer elements. We defined 283 positive enhancer element–gene links when they are ‘TRUE’ for ‘Regulated’ column (that is, the element–gene pair is significant and the effect size is negative) and 5,472 negative enhancer element–gene links when they are ‘FALSE’ for ‘Regulated’ column. We lifted them over to GRCh38 and obtained final sets of 278 positive links and 5,470 negative links. We used the SCENT enhancer–gene maps from eight multimodal datasets, excluding SHARE-seq dataset in which we were unable to perform statistical enrichment test due to low overlap with the designed target element–gene pairs in the CRISPR-Flow FISH experiments. For each dataset, we used ‘bedtools (version 2.26.0) intersect’ to categorize SCENT peak–gene links and non-SCENT ATAC peak–gene pairs into either CRISPR-positive or CRISPR-negative groups, based on whether these peaks overlapped with positive or negative CRISPR-Flow FISH links for the same gene ( Supplementary Table 3 ). We finally performed two-sided Fisher’s exact test to assess the enrichment of CRISPR-positive links within SCENT peak–gene links in each dataset. To assess if the SCENT enhancer–gene linkage is more likely within the contact map of active enhancers and target genes constructed from H3K27ac HiChIP experiment, we downloaded high-confidence contact loops by Hi-C combined with enhancer activity marked by H3K27ac level in naive T cells, Th17 T cells and regulatory T cells from Supplementary Table 2 of ref. 68 . We lifted the genomic coordinates to GRCh38 in each cell type, and annotated the promoter regions with gene names to be used as putative target genes if they fall within 1 kb from gene’s TSS. We took the union of the enhancer–promoter contacts across these three cell subtypes as experimentally validated T cell linkage. We analyzed six datasets that included T cells (that is, arthritis-tissue, public PBMC, NeurIPS, Dogma-seq (stimulated and control) and NEAT-seq datasets) to test whether the T-cell-specific enhancer–gene linkage from SCENT was enriched for the H3K27ac HiChIP enhancer–gene links. We additionally combined the significant linkages across all six datasets to create union SCENT peak–gene linkages in T cells. For each dataset and the combined linkage, we used ‘bedtools intersect’ to categorize SCENT peak–gene links and non-SCENT ATAC peak–gene pairs based on whether these peaks overlapped with H3K27ac HiChIP contact loops for the same gene. We performed two-sided Fisher’s exact test to assess the enrichment of SCENT peak–gene links within H3K27ac HiChIP contact loops in each dataset and the combined dataset. For cell types with more than 5,000 cells across datasets, we concatenated SCENT peak–gene linkages across all the datasets to create cell-type-specific SCENT tracks. We collected a set of SCENT peak–gene linkages for the same cell type and used ‘bedtools merge’ function (for each gene) to obtain a union of SCENT peaks for each gene. Similarly, we created aggregated SCENT tracks across all the cell types and all datasets. We collected all sets of SCENT peak–gene linkages and used ‘bedtools merge’ function (for each gene) to obtain a union of SCENT peaks for each gene across all the cell types and all datasets. We defined a causal enrichment for eQTL within SCENT enhancers and other annotations by using statistically fine-mapped variant–gene combinations from GTEx. We used publicly available statistics analyzed by CAVIAR software 20 , and selected variants with PIP >0.2 as putatively causal (fine-mapped) variants for primary analyses. For the primary enrichment analysis, we aggregated fine-mapped variants from all the 49 tissues. For cell-type-specific SCENT enrichment analysis ( Extended Data Fig. 6j ), we used fine-mapped variants from each tissue separately. We intersected these putatively causal variants with our annotations. We then retained any variants which the linking method (SCENT, ArchR, Signac and Cicero) connected to the same gene as GTEx phenotype gene. Enrichment gene _ i = # causal_var_in_annot gene_ i / ∑ common_var_in_annot gene_ i # causal_var gene_ i / ∑ common_var_in_cis gene_ i Overall Enrichment = 1 n ∑ i = 1 n Enrichment gene − i For each gene i (expression phenotype), we divided the number of putatively causal variants within an annotation normalized by the number of common variants within an annotation by the number of all causal variants for gene i normalized by the number of all common variants within cis -region from for gene i . To calculate common variants within annotation or within locus, we used 1,000 Genomes Project genotype. We selected any variants with minor allele frequency >1% in European population as a set of common variants to be intersected with each annotation. To derive score, we took the Overall Enrichment mean across all the genes. To have further insights into precision and recall and compare against ArchR peak2gene and Signac LinkPeaks functions, we varied the threshold for defining a set of significant peak–gene linkages in each software (that is, FDR in SCENT {0.50, 0.30, 0.20, 0.10, 0.05, 0.02}, Peason’s correlation r {any, 0, 0.1, 0.3, 0.5, 0.7} in ArchR, and correlation score {any, 0, 0.05, 0.1, 0.15} in Signac). We used the same myeloid cells in the arthritis-tissue dataset and a set of eQTL fine-mapped variants in GTEx blood tissue for this benchmark across all three methods. We then used each set of peak–gene linkages to recalculate causal variant enrichment Overall Enrichment score ( Fig. 3b ). We also assessed the impact of PIP threshold in defining a set of statistically fine-mapped variants on the causal variant enrichment analysis. To do so, we redefined the set of putative causal variants with more stringent PIP thresholds (PIP >0.5 and PIP >0.7), and re-computed the calculate causal variant enrichment Overall Enrichment score. We sought to evaluate the effect of regulatory elements’ proximity to TSS on enrichment for eQTL causal variants. We created annotated genome regions based on TSS distance to a given gene (for example, within 1 kb, from 1 kb to 10 kb, and so on). We concatenated these elements across all 23,715 genes to create genome-wide annotations reflecting TSS proximity ( Supplementary Fig. 5a ). We used the eQTL fine-mapped variants with PIP >0.2 from GTEx for all tissues as putative causal variants. We computed Overall Enrichment described above as the degree of causal variant enrichment within these regulatory elements in each TSS distance bin. As other benchmarking for assessing the effect of the components of SCENT on the causal variant enrichment, we also created peak–gene linkage using the Poisson regression but without nonparametric bootstrapping for the same dataset of myeloid cells in the arthritis-tissue dataset. We used the nominal P values for the term X peak from the Poisson regression (equation (1)) to perform FDR correction to obtain significant peak–gene pairs using the Poisson regression alone. We then used the FDR thresholds {0.30, 0.20, 0.10, 0.05, 0.02, 0.01} for assessing the recall–precision tradeoffs as described in the previous section. We used GWAS fine-mapping results in FinnGen release 6 (ref. 79 ) upon registration and publicly available GWAS fine-mapping results in UK Biobank 80 , 115 . For FinnGen traits, we downloaded all the fine-mapping results by SuSIE software 22 and systematically selected any traits with case count >1,000. We then selected noncoding fine-mapped loci that did not include any nonsynonymous or splicing variants with PIP >0.5. We thus analyzed 1,046 traits and 5,753 loci in total after QC. For UK Biobank, we analyzed the fine-mapping results by SuSIE software for all 94 traits including binary and quantitative traits. Since the genomic coordinates for the UK Biobank fine-mapping results were hg19, we lifted them over to GRCh38 by using LiftOver software. We again selected noncoding fine-mapped loci that did not include any nonsynonymous or splicing variants with PIP >0.5. We thus analyzed 7,274 loci in total after QC. We analyzed three additional autoimmune GWAS fine-mapping results for rheumatoid arthritis 26 , type 1 diabetes 90 and inflammatory bowel disease 29 , given our special interest in immune-mediated traits. We similarly selected noncoding fine-mapped loci that did not include any nonsynonymous or splicing variants with PIP >0.5, and lifted the results over to GRCh38 by using LiftOver software. We defined 117 loci for rheumatoid arthritis, 77 loci for type 1 diabetes and 86 loci for inflammatory bowel disease. We defined causal variant enrichment statistics for GWAS within SCENT enhancers and other annotations by using statistically fine-mapped variants from FinnGen 79 and UK Biobank 80 that we described in the previous section. We selected variants with PIP >0.2 as putatively causal variants for primary analyses. Enrichment trait _ i = # causal_var_in_annot trait_ i / ∑ common_var_in_annot trait_ i # causal_var trait_ i / ∑ common_var_across_loci trait_ i Overall Enrichment = 1 n ∑ i = 1 n Enrichment trait _ i For each trait i , we divided the number of putatively causal variants within an annotation (across all loci for trait i ) normalized by the number of common variants within an annotation by the number of all causal variants for trait i normalized by the number of all common variants within all significant loci analyzed for the trait i . To calculate common variants within annotation or within locus, we again used 1000 Genomes Project variants with minor allele frequency >1% in European population. To derive Overall Enrichment score, we took the mean across all the traits. For each trait i and putative causal gene pair, we calculated the distance between the TSS of the gene and the most likely causal variant which had the largest PIP when multiple variants were nominated for a single gene by SCENT ( Supplementary Fig. 6a ). For each putative causal gene for the trait i , we also sorted all the genes on the basis of the distance between the gene’s TSS and the most likely causal variant (from the smallest to the largest). We then obtained the rank of the putative causal gene from SCENT among the sorted gene list to see how often the SCENT gene is the closest gene from the most likely causal variant. We downloaded per-group EpiMap enhancer–gene links from ref. 116 . We lifted the genomic coordinates to GRCh38 by using LiftOver software. When we assessed aggregated EpiMap enhancer–gene links across all the 31 tissue groups, we used ‘bedtools merge’ function for each gene to create a union of all enhancer–gene links ( Fig. 3c , d ). For tissue-specific enrichment analyses, we analyzed the 31 group-specific tracks separately ( Extended Data Fig. 7a , b ). To benchmark the precision and recall, we used EpiMap correlation scores to define variable sets of enhancer–gene links from EpiMap based on the threshold of EpiMap correlation score. We downloaded ABC predictions in 131 cell types and tissues from ref. 117 . We lifted the genomic coordinates to GRCh38 by using LiftOver software. When we assessed aggregated ABC enhancer–gene links across all the groups, we used ‘bedtools merge’ function for each gene to create a union of all enhancer-gene links across 131 cell types ( Fig. 3c , d ). For cell-type-specific analyses, we aggregated cell lines or cell types to be corresponding with our cell types and analyzed each of these tracks separately (B cell, T cell, myeloid cells and fibroblasts; Extended Data Fig. 7a , b ). To benchmark the precision and recall, we used ABC scores to define variable sets of enhancer–gene links from ABC model based on the threshold of ABC score. To assess precision and recall and compare against bulk-tissue based methods (that is, EpiMap and ABC model), we used sets of significant peak–gene linkages in each method with varying thresholds (that is, FDR in SCENT {0.5, 0.3, 0.2, 0.1, 0.05, 0.02}, EpiMap correlation score {0, 0.4, 0.8, 0.9} in EpiMap, and ABC score {0, 0.05, 0.1, 0.2} for ABC model). We then used each set of peak–gene linkages to recalculate causal variant enrichment for GWAS ( Fig. 3d ). We also assessed the impact of PIP threshold in defining a set of statistically fine-mapped variants on the causal variant enrichment analysis. To do so, we redefined the set of putative causal variants with more stringent PIP thresholds (PIP >0.5 and PIP >0.7), and recomputed the calculate causal variant enrichment Overall Enrichment score. To compare the capability of identifying putative causal variants for GWAS between SCENT and previous single-cell-based enhancer–gene maps by ArchR or Signac ( Extended Data Fig. 8c , d ), we created an integrated enhancer–gene maps across available datasets for these methods. To this end, we used eight multimodal datasets (excluding NeurIPS dataset) in which fragment files for ATAC–seq in addition to count matrices of RNA-seq and ATAC–seq are available to run ArchR. For each of these methods (SCENT, ArchR and Signac), we took the union of significant enhancer–gene linkages with FDR <10% from eight datasets by using ‘bedtools merge’ function. We then calculated causal variant enrichment statistics for GWAS within SCENT, ArchR and Signac enhancers by using statistically fine-mapped variants from FinnGen and UK Biobank (PIP >0.2) as we described above. As part of the AMP consortium, we generated an independent arthritis-tissue dataset with single-cell unimodal ATAC–seq data with genotype ( n = 17; one sample without genotype data was excluded) 58 to define caQTLs. We used two methods, binomial test for allele-specific chromatin accessibility and RASQUAL. Briefly, we genotyped donors in the AMP Network for rheumatoid arthritis and systemic lupus erythematosus including 17 donors in this study by using Illumina Multi-Ethnic Genotyping Array across three batches. We performed quality control of genotype by sample call rate >0.99, variant call rate >0.99, minor allele frequency >0.01, and P HWE > 1.0 × 10 −6 . We performed haplotype phasing with SHAPEIT2 software (2.727) 118 and performed whole-genome imputation by using minimac3 software (version 2.0.1) 119 with a reference panel of 1000 Genomes Project phase 3 (ref. 120 ). After imputation, we selected variants with imputation R 2 > 0.7 as post-imputation QC. We next created a merged bam file of ATAC–seq for each donor and each cell type by aggregating all the reads. Using the imputed genotype for each donor and aggregated bam files for each donor and cell type, we applied WASP 121 to correct any bias in read mapping toward reference alleles to accurately quantify allelic imbalance. We thus created a bias-corrected bam files for each donor and cell type. For binomial tests for allele-specific chromatin accessibility, we ran ASEReadCounter module in GATK software (version 4.1.9.0) 122 using the bias-corrected bam files as input to quantify allelic imbalance in heterozygous sites with read count >4 within ATAC peak counts. We first performed one-sided binomial tests in each donor, and meta-analyzed the statistics across donors by Fisher’s method if multiple donors shared the same heterozygous site. For RASQUAL, we created a VCF file containing both genotype dosage and allelic imbalance from ASEReadCounter. We quantified the read coverage for each peak and for each donor by ‘bedtools coverage’ function. We created a peak by donor matrix with read coverage. We QCed samples with log(total mapped fragments) fewer than mean − 2 × standard deviation across samples in each cell type. We QCed peaks so that at least two individuals have any fragments for the peak. We then ran RASQUAL software with the inter-individual differences in ATAC peak counts (in the peak by donor matrix) and intra-individual allelic imbalance (in the VCF), with accounting for chromatin accessibility PCs (the first N components whose explained variances are greater than those from permutation result), 3 genotype PCs, sample site and sex as covariates. RASQUAL output chi-squared statistics and P values. We computed FDR from these raw P values by Benjamini & Hochberg correction on local multiple test burden (that is, the number of cis -SNPs in the region). To correct for genome-wide multiple testing, we ran the RASQUAL with random permutation, in which the relationship between sample labels and the count matrix was broken. Thus, we derived q values for each candidate caQTL. We finally intersected these peaks with significant caQTL effect in each significance threshold with SCENT peaks and assessed causal variants enrichment within these peaks for GWAS as explained in the previous sections. In the example SMAD3 locus, we visualized the allele-specific effect by creating an aggregated bam files for each donor and using Integrative Genomics Viewer ( Extended Data Fig. 9c ). We also visualized the inter-individual effect by taking the residuals after regressing out the covariates from the logged count per million of the ATAC read coverage for the peak in each donor ( Extended Data Fig. 9d ). We downloaded the latest clinically reported variant list registered at ClinVar from ref. 123 . We then screened the variants to exclude (1) exonic variants and (2) variants categorized as ‘benign’. We defined the ClinVar variant density as the number of the noncoding and nonbenign variants within each annotation × 1,000 divided by the total length (bp) of each annotation. We used a list of somatic noncoding mutation hotspots for leukemia in Supplementary Table 13 of the original publication 105 . We lifted the genomic coordinates to GRCh38 by using LiftOver software. We then intersected the noncoding somatic mutation hotspots with our cell-type-specific SCENT peaks in immune cells (that is, T cells, B cells and myeloid cells). We compared the intersected elements’ target genes by SCENT with the ‘Annotate_Gene’ column from the original publication. To evaluate the effect of cell numbers on the statistical power in detecting significant SCENT enhancer–gene linkages, we performed downsampling experiments in fibroblast (the most abundant cell type in arthritis-tissue dataset, n cell = 9,905). We randomly samples cells ( n cell = 500, 1000, 2,500, 5,000 and 7,500). We then applied SCENT to each of the subset groups of cells and defined significant peak–gene links with FDR <10%. We counted the number of significant peak–gene links in each of the subset groups of cells, and annotated peaks based on the distance to the TSS to the target gene. No sample data were excluded from the single-cell multimodal analyses. Neither randomization, sample size predetermination nor blinding of investigators was applicable to this study.

Results

SCENT accurately identifies significant association between chromatin accessibility in regulatory regions and expression of individual genes across single cells ( Fig. 1a ). For each peak–gene pair, we tested association between binarized chromatin accessibility in an ATAC peak with RNA-seq gene counts in cis (<500 kb from gene body, Methods). We tested each cell type separately to capture cell-type-specific gene regulation and to avoid spurious peak–gene associations due to gene co-regulation across cell types. Those associations can be used for prioritizing (1) likely causal variants in regulatory regions associated with gene expression, (2) likely causal genes if they are associated with the identified regulatory region and (3) the critical cell types based on cell type the association is identified in. Since both RNA-seq and ATAC–seq data are generally sparse 50 , 52 – 55 , we used Poisson regression 53 , 56 : E i ~ Poisson ( λ i ) log ( λ i ) = β 0 + β peak X peak + β % mito X % mito + β n UMI X n UMI + β batch X batch , where E i is the observed expression count of i th gene, and λ i is the expected count under Poisson distribution. β peak is the effect of chromatin accessibility of a peak ( X peak ) on expression of the i th gene; its magnitude reflects the strength of the regulatory effect and its sign indicates enhancing versus silencing effect. We accounted for donor or batch effects ( X batch ) and cell-level technical factors such as percentage of mitochondrial reads ( X %mito ). However, Poisson regression might be suboptimal for highly expressed and dispersed genes ( Fig. 1b and Extended Data Fig. 1a ). Unsurprisingly, we observed uncontrolled type I error with Poisson regression in null dataset where we permuted cell barcodes to disrupt ATAC and RNA associations ( Extended Data Fig. 1b , c ). Moreover, commonly used single-cell analytical models (for example, negative binomial regression and linear regression) demonstrated inflated statistics ( Extended Data Fig. 1d , e and Methods ). To accurately estimate the error and significance of β peak , we implemented a two-tailed non-parametric bootstrapping framework 57 ( Methods and Extended Data Fig. 2 ) by resampling cells and deriving the empirical significance of β peak . Bootstrapping resulted in calibrated statistics with appropriate type I error ( Extended Data Fig. 1f ). Therefore we used this model to define statistically significant peak–gene associations. We considered and implemented two alternative models ( Extended Data Fig. 1g , h , Supplementary Fig. 1 and Supplementary Note 1 ). We obtained nine single-cell multimodal datasets from diverse human tissues representing 13 cell types (immune-related, hematopoietic, neuronal and pituitary). Since we are interested in autoimmune diseases, we generated an inflammatory tissue dataset from synovial tissues from 11 patients with rheumatoid arthritis and 1 patient with osteoarthritis (arthritis-tissue dataset; n donor = 12, n cells = 30,893) 58 . In addition, we obtained eight public datasets with 129,672 cells 46 , 59 – 63 ( Fig. 1c ). We analyzed 16,621 genes and 1,193,842 open chromatin peaks in cis after quality control (QC) (4,753,521 peak–gene pairs, 28 median peaks per gene; Supplementary Fig. 2 and Supplementary Table 1 ). In each dataset, we clustered and annotated cell types. Applying SCENT to each of the cell types with n cells > 500, we constructed 23 enhancer–gene maps with a total of 87,648 peak–gene links (false discovery rate (FDR) <10%, Fig. 2a and Extended Data Fig. 3 ). Genes had variable number of associated peaks (from 0 to 97, mean 4.13, Extended Data Fig. 4a ). After accounting for the number of cells, power to detect peak–gene links was associated with the number of ATAC–seq fragments ( P = 0.045) and unique RNA molecules ( P = 0.030; Extended Data Fig. 4b , c ). To assess replicability of SCENT peak–gene links, we compared the effects from the arthritis-tissue dataset (discovery) with those from other datasets in the same cell type (replication) in B cells, T/natural killer (NK) cells and myeloid cells ( Supplementary Table 2a ). Despite different tissue contexts, we observed high concordance in estimated effect of chromatin accessibility on gene expression for peak–gene pairs significant in both datasets (FDR <10%; mean Pearson’s r = 0.63 of effect sizes, 99% mean directional concordance across all the datasets: Extended Data Fig. 4d ). For comparison, we tested ArchR 55 or Signac 46 , 50 , two popular linear parametric single-cell multimodal methods. In contrast, we noted lower concordance (mean Pearson’s r = 0.19, 57% mean directional concordance in ArchR and r = 0.38, 99% mean directional concordance in Signac; Supplementary Table 2b , c ). SCENT detects enhancer–gene links more reproducibly than previous parametric methods. To assess if SCENT peaks were functional, we examined if (1) they co-localized with conventional cis -regulatory annotation, (2) their effect on expression was greater for closer peak–gene pairs, (3) they had high sequence conservation and (4) peak–gene connections had experimental support. First, we tested the overlap of SCENT peaks with an ENCODE cCRE 64 , a cis -regulatory annotation devised from bulk epigenomic datasets. We observed that 98.0% of SCENT peaks overlapped with cCRE on average, compared to 23.3% of random size-matched cis -regions and 89.0% of non-SCENT peaks ( Extended Data Fig. 4e ). We also annotated SCENT peaks in immune cell types with 18-state chromHMM results; 97.4% of the SCENT peaks overlapped with promoter or enhancer annotations in aggregate of 41 immune-related samples 37 . Second, we examined the strength of enhancer-gene links, hypothesizing that stronger links would be more proximal to the transcription start site (TSS) of target genes. The regression coefficient β peak (the effect size of peak accessibility on gene expression) became larger and more positive as the SCENT peaks got closer to the TSS ( Fig. 2b , c ), consistent with previous observations 55 , 65 . Third, we assessed whether SCENT peaks had larger phastCons score 66 , reflecting higher sequence conversation across species 67 ; evolutionary conserved regulatory regions are functionally active and enriched for complex trait heritability 67 . As expected, exonic regions were much more evolutionary conserved than all noncoding cis -region (mean ΔphastCons score 0.38, paired t -test P < 10 −323 ; Fig. 2d , purple). SCENT regulatory regions were also conserved relative to noncoding cis -regions (mean ΔphastCons score 0.13, paired t -test P = 4.2 × 10 −42 in arthritis-tissue dataset; Fig. 2d , teal). In contrast, the ΔphastCons score between all cis -ATAC peaks and all noncoding cis -region was more modest (mean ΔphastCons score 0.092, paired t -test P = 8.7 × 10 −27 in arthritis-tissue dataset; Fig. 2d , yellow). We tested the effect of promoters on the observed higher conservation in SCENT peaks. We confirmed that the ΔphastCons score for SCENT is still larger than for all cis -regulatory ATAC–seq peaks, even after excluding promoters. We note that the difference was consistent across multiple datasets with overlapping confidence intervals (CIs) suggesting non-statistically significant differences ( Extended Data Fig. 4f ). More generally, to test the effect of SCENT peaks’ proximity to TSS on the higher conservation ( Supplementary Fig. 3a ), we matched each of the SCENT peak–gene pairs to one non-SCENT peak–gene pair matching on TSS distance ( Supplementary Fig. 3b ). SCENT peaks had significantly higher conservation scores than distance-matched non-SCENT peaks (mean ΔphastCons score 0.034, P = 4.7 × 10 −4 in arthritis-tissue dataset; Extended Data Fig. 4g and Methods ), indicating the functional importance of SCENT regulatory regions not solely driven by TSS proximity. Finally, we tested whether the target genes from SCENT were enriched for experimentally confirmed enhancer–gene links. First, we used CRISPR-Flow FISH results 39 that included 278 positive and 5,470 negative enhancer–gene connections. The SCENT peaks from tissues relevant to the experiment were significantly enriched for positive connections relative to non-SCENT peaks (for example, Fisher’s exact odds ratio 4.5×, P = 1.8 × 10 −9 in arthritis-tissue dataset; Methods, Fig. 2e and Supplementary Table 3 ). Second, we used H3K27ac HiChIP data in naive T cells, Th17 T cells and regulatory T cells up to 1-kb resolution 68 . Across our six T cell datasets, the SCENT peak–gene links were 1.6-fold enriched within H3K27ac HiChIP enhancer–gene loops relative to non-SCENT peaks (Fisher’s exact test P = 2.3 × 10 −54 , Fig. 2f and Methods ). We anticipate that the genes with the largest number of SCENT peaks are likely to be the most constraint and least tolerant to loss of function mutations. These genes included FOSB ( n = 97), JUNB ( n = 95) and RUNX1 ( n = 77), highly conserved transcription factors. We assessed mutational constraint based on the absence of deleterious variants within human populations, including the probability of being loss-of-function intolerant (pLI) 69 and the loss-of-function observed/expected upper bound fraction (LOEUF) 70 . The normalized number of SCENT peaks per gene is strongly associated with mean constraint score for the gene ( β = 0.37, P = 4.9 × 10 −90 for pLI where higher score indicates more constraint, and β = −0.35, P = −0.35 × 10 −106 for LOEUF where lower score indicates more constraint; Extended Data Fig. 5a , b , respectively). Our results are consistent with prior reports that genes with many regulatory regions from bulk-epigenomic data are enriched for loss-of-function intolerant genes 71 . We examined whether the SCENT peaks harbor statistically fine-mapped putative causal variants for expression quantitative loci (eQTL). We used eQTL from GTEx across 49 tissues 72 and defined putative causal variants as those with posterior inclusion probability (PIP) >0.2. Unsurprisingly, all accessible regions defined by ATAC–seq in cis -regions were modestly enriched in fine-mapped variants by 2.7× (yellow, Fig. 3a ). Strikingly, SCENT peaks were more enriched in fine-mapped variants by 9.6× averaged across all datasets (teal, Fig. 3a ). More stringent PIP threshold cutoffs yielded stronger enrichments ( Supplementary Fig. 4 ). Since many SCENT peaks are close to TSS regions, we considered if the enrichment was driven by TSS proximity ( Supplementary Fig. 3a ). The TSS distance is one of the most important features for causal eQTL variants 73 – 78 . We confirmed the relationship between proximity to TSS and causal variant enrichment in GTEx ( Supplementary Fig. 5a , b ). We therefore tested if the higher eQTL causal variant enrichment remained (1) after excluding promoter peaks from SCENT peak–gene linkages or (2) after matching the TSS distance. Excluding promoters, SCENT still consistently had higher enrichment in all analyzed datasets than all cis -regulatory ATAC–seq peaks ( Extended Data Fig. 6a ). We observed higher enrichment in SCENT peaks compared to TSS-distance-matched non-SCENT peaks while the differences now became insignificant ( Extended Data Fig. 6b ). This suggests that SCENT has additional information in identifying functional cis -regulatory regions beyond TSS distance. We next compared eQTL variant enrichment in SCENT peaks to peaks identified by two published linear parametric methods using single-cell multimodal data, ArchR 55 and Signac 46 , 50 . Statistically significant peak–gene links defined by the threshold of FDR <10% in ArchR and Signac without filtering with correlation r had lower causal variant enrichment (4.9× and 19.7×, respectively) compared to SCENT peaks (74.1×) with the same FDR threshold ( Extended Data Fig. 6c ). Given the large differences in the number of peak–gene links with FDR <10% among methods (4,330 in SCENT, 817,000 in ArchR and 9,840 in Signac on average), we were concerned that performance differences may reflect recall differences. By varying the thresholds of correlation r in ArchR and Signac ( Methods ), we called peak–gene pairs with different levels of stringency and tested causal variant enrichment (that is, recall–precision tradeoff; Fig. 3b and Extended Data Fig. 6c ). SCENT peaks demonstrated higher causal variant enrichment than ArchR and Signac peaks across different recall values. Additional benchmarking with existing methods including Cicero 51 is described in Supplementary Note 2 ( Extended Data Fig. 6d – i ). We assessed whether the Poisson regression or the bootstrapping in SCENT was driving the improved performance over linear parametric methods. We observed lower causal variant enrichment in peaks identified with Poisson regression alone compared to SCENT (14.4× versus 74.1× at FDR <10%, respectively; Extended Data Fig. 6h ). This underscored the importance of accounting for variable gene count distribution by nonparametric bootstrapping. SCENT can detect cis -regulatory regions in a cell-type-specific manner. We created cell-type-specific enhancer–gene maps in four major cell types with >5,000 cells across datasets; for each cell type we took the union of SCENT enhancers across datasets. The cell-type-specific SCENT enhancers were most enriched in eQTL variants within relevant samples in GTEx (for example, B cell SCENT peaks and eQTLs in Epstein–Barr-virus-transformed lymphocytes; Extended Data Fig. 6j ). These results showcased SCENT’s prioritization of causal eQTL variants in a cell-type-specific manner with higher precision than the previous single-cell methods. SCENT can be used to build disease-specific enhancer–gene maps by applying it to multimodal data from disease-relevant tissues. We examined whether SCENT peaks can be used to prioritize disease causal variants. We obtained candidate causal with PIP >0.2 (ref. 28 ) from GWASs in two large-scale biobanks, including 1,046 disease traits from FinnGen 79 and 35 binary and 59 quantitative traits from UK Biobank 80 . The aggregated SCENT enhancers were strikingly enriched in causal GWAS variants in FinnGen (31.6× on average; Fig. 3c ) and UK Biobank (73.2× on average; Fig. 3d ) while cell-type-specific SCENT tracks had variable enrichment ( Extended Data Fig. 7a , b ). This enrichment was again much larger than all cis -ATAC peaks (12.8× in FinnGen and 38.8× in UK Biobank). The enrichment in SCENT peaks remained higher than all peaks even after removing promoter regions or conditioning on TSS distance; in some datasets, the difference was not significant with overlapping CIs ( Extended Data Fig. 7c , d ). The target genes of the likely causal variants for autoimmune diseases identified by SCENT peaks in immune cell types had higher fraction (10.8%) of know genes implicated in Mendelian disorders of immune dysregulation ( n gene = 550) 81 , 82 than SCENT peaks in fibroblasts (3.8%; Extended Data Fig. 7e ). We compared SCENT to alternative genome annotations and enhancer–gene maps from bulk tissues. Causal variant enrichment in FinnGen and UK Biobank was higher in SCENT (31.6× and 72.3×, respectively) than the conventional bulk-based annotations such as ENCODE cCREs (13.9× and 46.5×), ABC (16.3× and 53.3×) and EpiMap (12.9× and 40.6×) ( Fig. 3c , d and Extended Data Fig. 7a , b ). We again assessed the tradeoff between recall and enrichment (precision). We constructed SCENT from 9 datasets and 23 cell types with only 28 samples, substantially less than the 833 samples and tissues used to construct EpiMap and 131 samples and cell lines for the ABC model. Despite the smaller dataset, SCENT peaks demonstrated higher enrichment of GWAS variants at a given number of identified peak–gene linkages than ABC model and EpiMap ( Extended Data Fig. 8a ). More stringent PIP threshold increased the enrichment ( Extended Data Fig. 8b ). The target genes for autoimmune disease by SCENT in immune-related cell types had higher fraction (10.8%) of known Mendelian genes of immune dysregulation 81 , 82 than EpiMap (8.6%) and ABC model (4.4%) ( Extended Data Fig. 7e ). GWAS variants were also more enriched in SCENT enhancers than ArchR and Signac ( Extended Data Fig. 8c , d ). These results demonstrate the benefit of accurately modeling association between chromatin accessibility and gene expression at the single-cell resolution. We hypothesized that putative causal variants identified by SCENT modulate chromatin accessibility (for example, transcription factor binding affinity). If so, the intersection of the SCENT enhancers and chromatin accessibility quantitative trait loci (caQTL) may be further enriched for GWAS causal variants 83 – 86 . To test this, we used single-cell ATAC–seq samples with genotype data 58 ( n donor = 17; Methods) and performed caQTL mapping by leveraging allele-specific chromatin accessibility (binomial test followed by meta-analysis across donors) or by combining allele-specific with inter-individual differences (RASQUAL 87 ). We observed higher enrichment within intersected regions with SCENT and caQTL than those with SCENT alone. The enrichment increased as we used more stringent threshold for caQTL peaks, reaching as high as 333-fold ( Fig. 3e ). As an illustrative example, an asthma GWAS locus at 15q22.33 included a SCENT enhancer within an intron of SMAD3 gene that harbored a putative causal variant rs17293632 (PIP 0.34; Extended Data Fig. 9a ). This SCENT enhancer ( Extended Data Fig. 9b ) had a significant caQTL effect, from both (1) allele-specific effect (meta-analyzed binomial-test P = 2.7 × 10 −4 ; Extended Data Fig. 9c ) and (2) inter-individual differences in chromatin accessibility ( P = 6.0 × 10 −5 ; Extended Data Fig. 9d ). The alternative allele T reduced the chromatin accessibility and was reported to disrupt a conserved AP-1 consensus site 30 . The allele T also decreased SMAD3 expression ( β = −0.0687, P = 3.3 × 10 −13 from eQTL catalog 88 ). SMAD3 , the target gene identified by SCENT, is involved in TGF-β signaling, which remodels airways in asthma 89 . Together, SCENT demonstrated the potential to further enrich causal variants by integrating caQTLs. Finally, we sought to use SCENT to define disease causal mechanisms. We analyzed the fine-mapped variants from GWAS (FinnGen, UK Biobank and GWAS cohorts of rheumatoid arthritis 26 , inflammatory bowel disease 29 and type 1 diabetes 90 ). SCENT linked 4,124 putative causal variants (PIP >0.1) to their potential target genes across 1,143 traits ( Supplementary Table 4 ). These target genes were mostly close to the causal variant, with 20% of them being the closest gene to the causal variant ( Supplementary Fig. 6a , b ). However, 30.6% of SCENT-linked genes were more than 300 kb away from the causal variants. We first focused on autoimmune loci, since our SCENT tracks were largely derived from immune cell types. We prioritized a single fine-mapped variant rs72928038 (PIP >0.3) within the T-cell-specific SCENT enhancer at a locus in 6q15 for multiple autoimmune diseases (rheumatoid arthritis, type 1 diabetes, atopic dermatitis and hypothyroidism; Fig. 4a ). This enhancer was linked to BACH2 , the closest gene to this fine-mapped variant. Base-editing the protective allele to the risk allele in T cells has confirmed the effect of this variant on BACH2 expression 91 . Moreover, rs72928038 -deleted naive CD8 T cells were more prone to differentiate into effector T cells in mice 91 . A 4p15.2 locus for rheumatoid arthritis and type 1 diabetes harbored 21 candidate variants, each with low PIPs (<0.14). SCENT prioritized a single variant rs35944082 in T cells and fibroblasts only within the arthritis-tissue dataset from inflamed synovial tissue ( Fig. 4b ). SCENT linked this variant to RBPJ , which was the third closest gene located 235 kb away. This variant-gene link was supported by promoter-capture Hi-C data in hematopoietic cells 92 and by H3K27ac HiChIP data in T cells 68 . The RBPJ transcription factor is critical for NOTCH signaling, which has been implicated in rheumatoid arthritis tissue inflammation through functional studies 93 , 94 . Rbpj knockdown in mice resulted in abnormal T cell differentiation and disrupted regulatory T cell phenotype 95 , 96 , consistent with a plausible role in autoimmune diseases. Intriguingly, we did not observe this enhancer–gene link in T cells from peripheral blood mononuclear cells (PBMC), blood nor in EpiMap. ABC map prioritized another variant, rs7441808 , at this locus and linked it nonspecifically to 16 genes including RBPJ , making it difficult to define the causal gene. The Hi-C and H3K27ac HiChIP data nominated the gene RBPJ , but due to the limited resolution of the contact maps, they could not prioritize a single causal variant. As an example of the power of SCENT to build enhancer–gene maps from disease-critical tissues, we examined single-cell data from pituitary 63 . We assessed a 11p14.1 locus for multiple gynecological traits (endometriosis, menorrhagia, ovarian cyst and age at menopause). Our map connected rs11031006 to FSHB ( Fig. 4c ), which is specifically expressed in the pituitary 72 , 97 and enables ovarian folliculogenesis 98 . Rare FSHB variants cause autosomal recessive hypogonadotropic hypogonadism 99 . Multimodal data from other tissues and bulk-based methods were unable to prioritize this variant since they missed pituitary, the most disease-relevant tissue. These results underscored the importance of creating enhancer–gene links using causal cell types, in the instances where links exist only in disease-relevant tissues. Having established SCENT’s utility in defining causal variants and genes in complex diseases, we examined rare noncoding variants causing Mendelian diseases. Currently, causal mutations can be identified in only ~30–40% of patients with Mendelian diseases 100 – 102 . Consequently, many variants in cases are annotated as variants of uncertain significance. The variants of uncertain significance annotation is especially challenging for noncoding variants. We examined the overlap of clinically reported nonbenign noncoding variants by ClinVar 103 (400,300 variants in total) within SCENT enhancers. The SCENT enhancers harbored 2.0× ClinVar variants on average than the ATAC regions with the same genomic length ( Supplementary Fig. 7 ). This density of ClinVar variants was 3.2× and 12× larger than that in ENCODE cCREs and all noncoding regions, respectively. We defined 3,724 target genes for 33,618 noncoding ClinVar variants ( Supplementary Table 5 ). As illustrative examples, we found 40 noncoding variants linked to LDLR gene causing familial hypercholesterolemia 1 (ref. 103 ), 3 noncoding variants linked to IL10RA causing autosomal recessive early-onset inflammatory bowel disease 28 ( Fig. 4d ) 104 , and an intronic variant rs1591491477 linked to ATM gene causing hereditary cancer-predisposing syndrome 103 . We also used SCENT to connect noncoding somatic mutation hotspots to target genes. Recently, somatic mutation analyses across the entire cancer genome revealed possible driver noncoding events 105 . Among 17 noncoding mutation hotspots in leukemia, SCENT enhancers from blood-related cell types included 12 hotspots ( Supplementary Table 6 ). SCENT enhancer–gene linkage linked those hotspots to known driver genes (for example, BACH2 , BCL6 , BCR , CXCR4 ( Fig. 4e ) and IRF8 in leukemia). In some instances, SCENT nominated different target genes for these mutation hotspots from those based on ABC model used in the original study. For example, SCENT connected a somatic mutation hotspot in leukemia at chr14:105568663–106851785 to immunoglobulin heavy chain related genes such as IGHA1 , which might be more biologically relevant than ADAM6 nominated by ABC model. These results implicate broad applicability of SCENT for annotating human variations in noncoding regions. While the number for enhancer–gene links by SCENT was smaller than that by bulk-tissue-based methods, this might be a function of current limited sample sizes. By downsampling of our multimodal single cell dataset, we observed that the number of significant gene–peak pairs increased linearly to the number of cells per cell type in a given dataset, suggesting that SCENT will be even better powered as the size of multimodal datasets increases ( Supplementary Fig. 8 ). In SCENT association, we used cells from a specific cell type to identify cell-type-specific gene regulation. While association across cells from different cell types might increase the number of significant peak–gene linkages due to greater variance in chromatin accessibility and gene expression, this strategy could yield false-positive enhancer–gene associations by increasing the chances of connecting enhancers that are merely ‘correlated’ with gene expression ( Extended Data Fig. 10a ). By simulation and real data analyses, we confirmed that the cell-type-specific analysis was better calibrated to reject false enhancer–gene links by correlation and powered to detect experimentally validated enhancer–gene links, while it was less powered to detect promoter–gene links when compared with multiple-cell-type analysis ( Supplementary Note 3 and Extended Data Fig. 10b – e ). We anticipate that we might be able to obtain even higher signal-to-noise ratio with more fine-grained cell-state-specific analyses in the future as better powered datasets with more cells become available; we conjecture that in these datasets there will be fewer false negatives while maintaining the high precision of enhancer–gene calls.

Discussion

We presented a statistical method, SCENT, to create a cell-type-specific enhancer–gene map from single-cell multimodal data. Single-cell RNA-seq and ATAC–seq are both sparse and have variable count distributions, which requires nonparametric bootstrapping to connect chromatin accessibility with gene expression. The SCENT model demonstrated well-controlled type I error, outperforming commonly used statistical models. SCENT mapped enhancers that showed high enrichment for putative causal variants in eQTLs and GWAS and outperformed previous methods. Despite using substantially fewer samples, enhancers defined by SCENT had equivalent or higher enrichment for causal variants than bulk-tissue-based methods with many samples. SCENT benefits from modeling at the single-cell level instead of obscuring associations by aggregating cells into individual samples. As limitations, first, SCENT enhancer–gene maps had relatively fewer enhancers compared to other resources ( Fig. 2a ). However, a linear relationship between the number of cells and the number of significant SCENT peak–gene links ( Supplementary Fig. 8 ) indicates that application of SCENT to larger datasets will expand the current enhancer–gene map. In contrast, bulk-tissue-based enhancer–gene map might be more challenging to expand given the number of samples required. Second, SCENT focuses on gene cis -regulatory mechanisms to fine-map disease causal alleles. However, there could be other causal mechanisms, such as alleles that act through trans -regulatory effects, splicing effects or post-transcriptional effects 106 . Third, to prove the causality of the alleles prioritized by SCENT, experimental validation using gene editing technology 107 – 109 is necessary. Fourth, due to Poisson regression and bootstrapping, SCENT is more computationally intensive than the previous methods (for example, 1.5 × 10 7 CPU seconds in SCENT, 2.5 × 10 5 in Signac, and 2.2 × 10 2 in ArchR; Supplementary Table 7 ). We implemented multi-threading and parallelization options in SCENT, which lead to linearly faster computation but at the cost of additional computational resources. Algorithmic improvements, such as downsampling or aggregating cells, may be useful for extremely large datasets. We argue that the real utility of SCENT is that it enables the construction of disease-tissue-relevant enhancer–gene maps. Multimodal single-cell data can be obtained from a wide range of primary human tissues. It can be applied to those that are difficult to disaggregate since these multimodal data can be obtained from nuclear material without tissue disaggregation. Therefore, it is possible to build relevant tissue-specific enhancer–gene maps that are necessary to understand disease causal mechanisms. For example, understanding the FSHB locus in gynecological traits specifically required a pituitary map, and RBPJ locus in rheumatoid arthritis specifically required a synovial tissue map. In summary, SCENT is a robust, versatile method to define causal variants and genes in human diseases.

Extended Data

a . In an example dataset of arthritis-dataset, mean gene count was strongly correlated with standard deviation of the gene count. b . The correlation between max expression count per gene (x-axis) and the mean naïve association chi-square values ( χ 2) from Poisson regression between gene expression and chromatin accessibility under null simulation (y-axis). c . The quantile-quantile (QQ) plot of two-sided P values from the Poisson regression between gene expression count and chromatin accessibility under null simulation. d . The QQ plot of two-sided P values from the negative binomial regression between gene expression count and chromatin accessibility under null simulation. e . The QQ plot of two-sided P values from the linear regression between log-normalized and inverse-normal-transformed gene expression and chromatin accessibility under null simulation. f . The QQ plot of two-sided P values estimated from bootstrapping based on the statistics distributions from the Poisson regression between gene expression count and chromatin accessibility under null simulation. g . The QQ plot of two-sided P values estimated from bootstrapping based on the statistics distributions from the negative binomial regression between gene expression count and chromatin accessibility under null simulation. h . Computational runtime benchmarking for Poisson regression with binarized ATAC-seq peak (red), negative binomial regression with binarized ATAC-seq peak (teal), and Poisson regression with non-binarized ATAC-seq peak (blue). The values are relative to the computational time for Poisson regression, and bars are the mean across n =100 randomly selected peak-gene pairs. Horizontal lines (error bars) indicate one standard deviation from the mean. We first run Poisson regression associating the raw gene expression count (RNA-seq) with the peak accessibility (ATAC-seq) accounting for technical covariates across the entire cells in the multimodal data to estimate β peak . Then, we resampled cells with replacement from the full data in each of the bootstrapping round and re-estimated β ′ peak for N times. We compared this empirical distribution of β ′ peak against the null hypothesis ( β ′ peak = 0) to derive the significance of β peak (that is, two-sided bootstrapping-based P value = P bootstrap ). We applied SCENT to each of 23 broad cell types from 9 single-cell multimodal datasets. Each QQ plot represents two-sided P bootstrap values in each cell type in each dataset ( a . arthritis-tissue, b . public PBMC, c . NeurIPS, d . SHARE-seq, e . Dogma-seq (control), f . Dogma-seq (stimulated) g . NEAT-seq, h . Brain, i . Pituitary. a . The number of significant SCENT peaks per gene across genes we investigated in at least one dataset-cell type pair. b . The number of significant gene-peak pairs discovered by SCENT with FDR < 10% in each dataset (y-axis) as a function of the total number of ATAC-seq fragments in each dataset (x-axis), colored by the dataset. c . The number of significant gene-peak pairs discovered by SCENT with FDR < 10% in each dataset (y-axis) as a function of the total number of unique RNA molecules in each dataset (x-axis), colored by the dataset. d . The effect size correlation r by Pearson’s correlation between arthritis-tissue dataset and the other dataset for the same cell type (left) and the directional (sign) concordance between arthritis-tissue dataset and the other dataset for the same cell type (right). e . Fraction of overlap with ENCODE cCREs in SCENT (teal) or non-SCENT peaks (orange) in each dataset and random set of cis -non-coding regions (pink). f . The mean Δ phastCons score for SCENT with excluding promoter peaks (teal) and all cis -ATAC peaks with excluding promoter peaks (yellow) in each of the three example multimodal datasets. The bars indicate the 95% CI by bootstrapping genes ( n bootstrap =1000). g . The mean Δ phastCons score between SCENT peaks and TSS-distance-matched non-SCENT peaks across all the genes. The bars indicate the 95% CI by bootstrapping genes ( n bootstrap =1000). For each gene, the number of SCENT peaks were counted and binned as shown in the x-asis, and mutational constraint metric (pLI (the probability of being loss of function intolerant): a , LOEUF (the loss-of-function observed/expected upper bound fraction): b ) for genes within each bin are shown as a violin plot on the y-axis. The dots indicate the mean score in each bin, and the error bars indicate one standard deviation from the mean. Each bin consists of 555–4071 genes in a and 568–4265 genes in b . a . The mean causal variant enrichment for eQTL within SCENT peaks with excluding all promoters (teal) or cis -regulatory ATAC-seq peaks with excluding all promoters (yellow) in each dataset. b . The mean causal variant enrichment for eQTL within SCENT peaks (teal) or non-SCENT peaks with matching distance to TSS (pink). c . Comparison of the mean causal variant enrichment for eQTL (y-axis) among SCENT (teal), ArchR (pink), and Signac (purple) as a function of the number of significant peak-gene pairs at each threshold of significance by FDR in SCENT and correlation r in ArchR and Signac. d . Comparison of the mean causal variant enrichment for eQTL among SCENT, ArchR, and Signac as a function of the number of significant peak-gene pairs at each threshold of FDR in SCENT, ArchR and Signac. The ArchR results with > 180,000 peak-gene linkages are omitted. e . Comparison of the mean causal variant enrichment for eQTL among SCENT, ArchR, and ArchR filtered on RNA expression as a function of the number of significant peak-gene pairs. f . Comparison of the mean causal variant enrichment for eQTL among SCENT, Signac, and Signac filtered on RNA expression as a function of the number of significant peak-gene pairs. g . Comparison of the mean causal variant enrichment for eQTL among SCENT, the default Pearson’s correlation version of Signac, and the optional Spearman’s correlation version of Signac as a function of the number of significant peak-gene pairs. h . Comparison of the mean causal variant enrichment for eQTL among original SCENT (Poisson regression + non-parametric bootstrapping), Poisson-only strategy without bootstrapping, and Cicero (correlation method using sc-ATAC-seq alone) as a function of the number of significant peak-gene pairs up to 100,000 peak-gene linkages. i . Comparison of the mean causal variant enrichment for eQTL between SCENT and Cicero peaks with adding all accessible promoter regions (1 kb regions from TSS) to account for potential promoter bias. j . Tissue-specific causal variant enrichment within SCENT peaks. The dots and lines are colored by the eQTL source tissue in GTEx that we assessed. In all panels, the bars indicate 95% confidence intervals by bootstrapping genes ( n bootstrap =1000). a and b . The mean causal variant enrichment for GWAS within cell-type-specific and aggregated SCENT enhancers (teal), ENCODE cCREs (pink), group-specific and aggregated EpiMap enhancers (red) and sample-specific and aggregated ABC enhancers (blue). GWAS results were based on FinnGen ( a ) and UK Biobank ( b ). The bars indicate 95% confidence intervals by bootstrapping traits ( n bootstrap =1000). c . The mean causal variant enrichment for FinnGen GWAS (see Methods ) within SCENT peaks with excluding all promoters (teal) or cis -regulatory ATAC-seq peaks with excluding all promoters (yellow) in each of the 9 single-cell datasets. The bars indicate 95% confidence intervals by bootstrapping traits ( n bootstrap =1000). d . The mean causal variant enrichment for FinnGen GWAS (see Methods ) within SCENT peaks (teal) or non-SCENT peaks with matching distance to TSS (pink) in each of the 9 single-cell datasets. The bars indicate 95% confidence intervals by bootstrapping traits ( n bootstrap =1000). e . The fraction of known genes from Mendelian autoimmune diseases among all the genes identified by SCENT, EpiMap, and ABC model. The color of the bars indicates the cell types in each linking method. a . Comparison of the mean causal variant enrichment for FinnGen GWAS (y-axis) among SCENT (teal), EpiMap (red), and ABC model (blue) as a function of the number of significant peak-gene pairs (x-axis) at each threshold of significance. The bars indicate 95% confidence intervals by bootstrapping traits ( n bootstrap =1000). b . We calculated the causal variant enrichment for FinnGen GWAS among SCENT (teal), EpiMap (reds), and ABC model (blues) by changing the PIP thresholds in defining putative causal variants from fine-mapping. The bars indicate 95% confidence intervals by bootstrapping traits ( n bootstrap =1000). c and d . The mean causal variant enrichment for GWAS within SCENT enhancers (teal), ArchR (pink) and Signac enhancers (purple). GWAS results were based on FinnGen ( c ) and UK Biobank ( d ) using the FDR < 10% threshold in each software and eight benchmarking datasets (see Methods ). The bars indicate 95% confidence intervals by bootstrapping traits ( n bootstrap =1000). Rs17293632 in asthma GWAS ( a ) was prioritized and connected to SMAD3 gene by SCENT in myeloid cells ( b ). The panel a is a GWAS regional plot, with x-axis representing the position of each genetic variant and y-axis representing −log 10 ( P ) from GWAS (a two-sided P value). The rs17293632 has a significant caQTL effect, as shown in c and d . In panel c , the read coverage from single-cell ATAC-seq in each of donors with heterozygous genotype at this accessible region is presented, and at rs17293632, we observed allele-specific increased accessibility with C allele when compared T allele across donors. In panel d , normalized chromatin accessibility based on the read coverage for an individual after regressing out covariates is presented by the genotype of rs17293632 (CC, CT and TT). The horizontal bars within boxes indicate the median, and the lower and upper hinges represent 25% and 75% quantile. The upper whisker extends from the hinge to the largest value no further than 1.5 * inter-quartile range (IQR) from the hinge. The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge. All individual points are plotted as dots. a . An example situation of correlated gene expression without biological regulatory function. b . Benchmarking models for statistical power to define biologically plausible peak-gene linkage over false-associations due to correlated genes. c . Benchmarking results regarding cells and covariates included in the SCENT regression model. The x-axis represents the number of statistically significant peak-gene linkages among 5,000 randomly selected peak-gene linkages in cis , and the y-axis represents the number of statistically significant peak-gene linkages in cis divided by the number of statistically significant peak-gene linkages in trans among 5,000 randomly selected peak-gene linkages on different chromosomes, as a proxy metric for capability of identifying regulatory elements over ‘correlated’ elements. Red dots indicate the analyses conducted in all cells including different cell types ( n = 8,881), whereas blue dots indicate the analyses conducted in only T cells ( n = 8,881). d and e . False positive rate and precision for peak-gene linkages from analyses conducted in all cells (teal) or in only T cells (orange) by using experimentally validated enhancer-gene linkages (that is, CRISPR-Flow FISH data in d and H3K27ac data in e ). False negative rate and precision were defined as follows: false negative rate = # false negative /(# true positive + # false negative ) = 1 – recall

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: pmc-nxml

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-06-25T06:14:32.897245+00:00