Full text
70,088 characters
· extracted from
preprint-html
· click to expand
baal-nf identifies motif-disrupting variants that decrease transcription factor binding affinity | bioRxiv /* */ /* */ <!-- <!-- /*! * yepnope1.5.4 * (c) WTFPL, GPLv2 */ (function(a,b,c){function d(a){return"[object Function]"==o.call(a)}function e(a){return"string"==typeof a}function f(){}function g(a){return!a||"loaded"==a||"complete"==a||"uninitialized"==a}function h(){var a=p.shift();q=1,a?a.t?m(function(){("c"==a.t?B.injectCss:B.injectJs)(a.s,0,a.a,a.x,a.e,1)},0):(a(),h()):q=0}function i(a,c,d,e,f,i,j){function k(b){if(!o&&g(l.readyState)&&(u.r=o=1,!q&&h(),l.onload=l.onreadystatechange=null,b)){"img"!=a&&m(function(){t.removeChild(l)},50);for(var d in y[c])y[c].hasOwnProperty(d)&&y[c][d].onload()}}var j=j||B.errorTimeout,l=b.createElement(a),o=0,r=0,u={t:d,s:c,e:f,a:i,x:j};1===y[c]&&(r=1,y[c]=[]),"object"==a?l.data=c:(l.src=c,l.type=a),l.width=l.height="0",l.onerror=l.onload=l.onreadystatechange=function(){k.call(this,r)},p.splice(e,0,u),"img"!=a&&(r||2===y[c]?(t.insertBefore(l,s?null:n),m(k,j)):y[c].push(l))}function j(a,b,c,d,f){return q=0,b=b||"j",e(a)?i("c"==b?v:u,a,b,this.i++,c,d,f):(p.splice(this.i++,0,a),1==p.length&&h()),this}function k(){var a=B;return a.loader={load:j,i:0},a}var l=b.documentElement,m=a.setTimeout,n=b.getElementsByTagName("script")[0],o={}.toString,p=[],q=0,r="MozAppearance"in l.style,s=r&&!!b.createRange().compareNode,t=s?l:n.parentNode,l=a.opera&&"[object Opera]"==o.call(a.opera),l=!!b.attachEvent&&!l,u=r?"object":l?"script":"img",v=l?"script":u,w=Array.isArray||function(a){return"[object Array]"==o.call(a)},x=[],y={},z={timeout:function(a,b){return b.length&&(a.timeout=b[0]),a}},A,B;B=function(a){function b(a){var a=a.split("!"),b=x.length,c=a.pop(),d=a.length,c={url:c,origUrl:c,prefixes:a},e,f,g;for(f=0;f<d;f++)g=a[f].split("="),(e=z[g.shift()])&&(c=e(c,g));for(f=0;f<b;f++)c=x[f](c);return c}function g(a,e,f,g,h){var i=b(a),j=i.autoCallback;i.url.split(".").pop().split("?").shift(),i.bypass||(e&&(e=d(e)?e:e[a]||e[g]||e[a.split("/").pop().split("?")[0]]),i.instead?i.instead(a,e,f,g,h):(y[i.url]?i.noexec=!0:y[i.url]=1,f.load(i.url,i.forceCSS||!i.forceJS&&"css"==i.url.split(".").pop().split("?").shift()?"c":c,i.noexec,i.attrs,i.timeout),(d(e)||d(j))&&f.load(function(){k(),e&&e(i.origUrl,h,g),j&&j(i.origUrl,h,g),y[i.url]=2})))}function h(a,b){function c(a,c){if(a){if(e(a))c||(j=function(){var a=[].slice.call(arguments);k.apply(this,a),l()}),g(a,j,b,0,h);else if(Object(a)===a)for(n in m=function(){var b=0,c;for(c in a)a.hasOwnProperty(c)&&b++;return b}(),a)a.hasOwnProperty(n)&&(!c&&!--m&&(d(j)?j=function(){var a=[].slice.call(arguments);k.apply(this,a),l()}:j[n]=function(a){return function(){var b=[].slice.call(arguments);a&&a.apply(this,b),l()}}(k[n])),g(a[n],j,b,n,h))}else!c&&l()}var h=!!a.test,i=a.load||a.both,j=a.callback||f,k=j,l=a.complete||f,m,n;c(h?a.yep:a.nope,!!i),i&&c(i)}var i,j,l=this.yepnope.loader;if(e(a))g(a,0,l,0);else if(w(a))for(i=0;i (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0];var j=d.createElement(s);var dl=l!='dataLayer'?'&l='+l:'';j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;j.type='text/javascript';j.async=true;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-M677548'); Skip to main content Home About Submit ALERTS / RSS Search for this keyword Advanced Search New Results baal-nf identifies motif-disrupting variants that decrease transcription factor binding affinity View ORCID Profile Breeshey Roskams-Hieter , View ORCID Profile Øyvind Almelid , View ORCID Profile Chris P Ponting doi: https://doi.org/10.1101/2025.01.17.633399 Breeshey Roskams-Hieter 1 Institute of Genetics and Cancer, MRC Human Genetics Unit, Western General Hospital, University of Edinburgh , Edinburgh, EH4 2XU, UK 2 Health Data Research UK , Edinburgh, UK Find this author on Google Scholar Find this author on PubMed Search for this author on this site ORCID record for Breeshey Roskams-Hieter For correspondence: b.j.roskams-hieter{at}sms.ed.ac.uk chris.ponting{at}ed.ac.uk Øyvind Almelid 1 Institute of Genetics and Cancer, MRC Human Genetics Unit, Western General Hospital, University of Edinburgh , Edinburgh, EH4 2XU, UK Find this author on Google Scholar Find this author on PubMed Search for this author on this site ORCID record for Øyvind Almelid Chris P Ponting 1 Institute of Genetics and Cancer, MRC Human Genetics Unit, Western General Hospital, University of Edinburgh , Edinburgh, EH4 2XU, UK Find this author on Google Scholar Find this author on PubMed Search for this author on this site ORCID record for Chris P Ponting For correspondence: b.j.roskams-hieter{at}sms.ed.ac.uk chris.ponting{at}ed.ac.uk Abstract Full Text Info/History Metrics Data/Code Preview PDF Abstract Human traits vary in part due to genetically-determined change of transcription factor (TF) binding affinity within gene regulatory regions. However, few trait-causal variants or mechanisms are known. Here we propose 1,960 variants as strong candidates for causally altering human traits. They were discovered by baal-nf which uses ChIP-sequencing data to identify allelic imbalance at heterozygous sites (‘allele-specific binding sites’; ASBs) within affinity-concordant positions within TF- and/or co-factor binding motifs. These variants are evolutionarily conserved, and enriched for trait associations and gene expression QTLs. baal-nf and these high-quality ASBs now allow trait variation due to altered TF binding to be investigated. Background Revealing the molecular mechanisms by which DNA variants alter function is critical for understanding how genotypes contribute to physiological traits 1 . DNA variants associated with complex traits and disease mostly lie within non-coding regions, suggesting that trait variation is often caused by genetically altered transcription factor (TF) binding affinity 2 , 3 , 4 , 5 , 6 , 7 . Attributing trait variation to a particular TF and its altered site-specific binding affinity, however, remains an unsolved problem in population genetics and functional genomics. Its solution will require accurate and comprehensive catalogues of both TF binding motifs and affinity-altering DNA variants mapped to these motifs 8 . DNA variants that significantly alter TF binding have been inferred from chromatin immunoprecipitation with sequencing (ChIP-Seq) studies. Rather than comparing across population samples, these experiments use the known genotypes of single cell lines to minimize false variant calls 9 , 10 , 11 . TF binding that favours one allele over the other (i.e., allele-specific TF binding sites; ASBs) is inferred at the cell line’s known heterozygous sites when there is a statistically significant imbalance of ChIP-Seq reads mapped to one of the two alleles. When inferring ASBs, however, care is required to discount alternative hypotheses such as PCR duplication of sequencing reads, reference-mapping bias, and copy number variants (CNVs), especially in immortalized cell lines that carry copy number aberrations 9 , 12 . Although many non-motif-based features modulate TF binding affinity 8 , 13 , 14 , TF binding motifs provide insight into molecular mechanism when an ASB maps to a well-defined motif for the experimentally-targeted TF, and when its lower affinity allele disrupts this motif. Altered binding can disrupt a motif that is represented in publicly available databases such as JASPAR 15 , or a previously unknown TF binding motif predicted de novo directly from the ChIP-Seq data 16 . To predict ASBs accurately, methods should call variants that meet four criteria: (i) they match known genotypes 9 , 11 , 12 , (ii) they meet stringent quality control and allele-dependent alignment of ChIP-Seq reads 17 , (iii) they lie within a ChIP-Seq peak, and (iv) they disrupt (or strengthen) a TF’s binding motif when it reduces (or increases) its DNA-binding affinity. Previous ASB prediction methods meet some but not all four criteria 9 , 10 , 18 . Furthermore, methods have not previously included a de novo motif discovery step, which limits the set of biologically-informative motifs to be investigated 17 . Prioritized ASBs that affect TF binding motifs can then facilitate genome-scale complex trait studies whose multiple testing burden is reduced due to the smaller number of prioritized variants. Here, we describe baal-nf , a new computational framework that infers ASBs by processing and quality controlling ChIP-Seq data from a large number of studies that used genotyped cell lines, while accounting for known biases. Mapping these ASBs to known and de novo motifs then permits baal-nf to infer the TF mechanism by which binding is disrupted at these loci. To achieve our goals, we use nextflow to build an end-to-end pipeline, integrating tools for read and alignment quality control, alignment of high-quality reads, de novo motif discovery using NoPeak 16 and inference of ASBs using BaalChIP 9 . BaalChIP is a fully-Bayesian approach that infers ASBs from ChIP-Seq data, and has been shown to accurately correct for biases due to copy number aberrations present in cancer cell lines. We showcase baal-nf by its prediction of 298,783 ASBs from 558 TFs, and map these to 374 known and de novo motifs using publicly available ENCODE and ChIP-ATLAS data from 46 genotyped human cell lines. Among these are 1,960 high-quality, mechanistically w ell-understood ASBs. These are sites with strong evidence of binding, one of whose alleles both lowers a TF’s DNA-binding affinity and disrupts its binding motif. Consistent with their functionality, high-quality ASBs are more evolutionarily conserved across species, and are enriched for known trait associations and molecular quantitative trait loci (molQTL), supporting their prioritization in studies seeking causal trait-altering DNA variants. Results baal-nf workflow Our aim was to create a workflow that infers ASBs, while correcting for sources of biological and technical bias, and that assigns their confidence based on disruption of known or de novo inferred TF motifs. To apply this tool across a large set of studies, the workflow needed to be scalable, parallelizable and reproducible, with adaptability to new computational settings. To achieve this, we used nextflow, a workflow management system that allows implementation of complex data analysis workflows and that can be run easily across various computational platforms. Docker containers were implemented to obtain reproducible software environments at each step in the workflow. The baal-nf workflow implements and builds upon BaalChIP, a Bayesian statistical approach that calls ASBs in cancer genomes 9 , as detailed below. baal-nf requires three sets of input data: (1) ChIP-Seq FASTQ data from genotyped cell lines, (2) reference allele frequencies (RAFs) for all heterozygous SNPs in each cell line-of-interest, and (3) BED (Browser Extensible Data) files containing the ChIP-Seq peak calls for a given sequencing run [ Figure 1A ]. FASTQ files are first quality controlled by filtering out reads with low sequencing quality or reads harboring repetitive DNA, such as rDNA 19 , 20 [ Figure 1B , steps 1-2]. Remaining reads are then aligned to the human reference genome and duplicate reads marked as potential PCR duplications, before inferring ASBs with BaalChIP 9 [ Figure 1B , steps 3-5]. BaalChIP estimates the Corrected Allelic Ratio (CAR), the preference of TF binding to either the reference or alternate allele at heterozygous sites after correcting for biological and technical biases [ Figure 1A ] (see Methods). A CAR estimate of 0.5 indicates no preference for TF binding to either allele; an estimate between 0.5 and 1 indicates higher TF affinity for the reference (REF) allele; and an estimate between 0 and 0.5 indicates higher affinity for the alternate (ALT) allele. Download figure Open in new tab Figure 1: baal-nf workflow. (A) Workflow for identifying high-quality ASBs through baal-nf . ChIP-sequencing data from genotyped cell lines is used to infer ASBs at heterozygous SNPs using BaalChIP, a beta-binomial model that takes into account biological bias due to copy number aberrations (through the reference allele frequency (RAF)) and technical aspects such as reference-mapping (RM) bias. Identified ASBs are further characterized by mapping them to JASPAR and NoPeak motifs for a given TF. High-quality ASBs show concordance in the corrected allelic ratio (CAR) and motif score difference (MSD) as shown by the blue points in the bottom-right scatter plot. Data points are coloured blue when a heterozygous SNP is (1) an ASB, and either (2a) CAR > 0.5 and MSD > 0 (i.e., REF allele strengthens binding), or (2b) CAR < 0.5 and MSD < 0 (i.e., ALT allele strengthens binding). (B) Bioinformatic workflow for processing and quality-controlling data, acquiring and identifying motifs, and how heterozygous SNPs are mapped to these motifs and further classified. Inferred ASBs are mapped to two types of TF motifs: known motifs for the experimentally-relevant TF (e.g., MA0148.2 for FOXA1), and de novo motifs that are enriched in the ChIP-Seq read data (e.g., SRR1021800) [ Figure 1A ]. Known motifs for the relevant TF are downloaded from the JASPAR database 15 for use during the mapping procedure [ Figure 1B , step 6]. De novo motifs are called from ChIP-Seq reads using NoPeak, a k-mer-based motif discovery method that predicts motifs from global read distribution profiles, without requiring background ChIP-Sequencing samples for motif discovery 16 [ Figure 1B , step 7] (Methods). NoPeak motifs are further clustered, combining information across highly-similar motifs and removing redundancy within the NoPeak motif set. Only heterozygous SNPs that lie within ChIP-Seq peaks are considered at this stage, due to their strong evidence for TF binding, while we seek to derive a high-quality motif set. This set of heterozygous SNPs are mapped to all JASPAR and NoPeak motifs, resulting in motif scores for each of the REF and ALT alleles [ Figure 1B , step 9] (see Methods). The difference in scores (the Motif Score Difference, MSD) indicates whether the REF or ALT allele better matches the motif-of-interest. Motifs that best explain allelic imbalance of TF binding are defined as “high-quality motifs”: those in which mapped alleles with higher binding affinity (inferred from ChIP-Seq reads) show a significant tendency to be associated with stronger motif instance (i.e., with more positive MSD). More precisely, to be “high-quality” a motif has a significant and positive correlation (Spearman’s correlation coefficient, SCC) between the CAR derived from BaalChIP and the calculated MSD, considering only heterozygous SNPs in peaks whose surrounding sequence match this motif well [ Figure 1B , step 10] (see Methods). To characterize NoPeak motifs, we compare them against all known JASPAR motifs, defining a match between them based on motif similarity metrics. Motif similarity is assessed using the approach outlined in Grau et al. 21 and Kielbasa et al. 22 [ Figure 1B , step 11] (Methods). Using this approach, high-quality NoPeak motifs are classified into one of three groups: (1) redundant motifs – those that are good matches to known motifs in the JASPAR database for the targeted TF, (2) accessory motifs – those that are good matches to known motifs in the JASPAR database that, however, are not linked to the ChIPped TF, and (3) de novo motifs – those that are poor matches to JASPAR motifs. After identifying NoPeak motifs similar to, and thus redundant for, JASPAR motifs, these are discarded from further analysis. Once the set of high-quality motifs is derived, baal-nf maps the entire set of heterozygous SNPs back onto these motifs [ Figure 1B , step 12]. The workflow first identifies “high quality ASBs”: those whose altered TF binding within a ChIP-Seq peak can be explained mechanistically by concordant disruption to a relevant, accessory or de novo TF binding motif. More specifically, these ASBs map to sequences matching a high-quality motif (for REF and ALT alleles) and that, additionally, show concordance between their direction of binding affinity (i.e., CAR) and their change in motif score (i.e., MSD) [ Figure 1A ; highlighted as blue data points]. Next, baal-nf identifies “low quality ASBs”: those that meet each of these criteria, except that they are located outside of a ChIP-Seq peak. We did not wish to discard all ASBs outside of peaks because peak-calling algorithms vary in performance across different binding mechanisms 23 . All remaining ASBs not meeting these criteria or that did not map to a high-quality motif were defined as “unclassified ASBs”. Further below, we justify naming ASBs as “high quality” based on evolutionary information and associations to quantitative trait loci (QTLs). Redundant, accessory and de novo motifs found for FOXA1 To exemplify these steps, we next describe how we applied baal-nf to a single TF, namely FOXA1. FOXA1 belongs to the FOXA subfamily of winged helix transcription factors that bind and open chromatin, thereby facilitating access for other transcription factors 24 . FOXA1 also binds cooperatively to other nuclear receptors 25 , 26 to enable transcriptional activation, and is associated with epigenetic regulation via DNA demethylation 27 , 28 . FOXA1 binds DNA with high specificity, with binding affinity being highly dependent on minor variation of nucleotides within binding sites 29 . To predict FOXA1 ASBs we used baal-nf and 156 ChIP-sequencing samples, 6 genotyped human cell lines and 270,587 heterozygous SNPs. Of 422,090 unique SNP-cell line pairs, 15,823 (3.7%) were predicted by BaalChIP to show allelic bias, with 420 of these replicated across two or more cell lines. Further, 9,214 (2.2%) lay within ChIP-Seq peaks whose sequences were good matches to JASPAR and/or NoPeak motifs for FOXA1. All 4 JASPAR motifs for FOXA1 were high-quality motifs (i.e., showed concordant change in motif score and allelic ratio, [ Figure 2A ]) and 11 of 135 NoPeak motifs were high-quality. High-quality JASPAR or NoPeak motifs tended to have higher information content than low-quality motifs [ Figure S1 ]. Of the 11 high-quality NoPeak motifs, 1 was redundant, 2 were accessory motifs, and 8 were de novo . The redundant NoPeak motif (Average_150) showed a higher similarity score (0.89) to the FOXA1 JASPAR motif MA0148.4 than other JASPAR motifs [vertical dashed red line, Figure 2B ]. Download figure Open in new tab Figure 2: FOXA1 allele-specific binding inferred using baal-nf . (A) All four JASPAR motifs for FOXA1 are high-quality, showing a significant and positive SCC (ρ) between the CAR (x-axis) and MSD (y-axis). ρ is labeled on the plot, with significance levels **** representing a p-value < 2×10 -16 . Each scatter plot represents a unique JASPAR motif for FOXA1. (B) Redundant NoPeak motif identified for FOXA1 through baal-nf , where the NoPeak motif (“Average_150”) yielded a high similarity score of 0.89 (vertical dashed red line) to MA0148.4, a FOXA1 JASPAR motif, when compared to similarity scores for all JASPAR motifs, shown here as a histogram. (C) Accessory NoPeak motifs identified in FOXA1 ChIP-sequencing data, where the identified NoPeak motif (SRR1021800_motifs.0) yielded a high similarity score to FOS/JUN/BATF motifs (green cluster), and a low similarity score to FOXA1 JASPAR motifs (shown by dotted red line). One of the 2 accessory motifs (SRR1021800_motif_0) is a good match to FOS/JUN heterodimer and BATF motifs [ Figure 2C , Table S1], but not to FOXA1 motifs. FOS, JUN and BATF are all members of the AP-1 family of transcriptional activators, known to form heterodimers with one another and to be in TF-coordinated complexes with FOXA proteins 30 . FOS has also been proposed to activate FOXA1 through the ERBB2 signaling pathway 31 . The other accessory motif (ENCFF774LQB_motif_0) is a good match to FOXP2, another member of the FOX family of TFs. These results demonstrate baal-nf ’s ability to discover mechanistically-predictive TF binding motifs. A total of 1,246 ASB events (7.9%) mapped to the 14 high-quality, non-redundant motifs (10 NoPeak, 4 JASPAR), with 224 being high-quality ASBs and 633 being low-quality ASBs. Across these 14 motifs, a median of 60 low-quality ASBs and 15 high-quality ASBs were identified [Table S2]. High-quality ASBs were discovered across JASPAR and NoPeak motifs in approximately equal measure, with 58% (129/224) mapped to JASPAR motifs, and 42% mapped to NoPeak motifs (16 to accessory motifs and 79 to de novo motifs) [Table S3, Figure S2 , Figure S3 ]. This approximate doubling of high-quality ASB predictions for FOXA1 highlights the importance of including de novo motif discovery in this workflow. High-quality ASBs for FOXA1 were associated, on average, with 2.7 human physiological traits at a p-value threshold < 5 × 10 -8 and 1.8 expression QTLs (eQTLs) [ Figure S4 ] (Methods). baal-nf predicts 1,960 high-quality binding variants across 558 TFs We applied baal-nf to 6,017 ChIP-Seq data sets for 558 TFs and 46 genotyped cell lines (Table S4), making up 1,056 TF-cell line groups. This resulted in 298,783 inferred ASBs, with 8,968 (3.00%) mapping to high-quality TF motifs. Of these, the binding change affinity for 5,063 (1.69%) was concordant with disruption to a relevant motif, and 1,960 (0.66%) passed the criteria (defined above) necessary to be assigned as high-quality [ Figure 3A , Table S5, Table S6]. Among 3,014 JASPAR and NoPeak motifs investigated, 374 were high-quality [ Figure S5 ]. Compared to low-quality motifs, high-quality motifs yielded consistently larger, positive Spearman’s correlation coefficients between CAR and MSD, indicating that they are more explicable of binding variation for investigated SNPs [ Figure S6 ]. Download figure Open in new tab Figure 3: ASB set derived across 558 TFs and 46 genotyped cell lines, resulting in a subset of high-quality ASBs. (A) From the complete set of ASBs predicted by BaalChIP, 3% of these map within high-quality motifs, 5,063 display concordant behaviour in direction of inferred binding from BaalChIP (CAR) and disruption to a relevant motif (MSD), and 1,960 are deemed to be high-quality. (B) Empirical cumulative density function (ECDF) of frequencies of the binding allele in the non-Finnish European population from gnoMAD, split by whether the binding allele is the ancestral or derived allele. A Wilcoxon rank sum test comparing the means between the groups (FALSE and TRUE) show a significantly higher binding allele frequency when the ancestral allele is also the binding allele (TRUE), with a p-value of 5.4 × 10 -66 . (C) High-quality ASB sites (turquoise dashed line) when compared to non-ASB sets (yellow histogram) are significantly better conserved, as indicated by a higher number of SNPs with PhastCons scores > 0.95. (D) We find the opposite effect for non-conserved bases, with a significantly lower number of non-conserved bases present in the high-quality ASB set, represented by a PhastCons scores < 0.05. High-quality ASBs (n=1,960) were predicted by baal-nf for 86 out of 558 TFs in 46 genotyped cell lines. JASPAR motifs contributed most to this high-quality subset (average of 22 high-quality ASBs per TF), followed by NoPeak de novo motifs (average of 18 high-quality ASBs per TF) and NoPeak accessory motifs (average of 13 high-quality ASBs per TF) [ Figure S7 ]. As with the FOXA1 example (above), inclusion of NoPeak motifs substantially expanded the set of high-quality ASBs, as a result of both accessory and de novo motifs being discovered across TFs [ Figure S8 ]. Variation in TF-binding affinity may not alter downstream molecular, cellular or organismal traits, and thus may not be functional 32 . To assess functionality of the high-quality ASB subset, we considered whether they had been subject to evolutionary selection, have known associations with human traits or are QTLs for transcript abundance or splicing (i.e., eQTL and sQTL, respectively). For the 1,960 high-quality ASBs, we hypothesized that the allele associated with stronger TF binding affinity, i.e., the “binding allele”, would more often be the ancestral allele that has been retained in the population at higher frequency than the lower affinity, derived allele. Indeed, we discovered that when the binding allele is the derived allele then its population frequency is typically lower than when the binding allele is the ancestral allele (p-value=5.4 × 10 -66 ) [ Figure 3B ]. This is consistent with positive selection of the binding allele and/or negative selection of the lower affinity allele. Next, we showed that high-quality ASB sites tend to be evolutionarily well conserved. For this analysis, we generated 1000 comparator sets of “non-ASBs”, defined as randomly sampled high read coverage heterozygous SNPs that, when tested by baal-nf , failed to show significant allelic bias in any cell line, and for any TF (Methods) [Table S7]. We define high read coverage here as ≥ 100 reads mapping to the region containing the SNP. SNPs in these non-ASB sets were matched on minor allele population frequencies to those of high-quality ASBs, and had the same number of SNPs as the high-quality set (Methods). SNP positions in both groups were scored for evolutionary conservation across 30 eutherian mammals using PhastCons 33 . PhastCons scores range between zero – no conservation of that SNP across the species-of-interest – and one, indicating complete conservation. The number of highly conserved bases (PhastCons score > 0.95) was significantly larger in the high-quality ASB set (empirical p-value = 8.0 × 10 -3 ), and the number of non-conserved bases (PhastCons score < 0.05) was significantly lower in the high-quality ASB set (empirical p-value = 1.3 × 10 -2 ) [ Figure 3C-D ]. Compared with non-ASBs, we found that high-quality ASBs also had: (i) a significantly higher number of known variant-human trait relationships (p-value = 2.75 × 10 -3 ), (ii) more eQTLs (p-value = 3.0 × 10 -9 ), and (iii) a higher maximum OpenTargets V2G (Variant-to-gene) score when associating each variant with a downstream gene (p-value = 1.9 × 10 -9 ), but no significant difference in the number of sQTLs (p-value = 0.97) 34 . As expected, minor allele frequencies were not significantly different between our non-ASB set and high-quality set, mitigating ASB discovery biases [“count100” in Figure S9A ]. Findings were also robust to different count thresholds used to define high coverage SNPs for the random samples of non-ASBs [ Figure S9B-E , Figure S10 ]. Both evolutionary conservation analyses and trait/QTL-association analyses were consistent when using a more lenient minimum read coverage threshold of 50 to derive our 1000 non-ASB comparator sets (results for “count50” are shown in Figures S9B-D and S10 ). With the same analyses, “low-quality ASBs” were found not to be more evolutionarily conserved compared to their non-ASB sets [ Figure S11 ]. They were, however, significantly associated (p-value = 2.0 × 10 -3 ) with a higher number of human traits at a p-value threshold < 5 × 10 -8 , but not for the numbers of colocalized eQTLs or sQTLs, or the maximum V2G score [ Figure S12 , Table S8, Table S9]. In summary, by applying baal-nf across a large set of TFs, we have generated a new database of 298,783 ASBs, and have identified a subset of 1,960 high-quality ASBs. For these high-quality ASBs, the differential binding mechanism, inferred using high-quality motifs, provides high confidence for functionally-significant altered binding at these loci. When compared to non-ASB SNPs, this high-quality ASB set exhibited greater conservation across species, a higher number of known associations with traits, and greater colocalization with molecular QTLs. Discussion Identification of trait-causal variants would yield insights into fundamental biology that could aid development of therapeutic interventions. Nevertheless, such variants are challenging to identify due to linkage disequilibrium, whereby neighboring SNPs are often co-inherited, leading to significant variant-trait associations that mainly reflecting correlation, not causation. Assigning molecular mechanisms to SNPs is critical for prioritization of variants that are truly causal 35 , 36 , 37 , 38 . Extensive cataloguing of molecular QTLs is required to refine the evidence for how finely-mapped SNPs can mechanistically explain trait variation. Substantial efforts have been made to identify and annotate cis- and trans-expression QTLs and splicing QTLs through databases like eQTLGen 39 , GTEx 40 , eQTL Catalogue 41 , single-cell eQTLGen Consortium 42 , as well as integrated platforms such as QTLbase2 43 and OpenTargets Genetics 34 , 37 . Less effort has been expended on TF binding QTLs: such resources are scarce and are seldom integrated into these platforms. Databases for ASBs like AlleleDB 44 and ADASTRA 10 , with extensions such as ANANASTRA 45 , have tried to address this gap. Here, we sought to help narrow the information gap and provide a tool for large-scale identification of new and high-quality ASB datasets. Nevertheless, the ASBs reported in this study are far from being complete. This is because baal-nf can only infer ASBs for heterozygous SNPs in genotyped cell lines with ChIP-Seq data. ASB incompleteness is evident from the near disjoint sets of ASBs inferred by baal-nf (n=298,783 ASBs) and by ADASTRA (n=264,318 ASBs) 10 [ Figure S13A ] caused by minimal overlap in the TFs and cell lines investigated by the two methods, and by ADASTRA not requiring ASBs to have a priori known genotypes [ Figure S13B-C ]. This shortfall can be addressed by investigators applying baal-nf to new ChIP-Seq datasets. Even with application to new datasets, low coverage of ChIP-Seq reads at heterozygous sites will impede our ability to call ASBs, as there may be insufficient power to do so. Prediction incompleteness is also evident after comparing baal-nf ASBs to those predicted from SNP evaluation by Systematic Evolution of Ligands by EXponential enrichment (SNP-SELEX) data in GVATdb 46 . Compared with GVATdb, baal-nf tested an order of magnitude higher number of TF-SNP pairs (17.5-fold higher) and called an order of magnitude higher numbers of ASBs (13.5-fold), demonstrating similar ASB call rates across studies. A very small percentage of TF-SNP pairs were tested in both strategies (with the same REF/ALT alleles) - 3,170 of 1,374,477 (0.2% of GVATdb) and 3,170 of 23,997,518 (0.01% of baal-nf ). Within this set of 3,170 non-redundant TF-SNP pairs, there were 4,435 ASB tests in total due to multiple experiments and/or cell lines. Of these, 3,083 (97.3%) were called as non-ASBs in both; 83 (2.6%) were discordant, being called as a non-ASB by baal-nf and as an ASB in GVATdb, or vice versa; and 4 (0.1%) were called as an ASB by both baal-nf and in GVATdb. Note that the same SNP/TF pair could be called as both a non-ASB and ASB in both databases due to variation across experiments, biological contexts (i.e., different cell lines) and statistical coverage (i.e., read depth). The low numbers of TF-SNP pairs investigated in both these studies demonstrates how far genomics research has yet to progress before testing anywhere near the full complement of possible TF binding QTLs in the human genome. We note that baal-nf can easily be extended to predict allelic imbalance from any other sequencing-based method, including CUT&RUN and CUT&TAG, or RNA-protein based approaches such as Cross-linking and immunoprecipitation followed by sequencing (CLIP-Seq). baal-nf is an open-source software, available on GitHub, which can be used to derive new ASB sets and high-quality ASBs from ChIP-Seq datasets. It supports end-to-end processing of raw sequencing data, inference of ASBs and characterization of these ASBs with respect to binding mechanism. For large-scale application, a high-performance compute (HPC) cluster is recommended for efficient parallelization of the workflow. A major challenge in genomic research is the ability to make research portable and reproducible 47 . To achieve this, we rely on nextflow and singularity, allowing large-scale implementation of this tool with stable and reproducible software environments 48 . Limitations of baal-nf include BaalChIP’s reliance on the hg19 genome assembly, and its current dependence on ChIP-sequencing data. Furthermore, non-JASPAR reference motifs are not currently integrated into the workflow. This tool can be extended in the future for compatibility with additional genome assemblies, motif databases, and other sequence-based methods that measure allelic imbalance. We hope that others will add to this initial discovery set of high-quality ASBs by applying baal-nf to further data sets. All such high-quality ASBs will be useful when prioritizing non-coding regulatory variants as causal of complex trait variation and disease risk. Conclusions Of the 298,783 ASBs ASBs reported in this study, 0.66% (n=1,960) of these were “high-quality”. We proposed this high-quality set as strong candidates for causal trait-altering DNA variants, as demonstrated by enrichment for known variant-trait relationships, colocalization with gene expression QTLs and strong evolutionary conservation. Although limited in size, this high-quality set’s variants are mechanistically well-understood and have evidence for altered binding across multiple datasets. This is because the method requires: (1) strong evidence for the called genotype at heterozygous SNPs, (2) strong evidence for allelic imbalance, (3) orthogonal evidence for altered binding by disruption of a TF binding motif, and (4) strong evidence for binding by residing within a ChIP-Seq peak. By proposing a high-quality set, we enable researchers to prioritize these variants in large-scale genomic studies. Methods Read QC, alignment and inferring ASBs For baal-nf , read quality is assessed using FastQC (version v0.11.9) and FastQScreen (version v0.14.0) and then trimmed using TrimGalore! (version v0.6.7), filtering out any reads lying below a minimum quality threshold (default parameters in TrimGalore!) 19 , 20 . High-quality reads are aligned using bowtie2 49 (version v2.3.5.1) to hg19, the human reference genome used in BaalChIP, duplicate reads are marked with Picard (version v2.23) 50 , and aligned BAM files are used to infer ASBs using BaalChIP 9 (custom version modified from v1.1.1; provided at https://github.com/BAAL-NF/BaalChIP ). BaalChIP models allelic counts with a beta-binomial model, correcting for CNVs using the RAF as a prior, and correcting for Reference Mapping (RM) bias, which is estimated directly from the data. An ASB is called only if the highest posterior density interval for the fraction of reads mapping to the REF allele does not include a value of 0.5. Heterozygous SNPs predicted to be ASBs will have a value of “True” set in the output column “isASB”. The Corrected Allelic Ratio (CAR) will also be reported for each SNP, which describes preference in binding to either the REF (CAR > 0.5) or ALT (CAR < 0.5) allele, after correcting for the RAF and RM bias. Predicting motifs with NoPeak Aligned BAM files are converted to BED files using bedtools (version v2.29.2), and NoPeak 51 is used to compute score profiles for each k-mer of length 8, the NoPeak default setting. Briefly, a score profile is computed by estimating the density of reads which include a given k-mer, and computing the distance from the beginning of that read to the k-mer of interest. Profiles that are consistent with TF binding have an increased frequency of reads over the k-mer sequence. All k-mer profiles that indicate TF binding are combined based on sequence similarity, resulting in predicted motifs per-ChIP-sequencing sample. Low k-mer motifs derived from fewer than 10 k-mers are excluded from subsequent steps as they are generally low-complexity and not indicative of true binding motifs. Identifying non-redundant NoPeak motifs Predicted NoPeak motifs are defined per-ChIP-sequencing sample, leading to likely redundant motifs across multiple samples ChIPped for the same TF. To remove redundant motifs, all NoPeak motifs discovered for a given TF are clustered using GimmeMotifs (version v0.17.2) 52 , using de Bruijn sequences to quantify motif similarity (cluster_motifs() function: trim_edges = True, metric = “seqcor”, threshold = 0.7, combine = “mean”). A minimum threshold of 0.7 is chosen based on the analysis using the “seqcor” metric in Grau et al 21 . This scores each motif against one another, trimming low information content bases at the edge of the motif, and clustering motifs together that have a similarity score greater than 0.7. Each motif cluster is then averaged across the length of the motif to obtain a final, clustered motif. The resulting motifs from NoPeak are named after the ChIP-sequencing sample in which it was discovered or, if clustered, are named Average_n, where n is a number randomly-generated by GimmeMotifs. Mapping heterozygous SNPs to JASPAR and NoPeak motifs Heterozygous SNPs are mapped to JASPAR and NoPeak motifs using GimmeMotifs (version v0.17.2) 52 . Sequence instances for the REF and ALT allele for each heterozygous SNP are derived by extracting +/- 25 bp around the position of the relevant SNP from the human reference genome assembly (here, hg19). Each instance is scored against a motif-of-interest (the motif score), which is represented as the log-odds of belonging to that motif compared to a random background sequence from the same reference genome, with a false positive rate less than 5 × 10 -2 . The Motif Score Difference (MSD) is computed by taking the difference between the REF motif score and the ALT motif score. An MSD of zero indicates that the REF allele does not disrupt the motif-of-interest compared to the ALT allele; a positive MSD indicates that the REF allele is more likely to belong to the motif-of-interest; and a negative MSD indicates that the ALT allele is more likely to belong to the motif-of-interest. “High-quality” motifs are determined by computing the Spearman’s Correlation Coefficient (SCC; rho) between the CAR and MSD across all heterozygous SNPs that map to a given motif, and then filtering for motifs with a significant, positive SCC (p-value < 0.05). Only such motifs are considered when identifying high-quality ASBs. Characterizing NoPeak motifs To assess similarity of NoPeak motifs to known JASPAR motifs, we compute a similarity score for each NoPeak motif compared to all known JASPAR motifs (organism: Homo sapiens) using GimmeMotifs (version v0.17.2). We follow the approach outlined in Grau et al. 21 and Kielbasa et al. 22 using de Bruijn sequences to compute a similarity score. Each step along the chosen de Bruijn sequence, representing every k-mer of length k (k=7, the default provided in GimmeMotifs), is scored against the first motif and second motif, resulting in two vectors of motif scores, termed score profiles. If the motifs are highly similar, then these score profiles are highly correlated, and have a strong, positive Pearson Correlation Coefficient (PCC). The PCC between score profiles is computed for every possible offset of the two motifs as well as the reverse complement, and the maximum score (termed “similarity score”) is taken. NoPeak motifs with a similarity score greater than 0.7 to a known JASPAR motif will either be classified as (1) redundant: the NoPeak motif matches the canonical JASPAR motif for the TF-of-interest, in which case it is removed from further analysis, or (2) accessory: the NoPeak motif matches a known JASPAR motif that is not the canonical binding motif Kielbasa et al 22 . ASB subsets – defining high and low-quality ASBs Concordant ASBs are derived by filtering for all ASBs that map to a high-quality motif, and either (1) the CAR is > 0.5 and the MSD > 0 or (2) the CAR < 0.5 and the MSD < 0. This subset is further characterized by whether or not the SNP lies within a ChIP-Seq peak. ASBs that lie within ChIP-Seq peaks are termed “high-quality” and those that lie outside of ChIP-Seq peaks are called “low-quality”. As the same SNP can map to multiple motifs, we define an order in which to call SNPs concordant by first prioritizing those that lie within (i) JASPAR motifs, then (ii) NoPeak accessory motifs and then (iii) NoPeak de novo motifs. Deriving non-ASB sets To compare sets of “non-ASBs” to the high-quality ASB set, all assessed heterozygous SNPs across all 46 genotyped cell lines were inspected, and any SNP (specifically rsID) for which an ASB was called by BaalChIP was removed from this set. This resulted in a final set of analysed SNPs for which an ASB was not called in any cell line assessed across all 558 TFs. Each heterozygous SNP in this set was then required to be “high coverage” (number of reads ≥ 100 at its site) to ensure sufficient coverage to confidently call this site as non-ASB. We also evaluated this workflow at minimum read threshold of 50 [“count50” in Figures S9 - S11 ]. Every SNP in this non-ASB sample set was queried using the ENSEMBL REST API to pull the gnoMAD minor allele frequency for the non-Finnish European population. Non-ASBs’ rsIDs were randomly sampled with replacement from sets matched on minor allele frequency (MAF) within a 5% relative threshold, and this process was repeated across all high-quality SNPs to form a sample size of 2,400 non-ASB SNPs, a number matching the count of ASBs found in the high-quality set, including SNPs that were replicated across cell lines for the same TF. This sampling procedure was repeated 1000 times to generate 1000 non-ASB sets. A single “median” non-ASB set was also derived by selecting the SNP with the median MAF across the 1000 sampled datasets for each SNP in the total set of MAF-matched SNPs (n=2,400). This results in a single “median” non-ASB dataset matching the SNP set size of 2,400. This median set was used for querying statistics from OpenTargets, including the number of trait associations found at a p-value threshold of 5 × 10 -8 , the number of colocalized eQTLs and sQTLs, and the maxiumum V2G score for that variant. This same process was repeated for low-quality ASBs, with a set size of 3,473 to derive new non-ASB reference sets matched on MAF. See Figure S14 for a flow chart of this workflow. Evolutionary and functional genomics analysis of SNP groups PhastCons scores for all SNPs assessed in 30-way bigwig files (for GRCh38) were pulled from UCSC 53 and scores computed for SNPs in high-quality ASB and non-ASB groups using ENSEMBL’s variant effect predictor VEP (version 112) 54 . Variant-trait and variant-gene relationships were determined for each SNP group by querying OpenTargets Genetics 34 , 37 using their GraphQL API. For each variant, every variant-trait and variant-gene relationship record was pulled, and the following values were computed for each variant: (1) number of traits that passed a significance threshold of p<5×10 -8 , (2) number of colocalized eQTLs, (3) number of colocalized sQTLs, and (4) maximum V2G score. Additional information was queried using the ENSEMBL REST API, including information about which allele was ancestral, as well as minor/major alleles in a population-of-interest and their respective frequencies in that population. Here, we used the gnoMAD non-Finnish European population, and in cases where this data was not available, the 1000Genomes GBR population. See Figure S15 for a flow chart of this workflow. Declarations Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Availability of data and materials All data generated or analysed during this study are included in this published article or are from publicly-available sources. All supplementary tables can be downloaded from the Open Science Framework (OSF) at https://osf.io/rwjec/ under the project osf.io/rwjec (v0.1.0; DOI 10.17605/OSF.IO/RWJEC). Accession IDs for FASTQ and BED files from ENCODE (at https://www.encodeproject.org/ ) 55 & ChIP-ATLAS (at https://chip-atlas.org/ ) 56 are described in Table S4. Additional ChIP-Seq FASTQ and BED files were used for the Vitamin D Receptor (VDR) from Gallone et al 57 . Results from this study are contained within Tables S5, S6, and S10. Table S5 contains all 298,783 predicted ASBs across 558 TFs, Table S6 contains all high-quality ASBs and queried information from ENSEMBL and OpenTargets. Table S10 contains all motifs explored in this study with associated metadata for each TF. All SNP-cell line mappings, including characterization with respect to high-quality motifs, are deposited on OSF under project osf.io/rwjec in the folder “Database”, separated by transcription factor. All code is deposited on GitHub at https://github.com/BAAL-NF/baal-nf and is publicly available. baal-nf version implemented in this manuscript is 0.9.0. All figures were generated using BioRender 58 . For the purpose of open access, the author has applied a creative commons attribution (CC BY) licence to any author accepted manuscript version arising. Competing interests The authors declare that they have no competing interests. Funding Breeshey Roskams-Hieter was supported by the Health Data Research UK & The Alan Turing Institute Wellcome PhD Programme in Health Data Science (Grant Ref: 218529/Z/19/Z). Øyvind Almelid was supported in part by an Alan Turing Institute award to Chris P Ponting (TU/ASG/R-SPEH-102). Chris P Ponting was supported in part by the Medical Research Council (MC_UU_00007/15). Authors’ contributions Breeshey Roskams-Hieter : Conceptualization, Methodology, Software, Validation, Formal analysis, Writing – Original draft, Visualization, and Funding Acquisition. Øyvind Almelid : Conceptualization, Methodology, Software, Formal analysis, Data Curation, Writing – Review & Editing, and Supervision. Chris P Ponting : Conceptualization, Funding Acquisition, Methodology, Supervision, and Writing – Review & Editing. Supplementary Figures Download figure Open in new tab Supplementary Figure S1: Properties of high-quality motifs for FOXA1. Empirical cumulative density frequency (ECDF) plot showing information content of high-(blue) versus low-(red) quality motifs found for FOXA1. Information content (x-axis) is measured in bits. Download figure Open in new tab Supplementary Figure S2: Redundant and accessory motifs discovered for FOXA1. Scatter plots for (A) redundant and (B) accessory motifs found for FOXA1 via NoPeak. High-quality ASBs are coloured in blue. CAR (x-axis) is the corrected allelic ratio calculated by BaalChIP for a heterozygous SNP lying within a ChIP-Seq peak; MSD (y-axis) is the motif score difference for each SNP mapped to the labeled motif. Download figure Open in new tab Supplementary Figure S3: de novo motifs found for FOXA1 via NoPeak. High-quality ASBs are coloured in blue. CAR (x-axis) is the corrected allelic ratio calculated by BaalChIP for a heterozygous SNP lying within a ChIP-Seq peak; MSD (y-axis) is the motif score difference for each SNP mapped to the labeled motif. Download figure Open in new tab Supplementary Figure S4: Number of associated traits & eQTLs for high-quality ASBs predicted for FOXA1. These values were obtained by querying for the variant in OpenTargets Genetics and summarizing statistics for each category – (A) Number of trait associations that passed a p-value threshold of 5 × 10 -2 , (B) Number of trait associations that passed a p-value threshold of 5 × 10 -8 , and (C) Number of colocalized eQTLs. Download figure Open in new tab Supplementary Figure S5: Number of high and low-quality motifs detected across 558 TFs. Download figure Open in new tab Supplementary Figure S6: Spearman’s correlation coefficient for high-(blue) versus low-(red) quality motifs in JASPAR and NoPeak motifs investigated across all 558 TFs. **** indicate a p-value < 2 × 10 -16 with p-values computed using a Wilcoxon rank sum test. Download figure Open in new tab Supplementary Figure S7: Breakdown by motif group for high-quality ASBs. Expanding our motif set to include accessory and de novo motifs from NoPeak substantially increases the number of high-quality ASBs. Download figure Open in new tab Supplementary Figure S8: Numbers of NoPeak motifs detected across the 558 TF run. A range of NoPeak accessory and de novo motifs are discovered, with more de novo motifs than accessory motifs found overall. Download figure Open in new tab Supplementary Figure S9: Trait/QTL associations for high-quality ASBs compared to non-ASB comparator sets. (A) As these non-ASB sets were matched on minor allele frequency (MAF), we see no significant difference between the MAFs for the high-quality ASB and non-ASB sets with p-values of 0.9 and 0.9 for the count 50 and count 100 thresholds, respectively. Non-ASB sets thresholded above 50 counts are enriched for (B) known variant-trait associations, and (C) colocalized eQTLs, but not for (D) sQTLs, although they tend to have (E) higher maximum V2G scores. Data is queried from OpenTargets for the median non-ASB set for counts threshold of 100 and 50 and the high-quality set. The median non-ASB set was defined by choosing the SNP with the median MAF across all 1000 sampled sets, for each SNP (see Methods). Note that the y-axis in plots (B-E) is log10-scaled. Download figure Open in new tab Supplementary Figure S10: High-quality ASBs are more conserved than non-ASB sets across varying counts thresholds. This figure details the same analysis usng PhastCons scores shown in Figure 3 but for non-ASB sets with a minimum counts threshold of 50. (A) High-quality ASBs (dashed red line) harbour more conserved SNPs compared to count 50 non-ASB sets (green). (B) High-quality ASBs (dashed red line) have fewer non-conserved SNPs compared to count 50 non-ASB SNPs. PhastCons scores are pulled from the 30-way UCSC track for non-ASBs (green) and high-quality ASBs (red). The set of SNPs used to derive the sampled non-ASB sets was relaxed to a slightly lower read coverage minimum threshold of 50 counts. More specifically, the region containing the investigated SNP must not be called as an ASB in any cell line, for any TF, and have at least 50 reads mapping to it (see Methods). This is different to the analysis in the main figure which used a minimum count threshold of 100 reads. Download figure Open in new tab Supplementary Figure S11: Low-quality ASBs are not better conserved than non-ASB sites. (A) No significant diference is found between the number of (A) highly-conserved SNPs or the number of (B) non-conserved SNPs when compared to the non-ASB set using a counts threshold > 100. Similar results are found for a counts threshold > 50 to determine non-ASB sets in (C) and (D). Download figure Open in new tab Supplementary Figure S12: Trait/QTL associations for low-quality ASBs compared to non-ASB comparator sets. (A) As these non-ASB sets were matched on minor allele frequency (MAF), we see no significant difference between the MAFs for the low-quality ASB and non-ASB sets with p-values of 0.9 and 0.9 for the count 50 and count 100 thresholds, respectively. For low-quality ASBs, we find an enrichment for the (B) number of associated traits, but not for the number of colocalized (C) eQTLs, (D) sQTLs or (E) maximum V2G score. Data is queried from OpenTargets for the median non-ASB set for counts threshold of 100 and 50 and the low-quality set. The median non-ASB set was defined by choosing the SNP with the median MAF across all 1000 sampled sets, for each SNP (see Methods). Note that the y-axis in plots (B-E) is log10-scaled. Download figure Open in new tab Supplementary Figure S13: Overlap in features between ADASTRA ASB database and baal-nf ASB database. 295,990 ASBs (99%) predicted by baal-nf have not previously been reported by the ADASTRA resource [Abramov et al. (2021)] due largely to minimal overlap in the sets of TFs and cell lines investigated. Additional differences will be caused by ADASTRA not requiring heterozygous SNPs to be genotyped prior to its calling of ASBs. We find that (A) only 2,793 ASBs are found in both databases, meaning that 295,990 predicted by baal-nf (green) have not been previously reported by ADASTRA (orange). Notably, many of the features of tested SNPs were different across the two databases, including the (B) transcription factors investigated and (C) cell lines investigated. Download figure Open in new tab Supplementary Figure S14: Flow chart detailing the sampling procedure for generating non-ASB sets from a reference set (high-quality ASB set as reference shown here). Download figure Open in new tab Supplementary Figure S15: Pipeline for evaluating evolutionary, functional and trait relevance of non-ASB sets compared to reference ASB set (shown here as high-quality ASB set). Acknowledgements None to declare. Footnotes https://osf.io/rwjec/ https://github.com/BAAL-NF/baal-nf https://www.encodeproject.org/ https://chip-atlas.org/ References 1. ↵ Rehm HL , Berg JS , Brooks LD , Bustamante CD , Evans JP , Landrum MJ , et al. ClinGen–the clinical genome resource . N Engl J Med . 2015 ; 372 ( 23 ): 2235 – 2242 . OpenUrl CrossRef PubMed 2. ↵ Buniello A , MacArthur J , Cerezo M , Harris LW , Hayhurst J , Malangone C , et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 . Nucleic Acids Res . 2019 ; 47 ( D1 ): D1005 – D1012 . OpenUrl CrossRef PubMed 3. ↵ Hindorff LA , Sethupathy P , Junkins HA , Ramos EM , Mehta JP , Collins FS , et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits . Proc Natl Acad Sci USA . 2009 ; 106 ( 23 ): 9362 – 9367 . OpenUrl Abstract / FREE Full Text 4. ↵ Frazer KA , Murray SS , Schork NJ , Topol EJ . Human genetic variation and its contribution to complex traits . Nat Rev Genet . 2009 Apr ; 10 ( 4 ): 241 – 251 . doi: 10.1038/nrg2554 . OpenUrl CrossRef PubMed Web of Science 5. ↵ Christmas MJ , Kaplow IM , Genereux DP , Dong MX , Hughes GM , Li X , et al. Evolutionary constraint and innovation across hundreds of placental mammals . Science . 2023 ; 380 ( 6643 ). doi: 10.1126/science.abn3943 . OpenUrl CrossRef PubMed 6. ↵ Cavalli M , Pan G , Nord H , Wallerman O , Wallén Arzt E , Berggren O , et al. Allele-specific transcription factor binding to common and rare variants associated with disease and gene expression . Hum Genet . 2016 ; 135 ( 4 ): 485 – 497 . OpenUrl CrossRef PubMed 7. ↵ Maurano MT , Humbert R , Rynes E , Thurman RE , Haugen E , Wang H , et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA . Science . 2012 ; 337 (6099): 1190 – 1195 . doi: 10.1126/science.1222794 . OpenUrl Abstract / FREE Full Text 8. ↵ Inukai S , Kock KH , Bulyk ML . Transcription factor-DNA binding: beyond binding site motifs . Curr Opin Genet Dev . 2017 Apr ; 43 : 110 – 119 . doi: 10.1016/j.gde.2017.02.007 . Epub 2017 Mar 27. PMID: 28359978 ; PMCID: PMC5447501 . OpenUrl CrossRef PubMed 9. ↵ de Santiago I , Liu W , Yuan K , O’Reilly M , Chilamakuri CSR , Ponder BAJ , et al. BaalChIP: Bayesian analysis of allele-specific transcription factor binding in cancer genomes . Genome Biol . 2017 ; 18 : 39 . doi: 10.1186/s13059-017-1165-7 . OpenUrl CrossRef PubMed 10. ↵ Abramov S , Boytsov A , Bykova D , Penzar DD , Yevshin I , Kolmykov SK , et al. Landscape of allele-specific transcription factor binding in the human genome . Nat Commun . 2021 ; 12 : 2751 . doi: 10.1038/s41467-021-23007-0 . OpenUrl CrossRef PubMed 11. ↵ Zeng H , Hashimoto T , Kang DD , Gifford DK . GERV: a statistical method for generative evaluation of regulatory variants for transcription factor binding . Bioinformatics . 2016 ; 32 ( 4 ): 490 – 496 . doi: 10.1093/bioinformatics/btv565 . OpenUrl CrossRef PubMed 12. ↵ Bailey SD , Virtanen C , Haibe-Kains B , Lupien M . ABC: a tool to identify SNVs causing allele-specific transcription factor binding from ChIP-Seq experiments . Bioinformatics . 2015 ; 31 ( 18 ): 3057 – 3059 . doi: 10.1093/bioinformatics/btv321 . OpenUrl CrossRef PubMed 13. ↵ Worsley Hunt R , Wasserman WW . Non-targeted transcription factors motifs are a systemic component of ChIP-seq datasets . Genome Biol . 2014 ; 15 : 412 . doi: 10.1186/s13059-014-0412-4 . OpenUrl CrossRef PubMed 14. ↵ O’Dwyer MR , Azagury M , Furlong K , Alsheikh A , Hall-Ponsele E , Pinto H , et al. Nucleosome fibre topology guides transcription factor binding to enhancers . Nature . 2024 . doi: 10.1038/s41586-024-08333-9 . OpenUrl CrossRef 15. ↵ Rauluseviciute I , Ruidavets-Puig R , Blanc-Mathieu R , Castro-Mondragon JA , Ferenc K , Kumar V , et al. JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles . Nucleic Acids Res . 2024 ; 52 ( D1 ): D174 – D182 . doi: 10.1093/nar/gkad1059 . OpenUrl CrossRef PubMed 16. ↵ Menzel M , Hurka S , Glasenhardt S , Gogol-Döring A . NoPeak: k-mer-based motif discovery in ChIP-Seq data without peak calling . Bioinformatics . 2021 ; 37 ( 5 ): 596 – 602 . doi: 10.1093/bioinformatics/btaa845 . OpenUrl CrossRef PubMed 17. ↵ Li Y , Zhang X , Liu Y , Lu A . Allele-specific binding (ASB) analyzer for annotation of allele-specific binding SNPs . BMC Bioinformatics . 2023 ; 24 : 464 . doi: 10.1186/s12859-023-05604-6 . OpenUrl CrossRef PubMed 18. ↵ Xu S , Feng W , Lu Z , Yu CY , Shao W , Nakshatri H , et al. regSNPs-ASB: A Computational Framework for Identifying Allele-Specific Transcription Factor Binding From ATAC-seq Data . Front Bioeng Biotechnol . 2020 ; 8 : 886 . doi: 10.3389/fbioe.2020.00886 . PMID: 32850739 ; PMCID: PMC7405637 . OpenUrl CrossRef PubMed 19. ↵ Andrews S. FastQC: a quality control tool for high throughput sequence data . Available from: http://www.bioinformatics.babraham.ac.uk/projects/fastqc . 20. ↵ Krueger F. Trimgalore. GitHub repository . 2021 . Available from: https://github.com/FelixKrueger/TrimGalore . 21. ↵ Grau J , Grosse I , Posch S , Keilwagen J . Motif clustering with implications for transcription factor interactions . PeerJ PrePrints . 2015 ; e1302v1 . 22. ↵ Kielbasa SM , Gonze D , Herzel H . Measuring similarities between transcription factor binding sites . BMC Bioinformatics . 2005 ; 6 : 237 . doi: 10.1186/1471-2105-6-237 . OpenUrl CrossRef PubMed 23. ↵ Jeon H , Lee H , Kang B , Jang I , Roh TY . Comparative analysis of commonly used peak calling programs for ChIP-Seq analysis . Genomics Inform . 2020 ; 18 ( 4 ): e42 . doi: 10.5808/GI.2020.18.4.e42 . OpenUrl CrossRef 24. ↵ Cirillo LA , Lin FR , Cuesta I , Friedman D , Jarnik M , Zaret KS . Opening of compacted chromatin by early developmental transcription factors HNF3 (FoxA) and GATA-4 . Mol Cell . 2002 ; 9 ( 2 ): 279 – 289 . doi: 10.1016/s1097-2765(02)00459-8 . OpenUrl CrossRef PubMed Web of Science 25. ↵ Nitsch D , Boshart M , Schütz G . Activation of the tyrosine aminotransferase gene is dependent on synergy between liver-specific and hormone-responsive elements . Proc Natl Acad Sci U S A . 1993 Jun 15; 90 ( 12 ): 5479 – 83 . doi: 10.1073/pnas.90.12.5479 . PMID: 8100067 ; PMCID: PMC46744 . OpenUrl Abstract / FREE Full Text 26. ↵ Gao N , Zhang J , Rao MA , Case TC , Mirosevich J , Wang Y , et al. The role of hepatocyte nuclear factor-3 alpha (Forkhead Box A1) and androgen receptor in transcriptional regulation of prostatic genes . Mol Endocrinol . 2003 Aug ; 17 ( 8 ): 1484 – 507 . doi: 10.1210/me.2003-0020 . Epub 2003 May 15. PMID: 12750453 . OpenUrl CrossRef PubMed Web of Science 27. ↵ Kaestner KH . The FoxA factors in organogenesis and differentiation . Curr Opin Genet Dev . 2010 ; 20 ( 5 ): 527 – 532 . doi: 10.1016/j.gde.2010.06.005 . OpenUrl CrossRef PubMed 28. ↵ Sérandour AA , Avner S , Percevault F , Demay F , Bizot M , Lucchetti-Miganeh C , et al. Epigenetic switch involved in activation of pioneer factor FOXA1-dependent enhancers . Genome Res . 2011 Apr ; 21 ( 4 ): 555 – 65 . doi: 10.1101/gr.111534.110 . Epub 2011 Jan 13. PMID: 21233399 ; PMCID: PMC3065703 . OpenUrl Abstract / FREE Full Text 29. ↵ Overdier DG , Porcella A , Costa RH . The DNA-binding specificity of the hepatocyte nuclear factor 3/forkhead domain is influenced by amino-acid residues adjacent to the recognition helix . Mol Cell Biol . 1994 ; 14 ( 4 ): 2755 – 2766 . doi: 10.1128 OpenUrl Abstract / FREE Full Text 30. ↵ Sharma NV , Pellegrini KL , Ouellet V , Giuste FO , Ramalingam S , Watanabe K , et al. Identification of the Transcription Factor Relationships Associated with Androgen Deprivation Therapy Response and Metastatic Progression in Prostate Cancer . Cancers (Basel ). 2018 Oct 11; 10 ( 10 ): 379 . doi: 10.3390/cancers10100379 . PMID: 30314329 ; PMCID: PMC6210624 . OpenUrl CrossRef PubMed 31. ↵ Naderi A , Meyer M , Dowhan DH . Cross-regulation between FOXA1 and ErbB2 signaling in estrogen receptor-negative breast cancer . Neoplasia . 2012 Apr ; 14 ( 4 ): 283 – 96 . doi: 10.1593/neo.12294 . PMID: 22577344 ; PMCID: PMC3349255 . OpenUrl CrossRef PubMed 32. ↵ Cusanovich DA , Pavlovic B , Pritchard JK , Gilad Y . The functional consequences of variation in transcription factor binding . PLoS Genet . 2014 Mar 6; 10 ( 3 ): e1004226 . doi: 10.1371/journal.pgen.1004226 . PMID: 24603674 ; PMCID: PMC3945204 . OpenUrl CrossRef PubMed 33. ↵ Siepel A , Bejerano G , Pedersen JS , Hinrichs AS , Hou M , Rosenbloom K , et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes . Genome Res . 2005 ; 15 ( 7 ): 1034 – 50 . doi: 10.1101/gr.4000105 . OpenUrl Abstract / FREE Full Text 34. ↵ Ghoussaini M , Mountjoy E , Carmona M , Peat G , Schmidt EM , Hercules A , et al. Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics . Nucleic Acids Res . 2021 ; 49 ( D1 ): D1311 – D1320 . doi: 10.1093/nar/gkaa840 . OpenUrl CrossRef 35. ↵ Wang J , Huang D , Zhou Y , Yao H , Liu H , Zhai S , et al. CAUSALdb: a database for disease/trait causal variants identified using summary statistics of genome-wide association studies . Nucleic Acids Res . 2020 Jan 8; 48 ( D1 ): D807 – 16 . doi: 10.1093/nar/gkz1026 . OpenUrl CrossRef PubMed 36. ↵ Wang J , Ouyang L , You T , Yang N , Xu X , Zhang W , et al. CAUSALdb2: an updated database for causal variants of complex traits . Nucleic Acids Res . 2025 Jan 6; 53 ( D1 ): D1295 – 1301 . doi: 10.1093/nar/gkae1096 . OpenUrl CrossRef PubMed 37. ↵ Mountjoy E , Schmidt EM , Carmona M , Schwartzentruber J , Peat G , Miranda A , et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci . Nat Genet . 2021 ; 53 ( 11 ): 1527 – 1533 . doi: 10.1038/s41588-021-00945-5 . OpenUrl CrossRef 38. ↵ Siraj L , Castro RI , Dewey H , Kales S , Nguyen TTL , Kanai M , et al. Functional dissection of complex and molecular trait variants at single nucleotide resolution . bioRxiv . 2024 May. doi: 10.1101/2024.05.05.592437 . OpenUrl Abstract / FREE Full Text 39. ↵ Võsa U , Claringbould A , Westra HJ , Bonder MJ , Deelen P , Zeng B , et al. Large-scale cis- and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression . Nat Genet . 2021 ; 53 ( 11 ): 1300 – 10 . doi: 10.1038/s41588-021-00913-z . OpenUrl CrossRef PubMed 40. ↵ Lonsdale J , Thomas J , Salvatore M , Philipps R , Lo E , Shad S , et al. The Genotype-Tissue Expression (GTEx) project . Nat Genet . 2013 ; 45 ( 6 ): 580 – 5 . doi: 10.1038/ng.2653 . OpenUrl CrossRef PubMed 41. ↵ Kerimov N , Hayhurst JD , Peikova K , Manning JR , Walter P , Kolberg L , et al. A compendium of uniformly processed human gene expression and splicing quantitative trait loci . Nat Genet . 2021 ; 53 ( 8 ): 1290 – 9 . doi: 10.1038/s41588-021-00924-w . OpenUrl CrossRef PubMed 42. ↵ van der Wijst MGP , de Vries DH , Groot HE , Trynka G , Hon CC , Bonder MJ , et al. The single-cell eQTLGen consortium . Elife . 2020 Mar 9; 9 : e52155 . doi: 10.7554/eLife.52155 . PMID: 32149610 ; PMCID: PMC7077978 . OpenUrl CrossRef PubMed 43. ↵ Huang D , Feng X , Yang H , Wang J , Zhang W , Fan X , et al. QTLbase2: an enhanced catalog of human quantitative trait loci on extensive molecular phenotypes . Nucleic Acids Res . 2023 Jan 6; 51 ( D1 ): D1122 – 8 . doi: 10.1093/nar/gkac1020 . OpenUrl CrossRef PubMed 44. ↵ Chen J , Rozowsky J , Galeev TR , Harmanci A , Kitchen R , Bedford J , et al. A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals . Nat Commun . 2016 ; 7 : 11101 . doi: 10.1038/ncomms11101 . OpenUrl CrossRef PubMed 45. ↵ Boytsov A , Abramov S , Aiusheeva AZ , Kasianova AM , Baulin E , Kuznetsov IA , et al. ANANASTRA: annotation and enrichment analysis of allele-specific transcription factor binding at SNPs . Nucleic Acids Res . 2022 Jul 5; 50 ( W1 ): W51 – 6 . doi: 10.1093/nar/gkac262 . OpenUrl CrossRef PubMed 46. ↵ Yan J , Qiu Y , Ribeiro dos Santos AM , Yin Y , Li YE , Vinckier N , et al. Systematic analysis of binding of transcription factors to noncoding variants. Nature . 2021 ; 591 (7850): 147 – 51 . doi: 10.1038/s41586-021-03211-0 . OpenUrl CrossRef 47. ↵ Ziemann M , Poulain P , Bora A . The five pillars of computational reproducibility: bioinformatics and beyond . Brief Bioinform . 2023 Nov ; 24 ( 6 ): bbad375 . doi: 10.1093/bib/bbad375 . OpenUrl CrossRef PubMed 48. ↵ Di Tommaso P , Chatzou M , Floden E , Barja PP , Palumbo E , Notredame C . Nextflow enables reproducible computational workflows . Nat Biotechnol 35 , 316 – 319 ( 2017 ). doi: 10.1038/nbt.3820 OpenUrl CrossRef PubMed 49. ↵ Langmead B , Salzberg SL . Fast gapped-read alignment with Bowtie 2 . Nat Methods . 2012 ; 9 ( 4 ): 357 – 359 . doi: 10.1038/nmeth.1923 . OpenUrl CrossRef PubMed Web of Science 50. ↵ Picard Toolkit. Broad Institute, GitHub Repository . Available from: https://broadinstitute.github.io/picard/ . 51. ↵ Menzel M. GitHub repository . https://github.com/menzel/nopeak . October 28, 2022 . 52. ↵ van Heeringen SJ , Veenstra GJC . GimmeMotifs: a de novo motif prediction pipeline for ChIP-sequencing experiments . Bioinformatics . 2011 ; 27 ( 2 ): 270 – 271 . doi: 10.1093/bioinformatics/btq636 . OpenUrl CrossRef PubMed Web of Science 53. ↵ Perez G , Barber GP , Benet-Pages A , Casper J , Clawson H , Diekhans M , et al. The UCSC Genome Browser database: 2025 update . Nucleic Acids Res . 2025 Jan 6; 53 ( D1 ): D1243 – 9 . doi: 10.1093/nar/gkae974 . PMID: 39460617 ; PMCID: PMC11701590 . OpenUrl CrossRef PubMed 54. ↵ McLaren W , Gil L , Hunt SE , Signh Riat H , Ritchie GRS , Thormann A , et al. The Ensembl Variant Effect Predictor . Genome Biol . 2016 ; 17 : 122 . doi: 10.1186/s13059-016-0974-4 . OpenUrl CrossRef PubMed 55. ↵ Luo Y , Hitz BC , Gabdank I , Hilton JA , Kagda MS , Lam B , et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal . Nucleic Acids Res . 2020 Jan 8; 48 ( D1 ): D882 – 9 . doi: 10.1093/nar/gkz1062 . PMID: 31713622 ; PMCID: PMC7061942 . OpenUrl CrossRef PubMed 56. ↵ Zou Z , Ohta T , Miura F , Oki S. ChIP-Atlas 2021 update: a data-mining suite for exploring epigenomic landscapes by fully integrating ChIP-seq, ATAC-seq and Bisulfite-seq data . Nucleic Acids Res . 2022 ; 50 ( W1 ): W175 – 82 . doi: 10.1093/nar/gkac199 . OpenUrl CrossRef PubMed 57. ↵ Gallone G , Haerty W , Disanto G , Ramagopalan SV , Ponting CP , Berlanga-Taylor AJ . Identification of genetic variants affecting vitamin D receptor binding and associations with autoimmune disease . Hum Mol Genet . 2017 Jun 1; 26 ( 11 ): 2164 – 76 . doi: 10.1093/hmg/ddx092 . PMID: 28335003 ; PMCID: PMC5886188 . OpenUrl CrossRef PubMed 58. ↵ Roskams-hieter , B. Created in BioRender ( 2024 ) https://BioRender.com/l92t545 View the discussion thread. Back to top Previous Next Posted January 21, 2025. Download PDF Data/Code Email Thank you for your interest in spreading the word about bioRxiv. NOTE: Your email address is requested solely to identify you as the sender of this article. Your Email * Your Name * Send To * Enter multiple addresses on separate lines or separate them with commas. You are going to email the following baal-nf identifies motif-disrupting variants that decrease transcription factor binding affinity Message Subject (Your Name) has forwarded a page to you from bioRxiv Message Body (Your Name) thought you would like to see this page from the bioRxiv website. Your Personal Message CAPTCHA This question is for testing whether or not you are a human visitor and to prevent automated spam submissions. Share baal-nf identifies motif-disrupting variants that decrease transcription factor binding affinity Breeshey Roskams-Hieter , Øyvind Almelid , Chris P Ponting bioRxiv 2025.01.17.633399; doi: https://doi.org/10.1101/2025.01.17.633399 Share This Article: Copy Citation Tools baal-nf identifies motif-disrupting variants that decrease transcription factor binding affinity Breeshey Roskams-Hieter , Øyvind Almelid , Chris P Ponting bioRxiv 2025.01.17.633399; doi: https://doi.org/10.1101/2025.01.17.633399 Citation Manager Formats BibTeX Bookends EasyBib EndNote (tagged) EndNote 8 (xml) Medlars Mendeley Papers RefWorks Tagged Ref Manager RIS Zotero Tweet Widget Facebook Like Google Plus One Subject Area Genomics Subject Areas All Articles Animal Behavior and Cognition (7622) Biochemistry (17648) Bioengineering (13871) Bioinformatics (41880) Biophysics (21423) Cancer Biology (18558) Cell Biology (25460) Clinical Trials (138) Developmental Biology (13364) Ecology (19866) Epidemiology (2067) Evolutionary Biology (24290) Genetics (15589) Genomics (22475) Immunology (17711) Microbiology (40327) Molecular Biology (17145) Neuroscience (88473) Paleontology (666) Pathology (2827) Pharmacology and Toxicology (4816) Physiology (7635) Plant Biology (15114) Scientific Communication and Education (2044) Synthetic Biology (4286) Systems Biology (9815) Zoology (2268)
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.