{"paper_id":"6cbd2d2c-fe89-467e-85ae-5b38ca523ee3","body_text":"Billion-Scale Deciphering of Human Gene Regulatory \nGrammar \nJoshua Mayne 1,2, Marino Exposito Rodriguez 3 ,  Matthew Burrell 3, Francesca Ceroni 2,4,5*,Stephen \nGoldrick1* \n \n1.Department of Biochemical Engineering,  University College London, London, UK  \n2. Department of Chemical Engineering and Imperial College Centre for Synthetic Biology, Imperial College London, \nLondon, UK   \n3. Biologics Engineering R&D, AstraZeneca, Cambridge, UK  \n4. Bezos Centre for Sustainable Protein, London, UK  \n5. National Alternative Protein Innovation Centre (NAPIC, UK)  \n \n*Correspondence: f.ceroni@imperial.ac.uk, s.goldrick@ucl.ac.uk \n \nAbstract \nPredicting how DNA sequence specifies gene expression remains a core challenge across \nregulatory genomics. Most predictive assays and models depend on native genomic DNA, \nconstraining the full biochemical engineering space for assessing and designing new \nsequences. Here, we address this gap with a scalable experimental–computational platform  \nthat rapidly generates million-scale sequence-to-expression datasets that directly link \ndegenerate sequences to their function in human cells.  \nWe built degenerate libraries of 200-bp promoter cassettes and performed pooled stable \nintegration of up to 1012 unique constructs, enabling the curation of million-scale \nsequence-to-expression datasets by fluorescently sorting billions of human cells. Biophysical \nmodeling of transcription-factor occupancy on the data using position weight matrices reveals a \nbroad spectrum of correlations between factor abundance and expression levels, with some \nco-abundances reaching Pearson’s r ≈ 0.99, consistent with cooperative and probabilistic \nregulation. Leveraging the dataset, we trained sequence-to-expression deep learning models \nthat predict held-out expression with Pearson r ≈ 0.4, converge on shared sequence \ndeterminants, and agree strongly with each other (Pearson’s r  = 0.93),  indicating reproducible \nsequence-expression relationships. Finally, with minimal retraining the models generalize to an \nindependently generated dataset collected under distinct sorting conditions, transferring \nsequence rules across contexts. Our platform enables repeated, rapid studies and supports \ndeeper mechanistic insight while providing baseline models for forward design of human \nregulatory elements, advancing prediction beyond genomic-DNA-anchored methods. \nIntroduction \nGenetic engineering of human and mammalian cell lines underpins production of most biological \ntherapeutics (Zhang et al., 2023) (Singh et al., 2024) (Tihanyi and Nyitray, 2020). Despite this, \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nthe design of genetic constructs still leans on a limited set of experimentally derived \ncomponents, with a limited choice of available promoters to guide expression. Traditional design \nbuild test learn (DBTL) cycles for the generation of new synthetic promoters focus on \noptimisation through mutagenesis or assembly of existing sequences or genomically-derived \nelements (Weingarten-Gabbay et al., 2019) (Zahm et al., 2024). As a result, these only tweak an \nexisting and constrained sequence pool, limiting their ability to rationally explore a wider design \nspace. \nTranscriptional regulation emerges from the interplay of protein-DNA and protein-protein \ninteractions, cis-regulatory sequence context and the biochemical environment of multi-protein \nassemblies. Thus, its outputs are context-dependent and non-linear, resisting traditional \nreductionist dissection. Accordingly, there is increasing demand for statistical and \nmachine-learning approaches to infer the governing principles of promoter function and to \ncapture the combinatorial interactions that underpin them. Assessment in-silico, followed by \nde-novo sequence generation based on the ruleset learnt, can potentially avoid the parameter \nbottlenecks of earlier DBTL and create fully novel promoter parts. \nPrevious work in unicellular organisms has demonstrated the feasibility of understanding \nexpression biochemistry through machine learning. In bacteria, this has enabled gene regulation \nprediction and rational design of gene circuits. (Parisutham et al., 2024) (Nguyen et al., 2024) \n(Palacios, Collins and Del Vecchio, 2025). Previous work in eukaryotes such as S. cerevisiae \nthrough massively parallel reporter assay (MPRA), shows that compact promoter designs can \ndrive precise and strong gene expression. In yeast, a short module containing a minimal core \npromoter machinery plus a few upstream activating sequences is often sufficient to produce \nstrong expression (De Boer et al., 2020). Accordingly, yeast promoter assays show that, given \nsufficient reporter measurements, gene expression rules there are learnable (Vaishnav et al., \n2022). \nNo comparable platform has been achieved in human cells due to their greater regulatory \ncomplexity and challenges to generate datasets of comparable scale. Unlike in yeast, \ncomparatively promiscuously binding Human TFs assemble into context-dependent complexes, \nswitch between cooperative and antagonistic binding, and recruit a wide cast of cofactors \n(Gupta et al., 2009). Together, they create a highly plastic binding landscape in which the same \nmotif can support distinct regulatory outcomes depending on context. Most human large \ndatasets derive from endogenous genomes and proxy readouts (ATAC-seq, DNase-seq, or \nCAGE-seq), which report accessibility or transcriptional potential rather than the biochemical \nevents that determine reporter or protein expression (Avsec et al., 2021; Nguyen et al., 2024).  \nPredictors of regulatory activity span a spectrum from local grammar encoders to architectures \nthat integrate kilobase-scale context. Early convolutional networks functioned primarily as \nposition-tolerant motif detectors, capturing short-range syntax (DeepSEA; DeepBind) (Alipanahi \net al., 2015; Zhou and Troyanskaya, 2015). Subsequent dilated/residual convolutional stacks \nand attention-based architectures such as the Basenji and Enformer model families (Avsec et \nal., 2021; Kelley et al., 2018) further improved cross-locus generalization by modelling distal \ndependencies up to 100 Kilobases. More recently, Borzoi (Linder et al., 2025) models \ndemonstrated strong predictive power from proxy readouts, yet these models remain inherently \nlimited by their reliance on the representatively narrow endogenous genome and on \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nmeasurements that capture regulatory potential rather than the full cascade leading to functional \nprotein output. \nIn humans, current approaches to expression profiling still rely predominantly on \ngenome-exclusive analyses, either through direct assays or through inferral of features from \nsequence context. Such dependence risks propagating biases intrinsic to the repetitive and \ncompositionally constrained genome, thereby confounding causal mechanisms with correlation \nin downstream modelling (Avsec et al., 2021) (Rafi et al., 2025). Moreover, sampling of parts \nwhich are found naturally in the genome does not explore the full space which can exist for \nnovel synthetic promoters. There is therefore a clear space for experimental frameworks that \ncan reveal the causal rules governing regulatory activity approaches that move beyond \ncorrelative inference to directly map sequence to function. \nOur work addresses this gap by leveraging a high- throughput platform capable of generating \nand characterising millions of sequences from billions of sorted cells, enabling advanced AI \nmodels to uncover these relationships. \nWe introduce a massively parallel degenerate-promoter reporter assay that enables measuring \nexpression from a fully degenerate DNA sequence space, with millions of sequences assessed \nsimultaneously (Fig. 1A,B). These sequences are first analysed for protein-DNA and protein \ncomplex-DNA activity through in-silico binding assays to survey their impacts on gene \nexpression (Fig 3A). Then, by leveraging the results of our assay we train two families of \npredictors, inspired by new developments in the fields of machine learning and computer vision. \nWe train attention models adapted to 1D genomics (including shifted-window/hierarchical \nattention with Multiscale Vision stage-wise downsampling) and U-Net hybrids that stack \nconvolutional stems with multiscale attention and interpretable scale-context aware attention.  \nBy curating datasets at the million-scale, we uncover strong correlations between protein \navailability and gene expression levels through protein-DNA analyses. Our results reveal that \ngene expression strength reflects the combined influence of diverse, simultaneous \nprotein-protein interactions. Furthermore, assessment of gene expression from DNA was found \nto be accurately predictable using the aforementioned deep learning architectures. \nWe anticipate that this scale of analysis will enable unprecedented exploration of the \nbiochemistry underlying gene expression, while bridging a key gap in current \nsequence-to-expression assays at the million-scale resolution in humans. \nResults \nDesign of an experimental platform for sequence-to-expression \nmeasurements at scale. \nWe reasoned that an effective functional promoter assay would need to (i) use sequences long \nenough to drive measurable expression; (ii) avoid lengths that render modeling intractable given \nwidespread TF promiscuity, (iii) provide real, quantitative performance for each promoter in a \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nsingle cell context measured so that resource competition would not interfere with readout. \nFinally, it would need to (iv) be able to produce the millions of individual sequences required to \ninterpret the sequence space, (10 6+ variants). Because no informative prior exists for human \npromoter grammars, we chose degenerate DNA libraries to sample the regulatory parameter \nspace directly rather than evolve from natural or hand-designed scaffolds thereby escaping local \nminima imposed by genomic repetition and background motif structure. \n \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \n \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \n \nTo probe gene expression at the complex level, rather than individual transcription factors, \nsufficient sequence length and complexity were necessary to capture higher-order molecular \ninteractions. We therefore opted for a 200bp promoter for our assessments of promoter function \nreasoning that any longer would make the parameter space prohibitively complex (Fig. 1A). This \npromoter contained a variable degenerate region of 164 bases with 18 bp flanking regions for \nscarless insertion using type IIS enzymes.  \nThe variable degenerate region was assembled upstream of a core promoter (Fig. 1A). We \nbelieved a minimal promoter would be the ideal to minimise co-factor noise that may emerge \nfrom motifs inside the promoter. We selected the minimal promoter from the Tet-On 3G inducible \nsystem (Zhou, Lei and Zhu, 2020) for its minimal leakage of expression so that any expression \nderived from this core sequence could be expected to come from the degenerate part. \nAdditionally, we considered that leaving the core to random chance would risk total intractability \nor too few positive cases to capture a sufficient design space. \n \nCo-expression of multiple synthetic constructs in a single human cell diverts shared \ntranscriptional and translational resources, producing non-linear artefacts that confound \nsequence-to-output interpretation at the single-cell level (Di Blasi et al., 2021, 2023). To \neliminate resource competition and enable direct DNA-to-protein measurements, we restricted \nour assay to one construct per cell. We implemented single-copy genomic integration at a \npredefined “landing pad,” which inserts a single expression cassette per cell and thereby \nstabilizes reporter output across assays (Matreyek, Stephany and Fowler, 2017; Zhang et al., \n2023)(Noderer et al., 2014; Matreyek et al., 2018). Coupling this design with an eGFP reporter \nallows high-throughput, single-cell quantification by flow cytometry, preserving cell-to-cell \nheterogeneity while enabling robust, population-scale comparisons of promoter activity through \nFACS. This configuration therefore enables a single-copy, single-construct, single-cell readout, \nminimises dose and copy-number effects, reduces experimental noise, and provides a \nreproducible basis for ranking promoters and resolving even subtle sequence-dependent \ndifferences in reporter expression. \n \nTo minimise extrinsic genomic noise potentially brought about from inserting a cassette into the \nhighly complex expression environment of the genome and to avoid interfering with host cell \nprocesses, we perform assays using genomic integration at a “safe harbour” locus. We selected \nthe TARGATT HEK293 Integration system as the cell line for our assessments. This cell line \ncontains a landing pad in the H11 safe-harbour locus, an intergenic site conserved across \nmammals and located in humans on chromosome 22 (Chi et al., 2019) (Zhu et al., 2014).  \nAs it cannot be guaranteed safe harbours do not entirely avoid cryptic or undetected enhancer \nactivity, we designed the expression cassette to minimise non-degenerate promoter influence on \nexpression by flanking it with the  insulators sequences of the TCR alpha/delta locus BEAD-1 \nenhancer blocker (Zhong and Krangel, 1997) (Martella et al., 2017). By minimising external \ninfluences, we strengthen the attribution of observed interactions and transcription-factor \nbinding properties to the degenerate promoter itself.  \nWe started by assembling a library of 1012  (theoretical maximum) degenerate promoter \ncassettes through scarless insertion of the degenerate part into the gene expression cassette, \nfollowed on with a large-scale bacterial transformation, as outlined in Fig. 1A. To recover as \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nmany variants as possible, transformants were spread across large surface areas at low colony \ndensity, minimizing inter-colony competition and sequence bias (Mateyko and De Boer, 2024). \nEstimates of this area are 0.636 m², yielding colony counts of ~ 50 million by CFU calculation. \nPlasmids were collected through multiple maxiprep columns in an effort to maximise plasmid \nyield for subsequent plasmid-heavy transfection without losing diversity.    \n \nExperimental Platform Yields Millions of Unique Promoter \nSequences \nTo confirm the landing-pad system would work in our hands, we initially integrated an mCherry \nreporter under the control of EF1a promoter and confirmed single-copy stability without \ndetectable silencing over 14 days by flow cytometry, which was the selected expression window \nfor our assay (Fig. 2A). Library  transfection and integration using eGFP fluorescent reporter \nwas performed in-flask with a modified transfection protocol (see Methods section), enabling \ntransfection on the much larger area for increased efficiency. Transfection was followed up by \nenrichment to kill non-integrant-containing cells. Enrichment continued at maximum strength up \nto FACS sorting after 14 days. This time gap between transfection and sorting additionally \nensured no fluorescence detected could be mis-attributed to cell insertion as the remaining \nplasmid would be diluted out of cells by this point as has been previously reported (Carreño et \nal., 2024). \nFlow of the variant library revealed broad fluorescence heterogeneity, ranging from as little as \n10 fluorescent units of expression to over 100,000 (Fig. 2A and 2B). Additionally, flow cytometry \ndid not detect any sign of bimodality or shouldering, typical signs of potential multicopy \nintegration (Fig. 2B). Despite selection for integrants, ~18% of cells (Fig. 2B) were \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nnon-fluorescent, consistent with library members lacking functional promoter features or \ncontaining insulator or terminator-like elements that suppress transcription. \nBecause our goal was to sample the expression landscape broadly, gating and collection were \ntuned for a more uniform representation across the distribution rather than excluding \nnon-expressers. To quantify intrinsic and extrinsic variability and ensure repeated variants were \nrepresented, we sorted from pools comprising billions of cells from populations of the initially \nintegrated cells, in which identical cassettes recur. \nAs some of the highest expression degenerate sequences were found comparable to \ncanonically strong promoters such as CMV (supplementary Fig. 5), it was concluded a \ntop-expression bin of this level was appropriate. \nAn 18-bin plan maximized dynamic-range of sortants but yielded only hundreds of thousands of \ncells, insufficient for our predictive task due to the sorter permitting four-way splitting, \nconstraining collection to four intensity bins per run. We therefore established a five-bin scheme \nas the primary dataset (Fig. 2C), capturing millions of cells across runs and better reflecting the \nunderlying biochemical landscape. The 18-bin dataset was retained as an orthogonal, \nhigher-resolution out-of-sample set to test generalization. \nTogether, this strategy and the up-scaled transfection provided a scalable, cost-effective \nplatform that delivered a foundational dataset of 3 million+ and a validation set of 100k+ unique \ncell sequences (including duplicates; this number reaches 4 million+). \n \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nTF Binding and TF-TF Interactions Demonstrate Rich Diversity in \nRoles and Abundances \nChromatin immunoprecipitation followed by sequencing (ChIP–Seq) remains a powerful method \nfor mapping protein-DNA interactions. By integrating independent assays across laboratories \nand experimental systems, the probabilistic (and often stochastic) nature of individual TF \nbinding site (TFBS) calls can be compensated for, resulting in consensus motifs with higher \nconfidence. These consensuses, distilled into curated motif libraries such as JASPAR and \nHOCOMOCO (Castro-Mondragon et al., 2022) (Kulakovskiy et al., 2018) have become the gold \nstandard in representing intrinsic sequence-specific binding preferences.  \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nWe began with a descriptive analysis of the degenerate library, testing whether variability in \nexpression corresponds to variation in TF binding-site number and affinity. Confirmation of this \nlink would substantiate the contribution of TF binding to expression variation in our constructs. \nWe used a Position-Specific Scoring Matrix (PSSM) to detect the log-odds of a DNA binding to \nall the proteins available in the HOCOMOCO and JASPAR datasets. (Castro-Mondragon et al., \n2022) (Kulakovskiy et al., 2018), totalling 2824 proteins. We used a moderate stringency to \ncompensate for the promiscuousness of DNA binding sequences while avoiding missing any \nsequence interactions that may exist. We anticipated, with sufficient stringency, the diverse but \nsubtle trends of promoter abundance would be detectable through the noise of occasional \nmis-appropriation of TF binding. We additionally compensated for the background DNA base \npair variance, to ensure compensation against the degenerate background. \nAcross the panel of position weight matrices (PWMs) analyzed, their relative abundances were \nnormalised for their respective cell count per bin. We found that nearly all TFs demonstrated at \nleast some linear correlation between their relative abundance and sequence expression FACS \nbins (Fig. 3a). Strikingly, some TFs exhibited correlations with expression ranging as strong as \n+0.85 to –0.99 when assessed by Pearson's r.  \nTo gauge how much expression is predictable from TF occupancy alone, we trained \ninterpretable shallow-learning models on PSSM-derived features. Linear regression performed \nbest, predicting reporter output with Pearson r ≈ 0.3 on held-out data. This establishes that \nprotein-occupancy assessment can carry significant signals for expression, but also that \nadditional information beyond PWM scores is required to reach higher accuracy. \n \n \n \n \n \n \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nWe next examined TF-TF interactions themselves in a data-driven manner by jointly modeling \nTF abundance and occupancy.  \nUnlike in-vitro high throughput protein-protein interaction studies which often rely on binary \nassays outside of the cell environment (Berger and Bulyk, 2006), or not in the context of gene \nexpression strength regulation, (Mohammed et al., 2016), this strategy enables the \nsimultaneous assessment of thousands of TF-TF combinations within a physiological snapshot. \nMoreover, by analysing binding to degenerate DNA sequences devoid of pre-existing chromatin \nmarks, we were able to evaluate TF biochemistry in a minimally confounded context.  \nAs demonstrated by the interaction heatmap (Fig. 4A), the assessment captured a broad \ndiversity of cooperative and antagonistic relationships, many of which would be difficult to \nresolve by conventional pairwise approaches. Incorporating TF-TF co-abundance into \nexpression-correlation analyses revealed that combinations of TFs often explained more \nvariance than individual factors, with some combinations reaching near-perfect correlations (r ≈ \n0.99, Fig. 3C). This supports the long-standing hypothesis that higher-order TF complexes, \nrather than isolated factors, are primary drivers of transcriptional regulation in human cells. A full \nmap of all interaction pairs revealed a great diversity in co-abundance correlations to \ntranscription (Fig. 4A).  \n \nTo further dissect how TF abundance patterns shape gene expression, we applied a linear \nregression framework to model expression levels as a function of TF abundance for each of the \n5 bins in the foundational expression dataset. This approach provides a direct and interpretable \nparameterization of the relationship between TF levels and transcriptional output, enabling the \nidentification of both individual and cooperative effects among thousands of TFs (Fig. 4B). \nDespite the simplicity of the model, regression fits captured substantial variation in expression \nfor a large fraction of TF pairs, revealing that even shallow parameterizations can explain \ncomplex abundance–expression relationships. The resulting interaction coefficients (Fig. 4C) \nexpose a rich landscape of TF–TF dependencies. Some factors act additively to enhance \nexpression, while others exhibit antagonistic or mutually repressive dynamics. This diversity of \ninteraction signs and magnitudes highlights that expression control emerges not merely from the \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \npresence of TFs, but from their relative balance and combinatorial availability within the cellular \nenvironment. \nBy projecting these interaction parameters back onto model performance (Fig. 4B), we show \nthat high explanatory power can arise from both synergistic and antagonistic TF interactions, \nemphasizing that both cooperative activation and competitive repression are central to \nexpression tuning. Although exhaustive modeling of higher-order interactions would rapidly \nscale into billions of combinations, our findings demonstrate that even pairwise linear models \ncapture a substantial portion of the explainable variance in expression. These results motivated \nour future use of more parameter-efficient nonlinear methods, such as deep learning, to \ngeneralize this framework to higher-dimensional interaction spaces. \nAltogether, by systematically regressing in-silico TF abundance data against observed \nexpression, we uncover a broad spectrum of abundance-expression couplings that reflect the \nunderlying logic of transcriptional control in human cells. The approach transforms large-scale \nbinding and abundance data into an interpretable map of regulatory dependencies, revealing \nthat expression predictability arises from structured and diverse TF-TF relationships rather than \nfrom any single dominant driver. \nCell-Specific Human Gene Expression is Predictable \nAs one of the primary goals of this assay is to predict the space of human DNA sequence to \nexpression, and to compensate for the aforementioned complexity of this system, we elected to \nuse Deep Learning for our primary predictive purposes. We adopt two major architectures for \nour sequence to depression prediction: A two dimensional Convolutional U-Net/ Attention Hybrid \n(Fig. 5A) and a Hybrid Window (HWin) Style Attention model (Fig. 5B).  \nAdvances in attention-mediated computer vision offer useful design principles for sequence \nmodels. Vision transformers now combine multiscale representations with windowed \nself-attention to balance accuracy and efficiency. In the “shifted-window” scheme, inputs are \ndivided into partially overlapping windows and self-attention is computed within each window. \nShifting the window grid between layers allows neighbouring windows to exchange information, \npreserving locality while approximating global interactions at much lower cost (for example, the \nSwin Transformer (Liu et al., 2021)). Related multiscale architectures use hierarchical \ndown-sampling via token pooling or strided convolutions so that receptive fields expand with \ndepth even as the token count contracts, as in multiscale vision transformers (MViTs) (Fan et al., \n2021). Together, these ideas motivate DNA models that capture local context precisely, transmit \ninformation across larger genomic neighbourhoods efficiently, and do so within practical \ncompute budgets. In practice, we use the Swin elements for layer-wise detection of sequence \nand employ MViT attention down-sampling so receptive fields grow with depth and token count \nshrinks.  \nThe 1D Hwin lets us test, head-to-head, what multiscale attention can add beyond pure CNNs \non our degenerate promoters, offering an alternative method for detecting and predicting \nbiochemical activity from sequence. Given the rich diversity in protein binding detected in the \nprevious section, we designed the model to detect TF motifs using a short encoder, followed by \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nrows of Swin blocks, custom designed for sequence detection by enabling further hierarchical \nwindow-attention for motif detection(Fig. 5B). MViT style down-sampling then collapses the \nsequence while retaining informational representation of the motif-layer before further Swin \nblocks detect the higher-order interactions inside each sequence. In addition, the Swin heads \nuse a novel hierarchical attention, intended to evaluate short base interactions as well as further \nbase interactions at once. This allows the attention mechanism to appreciate both the short and \nfurther range interactions in a given DNA input. \nIn addition to ViT’s other Conv/Attention hybrids including some U-Net-style architectures that \nhave found great success in other fields such as biomedical imaging  (Pan et al., 2025) have \nbeen effective in identifying key short regions.  \nThe ethos of using a U-Net is to use convolutional filters in the contracting path to act as motif \ndetectors, capturing fine-grained TF binding motifs and short-range syntax that underlie local \ngene-regulatory signals and then map these across the local and global contest of a sequence. \nWe engineer the encoder to detect the complex motif grammar of the promoter sequence, as \ndepth increases, receptive fields expand to encode longer sequence grammars and positional \ninterdependencies that influence promoter output. The core element consists of layers of \nmulti-head attention, intended to evaluate the detected motifs and their interdependencies’ \nimpacts on promoter strength. The decoder path should then reconstitute base- or window-level \npredictions by fusing these long-range representations with high-resolution features passed \nthrough skip connections, preserving precise nucleotide context while integrating broader \nregulatory cues. The same sequence is fed twice through the model. Encoding the same DNA \nsequence as a 2D input compels the model to reconcile multiple contextual views of each \nbase-capturing motif co-occurrence, spacing, and shape-coupled dependencies thereby \nsuppressing rote k-mer memorization and encouraging generalizable, biochemically grounded \nrepresentations. \nFor both models, we used simple relative positional encodings suited to 1D distances as we \nbelieved it was important to inform the model of the promoters’ inter-motif proximities while \nadditionally providing closeness to the core promoter. Both models additionally contain final \nsqueeze excitation/ convolutional layers to reduce channel sizes in an attention conscious \nmanner to avoid flooding the final flattening and fully connected layers with too much \ninformation and noise at once. Fully Connected layers taper in a triangular channel fashion to \nminimise signal loss towards the final regression head. As we were testing the mean occupancy \nof sequences in the FACS bins, a continuous target, we believed testing with Huber loss or MSE \nwas the best option for this.  \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \n \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nThe U-Net scaled well to large, high-dimensional biological data (millions of positions/genes) \nwithout additional assistance, while linear regression struggles with collinearity and high \ndimensionality. Achieving a 0.38 Pearson r (Fig. 5E) in predicting totally degenerate DNA \nsequence on a randomly selected hold-out, a world first in understanding the true biochemistry \nthat underlies the body, free from the influence and containment of the human genome. In real \nterms, this means ~80% of predictions are within 1.5 mean absolute error in log10 geomean \nexpression (Fig. 5D). We then, with minimal retrain on the 18-bin database achieve Pearson's r \nof 0.20 in transfer to a completely different dataset, validating indeed human cell biochemistry \nhas transferred from the foundational model to the current. \nHwin hybrid window attention transformer, yielded a Pearson’s r of 0.35, slightly less than the \nU-net Architecture but demonstrates again the possibility to predict gene expression of HEK \ncells. The correlation of the two models (Fig. 5D) yield was 0.93, indicating that both extract the \nsame underlying biochemical signal from the data rather than overfitting idiosyncrasies. \nRobustness analyses on this predictable regime further showed substantial sequence diversity, \nconsistent with detection of human regulatory biochemistry rather than trivial sequence trends. \nTo interrogate how local and global sequence features are processed by the model, the U-Net \ncontained a modification replacing the standard skip concatenation/addition with a learned \ncross-attention gate. In this scheme, decoder queries attend over encoder keys/values, yielding \nper-base (and motif-level) attention maps while selectively passing only the most informative \nlocal features forward. This design exposes an interpretable proxy of the model’s internal \nattributions: Fig. 5F illustrates that the network highlights specific nucleotides and short motifs \nwhen calling expression strength, linking decoder decisions to concrete sequence positions. \nNotably, the model’s attention concentrates on clustered bases rather than isolated positions, a \npattern consistent with transcription-factor binding sites and their spacing grammars. While \nattention does not by itself establish causality, these maps generate testable hypotheses. Future \nwork will pair attention readouts with wet lab rational design to explore if this is a method to \nsupport in silico design of synthetic sequences. \nDiscussion \nIn this work, we have created and assessed for the first time, millions-scale sequence to \nexpression datasets in human cell lines. We assess the full scope of gene expression from DNA \nsequence to protein expression levels and have successfully trained models which can predict \nthese levels. We used safe-harbour, single-copy genomic integration to eliminate copy-number \nand replication artefacts and to minimise resource competition noted in prior studies. This \nchoice improves stability and interpretability of expression measurements and enables us to \nassess constructs efficiently.  \nTF binding sites detected using PSSMs in the degenerate DNA sequences explain meaningful \nvariance in our reporter data, with individual TFs showing strong positive and negative \nassociations with expression, consistent with activator and repressor-like behaviour in context. \nJoint modelling reveals that TF–TF interactions outperform single factors (some combinations \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \napproaching r≈0.99). This is in support of the well-established hypothesis that interaction \nthrough complexes is a major driver of human gene expression (Jolma et al., 2015). \n \nWe reasoned that, unlike sequence MPRAs that typically isolate one or two transcription-factor \n(TF) binding events per construct, resolving human regulatory logic requires assaying the \nconcurrent action of many TFs within the same sequence context. When leveraging a yeast TF \ndataset, our model achieved strong sequence-to-expression performance in (Pearson’s r ≈ 0.99; \nSupplementary Fig. 6) (De Boer et al., 2020). Because the model performed so well on these \nsimpler cases, we hypothesized it could be pushed to learn combinatorial regulatory grammar, \nan expectation further inspired by prior deep sequence models capable of capturing distal and \ncombinatorial effects (e.g., Enformer, Kelley et al., 2018). On this basis, we adopted a larger \n200-bp design rather than smaller oligonucleotides. This choice trades locus precision for \ncombinatorial context, enabling our model to learn higher-order TF interplay and, in some \ncases, to achieve expression levels comparable to strong promoters (see Supplementary \nInformation).  \nOur platform directly measures the functional consequences of promoter sequences from \ntranscript to protein. By reading outputs end-to-end, it directly captures the combined effects of \nmultiple regulators and reveals a combinatorial “grammar,” in which motif spacing and \norientation encode cooperative and antagonistic interactions beyond simple PWM additivity.  \nAlthough the human genome offers a richly layered regulatory environment, it is also highly \nrepetitive and correlated across scales (Haubold and Wiehe, 2006), risking misattribution of \nsequence patterns as biochemical drivers and train/test leakage. By assaying a degenerate \nDNA sequence space instead of native loci, we escape these redundancies and gain an \neffectively unbounded combinatorial landscape. Degenerate sequences naturally contain \nvariations of motifs, their counts, spacings and orientations, enabling clean, non-homologous \ntrain/validation/test splits. In turn, models are forced to learn the underlying biochemistry and \nregulatory grammar rather than memorise recurrent genomic patterns, yielding a more rigorous \nassessment of generalisation. \nWe adapted developments in computer vision (MViT, Swin, U-Net’s) to enable us to treat the \nDNA as a 1D “image” to capture motif-level signals locally while enabling selective, longer-range \ncommunication. As a result, predictive models are compelled to learn the biochemistry of gene \nexpression motif dosage, spacing/orientation rules, and higher-order behaviours rather than \nmemorising sequences.  \nOn a fair, non-homologous train/test split of our multi-million–member library, the model predicts \nreporter output with Pearson r ≈ 0.4 (P ≈ 1×10⁻⁹⁹⁹). To our knowledge, this is the first \ndemonstration of decisive sequence-to-expression prediction in human cells from a fully \nsynthetic, degenerate sequence space. Despite a substantial distribution shift from five FACS \nbins to a uniform 18-bin design the model generalizes with r ≈ 0.2 (P ≈ 7×10⁻⁴⁸) with minimal \nadaptation, indicating learned sequence logic rather than dataset memorization. While an r ≈ 0.4 \nmay appear modest, it represents a meaningful advance given the combinatorial complexity of \nhuman promoter biochemistry and the unbiased nature of the sequence space being \ninterrogated. \n \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nAlthough our 200-bp design is shorter than methods that operate at the kilobase scale (Avsec et \nal., 2021), it nonetheless recovers a broad diversity of sequence activity, ranging from classically \nstrong high-activity to non-active sequences. This compactness makes the approach appealing \nfor engineering short, potent promoters. More broadly, the model’s parameters encode \nbiochemical regularities, motif identity, dosage, and spacing/orientation grammars that should \ngeneralise to longer inputs and help constrain long-range expression predictions, consistent \nwith prior demonstrations of grammar-based generalisation (e.g., de Boer et al.; Taipale and \ncolleagues). \n \nThis platform establishes a direct, quantitative link between DNA sequence and expression, \nproviding a ground truth for modelling, interpretation and design. The inferred rules of motif \ndosage and grammar translate directly to compact, predictable expression cassettes suitable for \ntherapeutic and manufacturing applications. By design, we operate in a controlled, non-native \ncontext with single-copy integration at an insulated safe-harbour locus to minimise position \neffects and upstream or downstream interference. The same design principles and experimental \narchitecture are readily transferrable to other genomic loci, cell types, and regulatory regions, \nproviding a generalizable route to dissect and engineer additional layers of gene expression \ncontrol beyond the core promoter. We anticipate that with broader adoption, accumulating data \nand community-driven refinement will further enhance model accuracy and interpretability, \naccelerating the convergence between empirical and predictive design. \n \nThis work establishes a powerful and generalizable novel platform for learning the biochemical \nrules of human gene expression. We view this as a blueprint for new experimental workflows \nand with community-wide adoption, across loci, promoters and cell types, could improve overall \nmodel performance through transfer of information between models and better foundational \ndata sources. \nMethods \nCloning of Backbone \nAll PCR for Cloning of Backbone Used NEB Phusion according to manufacturer’s instructions. \nOligonucleotides used as primers were ordered from Integrated DNA Technologies (IDT). \nSequences for these can be found in supplementary table 1. \nIntegration sites and Blasticidin resistance were cloned from the TARGATT Cell line \nPromoter/Blasticidin Plasmid as provided upon purchase of the TARGATT HEK293 H11 Kit \nthrough PCR.  Sequences of these cannot be shared due to NDA with Applied Cell Systems. \nPre-existing plasmid from Ceroni Lab consisting of EMMA toolkit backbone, insulator P2, \npromoter CMV, cds eGFP, terminator SV40 and insulator P2 was linearised by PCR and BsmBI \nsites included. Sequences of these can be found in supplementary table 2. \nTET-3G-On system minimal core promoter was cloned out using PCR and BsmBI cloning sites \nadded from TETON3G system. Sequence of this minimal promoter can be found in \nsupplementary table 2. \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nParts of all prior plasmids were run for 1 hour on a 1% Agarose Gel in TAE Buffer along with a \n1KB+ Ladder for fragment identification. SyBR Safe from Innovatis was used to visualise the \nDNA for extraction at manufacturers recommended concentration, as was New England Biolabs \nLoading Dye. \nThe Linear DNA fragments were extracted from the Gel using the Qiagen Gel Extraction Kit \nfollowing the manufacturer’s instructions. Sequences assembled by NEB BsmBI Digestion and \nLigation in one pot golden gate reactions as per manufacturers instructions for scarless \nassembly of expression cassette. Assembly confirmed by sequencing by Full Circle Labs. \nBsmBI Sites were cloned into plasmid by PCR to host degenerate oligonucleotides, followed by \nblunt ended ligation by DNA ligase T4 from New England Biolabs using manufacturer \ninstructions. \n1 ml of competent cells transformed with backbone plasmid and cultured overnight at 37oC in \nLB Agar made up per manufacturer (Formedium) instruction, in addition to 100mg/ml Ampicillin. \nwas combined with 1ml autoclaved Glycerol from Sigma Aldrich to make stocks to be stored in \n-80 freezer. \nPositive Control Plasmid (EF1a/mCherry/SV40) Provided by Applied StemCell as part of \nTARGATT™ system. \nDuplex of Oligonucleotides \nOligonucleotides were ordered from IDT, as were the primers for its duplexing and subsequent \ncloning into the backbone. \nDuplexing of the oligonucleotides was carried out using Phusion PCR, used according to the \nmanufacturer’s instructions. 200 ng of oligonucleotide was subject to 5 cycles of PCR to ensure \ncomplete duplexing without significant bias to the distributions of each sequence. This mix was \npurified using a Qiagen PCR purification kit according to manufacturers instructions. \nGolden Gate Cloning of Parts \nGolden gate was carried out at the maximum recommended amounts by NEB. 1 ul of NEB \nBsmBI and 0.5ul of NEB T4 Ligase was added to 75 ng of Backbone Plasmid and 75ng of \nduplexed oligonucleotide. Additionally, 4 ul of T4 Ligase Buffer was added and the solution was \nmade up to 20ul with deionised water. \nThe reaction mixture was purified using a Qiagen PCR Purification Kit according to the \nmanufacturer’s instructions. \nLarge Scale - Bacterial Transformation \nMEGAX DH10b cells were used during bacterial transformation. These are electrocompetent \ncells provided by Thermofisher. BioRad Gene Pulser X was used for transformation according to \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nmanufacturers recommendations. The electroporated cells were then placed in 1 ml recovery \nmedia for one hour before being plated on 100 ampicillin plates.  \nBacterial Collection \nBacteria were scraped by first washing with deionised water and then collected by pipetting and \nl-shaped scrapers. It was made sure by every measure that as much bacteria could be removed \nfrom every plate. These Colonies were then DNA extracted using the Qiagen Maxiprep Kit.  4 \npreps were required in total to correctly extract the DNA. \nMammalian Transfection \nTARGATT modified HEK293 Cells were Thawed P1 and grown in T25, then T75 up to the \nrequired confluence. These cells were then seeded in T75 for 80 percent confluence the next \nday. Transfection occurred using Xtreme Gene HP according to manufacturers instructions. At \nratio integrase to plasmid cargo as per TARGATT instructions.  \nBlasticidin Enrichment \nAccording to the TARGATT Enrichment Protocol, 24 hours after the splitting into T175, \nBlasticidin was added at 10ug/ml. Care was taken to ensure from this moment that cells would \nnot become too confluent. Enrichment proceeded for 10 days until 14 days had passed since \ntransfection had begun. Visible Death of Negative Control Occurred within 48 Hours of \nBlasticidin Introduction. \nCell Flow Cytometry \nCell plates had their media removed and were washed with Sigma Aldrich pH 7.4, liquid, \nsterile-filtered Thermo Fisher Phosphate Buffered Saline (PBS) to dislodge the cells. Cells were \nplaced in 1.5ml Eppendorf tubes and centrifuged at 1500 RPM for 5 minutes in an Eppendorf \nTabletop Centrifuge 5424R. Supernatant was discarded and cells were resuspended to \nconcentration 10 million cells per millilitre fresh PBS before being passed through a metal mesh \ninto a flow tube. \nFlow analysis was carried out immediately after prep, using a MACSQuant® X Flow Cytometer \nfor all experiments. The apparatus was calibrated each time using MACSQuant® Calibration \nBeads. Flow was carried out at medium flow rate (~800 Events per second) for all experiments. \n50ul of each flow was analysed. \nAnalysis was carried out using the MACSQuant B1 Laser, suited for GFP analysis, at its \nminimum voltage (100 V). \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nCells were gated by FSC-A and SSC-A to identify living cells. Said living cells were further gated \nby FSC-A and FSC-H. Negative controls were used to gate GFP/FITC-A fluorescence read by \nthe B1 laser.  \nFor Cell FACS - Cells were prepared in the same manner. Cells were collected in 1ml PBS and \nimmediately prepped for genomic extraction. \n \nGenomic extraction was carried out using the NEB Monarch® Spin gDNA Extraction Kit \naccording to manufacturers instructions. \n \nIllumina Indexing  \nPCR was used to attach index adapters to illumina reads. Illumina indexing was carried out \nusing NEBNext® Multiplex Oligos for Illumina® (Dual Index Primers Set 1) according to \nmanufacturers instructions. \n \nPaired-end 150-bp Illumina Multiplex Sequencing and Demultiplexing was carried out by Azenta \nLife Sciences. \nDry-Lab - Computation Resources \nGoogle Cloud Resources consisted of Two A100s, 24 N1 CPUs and 170 GB RAM which was \nused for model training. Data processing and further data modelling also carried out using \nGoogle Colab Pro+ 80GB A100, 170GB RAM. \nSee Supplementary for Details on Illumina Processing. \nAuthor contributions. JM, FC and SG conceived the research. JM performed the experiments \nand computational work. All authors wrote and edited the manuscript.  \n  \nAcknowledgements. This work was supported by the EPSRC Centre for Doctoral Training in \nBioDesign Engineering (EP/S022856/1) (to J.M., F.C. and S.G.) and co-funded by the \nUCL-AstraZeneca Centre of Excellence . F.C. was also partly funded by the Bezos Earth \nFund through the Bezos Centre for Sustainable Protein (BCSP/IC/001), the UK National \nAlternative Proteins Innovation Centre (NAPIC), which is an Innovation and Knowledge Centre \nfunded by the Biotechnology and Biological Sciences Research Council (BBSRC) and Innovate \nUK (BB/Z516119/1). F.C. was partly funded by the Engineering and Physical Sciences Research \nCouncil under the EEBio Programme Grant (EP/Y014073/1) and by the Chan Zuckerberg \nInitiative.  \nDeclaration of interests. The authors declare no competing interests.  \n \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nReferences \nAlipanahi, B. et al. (2015) ‘Predicting the sequence specificities of DNA- and RNA-binding \nproteins by deep learning’, Nature Biotechnology, 33(8), pp. 831–838. Available at: \nhttps://doi.org/10.1038/nbt.3300. \nAvsec, Ž. et al. (2021) ‘Effective gene expression prediction from sequence by integrating \nlong-range interactions’, Nature Methods, 18(10), pp. 1196–1203. Available at: \nhttps://doi.org/10.1038/s41592-021-01252-x. \nBerger, M.F. and Bulyk, M.L. (2006) ‘Protein Binding Microarrays (PBMs) for Rapid, \nHigh-Throughput Characterization of the Sequence Specificities of DNA Binding Proteins’, in \nMinou, B., Gene Mapping, Discovery, and Expression. New Jersey: Humana Press, pp. \n245–260. Available at: https://doi.org/10.1385/1-59745-097-9:245. \nCarreño, A. et al. (2024) ‘Tuning plasmid DNA amounts for cost-effective transfections of \nmammalian cells: when less is more’, Applied Microbiology and Biotechnology, 108(1), p. 98. \nAvailable at: https://doi.org/10.1007/s00253-024-13003-x. \nChi, X. et al. (2019) ‘A system for site-specific integration of transgenes in mammalian cells’, \nPLOS ONE. Edited by Z. Cui, 14(7), p. e0219842. Available at: \nhttps://doi.org/10.1371/journal.pone.0219842. \nDe Boer, C.G. et al. (2020) ‘Deciphering eukaryotic gene-regulatory logic with 100 million \nrandom promoters’, Nature Biotechnology, 38(1), pp. 56–65. Available at: \nhttps://doi.org/10.1038/s41587-019-0315-8. \nDi Blasi, R. et al. (2021) ‘A call for caution in analysing mammalian co-transfection experiments \nand implications of resource competition in data misinterpretation’, Nature Communications, \n12(1), p. 2545. Available at: https://doi.org/10.1038/s41467-021-22795-9. \nDi Blasi, R. et al. (2023) ‘Resource-aware construct design in mammalian cells’, Nature \nCommunications, 14(1), p. 3576. Available at: https://doi.org/10.1038/s41467-023-39252-4. \nFan, H. et al. (2021) ‘Multiscale Vision Transformers’, in 2021 IEEE/CVF International \nConference on Computer Vision (ICCV). 2021 IEEE/CVF International Conference on Computer \nVision (ICCV), Montreal, QC, Canada: IEEE, pp. 6804–6815. Available at: \nhttps://doi.org/10.1109/ICCV48922.2021.00675. \nGupta, P. et al. (2009) ‘PU.1 and partners: regulation of haematopoietic stem cell fate in normal \nand malignant haematopoiesis’, Journal of Cellular and Molecular Medicine, 13(11–12), pp. \n4349–4363. Available at: https://doi.org/10.1111/j.1582-4934.2009.00757.x. \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nHaubold, B. and Wiehe, T. (2006) ‘How repetitive are genomes?’, BMC Bioinformatics, 7(1), p. \n541. Available at: https://doi.org/10.1186/1471-2105-7-541. \nJolma, A. et al. (2015) ‘DNA-dependent formation of transcription factor pairs alters their binding \nspecificity’, Nature, 527(7578), pp. 384–388. Available at: https://doi.org/10.1038/nature15518. \nKelley, D.R. et al. (2018) ‘Sequential regulatory activity prediction across chromosomes with \nconvolutional neural networks’, Genome Research, 28(5), pp. 739–750. Available at: \nhttps://doi.org/10.1101/gr.227819.117. \nLinder, J. et al. (2025) ‘Predicting RNA-seq coverage from DNA sequence as a unifying model \nof gene regulation’, Nature Genetics, 57(4), pp. 949–961. Available at: \nhttps://doi.org/10.1038/s41588-024-02053-6. \nLiu, X. et al. (2017) ‘In Situ Capture of Chromatin Interactions by Biotinylated dCas9’, Cell, \n170(5), pp. 1028-1043.e19. Available at: https://doi.org/10.1016/j.cell.2017.08.003. \nLiu, Z. et al. (2021) ‘Swin Transformer: Hierarchical Vision Transformer using Shifted Windows’. \narXiv. Available at: https://doi.org/10.48550/arXiv.2103.14030. \nMartella, A. et al. (2017) ‘EMMA: An Extensible Mammalian Modular Assembly Toolkit for the \nRapid Design and Production of Diverse Expression Vectors’, ACS Synthetic Biology, 6(7), pp. \n1380–1392. Available at: https://doi.org/10.1021/acssynbio.7b00016. \nMateyko, N. and De Boer, C.G. (2024) ‘Culture Wars: Empirically Determining the Best \nApproach for Plasmid Library Amplification’, ACS Synthetic Biology, 13(8), pp. 2328–2334. \nAvailable at: https://doi.org/10.1021/acssynbio.4c00377. \nMatreyek, K.A. et al. (2018) ‘Multiplex assessment of protein variant abundance by massively \nparallel sequencing’, Nature Genetics, 50(6), pp. 874–882. Available at: \nhttps://doi.org/10.1038/s41588-018-0122-z. \nMatreyek, K.A., Stephany, J.J. and Fowler, D.M. (2017) ‘A platform for functional assessment of \nlarge variant libraries in mammalian cells’, Nucleic Acids Research, 45(11), pp. e102–e102. \nAvailable at: https://doi.org/10.1093/nar/gkx183. \nMohammed, H. et al. (2016) ‘Rapid immunoprecipitation mass spectrometry of endogenous \nproteins (RIME) for analysis of chromatin complexes’, Nature Protocols, 11(2), pp. 316–326. \nAvailable at: https://doi.org/10.1038/nprot.2016.020. \nNguyen, E. et al. (2024) ‘Sequence modeling and design from molecular to genome scale with \nEvo’, Science, 386(6723), p. eado9336. Available at: https://doi.org/10.1126/science.ado9336. \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nNoderer, W.L. et al. (2014) ‘Quantitative analysis of mammalian translation initiation sites by \nFACS ‐seq’, Molecular Systems Biology, 10(8), p. 748. Available at: \nhttps://doi.org/10.15252/msb.20145136. \nPalacios, S., Collins, J.J. and Del Vecchio, D. (2025) ‘Machine learning for synthetic gene circuit \nengineering’, Current Opinion in Biotechnology, 92, p. 103263. Available at: \nhttps://doi.org/10.1016/j.copbio.2025.103263. \nPan, P. et al. (2025) ‘Multi-scale conv-attention U-Net for medical image segmentation’, \nScientific Reports, 15(1), p. 12041. Available at: https://doi.org/10.1038/s41598-025-96101-8. \nParisutham, V. et al. (2024) ‘E. coli transcription factors regulate promoter activity by a universal, \nhomeostatic mechanism’. Available at: https://doi.org/10.1101/2024.12.09.627516. \nRafi, A.M. et al. (2025) ‘Detecting and avoiding homology-based data leakage in \ngenome-trained sequence models’. Available at: https://doi.org/10.1101/2025.01.22.634321. \nRauluseviciute, I. et al. (2024) ‘JASPAR 2024: 20th anniversary of the open-access database of \ntranscription factor binding profiles’, Nucleic Acids Research, 52(D1), pp. D174–D182. Available \nat: https://doi.org/10.1093/nar/gkad1059. \nSingh, R. et al. (2024) ‘Advancements in CHO metabolomics: techniques, current state and \nevolving methodologies’, Frontiers in Bioengineering and Biotechnology, 12, p. 1347138. \nAvailable at: https://doi.org/10.3389/fbioe.2024.1347138. \nTihanyi, B. and Nyitray, L. (2020) ‘Recent advances in CHO cell line development for \nrecombinant protein production’, Drug Discovery Today: Technologies, 38, pp. 25–34. Available \nat: https://doi.org/10.1016/j.ddtec.2021.02.003. \nVaishnav, E.D. et al. (2022) ‘The evolution, evolvability and engineering of gene regulatory \nDNA’, Nature, 603(7901), pp. 455–463. Available at: \nhttps://doi.org/10.1038/s41586-022-04506-6. \nVorontsov, I.E. et al. (2024) ‘HOCOMOCO in 2024: a rebuild of the curated collection of binding \nmodels for human and mouse transcription factors’, Nucleic Acids Research, 52(D1), pp. \nD154–D163. Available at: https://doi.org/10.1093/nar/gkad1077. \nWeingarten-Gabbay, S. et al. (2019) ‘Systematic interrogation of human promoters’, Genome \nResearch, 29(2), pp. 171–183. Available at: https://doi.org/10.1101/gr.236075.118. \nZahm, A.M. et al. (2024) ‘A massively parallel reporter assay library to screen short synthetic \npromoters in mammalian cells’, Nature Communications, 15(1), p. 10353. Available at: \nhttps://doi.org/10.1038/s41467-024-54502-9. \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint \n\n \nZhang, C. et al. (2023) ‘Construction and application of a multifunctional CHO cell platform \nutilizing Cre/lox and Dre/rox site-specific recombination systems’, Frontiers in Bioengineering \nand Biotechnology, 11, p. 1320841. Available at: https://doi.org/10.3389/fbioe.2023.1320841. \nZhang, Z. et al. (2023) ‘Elucidation of E3 ubiquitin ligase specificity through proteome-wide \ninternal degron mapping’, Molecular Cell, 83(18), pp. 3377-3392.e6. Available at: \nhttps://doi.org/10.1016/j.molcel.2023.08.022. \nZhong, X.-P. and Krangel, M.S. (1997) ‘An enhancer-blocking element between α and δ gene \nsegments within the human T cell receptor α/δ locus’, Proceedings of the National Academy of \nSciences, 94(10), pp. 5219–5224. Available at: https://doi.org/10.1073/pnas.94.10.5219. \nZhou, J. and Troyanskaya, O.G. (2015) ‘Predicting effects of noncoding variants with deep \nlearning–based sequence model’, Nature Methods, 12(10), pp. 931–934. Available at: \nhttps://doi.org/10.1038/nmeth.3547. \nZhou, Y., Lei, C. and Zhu, Z. (2020) ‘A low-background Tet-On system based on \npost-transcriptional regulation using Csy4’, PLOS ONE. Edited by O.P. Perera, 15(12), p. \ne0244732. Available at: https://doi.org/10.1371/journal.pone.0244732. \nZhu, F. et al. (2014) ‘DICE, an efficient system for iterative genomic editing in human pluripotent \nstem cells’, Nucleic Acids Research, 42(5), pp. e34–e34. Available at: \nhttps://doi.org/10.1093/nar/gkt1290. \n \n \n \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.10.687627doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}