{"paper_id":"1ac69792-5465-4fe1-a052-c044a4aa5c53","body_text":"1 \n \n  1 \nDesigning Convergent Overlapping Genes with Transformer Encoder Models and 2 \nLightweight Structural Proxies 3 \nJason K. Morgan 1,* 4 \n1 Independent Researcher 5 \n* Corresponding author at: jkm783@gmail.com 6 \nAbstract 7 \nOverlapping genes allow multiple proteins to be encoded from a single DNA sequence, including 8 \nconvergent (antisense; tail-to-tail) orientations across three reading frames (phases 0, 1, and 2), 9 \nwith phase 1 most frequently observed in nature. Designing such overlaps is challenging due to 10 \ncodon degeneracy, phase-specific biases, and the need to preserve structural integrity for both 11 \nproteins. Here, a purpose-built transformer encoder is introduced, trained on a balanced synthetic 12 \ndataset of convergent overlaps spanning diverse prokaryotic genomes and GC contents. 13 \nControlled amino acid substitutions were incorporated during training to enhance model 14 \ngeneralization, particularly for phase 1 overlaps. At inference, Monte Carlo dropout enabled 15 \nuncertainty-aware sampling of synonymous codon solutions, which were iteratively refined 16 \nusing a windowed, multi-objective optimization framework. Candidate overlaps were scored 17 \nusing composite weighting across secondary structure preservation, substitution similarity, 18 \nalignment identity, and ESM-2 contact map similarity, with SSIM applied as a rapid proxy for 19 \nstructural fidelity. This approach generated convergent overlaps across all phases, with phase 1 20 \nshowing the highest success rates. Optimization trajectories revealed distinct dynamics, with 21 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n2 \n \nsecondary structure preservation steadily increasing despite its lower weight. External validation 22 \nusing SwissProt proteins stratified by AlphaFold2 pLDDT confidence supported generalization 23 \nto proteins with differing rigidity, yielding high secondary structure preservation in silico. These 24 \nresults demonstrate that transformer models trained directly at the nucleotide level, when 25 \ncoupled with uncertainty-aware inference and lightweight structural proxies, can support the 26 \ncomputational design of synthetic overlapping genes without requiring full structural prediction. 27 \nThis framework offers a scalable path for phase-specific, codon-aware overlap design under 28 \nrealistic constraints. 29 \nIntroduction 30 \nThe study of overlapping genes is an important area of interest in prokaryotic genetics, attributed 31 \nto their ability to encode multiple functional products from a single DNA sequence (1–4). For 32 \nexample, a single gene sequence might code for two distinct proteins, each playing a different 33 \nrole in cellular function. Advancements in identifying and designing these overlapping sequences 34 \npromise the creation of genetic constructs that are both more efficient and compact, potentially 35 \nboosting the functionality and stability of engineered organisms (5,6). This phenomenon not only 36 \nposes challenges but also opens new avenues in genetic engineering and synthetic biology (3). 37 \nOverlapping genes have been described using different approaches, but can be broadly grouped 38 \ninto unidirectional (sense; encoded on the same strand, → →) or convergent (antisense; encoded 39 \non the opposite strand) orientations (Fig 1A); convergent overlaps can be further refined into 40 \n“head-to-head” (← →) or “tail-to-tail” (→ ←) orientations (7). Convergent overlapping genes 41 \noccur in the same reading frame (phase 0), or frameshifted by one (phase 1) or two (phase 2) 42 \nnucleotides. For example, in phase 2, convergent overlaps share their third (degenerate) codon 43 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n3 \n \npositions. Research into convergent (tail-to-tail; antisense) overlapping gene sequences in 44 \nprokaryotes has included analyses on the length distribution among overlapping and near-45 \noverlapping genes (8), investigations into their prevalence (and misannotations) within 46 \nprokaryotic genomes (9,10), as well as the characterization of convergent gene pairs (11–18). 47 \nNotably, both convergent and unidirectional overlaps exhibit a non-uniform distribution with a 48 \nsignificant phase bias, predominantly favoring phase 1 overlaps (8,19,20). While the majority of 49 \nconvergent overlaps fall into phase 2, when excluding 4 nucleotide overlaps, nearly half are 50 \ninstead found in phase 1 (8). These studies collectively highlight the phase preference and 51 \ndistribution patterns of naturally occurring gene overlaps. 52 \nThe successful prediction of overlapping nucleotide sequences encoding two amino acid 53 \nsequences poses unique computational challenges, and has been previously explored (6,21–25); 54 \nthe landscape of available methods has been summarized (22) with additional information 55 \npertaining to more recently released programs provided in Supplementary Table 1. An early 56 \nsolution for the design of unidirectional overlapping protein sequences included design of an 57 \nalgorithm to identify the shortest DNA sequence which could encode two amino acid sequences 58 \n(26). More recently, examples included a dynamic programming algorithm (6) and a subsequent 59 \nmethod that combines dynamic programming with a hidden Markov model (HMM) (5,27). The 60 \nstrategy was further extended to arbitrary pairs of natural protein domains (28). These 61 \napproaches have collectively shown that it is possible to design fully embedded overlapping 62 \ngenes that encode functional proteins with high homology to starting sequences, underscoring 63 \nthe potential for engineered overlapping genes in prokaryotic genetics. Another approach, termed 64 \noverlapping, alternate-frame insertion (OAFI) demonstrated the ability to design unidirectional 65 \noverlaps through insertion of one gene into an alternate reading frame of another by targeting 66 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n4 \n \nsites that may tolerate insertions (29). Recently, Byeon et al. (2025) demonstrated that deep 67 \ngenerative models can be used to design synthetic overlapping genes spanning distinct protein 68 \nfamilies, achieving high in silico and experimental success rates even under the constraints of the 69 \nstandard genetic code (30). Their results indicate that sequence space may contain many viable 70 \noverlapping solutions, suggesting broader applicability of computational approaches to both 71 \nunidirectional and convergent overlaps. 72 \nTransformer models have demonstrated superior performance in various sequence modeling 73 \ntasks due to their ability to capture long-range dependencies and contextual information (31–33). 74 \nThis study leverages a novel application of the transformer encoder-based model architecture to 75 \npredict synthetic convergent overlapping genes in prokaryotes. Unlike dynamic programming 76 \napproaches, transformers utilize self-attention mechanisms to process the entire input sequence 77 \nsimultaneously (32), enabling the model to capture patterns and dependencies that span across 78 \nlong sequences. In this study, Monte Carlo (MC) dropout is applied during inference to enhance 79 \nprediction performance. Dropout, more generally, is a method of reducing overfitting in deep 80 \nneural networks during training (34). MC dropout retains dropout activity during inference, 81 \neffectively sampling from a distribution of possible model predictions (34–36). This provides a 82 \npractical Bayesian approximation of model uncertainty (37), which is especially advantageous 83 \nfor predicting overlapping genes where the degeneracy of the genetic code permits many distinct 84 \nbut functionally viable nucleotide encodings. By sampling multiple plausible solutions, MC 85 \ndropout enables exploration of the many synonymous coding possibilities inherent to the genetic 86 \ncode, increasing the likelihood of identifying overlaps that preserve the structural and functional 87 \nintegrity of both proteins. 88 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n5 \n \nThe objectives of this study are to (i) evaluate the ability of transformer encoder models to 89 \nrecover convergent overlaps across all three phases, (ii) quantify the impact of overlap length 90 \nand sequence variability on prediction success, and (iii) assess multi-objective optimization as an 91 \nalternative to full structural prediction in improving overlap fidelity. 92 \nMethods 93 \nSynthetic Overlapping Gene Dataset Construction for Model Training 94 \nWhile naturally occurring convergent overlaps have been systematically assessed and 95 \ndocumented (9), there is no readily available database of convergent overlapping genes with 96 \nequally distributed overlap lengths. Given this limitation, a database of synthetic convergent 97 \noverlaps was generated using a diverse set of open reading frames (ORFs) from prokaryotic 98 \ngenomes. Synthetic convergent overlapping genes were created using ORF sequences from 24 99 \ndifferent genomes from prokaryotes (Supplementary Table 2), using the R (version 4.3.1) and 100 \nPython programming languages (version 3.8.8). Genomes were selected to have a wide range of 101 \ngenomic GC content (ranging from 23.3% to 74.9%, with a mean of 52.41%), with the aim of 102 \ncontrolling for potential amino acid compositional bias driven by GC content (38). The between-103 \nstrain coding GC content (calculated using the set of ORF sequences from each strain) ranged 104 \nfrom 20.7% to 72.7%, with a mean of 50.5%. 105 \nBriefly, for each strain, sequences were downloaded from the NCBI database 106 \n(https://www.ncbi.nlm.nih.gov/genome/) and each FASTA file was processed to extract ORFs, 107 \nwhich were stored in separate list for each strain (Fig 1B). ORF sequences were filtered to 108 \nremove sequences shorter than 450 nucleotides, to exclude sequences with insufficient 109 \nopportunity for varied random sampling during overlap construction. A function was defined to 110 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n6 \n \nrandomly select in-frame segments from the ORF sequence lists, to include a primary and 111 \nsecondary sequence either with or without a convergent overlapping region. First, a contiguous 112 \nin-frame 312-nucleotide length section from each ORF sequence was extracted and used as the 113 \nbasis for the primary sequence. The primary sequence segments were modified to ensure they 114 \ncontained start and randomly selected stop codons, as these were removed during initial 115 \nsequence processing.  116 \nThe reverse complement of this synthetic gene was selected to generate the secondary sequence 117 \nwith a defined convergent overlap length. A uniform set of overlap lengths was generated for 118 \neach length by extracting the reverse complement, adding a stop codon, and removing any in-119 \nframe stop codons with a random codon. Both the added stop codon and in-frame replacements 120 \nwere selected such that stop codons in the forward direction were not introduced. However, 121 \nowing to the methodology employed, a portion of generated secondary sequences with 122 \noverlapping sequences contained more than one stop codon and were subsequently filtered from 123 \nthe dataset to ensure all sequences contained only one stop codon and encoded a single amino 124 \nacid sequence. 125 \nThe amino acid sequence pairs with known overlaps were modified to incorporate up to 4 of 126 \nBLOSUM62 weighted amino acid changes. The changes included altering amino acid sequences 127 \nsuch that 1) the internal reverse stop codon was removed from both amino acid sequences 128 \n(primary and secondary), while keeping the known overlap for the original amino acid 129 \nsequences, or 2) the internal reverse stop codon was removed from both amino acid sequences 130 \n(primary and secondary), while also introducing additional amino acid modifications per 131 \nsequence. This yielded modified primary and secondary amino acid sequences, with an 132 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n7 \n \nunmodified known convergent overlap nucleotide sequence capable of encoding amino acid 133 \nsequences with high amino acid sequence identity to the primary and secondary amino acid 134 \nsequences. 135 \nThe length of the input amino acid sequences was limited to 104 amino acids to manage 136 \ncomputational complexity during training and inference phases. Each amino acid was separated 137 \nby a space and the overall included a terminal asterisk (*) and the primary and secondary 138 \nsequences were concatenated to generate a single input sequence for training.  139 \nTo balance the representation of different overlap lengths, data were stratified based on the 140 \nlength of the overlap. The overlapping gene training data were balanced, ensuring equal 141 \nrepresentation for each strain and each length of nucleotide overlap (specifically, convergent 142 \noverlaps of length 199 to 312 nucleotides). The final dataset was composed of synthetic gene 143 \noverlaps, each represented by a pair of modified amino acids sequences with a known target 144 \nconvergent overlap sequence. This dataset served as the training set for the transformer encoder-145 \nbased models. 146 \nInitial Model Description and Training 147 \nThe synthetic overlapping gene dataset was used to train transformer-encoder based models with 148 \nthe aim of generating convergent (tail-to-tail) overlaps for any two specified amino acid 149 \nsequences. A total of 8,892,000 combined pairs, corresponding to 78,000 pairs per convergent 150 \noverlap length. 151 \nTransformers have demonstrated superior performance in various sequence modeling tasks, 152 \nparticularly where understanding long-range dependencies is important (31–33). As such, this 153 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n8 \n \nstudy utilizes a PyTorch based transformer encoder model for overlap prediction (39). 154 \nTokenization and text vectorization were performed using the TensorFlow Keras API (40), and 155 \ndata were split into training and validation sets. During training, the models each utilized 156 156 \nembedding dimensions, 13 attention heads, 192 feed-forward network dimensions, 5 blocks, a 157 \ndropout rate of 0.1, and a total of 802,983 trainable parameters (Fig. 2D). Models were compiled 158 \nusing the Adam optimizer with cross-entropy as the loss function.  159 \nEach model was trained for 3 epochs with a batch size of 32 on a single NVIDIA GeForce GTX 160 \n5070 Ti GPU with 16GB of memory, and reached a validation accuracy of ≥90% by the third 161 \nand final epoch. 162 \nTest Dataset Generation and Model Inference 163 \nAmino acid pairs were generated as described above for preparation of the training dataset, but 164 \nwithout the final modification step; however, a different set of strains (n = 12) were selected than 165 \nthose used for training, with a wide range of genomic GC content (ranging from 23.6% to 74.1%, 166 \nwith a mean of 50.2%) (Supplementary Table 3). Given that inference was performed using 167 \nonly coding sequences, the coding sequence GC content was calculated for each strain, and the 168 \ncorresponding mean GC value was used for subsequent analyses (varying from 24.2% to 72.1%, 169 \nand a mean of 50.0%). 170 \nIt is noted that inference was performed using the trained models along with the same tokenizer 171 \nto vectorize the input sequence. Using a “greedy decoder” (41), the token with the highest 172 \nprobability at each step of the sequence generation process was selected. 173 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n9 \n \nMC dropout was implemented during inference to allow exploration of the uncertainty in model 174 \npredictions (eg, variation in predicted overlap sequences), leveraging the benefits of Bayesian 175 \ninference, as previously described (34,36,42). During inference, depending on the objective, 176 \nfeedforward dropout was set to 0.275 for initial predictions. 177 \nAn algorithm was developed to generate overlapping gene sequences from input amino acid 178 \nsequences using the trained models. The algorithm first back-translated the input amino acid 179 \nsequences into a nucleotide sequence, leveraging E. coli codon usage frequencies to reflect the 180 \nprobabilistic distribution of codons for each amino acid (43). After back-translation, the models 181 \nwere used to predict potential overlaps within these nucleotide sequences. Predicted overlaps 182 \nwere then integrated into the original nucleotide sequences, resulting in a collection of modified 183 \nsequences that incorporated the predicted overlapping regions downstream of the remaining 184 \nback-translated nucleotide sequence. These modified sequences (ie, the back-translated and 185 \npredicted regions) were then translated into amino acid sequences to assess the accuracy of the 186 \npredictions. In instances where the correct amino acid sequence was not identified, additional 187 \nrounds of prediction and back-translation were performed. This iterative process continued until 188 \neither an overlap yielding amino acid sequences meeting the pre-specified criteria was identified, 189 \nor the specified number of inference rounds were completed without a predicted overlap (defined 190 \nas an unsuccessful prediction). 191 \nAn alignment score was calculated for each sequence prediction. The predicted sequence was 192 \naligned against the known overlap using the Needleman-Wunsch algorithm (implemented via 193 \npairwise2 from Biopython (44)). For sequences with a known overlap and a predicted overlap 194 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n10 \n \nsequence, the calculated alignment score was based on the length of the overlap, normalizing 195 \nscores to a scale of 0 to 1 to facilitate comparison across predictions. 196 \nSecondary Structure Prediction, Substitution Score, and Structure Prediction 197 \nThe prediction of secondary structures was performed using S4PRED, which enabled predictions 198 \nbased on an input amino acid sequence alone without the need for multiple sequence alignments 199 \nor known homologous sequences (45). S4PRED outputs were formatted as “ss2” files, from 200 \nwhich the predicted secondary structure sequence was extracted for downstream analyses. The 201 \nmodel employs a simplified three-state classification—α-helix (H), β-sheet (E), and coil/loop 202 \n(C), which corresponds to a coarse-grained version of the DSSP (Dictionary of Secondary 203 \nStructure of Proteins) annotations (46). Minor code modifications were implemented to enable 204 \nefficient batch processing, substantially reducing runtime. These predicted structures were used 205 \nto evaluate the structural compatibility of convergent overlapping gene candidates. Substitution 206 \nmatrices utilized were BLOSUM62 (47,48) and ProtSub (49). Protein structures were predicted 207 \neither with AlphaFold3 or ColabFold (50–52). 208 \nOptimization of predicted sequences 209 \nAn iterative, windowed optimization procedure was used to convert paired amino-acid inputs 210 \ninto convergent (tail-to-tail) overlapping nucleotide sequences while preserving fold-relevant 211 \nfeatures of both proteins and maintaining sequence similarity. Input amino-acid sequences were 212 \nfirst back-translated to an initial nucleotide scaffold using E. coli codon usage frequencies. The 213 \nregion targeted for overlap design was tiled into partially overlapping amino-acid windows; in 214 \nthe experiments reported here windows were 10 amino acids with an 8-amino-acid stride (10 AA 215 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n11 \n \nwindow, 8 AA stride; model receptive field used = 105 AA). During optimization, only 216 \nnucleotides mapping to the current window were permitted to change; flanking nucleotides were 217 \nheld fixed except at user-designated residues enforced by token-level biases (for example to 218 \npreserve active-site residues or motifs). Bracketed constraints on the forward strand were 219 \nmirrored on the reverse-complement strand so that enforced bases remained consistent with the 220 \nconvergent orientation. 221 \nWithin each window a transformer-encoder was executed in an uncertainty-aware mode: Monte-222 \nCarlo (MC) dropout was retained at inference to generate diverse synonymous codon proposals. 223 \nToken-level fixed-logit (base) biases were applied at specified nucleotide positions to enforce 224 \nrequired bases during sampling; combined forward/reverse fixed-logit maps ensured constraints 225 \nwere respected on both strands. Each sampled nucleotide pair was translated to amino-acid 226 \nsequences and duplicate amino-acid solutions were removed prior to scoring. 227 \nStructural compatibility was assessed using single-sequence secondary structure prediction 228 \n(S4PRED; three-state H/E/C). Secondary structure preservation was reported as the percentage 229 \nof residues for which a candidate’s predicted state matched the original sequence prediction. 230 \nSubstitution similarity was computed using BLOSUM62 and normalized to the original 231 \nsequence’s self-score to yield a percent similarity; global alignment identity was also measured. 232 \nESM-2 (53) (model: esm2_t12_35M_UR50D) was used to generate residue-residue contact 233 \nprobability matrices, having demonstrated utility as a fast and effective single-sequence 234 \nprediction feature (54). Contact map similarity was assessed using the structural similarity index 235 \n(SSIM) (55), computed between ESM-2-predicted contact maps of candidate overlap sequences 236 \nand their respective originals. SSIM quantifies local and global structural pattern agreement 237 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n12 \n \n(range 0-1, with 1 indicating identical maps). Raw values were linearly scaled to a percentage (0-238 \n100) to yield the ESM-2 contact map score. This metric served as a rapid, computationally 239 \nefficient proxy for structural preservation, enabling large-scale overlap design iterations while 240 \nretaining sensitivity to perturbations in contact map topology. 241 \nCandidates were ranked by a fixed weighted composite score across all experiments: 0.15 242 \n(secondary structure preservation), 0.15 (substitution similarity), 0.1 (alignment identity), and 0.6 243 \n(ESM2 contact map similarity). Although this weighting scheme was not exhaustively 244 \noptimized, empirical testing with randomly selected SwissProt proteins indicated that it yielded 245 \nstable rankings across experiments and supported overlap generation while balancing structural 246 \nfidelity with sequence similarity (Figs. S3, S4, and S5). The top-ranked candidate for the 247 \nwindow was integrated, and optimization proceeded to the next window. Multiple full passes 248 \nacross windows were permitted for refinement. After the first structurally successful candidate, 249 \nglobal feedforward and attention dropout levels were reduced to bias sampling from exploratory 250 \nsearch toward fine-tuning, though the weighting scheme itself remained unchanged. 251 \nEfficiency measures implemented for reproducibility and speed included: (i) caching of 252 \nsecondary structure predictions so identical amino-acid sequences were scored only once, and 253 \n(ii) caching of ESM-2 embeddings and contact maps. All candidate metadata (attempt, window, 254 \nper-candidate metrics, and secondary structure predictions) were retained in memory and 255 \nexported at the end of each input row to permit analysis. 256 \nMutual information, PCA, and robustness testing 257 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n13 \n \nTerminal regions (104 amino acids) from 250 randomly selected SwissProt-derived proteins 258 \nwere used as seeds to generate ensembles of mutated sequences. For each seed, trajectories were 259 \ncreated by stepwise amino acid substitutions sampled from the 20-letter alphabet, introducing 260 \none mutation per step for up to 75 iterations. This yielded a progressive set of variant sequences 261 \nper seed protein. Each sequence in the ensemble was then scored for global alignment identity, 262 \nESM-2 contact map similarity, and secondary structure match relative to the reference. These 263 \nmetric vectors formed the input for subsequent mutual information (MI) and principal 264 \ncomponent analyses. 265 \nMI was estimated after quantile discretization of continuous scores into 20 bins to evaluate 266 \npairwise dependence among objectives. To reduce sensitivity to binning, MI was also estimated 267 \nusing a k-nearest-neighbor approach. Statistical significance of discrete MI values was assessed 268 \nusing permutation testing with 1,000 random shuffles. Principal component analysis (PCA) was 269 \napplied to z-scored metric vectors (alignment, ESM-2 contact map similarity, secondary 270 \nstructure match) to examine the dimensional structure of variation. Robustness checks included 271 \nvarying the number of bins (5-40) for MI and bootstrap resampling of PCA explained variance. 272 \nStatistical Analyses and Figure Preparation 273 \nThe R programming language (version 4.3.1) and the ggplot2 package (56,57) were used to 274 \nperform statistical analyses and generate figures. Prediction success rates are reported as 275 \npercentages with exact 95% confidence intervals calculated using the Clopper–Pearson method. 276 \nCorrelations between similarity metrics were assessed using Pearson’s correlation coefficient 277 \nwith associated p-values. Relationships between secondary structure composition (coil, helix, 278 \nsheet) and prediction performance were examined using LOESS smoothing with 95–99% 279 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n14 \n \nconfidence bands. For external validation experiments, performance metrics were stratified by 280 \nAlphaFold2 pLDDT confidence brackets and overlap phase. Group differences were evaluated 281 \nusing nonparametric pairwise contrasts with adjusted p-values (Bonferroni correction). Effect 282 \nsizes are reported as absolute percentage-point differences between groups, alongside confidence 283 \nintervals. Data are presented as mean values with ranges or confidence intervals where 284 \nappropriate. Figures were generated with ggplot2, with additional overlays and schematics 285 \nprepared in Inkscape (version 1.3.2). 286 \nDeclaration of Generative AI and AI-assisted technologies 287 \nDuring the preparation of this work, ChatGPT (5.0; OpenAI) was used to assess readability and 288 \nlanguage, and as a tool to generate code snippets and assist in debugging. After using this 289 \ntool/service, the author reviewed and edited the content as needed and takes full responsibility 290 \nfor the content of the publication. 291 \nCode and data availability 292 \nThe code used for synthetic overlap dataset construction, model training, and inference are 293 \navailable at: https://github.com/protosome/convergent_overlaps_aa_change  294 \nResults 295 \nPrediction of convergent overlapping genes from amino acid sequence pairs from a 296 \ndedicated transformer-encoder model set 297 \nWhile prior work has established that it is technically feasible to generate overlapping genes 298 \nfrom paired amino acid sequences, the performance of dedicated transformer-encoder models for 299 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n15 \n \nthis task has not been systematically evaluated. In this study, transformer-based models were 300 \ntrained to generate convergent overlapping nucleotide sequences from amino acid pairs, and their 301 \nability to recover overlaps across all three phases was assessed. These models were designed to 302 \nidentify convergent overlaps capable of encoding both input proteins, using recent advances in 303 \ntransformer-based sequence modeling to address the large solution space introduced by the 304 \ndegeneracy of the genetic code (31,58). Prediction quality was evaluated using alignment 305 \nidentity scores and secondary structure preservation metrics, allowing comparison across overlap 306 \nlengths from 199 to 312 nucleotides. 307 \nInitial filtering was performed using a conservative alignment score cutoff of 34% (28,59). 308 \nWithout Monte Carlo dropout (ie, dropout enabled during inference), successful predictions were 309 \nsubstantially reduced, and repeated inference runs converged to identical codon solutions. This 310 \nindicated that deterministic decoding restricted the search space to narrow, low-diversity 311 \noutcomes. By contrast, applying Monte Carlo dropout during inference expanded sequence 312 \ndiversity, producing multiple distinct codon assignments that encoded the same amino acid 313 \nsequences. This stochastic sampling increased the likelihood of recovering overlaps that satisfied 314 \nboth similarity and structural thresholds, underscoring the importance of uncertainty-aware 315 \ninference in navigating the combinatorial complexity of convergent overlap prediction. 316 \nTo further assess training requirements, the effect of imperfect overlaps and amino acid 317 \nsubstitutions in the training data was examined. Models trained only on unaltered overlaps 318 \nexhibited markedly reduced success rates, particularly for phase 1 (Figs. 3D, 3E). In a shared test 319 \nset of 5,000 amino acid sequence pairs without a known convergent overlap, training with 320 \nmodified sequences (stop codon removed plus 1–4 substitutions per sequence) substantially 321 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n16 \n \nimproved prediction rates: phase 0, 95.3% (94.7-95.9%); phase 2, 86.7% (85.7-87.6%); and 322 \nphase 1, 81.6% (80.5-82.6%). In contrast, models trained on unaltered overlaps performed 323 \nsignificantly worse: phase 0, 73.7% (72.4-74.9%); phase 2, 49.1% (47.7-50.5%); and phase 1, 324 \n9.7% (8.9-10.6%). These findings are consistent with the hypothesis that incorporating 325 \ncontrolled sequence variation during training increased model flexibility and suggest improved 326 \ngeneralization, particularly when coupled with Monte Carlo dropout during inference. On this 327 \nbasis, models trained with amino acid substitutions were selected for all subsequent analyses. 328 \nThe outputs of these models were further evaluated using BLOSUM62 and ProtSub substitution 329 \nmatrices. For overlap lengths of 310-312 nucleotides, similarity scores were strongly correlated 330 \nwith amino acid identity (r = 0.88; p < 0.0001), and the matrices themselves were highly 331 \ncorrelated with each other (r = 0.98; p < 0.0001). 332 \nIn addition to substitution metrics, alignment scores and secondary structure scores were 333 \ncompared across phases. Phase 1 overlaps exhibited the highest average alignment and 334 \nsecondary structure scores, phase 0 displayed intermediate values, and phase 2 was generally the 335 \nlowest. However, score distributions overlapped considerably, suggesting that phase-dependent 336 \ndifferences were not absolute. The near-identical secondary structure score distributions for 337 \nphase 0 and phase 2 further indicated that structural outcomes may be more sensitive to 338 \nsequence-specific context than to phase alone. 339 \nTo broaden the analysis, overlap lengths from 199 to 312 nucleotides were examined across a 340 \nlarger set of amino acid pairs. Phase-dependent patterns persisted: phase 1 overlaps were 341 \nincreasingly recoverable at shorter lengths, and the relative rank order of secondary structure and 342 \nalignment scores across phases remained consistent. These findings suggest that phase 1 overlaps 343 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n17 \n \nmay be predicted to be structurally and codon-compatible favorable, and that performance of the 344 \ntransformer encoder model set is influenced by overlap length and the diversity of sequences 345 \navailable for training.  346 \nHigh Secondary Structure Scores Are Linked to Coil-Dominated Precursor Sequences 347 \nTo assess whether the secondary structure composition of precursor amino acid sequences 348 \ninfluences the predicted quality of convergent overlaps, the proportion of coil (C), β-sheet (E), 349 \nand α-helix (H) structures in each input was compared against the model’s predicted secondary 350 \nstructure score. For each amino acid pair, structural proportions were calculated separately and 351 \nthen averaged across both sequences. This approach reflects the biological premise that overlap 352 \nfeasibility arises not from a single sequence in isolation, but from the combined structural 353 \nconstraints and flexibilities of the two proteins that must be co-encoded (60,61). 354 \nAcross all three overlap phases, sequences with higher coil content exhibited elevated secondary 355 \nstructure scores (Fig. 3C). This trend was most apparent in phase 0 and phase 1, where LOESS-356 \nsmoothed curves showed a modest increase in secondary structure score with increasing coil 357 \ncontent, followed by a sharper rise beyond approximately 70% coil. In phase 2, a similar increase 358 \nwas observed at high coil levels, but the trend was less consistent, with greater variability at 359 \nintermediate values. By contrast, α-helix content was inversely associated with secondary 360 \nstructure score. The most pronounced effect occurred in phase 0, where increasing helix content 361 \ncorresponded to a marked decline in secondary structure scores. A similar, though less steep, 362 \ntrend was observed in the other phases. For β-sheet content, no consistent relationship with 363 \nsecondary structure score was observed; across all three phases, smoothed curves were largely 364 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n18 \n \nflat, indicating that sheet content neither strongly promoted nor hindered predicted structural 365 \nquality. 366 \nTo evaluate whether these relationships extended to sequence-level similarity, mean alignment 367 \nscores were also plotted against precursor structural compositions. No meaningful trends were 368 \nobserved for any secondary structure class (Fig. S1). Alignment scores remained relatively 369 \nconstant regardless of coil, β-sheet, or α-helix content, suggesting that precursor structure did not 370 \nsystematically influence the model's ability to recover the reference overlap sequence. 371 \nMulti-objective optimization and computational design of EGFP/AmpR convergent 372 \noverlaps 373 \nAlthough the transformer-encoder models were capable of generating convergent overlaps for 374 \npaired amino acid sequences, the secondary structure preservation achieved in initial predictions 375 \nwas variable and generally lower than values reported in prior work (60). To address this 376 \nlimitation, an iterative sequence optimization algorithm was implemented to refine model 377 \noutputs using a multi-objective framework that did not rely on MSA availability (Fig. 4A). 378 \nThe optimization procedure employed a windowed feedforward Monte Carlo (MC) dropout 379 \nstrategy to generate multiple candidate solutions for each sequence window (Fig. 4B). Candidate 380 \noverlaps were then evaluated against two complementary structural criteria: 1) predicted 381 \nsecondary structure states derived from S4PRED (62), providing local fold-relevant constraints, 382 \nand 2) ESM-2 contact map similarity (63), which encodes sequence-level context that may 383 \nreflect long-range residue dependencies. A key feature of the framework was the ability to 384 \nenforce selective preservation of amino acid residues when required. This was achieved by 385 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n19 \n \napplying a feedforward mask to specific codon positions, constraining the model to maintain 386 \ndefined residues (for example, active-site motifs) while permitting synonymous variation 387 \nelsewhere. In practice, this allowed critical sequence features to be retained while enabling 388 \nexploration of codon-level degeneracy to optimize overlap formation. 389 \nRelationships among the chosen objectives were next quantified using mutual information (MI) 390 \nand principal component analysis (PCA). Both approaches indicated that alignment identity, 391 \nESM-2 contact map similarity, and predicted secondary structure capture overlapping but distinct 392 \nconstraints. Discrete MI (20 quantile bins) showed that alignment carried moderate information 393 \nabout both ESM-2 contact map similarity (MI = 0.393) and secondary structure match (MI = 394 \n0.381), whereas the MI between ESM-2 contact map similarity and secondary structure 395 \nsimilarity was lower (MI = 0.240), consistent with partial independence. Continuous MI 396 \nestimates (kNN regression) produced similar results (0.509, 0.508, and 0.340, respectively). All 397 \nassociations were highly significant in permutation tests (p = 0.001). PCA of z-scored metrics 398 \nsupported this interpretation (Fig. S2). A single axis (PC1) explained 72.4% of variance (95% 399 \nCI: 71.8-73.0) and loaded positively on all three objectives, reflecting a general similarity 400 \ndimension. PC2 (18.4%; 95% CI: 18.0-18.9) contrasted ESM-2 contact map similarity with 401 \nsecondary structure similarity, while PC3 (9.2%) primarily captured sequence-only variance. 402 \nThus, although alignment dominates the shared signal, secondary structure preservation and 403 \nESM-2 contact map-based similarity provide orthogonal information that cannot be reduced to 404 \nalignment alone. This justifies their inclusion as separate objectives during overlap optimization. 405 \nMetric behavior under progressive sequence divergence was also assessed by subjecting 250 406 \nrandomly sampled SwissProt proteins to iterative single-residue substitution trajectories (75 steps 407 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n20 \n \nper sequence). Across trajectories, all four similarity measures declined with increasing 408 \nmutational distance, but the rate and pattern of decay differed by metric (Fig. S3). Alignment 409 \nidentity and BLOSUM62 similarity decreased rapidly and near-linearly, consistent with their 410 \ndirect dependence on residue-level identity. By contrast, ESM-2 contact map similarity and 411 \nsecondary structure similarity showed more gradual declines. These distinct decay profiles 412 \nsupport the interpretation that the objectives capture complementary aspects of overlap fidelity. 413 \nWeighting experiments further clarified the contributions of each objective. When the relative 414 \nweights assigned to secondary structure similarity, ESM-2 contact map similarity, alignment 415 \nidentity, and BLOSUM62 substitution similarity were varied across randomly selected SwissProt 416 \npairs with very high AlphaFold2 pLDDT values (≥90), distinct trade-off dynamics were 417 \nobserved (Fig. S4). Increased weighting on secondary structure similarity consistently improved 418 \nsecondary structure preservation but reduced alignment and substitution similarity, whereas 419 \nembedding-focused weightings enhanced ESM-2 contact map similarity at the expense of 420 \nsecondary structure similarity. Intermediate combinations (eg, 0.15-0.4 secondary structure with 421 \n0.6 ESM-2 contact map similarity) yielded more balanced outcomes across the four metrics. 422 \nVariance patterns indicated that secondary structure weighting stabilized optimization 423 \ntrajectories, while ESM-2 contact map similarity remained comparatively variable across 424 \ncandidate sequences. 425 \nA convergent overlap between enhanced green fluorescence protein (EGFP) and a TEM-1 β-426 \nlactamase (ampicillin resistance marker from pCVD004; hereafter AmpR) was designed and 427 \nrefined under the same objectives as a biologically relevant test case (64,65). The combination of 428 \nMC dropout sampling, secondary structure prediction, and embedding-based scoring consistently 429 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n21 \n \nimproved fold preservation of computationally designed EGFP/AmpR 311 nucleotide (phase 1) 430 \noverlaps (Fig. 4C; Fig. 5). Compared to unoptimized outputs (ie, sequences generated from the 431 \nfirst window), optimized sequences displayed higher secondary structure similarity rates and 432 \nincreased ESM-2 contact map similarity, demonstrating that overlap designs can be improved 433 \nusing lightweight, MSA-independent objectives. During windowed optimization, alignment and 434 \nBLOSUM62 metrics remained largely constant or decreased across windows, reflecting their 435 \nrole in constraining candidate sequences to a biologically plausible subset of sequence space. In 436 \ncontrast, secondary structure similarity generally increased throughout optimization, despite 437 \nbeing weighted at only 0.15 in the composite score. ESM-2 contact map similarity increased 438 \nmodestly in the early windows before plateauing. These findings suggest that fold-relevant 439 \nsecondary structure metrics may act as the principal discriminative driver of optimization 440 \nprogress, with alignment- and embedding-based metrics providing a stabilizing baseline. 441 \nFollowing optimization, a pair of amino acids with the highest combined score was selected for 442 \nstructural comparison using predictions from AlphaFold3 (50), while selecting amino acid 443 \nsequences in both pairs for preservation based on predicted importance (eg, active site). The 444 \nAlphaFold3-predicted models of designed overlaps were highly similar to reference predictions 445 \n(Fig. 4C), with TM-scores of 0.98 (EGFP) and 0.90 (AmpR), the latter with deviations generally 446 \nlocalized to a disordered coil not located in the overlapping terminal region (66). 447 \nPerformance of this algorithm in predicting overlapping genes using pLDDT as an 448 \northogonal proxy for intrinsic structure rigidity 449 \nAnalyses on prokaryote-derived amino acid sequence pairs indicated that secondary structure 450 \npreservation increased with coil content and decreased with helix content, with sheet content 451 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n22 \n \nshowing no consistent effect. To determine whether this pattern reflected a generalizable 452 \nproperty of overlap design for convergent overlap derived from terminal amino acid sequences, 453 \nan external validation set was assembled from SwissProt proteins stratified by AlphaFold2 (AF2) 454 \nper-residue confidence (pLDDT), an orthogonal proxy for intrinsic structural rigidity (52,67–69). 455 \nSwissProt sequences were filtered to 105–200 amino acids to align with the 105-AA receptive 456 \nfield and to bound computational cost, then binned into Low-intermediate (50-69), Confident 457 \n(70-89), and Very high (90-100) pLDDT groups. For each bracket, 100 non-redundant amino-458 \nacid pairs were formed, and convergent overlaps were designed across phases 0, 1, and 2 at 310, 459 \n311, and 312 nt using the same transformer-encoder, windowed Monte-Carlo (MC) dropout 460 \ninference, objective weights, and constraint handling described previously. 461 \nQuantitative analysis of secondary structure composition (determined using S4PRED) across 462 \npLDDT confidence bins revealed clear trends. Looking specifically at the terminal AA regions, 463 \nas pLDDT confidence increased from the Low-intermediate to the Very high bins, coil content 464 \ndeclined substantially (from ~64% to ~41%), while both helix and sheet content rose (helix from 465 \n~29% to ~39%; sheet from ~6% to ~20%) (Figs. 6A, 6B). These shifts suggest that AlphaFold’s 466 \nhigher-confidence regions correspond generally to more ordered structural states, and reinforce 467 \nthe interpretation of pLDDT as an approximate proxy for structural order and rigidity (50,70,71). 468 \nThese AF2-stratified findings were consistent with the earlier composition-based analysis using 469 \nprokaryotic ORFs in this study. Sequence pairs with lower pLDDT (coil-enriched) were 470 \nassociated with slightly higher secondary structure preservation, whereas higher-confidence 471 \ninputs tended to show reduced variation (Fig. 6C). Nevertheless, the overall high secondary 472 \nstructure scores across all three brackets emphasize that the algorithm was able to recover 473 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n23 \n \nstructurally faithful overlaps even for proteins predicted by AlphaFold2 to adopt stable, well-474 \ndefined folds. Alignment identity and substitution similarity remained generally stable across 475 \nbrackets, indicating that secondary structure preservation was the primary dimension along 476 \nwhich flexibility exerted its effect. 477 \nBracket- and phase-specific effects were evident across all three metrics, consistent with the 478 \nsecondary structure composition of the pLDDT bins. Across overlap phases, sequences in the 479 \nLow-intermediate (50-69) bracket generally outperformed those in the Confident (70-89) and 480 \nVery high (90-100) brackets, with the gap being most pronounced in phase 2 (Fig. 6D). In phase 481 \n1, the bracket effect was clear for secondary structure preservation and for the combined score 482 \nrelative to Very high (90-100), whereas pairwise differences were not evident for ESM-2 contact 483 \nmap similarity; Low-intermediate (50-69) also did not differ from Confident (70-89) for the 484 \ncombined score. Overall, phase 1 maintained the highest preservation, while phase 2 exhibited 485 \nthe largest bracket separations. Although group differences were statistically significant, effect 486 \nsizes varied by metric: within a phase, combined and ESM-2 contact map similarity differences 487 \nwere typically ~1-4 percentage points, whereas secondary structure differences reached ~5-8 488 \npoints in some phase 2 contrasts. All groups maintained mean secondary structure preservation 489 \nabove 85%. Full pairwise estimates with confidence intervals and adjusted p-values are provided 490 \nin Supplementary Tables 4 and 5. 491 \nWhile statistically significant differences were observed between pLDDT brackets, these 492 \ndifferences were relatively small, and overall secondary structure preservation remained 493 \nconsistently high across all conditions (Fig. 6E). These results reinforce the generalizability of 494 \nthe earlier coil-associated trend, but also highlight the robustness of the multi-objective, 495 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n24 \n \nuncertainty-aware optimization framework in achieving high predicted structural preservation 496 \nregardless of precursor rigidity. 497 \nDiscussion 498 \nOverlapping gene architectures have long been recognized as both a source of evolutionary 499 \nconstraint and an avenue for novel functionality (72–80). In addition to their evolutionary 500 \nimportance, overlapping structures can be purposefully engineered under realistic codon and 501 \nstructural constraints. The present study shows that phase-specific convergent overlaps can be 502 \ncomputationally generated and optimized using a transformer encoder-based multi-objective 503 \nframework without requiring full 3D prediction at the optimization stage; targeted 3D checks 504 \nwere applied post hoc. This framework offers a complementary tool that could expand the range 505 \nof strategies available for synthetic overlap design, particularly where phase, length, and 506 \nnucleotide-level control are critical. 507 \nRecent work has demonstrated that deep generative protein models can discover viable 508 \noverlapping solutions under genetic code constraints. Byeon et al. (2025) showed that pretrained 509 \namino acid-space generative models can design synthetic overlapping genes across multiple 510 \nreading frames, validating expression experimentally using Gibbs sampling with codon-511 \ncompatibility constraints (30). Xu et al. (2025) further demonstrated that pretrained generative 512 \nprotein language models (ESM-3) can yield viable entangled protein pairs through structure-513 \nconditioned inverse folding and CAMEOS-based entanglements, filtered by cross-entropy and 514 \nPotts energy scores (81). Importantly, Xu et al. established that pretrained generative models are 515 \ncapable of identifying functional overlap-compatible solutions, albeit in a limited experimental 516 \ncontext (the InfA/AroB system) and without explicit codon-level optimization. 517 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n25 \n \nThe framework presented here builds on this momentum but takes a different route. Rather than 518 \nrelying on pretrained amino acid-space sampling, it employs a purpose-built, phase-generalizable 519 \ntransformer encoder model set trained on a balanced synthetic dataset of convergent overlaps 520 \nspanning diverse GC contexts and overlap lengths. Multi-objective optimization is integrated 521 \ndirectly into the inference loop, with candidate overlaps iteratively refined through a windowed, 522 \nuncertainty-aware (MC dropout) search. Evaluation criteria span predicted secondary structure 523 \npreservation (S4PRED), amino acid substitution similarity (BLOSUM62), pairwise alignment 524 \nidentity, and ESM-2 contact map similarity. This approach emphasizes codon-level control while 525 \nmaintaining structural fidelity, enabling overlap recovery without requiring full 3D structural 526 \nprediction at each iteration. 527 \nMC dropout was retained during inference to enable stochastic sampling of synonymous codon 528 \nsolutions, thereby broadening the effective search space (35). Candidate overlaps were ranked 529 \nusing a fixed composite score that balanced structural preservation, substitution similarity, 530 \nalignment identity, and embedding-based metrics. This weighting scheme served as a practical 531 \napproach for navigating competing design constraints without requiring full structural prediction 532 \nat each iteration, and was most closely aligned with prior work arguing that synthetic overlap 533 \ndesign requires explicit balancing of competing biological constraints and objectives (5,60). 534 \nWhile other multi-objective optimization strategies could be considered in future work (such as 535 \nε-constraint methods (82,83), adaptive scalarization (84,85), or evolutionary multi-objective 536 \noptimizers (86)), the fixed weighted framework proved sufficient to achieve high overlap fidelity 537 \nunder the conditions evaluated. 538 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n26 \n \nThe synthetic convergent overlapping gene dataset construction method developed as part of this 539 \nstudy provides a structured framework for exploring overlapping gene design in silico, as it 540 \npermits generation of diverse and uniformly distributed overlaps. However, one potential 541 \nlimitation of this approach is the reliance on synthetic, rather than natural, overlapping genes for 542 \nmodel training, which were generated using ORFs from a diverse set of prokaryotic strains. This 543 \nmay introduce a bias affecting the predictive ability and variability in sequence output. Despite 544 \nthis concern, the method employed to generate synthetic overlapping genes is in principle 545 \nanalogous to the formation of new overlaps through the process of gene extension following the 546 \nloss of a stop codon (8). This results in synthetic overlapping genes where the second (extended) 547 \nstrand's encoded amino acid composition is derived from the reverse complement of the first 548 \nstrand, without undergoing further selection. As recently reported, there is extensive evidence 549 \nsuggesting functionality in prokaryotic convergent (antisense) proteins, indicating that non-550 \ncoding RNA regions could also encode functional proteins (87). For same-strand overlaps, a 551 \nsignificant difference in overall composition compared to non-overlapping genes has been 552 \npreviously observed (88), along with a bias towards disorder-promoting amino acids (89). The 553 \nextent to which this affects convergent overlapping genes remains unclear. Future work will 554 \ntherefore aim to (i) characterize compositional differences between natural and synthetic 555 \noverlaps, (ii) extend design to longer proteins beyond the 199-312 nucleotide range considered 556 \nhere, and (iii) experimentally validate whether in silico-designed overlaps from this framework 557 \nmaintain expression, folding, and function in vivo. 558 \nIn summary, this study establishes that phase-aware transformer models trained directly at the 559 \nnucleotide level, when coupled with uncertainty-aware inference and multi-objective 560 \noptimization, can recover computationally designed convergent overlapping genes under realistic 561 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n27 \n \ngenomic constraints. By integrating codon-level sampling with structural and substitution-based 562 \nobjectives, this framework complements recent amino acid-space generative approaches while 563 \noffering length- and phase-specific control. These findings provide a computational foundation 564 \nfor experimentally testable designs and point toward a scalable strategy for compact, synthetic 565 \ngene architectures. 566 \n 567 \n 568 \n 569 \nFig 1. Depiction of convergent overlapping genes in three reading frames and the process 570 \ndeveloped to computationally design synthetic convergent overlaps from coding sequences.  571 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n28 \n \n(A) Sequences are represented for convergent (tail-to-tail) and unidirectional (same-strand) 572 \noverlapping genes in each of the three reading frames (phase 0, phase 1, and phase 2). Note, 573 \nphase 0 unidirectional overlaps are not represented as these are considered as an alternative start 574 \nsite for the same gene (90). In each phase, the reading frame was shifted by one nucleotide. Start 575 \nand stop codons are indicated with green and red shading, respectively. (B) The general process 576 \nfor creating synthetic convergent overlapping genes is presented, starting with a pool of ORF 577 \nsequences filtered by length (≥450 base pair [bp]). The primary sequence and in-frame 105-bp 578 \nfragment were randomly selected. A randomly selected stop codon from the reverse complement 579 \n(rc), and the fragment reverse complement was extended to that stop codon to serve as the 580 \ndownstream sequence of the secondary sequence. Given the downstream secondary sequence 581 \nlength, another sequence was randomly selected from the same gene list. A size appropriate 582 \nfragment was selected from that gene with length equal to the difference between the desired 583 \ngene length and the upstream overlap fragment (ie, the reverse complement of the primary 584 \nsequence to the selected stop codon). The upstream fragment was then combined with the 585 \ndownstream fragment to generate the complete secondary sequence. The resultant primary and 586 \nsecondary nucleotide sequences, sharing a known convergent overlapping region, were then 587 \ntranslated to amino acid (aa) sequences (either modified or unmodified) for subsequent use in 588 \nmodel training or inference. 589 \n 590 \n 591 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n29 \n \nFig 2. Illustration of the overlap transformer-encoder architecture and general structure of 592 \nthe data flow 593 \n(A) Depiction of the general structure of the data flow, for two input concatenated amino acid 594 \nsequences (primary and secondary) used during inference. The primary and secondary amino 595 \nacid sequences are represented with green and blue text, respectively, and the flow proceeds 596 \nfrom top to bottom. (B) Illustration of the transformer encoder model architecture. The model 597 \ninput is tokenized concatenated paired amino acid sequences, and the model outputs raw logits. 598 \nThe output logits are processed using a greedy decoder to generate a predicted overlap sequence. 599 \n 600 \n 601 \nFig 3. Training data evaluation of secondary structure preservation and convergent 602 \noverlap-phase performance. 603 \n(A, D) Results for unmodified amino acid sequences. (A) Scatter density plot of secondary 604 \nstructure (SS) score versus alignment score, with contours and marginal density plots showing 605 \nthe distribution by overlap phase (0 = red, 1 = green, 2 = blue). (D) Fraction of successful 606 \npredictions across overlap lengths (310, 311, and 312 nucleotides), stratified by phase. Success 607 \nrates are reported with exact 95% Clopper–Pearson confidence intervals. (B, E) Results for 608 \nmodified amino acid sequences. (B) Scatter density plot of SS score versus alignment score with 609 \nthe same phase-coloring scheme. (E) Fraction of successful predictions across overlap lengths 610 \nfor modified sequences, stratified by phase. Success rates are reported with exact 95% Clopper–611 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n30 \n \nPearson confidence intervals. (C) Relationship between sequence-level SS composition (coil [C], 612 \nsheet [E], helix [H]) and mean SS score across phases 0, 1, and 2. Dashed lines indicate LOESS 613 \nregression fits with 99% confidence bands. 614 \n 615 \n 616 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n31 \n \n 617 \nFig 4. In silico design of convergent overlap with EGFP and AmpR 618 \nOverview of the multi-objective optimization process for EGFP and AmpR proteins using the 619 \ntransformer encoder framework. (A) Schematic of the transformer encoder–based workflow. The 620 \nterminal 104 amino acids of EGFP and AmpR (across all tested overlap lengths) are 621 \nconcatenated and passed through the model. Outputs are scored by a composite metric 622 \nintegrating alignment, substitution, secondary structure preservation, and long-range interactions. 623 \nBracketed amino acids from the input are preserved to enforce positional constraints. The final 624 \nsequence pair is selected from the output table by the highest average combined score. (B) 625 \nMultiple-sequence alignment of all output pairs (n=3,624) for EGFP and AmpR across both 626 \noptimization passes and windows (top to bottom). Diagonal dropout patterns correspond to 627 \nlocalized MC dropout applied to the overlap region. Variability reflects bracket-constrained 628 \ncodon degeneracy introduced on the non-fixed partner sequence. (C) Predicted tertiary structures 629 \ngenerated with AlphaFold3 for EGFP (left) and AmpR (right). Native and overlapping terminal 630 \nregions relevant to the 311-nt overlap are highlighted (native in red, overlapped in blue); non-631 \noverlapping regions are shown in gray. 632 \n 633 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n32 \n \n 634 \nFig 5. Trajectories of composite and component metrics during computational optimization 635 \nof EGFP/AmpR convergent overlaps 636 \nOptimization performance for EGFP and AmpR across two passes, each consisting of 13 637 \nwindows. Metrics shown include the combined score (top left), which integrates secondary 638 \nstructure (SS), ESM-2 contact map similarity, and alignment components, as well as the 639 \nindividual SS (top right), ESM-2 (bottom left), and amino acid alignment (bottom right) scores. 640 \nSolid red (EGFP) and blue (AmpR) lines denote mean values, and shaded regions represent the 641 \nminimum-maximum range across sequences within each window. 642 \n 643 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n33 \n \n 644 \nFig 6. Comparison of computational overlap predictions across pLDDT brackets, with 645 \nstructural composition and optimization outcomes 646 \n 647 \nAnalyses are based on random sampling of SwissProt protein pairs (n=100 for each bracket) 648 \nfrom the AlphaFold2 Protein Structure Database (AF2). (A) Distribution of average pLDDT 649 \nvalues for the terminal 104 residues across sampled sequences, with red dashed lines marking the 650 \nconfidence thresholds. (B) Counts of sequences assigned to pLDDT confidence brackets, shown 651 \nfor both total and terminal regions (internal white bars denote terminal regions). (C) Ternary 652 \nplots of amino acid composition (helix [H], sheet [E], and coil [C]) for the primary and 653 \nsecondary sequence sets, colored by pLDDT bracket (low-intermediate, confident, and very 654 \nhigh). Circles with black centers denote the centroid (mean position of all sequences within a 655 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n34 \n \ngiven pLDDT bracket) in ternary space. (D) Optimization performance across convergent 656 \noverlap phases (0, 1, 2) for Pass 1 (P1) and Pass 2 (P2). Plots show average secondary structure 657 \n(SS Avg) and ESM-2 contact map similarity (ESM Avg) scores by optimization window, with 658 \nshaded regions representing the range of values stratified by pLDDT bracket. (E) Density 659 \ndistributions of individual metric scores (SS, ESM, and Alignment) for the highest combined 660 \nscore sequences (primary and secondary) across phases 0, 1, and 2. 661 \n 662 \nReferences 663 \n 664 \n1. Barrell BG, Air GM, Hutchison CA. Overlapping genes in bacteriophage φX174. Nature. 665 \n1976 Nov;264(5581):34–41.  666 \n2. Keese PK, Gibbs A. Origins of genes: “big bang” or continuous creation? Proc Natl Acad Sci 667 \nUSA. 1992 Oct 15;89(20):9489–93.  668 \n3. Wright BW, Molloy MP, Jaschke PR. Overlapping genes in natural and engineered genomes. 669 \nNat Rev Genet. 2022 Mar;23(3):154–68.  670 \n4. Grainger DC. The unexpected complexity of bacterial genomes. Microbiology. 2016 July 671 \n1;162(7):1167–72.  672 \n5. Blazejewski T, Ho HI, Wang HH. Synthetic sequence entanglement augments stability and 673 \ncontainment of genetic information in cells. Science. 2019 Aug 9;365(6453):595–8.  674 \n6. Opuu V , Silvert M, Simonson T. Computational design of fully overlapping coding schemes 675 \nfor protein pairs and triplets. Sci Rep. 2017 Nov 20;7(1):15873.  676 \n7. Sabath N, Graur D, Landan G. Same-strand overlapping genes in bacteria: compositional 677 \ndeterminants of phase bias. Biology Direct. 2008;3(1):36.  678 \n8. Kingsford C, Delcher AL, Salzberg SL. A Unified Model Explaining the Offsets of 679 \nOverlapping and Near-Overlapping Prokaryotic Genes. Molecular Biology and Evolution. 680 \n2007 Sept;24(9):2091–8.  681 \n9. Pallejà A, Harrington ED, Bork P. Large gene overlaps in prokaryotic genomes: result of 682 \nfunctional constraints or mispredictions? BMC Genomics. 2008;9(1):335.  683 \n10. Zehentner B, Ardern Z, Kreitmeier M, Scherer S, Neuhaus K. Evidence for Numerous 684 \nEmbedded Antisense Overlapping Genes in Diverse E. coli Strains [Internet]. 2020 [cited 685 \n2024 July 12]. Available from: http://biorxiv.org/lookup/doi/10.1101/2020.11.18.388249 686 \n11. Hücker SM, Vanderhaeghen S, Abellan-Schneyder I, Scherer S, Neuhaus K. The Novel 687 \nAnaerobiosis-Responsive Overlapping Gene ano Is Overlapping Antisense to the Annotated 688 \nGene ECs2385 of Escherichia coli O157:H7 Sakai. Front Microbiol. 2018 May 14;9:931.  689 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n35 \n \n12. Vanderhaeghen S, Zehentner B, Scherer S, Neuhaus K, Ardern Z. The novel EHEC gene asa 690 \noverlaps the TEGT transporter gene in antisense and is regulated by NaCl and growth phase. 691 \nSci Rep. 2018 Dec 14;8(1):17875.  692 \n13. Zehentner B, Ardern Z, Kreitmeier M, Scherer S, Neuhaus K. A Novel pH-Regulated, 693 \nUnusual 603 bp Overlapping Protein Coding Gene pop Is Encoded Antisense to ompA in 694 \nEscherichia coli O157:H7 (EHEC). Front Microbiol. 2020 Mar 20;11:377.  695 \n14. Behrens M, Sheikh J, Nataro JP. Regulation of the Overlapping pic/set Locus in Shigella 696 \nflexneri and Enteroaggregative Escherichia coli. Infect Immun. 2002 June;70(6):2915–25.  697 \n15. Balabanov VP, Kotova VYu, Kholodii GY , Mindlin SZ, Zavilgelsky GB. A novel gene, ardD 698 \n, determines antirestriction activity of the non-conjugative transposon Tn 5053 and is located 699 \nantisense within the tniA gene. FEMS Microbiol Lett. 2012 Dec;337(1):55–60.  700 \n16. Delaye L, DeLuna A, Lazcano A, Becerra A. The origin of a novel gene through overprinting 701 \nin Escherichia coli. BMC Evol Biol. 2008;8(1):31.  702 \n17. Fellner L, Bechtel N, Witting MA, Simon S, Schmitt-Kopplin P, Keim D, et al. Phenotype of 703 \nhtgA ( mbiA ), a recently evolved orphan gene of Escherichia coli and Shigella , completely 704 \noverlapping in antisense to yaaW. FEMS Microbiol Lett. 2014 Jan;350(1):57–64.  705 \n18. Graf F, Zehentner B, Fellner L, Scherer S, Neuhaus K. Three Novel Antisense Overlapping 706 \nGenes in E. coli O157:H7 EDL933. Gilk SD, editor. Microbiol Spectr. 2023 Feb 707 \n14;11(1):e02351-22.  708 \n19. Cock PJA, Whitworth DE. Evolution of Gene Overlaps: Relative Reading Frame Bias in 709 \nProkaryotic Two-Component System Genes. J Mol Evol. 2007 Apr;64(4):457–62.  710 \n20. Cock PJA, Whitworth DE. Evolution of Relative Reading Frame Bias in Unidirectional 711 \nProkaryotic Gene Overlaps. Molecular Biology and Evolution. 2010 Apr 1;27(4):753–6.  712 \n21. Pavesi A. Prediction of two novel overlapping ORFs in the genome of SARS-CoV-2. 713 \nVirology. 2021 Oct;562:149–57.  714 \n22. Nelson CW, Ardern Z, Wei X. OLGenie: Estimating Natural Selection to Predict Functional 715 \nOverlapping Genes. Tamura K, editor. Molecular Biology and Evolution. 2020 Apr 716 \n3;msaa087.  717 \n23. Schlub TE, Buchmann JP, Holmes EC. A Simple Method to Detect Candidate Overlapping 718 \nGenes in Viruses Using Single Genome Sequences. Malik H, editor. Molecular Biology and 719 \nEvolution. 2018 Oct 1;35(10):2572–81.  720 \n24. Chlebek JL, Leonard SP, Kang-Yun C, Yung MC, Ricci DP, Jiao Y , et al. Prolonging genetic 721 \ncircuit stability through adaptive evolution of overlapping genes. Nucleic Acids Research. 722 \n2023 July 21;51(13):7094–108.  723 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n36 \n \n25. Decrulle AL, Frénoy A, Meiller-Legrand TA, Bernheim A, Lotton C, Gutierrez A, et al. 724 \nEngineering gene overlaps to sustain genetic constructs in vivo. Braun EL, editor. PLoS 725 \nComput Biol. 2021 Oct 8;17(10):e1009475.  726 \n26. Wang B, Papamichail D, Mueller S, Skiena S. Two Proteins for the Price of One: The Design 727 \nof Maximally Compressed Coding Sequences. In: Carbone A, Pierce NA, editors. DNA 728 \nComputing [Internet]. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006 [cited 2024 July 729 \n10]. p. 387–98. (Lecture Notes in Computer Science; vol. 3892). Available from: 730 \nhttp://link.springer.com/10.1007/11753681_31 731 \n27. Manuel Martí J, Hsu C, Rochereau C, Xu C, Blazejewski T, Nisonoff H, et al. 732 \nGENTANGLE: integrated computational design of gene entanglements. Elofsson A, editor. 733 \nBioinformatics. 2024 June 21;btae380.  734 \n28. Wichmann S, Scherer S, Ardern Z. Biological factors in the synthetic construction of 735 \noverlapping genes. BMC Genomics. 2021 Dec;22(1):888.  736 \n29. Leonard SP, Halvorsen T, Lim B, Park DM, Jiao Y , Yung M, et al. Creating overlapping 737 \ngenes by alternate-frame insertion [Internet]. 2024 [cited 2024 Nov 29]. Available from: 738 \nhttp://biorxiv.org/lookup/doi/10.1101/2024.11.07.622342 739 \n30. Byeon GW, Expòsit M, Baker D, Seelig G. Design of overlapping genes using deep 740 \ngenerative models of protein sequences [Internet]. Synthetic Biology; 2025 [cited 2025 Aug 741 \n8]. Available from: http://biorxiv.org/lookup/doi/10.1101/2025.05.06.652464 742 \n31. Choi SR, Lee M. Transformer Architecture and Attention Mechanisms in Genome Data 743 \nAnalysis: A Comprehensive Review. Biology. 2023 July 22;12(7):1033.  744 \n32. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All 745 \nYou Need [Internet]. arXiv; 2023 [cited 2023 Oct 18]. Available from: 746 \nhttp://arxiv.org/abs/1706.03762 747 \n33. Lin T, Wang Y , Liu X, Qiu X. A survey of transformers. AI Open. 2022;3:111–32.  748 \n34. Srivastava, N, Hinton, G, Krizhevsky, A, Sutskever, I, Salakhutdinov, R. Dropout: A Simple 749 \nWay to Prevent Neural Networks from Overfitting. JMLR. 2014;15(56):1929−1958.  750 \n35. Gal Y , Ghahramani Z. Dropout as a Bayesian Approximation: Representing Model 751 \nUncertainty in Deep Learning. Proceedings of The 33rd International Conference on 752 \nMachine Learning. 2016;(48):pp 1050-1059.  753 \n36. Lemay A, Hoebel K, Bridge CP, Befano B, De Sanjosé S, Egemen D, et al. Improving the 754 \nrepeatability of deep learning models with Monte Carlo dropout. npj Digit Med. 2022 Nov 755 \n18;5(1):174.  756 \n37. Seoh R. Qualitative Analysis of Monte Carlo Dropout [Internet]. arXiv; 2020 [cited 2024 757 \nJuly 13]. Available from: https://arxiv.org/abs/2007.01720 758 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n37 \n \n38. Singer GAC, Hickey DA. Nucleotide Bias Causes a Genomewide Bias in the Amino Acid 759 \nComposition of Proteins. Molecular Biology and Evolution. 2000 Nov 1;17(11):1581–8.  760 \n39. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative 761 \nStyle, High-Performance Deep Learning Library [Internet]. arXiv; 2019 [cited 2024 July 13]. 762 \nAvailable from: https://arxiv.org/abs/1912.01703 763 \n40. TensorFlow Developers. TensorFlow [Internet]. Zenodo; 2023 [cited 2024 Feb 24]. Available 764 \nfrom: https://zenodo.org/doi/10.5281/zenodo.10126399 765 \n41. Zarrieß S, V oigt H, Schüz S. Decoding Methods in Neural Language Generation: A Survey. 766 \nInformation. 2021 Aug 30;12(9):355.  767 \n42. Graczyk KM, Pawłowski J, Majchrowska S, Golan T. Self-normalized density map (SNDM) 768 \nfor counting microbiological objects. Sci Rep. 2022 June 22;12(1):10583.  769 \n43. Subramanian K, Payne B, Feyertag F, Alvarez-Ponce D. The Codon Statistics Database: A 770 \nDatabase of Codon Usage Bias. Saitou N, editor. Molecular Biology and Evolution. 2022 771 \nAug 3;39(8):msac157.  772 \n44. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely 773 \navailable Python tools for computational molecular biology and bioinformatics. 774 \nBioinformatics. 2009 June 1;25(11):1422–3.  775 \n45. Moffat L, Jones DT. Increasing the accuracy of single sequence prediction methods using a 776 \ndeep semi-supervised learning framework. Xu J, editor. Bioinformatics. 2021 Nov 777 \n5;37(21):3744–51.  778 \n46. Kabsch W, Sander C. Dictionary of protein secondary structure: Pattern recognition of 779 \nhydrogen‐bonded and geometrical features. Biopolymers. 1983 Dec;22(12):2577–637.  780 \n47. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl 781 \nAcad Sci USA. 1992 Nov 15;89(22):10915–9.  782 \n48. Eddy SR. Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol. 783 \n2004 Aug;22(8):1035–6.  784 \n49. Jia K, Jernigan RL. New amino acid substitution matrix brings sequence alignments into 785 \nagreement with structure matches. Proteins. 2021 June;89(6):671–82.  786 \n50. Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure 787 \nprediction of biomolecular interactions with AlphaFold 3. Nature. 2024 June 788 \n13;630(8016):493–500.  789 \n51. Mirdita M, Schütze K, Moriwaki Y , Heo L, Ovchinnikov S, Steinegger M. ColabFold: 790 \nmaking protein folding accessible to all. Nat Methods. 2022 June;19(6):679–82.  791 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n38 \n \n52. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate 792 \nprotein structure prediction with AlphaFold. Nature. 2021 Aug 26;596(7873):583–9.  793 \n53. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-794 \nlevel protein structure with a language model. Science. 2023 Mar 17;379(6637):1123–30.  795 \n54. Singh J, Litfin T, Singh J, Paliwal K, Zhou Y . SPOT-Contact-LM: improving single-796 \nsequence-based prediction of protein contact map using a transformer language model. 797 \nMartelli PL, editor. Bioinformatics. 2022 Mar 28;38(7):1888–94.  798 \n55. Zhou Wang, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error 799 \nvisibility to structural similarity. IEEE Trans on Image Process. 2004 Apr;13(4):600–12.  800 \n56. Wickham H. ggplot2: Elegant Graphics for Data Analysis [Internet]. New York, NY: 801 \nSpringer New York; 2009 [cited 2024 Mar 22]. Available from: 802 \nhttps://link.springer.com/10.1007/978-0-387-98141-3 803 \n57. Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al. Welcome to the 804 \nTidyverse. JOSS. 2019 Nov 21;4(43):1686.  805 \n58. Zhang S, Fan R, Liu Y , Chen S, Liu Q, Zeng W. Applications of transformer-based language 806 \nmodels in bioinformatics: a survey. Bateman A, editor. Bioinformatics Advances. 2023 Jan 807 \n5;3(1):vbad001.  808 \n59. Rost B. Twilight zone of protein sequence alignments. Protein Engineering, Design and 809 \nSelection. 1999 Feb;12(2):85–94.  810 \n60. Wichmann S, Scherer S, Ardern Z. Biological factors in the synthetic construction of 811 \noverlapping genes. BMC Genomics. 2021 Dec;22(1):888.  812 \n61. Lebre S, Gascuel O. The combinatorics of overlapping genes. 2016 [cited 2023 Dec 29]; 813 \nAvailable from: https://arxiv.org/abs/1602.04971 814 \n62. Moffat L, Jones DT. Increasing the accuracy of single sequence prediction methods using a 815 \ndeep semi-supervised learning framework. Xu J, editor. Bioinformatics. 2021 Nov 816 \n5;37(21):3744–51.  817 \n63. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-818 \nlevel protein structure with a language model. Science. 2023 Mar 17;379(6637):1123–30.  819 \n64. Zhang G, Gurtu V , Kain SR. An Enhanced Green Fluorescent Protein Allows Sensitive 820 \nDetection of Gene Transfer in Mammalian Cells. Biochemical and Biophysical Research 821 \nCommunications. 1996 Oct;227(3):707–11.  822 \n65. Taton A, Unglaub F, Wright NE, Zeng WY , Paz-Yepes J, Brahamsha B, et al. Broad-host-823 \nrange vector system for synthetic biology and biotechnology in cyanobacteria. Nucleic Acids 824 \nResearch. 2014 Sept 29;42(17):e136–e136.  825 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n39 \n \n66. Zhang Y , Skolnick J. Scoring function for automated assessment of protein structure template 826 \nquality. Proteins. 2004 Dec;57(4):702–10.  827 \n67. Alderson TR, Pritišanac I, Kolarić Đ, Moses AM, Forman-Kay JD. Systematic identification 828 \nof conditionally folded intrinsically disordered regions by AlphaFold2. Proc Natl Acad Sci 829 \nUSA. 2023 Oct 31;120(44):e2304302120.  830 \n68. Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, et al. Highly accurate 831 \nprotein structure prediction for the human proteome. Nature. 2021 Aug 26;596(7873):590–6.  832 \n69. Binder JL, Berendzen J, Stevens AO, He Y , Wang J, Dokholyan NV , et al. AlphaFold 833 \nilluminates half of the dark human proteins. Current Opinion in Structural Biology. 2022 834 \nJune;74:102372.  835 \n70. Abbas U, Chen J, Shao Q. Assessing Fairness of AlphaFold2 Prediction of Protein 3D 836 \nStructures [Internet]. Bioinformatics; 2023 [cited 2025 Sept 14]. Available from: 837 \nhttp://biorxiv.org/lookup/doi/10.1101/2023.05.23.542006 838 \n71. Guo HB, Perminov A, Bekele S, Kedziora G, Farajollahi S, Varaljay V , et al. AlphaFold2 839 \nmodels indicate that protein sequence determines both structure and dynamics. Sci Rep. 2022 840 \nJune 23;12(1):10696.  841 \n72. Ardern Z. Alternative Reading Frames are an Underappreciated Source of Protein Sequence 842 \nNovelty. J Mol Evol. 2023 Oct;91(5):570–80.  843 \n73. Cassan E, Arigon-Chifolleau AM, Mesnard JM, Gross A, Gascuel O. Concomitant 844 \nemergence of the antisense protein gene of HIV-1 and of the pandemic. Proc Natl Acad Sci 845 \nUSA. 2016 Oct 11;113(41):11537–42.  846 \n74. Chirico N, Vianelli A, Belshaw R. Why genes overlap in viruses. Proc R Soc B. 2010 Dec 847 \n22;277(1701):3809–17.  848 \n75. Pavesi A. Origin, Evolution and Stability of Overlapping Genes in Viruses: A Systematic 849 \nReview. Genes. 2021 May 26;12(6):809.  850 \n76. Pavesi A. Asymmetric evolution in viral overlapping genes is a source of selective protein 851 \nadaptation. Virology. 2019 June;532:39–47.  852 \n77. Pavesi A, Romerio F. Extending the Coding Potential of Viral Genomes with Overlapping 853 \nAntisense ORFs: A Case for the De Novo Creation of the Gene Encoding the Antisense 854 \nProtein ASP of HIV-1. Viruses. 2022 Jan 14;14(1):146.  855 \n78. Safari M, Jayaraman B, Yang S, Smith C, Fernandes JD, Frankel AD. Functional and 856 \nstructural segregation of overlapping helices in HIV-1. eLife. 2022 May 5;11:e72482.  857 \n79. Fernandes JD, Faust TB, Strauli NB, Smith C, Crosby DC, Nakamura RL, et al. Functional 858 \nSegregation of Overlapping Genes in HIV . Cell. 2016 Dec;167(7):1762-1773.e12.  859 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint \n\n40 \n \n80. Krakauer DC. Stability and Evolution of Overlapping Genes. Evolution. 2000 860 \nJune;54(3):731–9.  861 \n81. Xu C, Chlebek JL, Allen JE, Nisonoff H, Park DM. Generative Protein Design for 862 \nOverlapping Genes. In OpenReview; 2025. Available from: 863 \nhttps://openreview.net/forum?id=35 864 \n82. On a Bicriterion Formulation of the Problems of Integrated System Identification and System 865 \nOptimization. IEEE Trans Syst, Man, Cybern. 1971 July;SMC-1(3):296–7.  866 \n83. Miettinen K. Nonlinear Multiobjective Optimization [Internet]. Boston, MA: Springer US; 867 \n1998 [cited 2025 Sept 9]. (Hillier FS, editor. International Series in Operations Research & 868 \nManagement Science; vol. 12). Available from: http://link.springer.com/10.1007/978-1-4615-869 \n5563-6 870 \n84. Marler RT, Arora JS. Survey of multi-objective optimization methods for engineering. 871 \nStructural and Multidisciplinary Optimization. 2004 Apr 1;26(6):369–95.  872 \n85. Das I, Dennis JE. Normal-Boundary Intersection: A New Method for Generating the Pareto 873 \nSurface in Nonlinear Multicriteria Optimization Problems. SIAM J Optim. 1998 874 \nAug;8(3):631–57.  875 \n86. Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic 876 \nalgorithm: NSGA-II. IEEE Trans Evol Computat. 2002 Apr;6(2):182–97.  877 \n87. Ardern Z, Neuhaus K, Scherer S. Are Antisense Proteins in Prokaryotes Functional? Front 878 \nMol Biosci. 2020 Aug 14;7:187.  879 \n88. Pavesi A, Vianelli A, Chirico N, Bao Y , Blinkova O, Belshaw R, et al. Overlapping genes and 880 \nthe proteins they encode differ significantly in their sequence composition from non-881 \noverlapping genes. Jan E, editor. PLoS ONE. 2018 Oct 19;13(10):e0202513.  882 \n89. Rancurel C, Khosravi M, Dunker AK, Romero PR, Karlin D. Overlapping Genes Produce 883 \nProteins with Unusual Sequence Properties and Offer Insight into De Novo Protein Creation. 884 \nJ Virol. 2009 Oct 15;83(20):10719–36.  885 \n90. Fonseca MM, Harris DJ, Posada D. Origin and Length Distribution of Unidirectional 886 \nProkaryotic Overlapping Genes. G3 Genes|Genomes|Genetics. 2014 Jan 1;4(1):19–27.  887 \n 888 \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}