Abstract
7
Overlapping genes allow multiple proteins to be encoded from a single DNA sequence, including 8
convergent (antisense; tail-to-tail) orientations across three reading frames (phases 0, 1, and 2), 9
with phase 1 most frequently observed in nature. Designing such overlaps is challenging due to 10
codon degeneracy, phase-specific biases, and the need to preserve structural integrity for both 11
proteins. Here, a purpose-built transformer encoder is introduced, trained on a balanced synthetic 12
dataset of convergent overlaps spanning diverse prokaryotic genomes and GC contents. 13
Controlled amino acid substitutions were incorporated during training to enhance model 14
generalization, particularly for phase 1 overlaps. At inference, Monte Carlo dropout enabled 15
uncertainty-aware sampling of synonymous codon solutions, which were iteratively refined 16
using a windowed, multi-objective optimization framework. Candidate overlaps were scored 17
using composite weighting across secondary structure preservation, substitution similarity, 18
alignment identity, and ESM-2 contact map similarity, with SSIM applied as a rapid proxy for 19
structural fidelity. This approach generated convergent overlaps across all phases, with phase 1 20
showing the highest success rates. Optimization trajectories revealed distinct dynamics, with 21
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
2
secondary structure preservation steadily increasing despite its lower weight. External validation 22
using SwissProt proteins stratified by AlphaFold2 pLDDT confidence supported generalization 23
to proteins with differing rigidity, yielding high secondary structure preservation in silico. These 24
Results
demonstrate that transformer models trained directly at the nucleotide level, when 25
coupled with uncertainty-aware inference and lightweight structural proxies, can support the 26
computational design of synthetic overlapping genes without requiring full structural prediction. 27
This framework offers a scalable path for phase-specific, codon-aware overlap design under 28
realistic constraints. 29
Introduction
30
The study of overlapping genes is an important area of interest in prokaryotic genetics, attributed 31
to their ability to encode multiple functional products from a single DNA sequence (1–4). For 32
example, a single gene sequence might code for two distinct proteins, each playing a different 33
role in cellular function. Advancements in identifying and designing these overlapping sequences 34
promise the creation of genetic constructs that are both more efficient and compact, potentially 35
boosting the functionality and stability of engineered organisms (5,6). This phenomenon not only 36
poses challenges but also opens new avenues in genetic engineering and synthetic biology (3). 37
Overlapping genes have been described using different approaches, but can be broadly grouped 38
into unidirectional (sense; encoded on the same strand, → →) or convergent (antisense; encoded 39
on the opposite strand) orientations (Fig 1A); convergent overlaps can be further refined into 40
“head-to-head” (← →) or “tail-to-tail” (→ ←) orientations (7). Convergent overlapping genes 41
occur in the same reading frame (phase 0), or frameshifted by one (phase 1) or two (phase 2) 42
nucleotides. For example, in phase 2, convergent overlaps share their third (degenerate) codon 43
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
3
positions. Research into convergent (tail-to-tail; antisense) overlapping gene sequences in 44
prokaryotes has included analyses on the length distribution among overlapping and near-45
overlapping genes (8), investigations into their prevalence (and misannotations) within 46
prokaryotic genomes (9,10), as well as the characterization of convergent gene pairs (11–18). 47
Notably, both convergent and unidirectional overlaps exhibit a non-uniform distribution with a 48
significant phase bias, predominantly favoring phase 1 overlaps (8,19,20). While the majority of 49
convergent overlaps fall into phase 2, when excluding 4 nucleotide overlaps, nearly half are 50
instead found in phase 1 (8). These studies collectively highlight the phase preference and 51
distribution patterns of naturally occurring gene overlaps. 52
The successful prediction of overlapping nucleotide sequences encoding two amino acid 53
sequences poses unique computational challenges, and has been previously explored (6,21–25); 54
the landscape of available methods has been summarized (22) with additional information 55
pertaining to more recently released programs provided in Supplementary Table 1. An early 56
solution for the design of unidirectional overlapping protein sequences included design of an 57
algorithm to identify the shortest DNA sequence which could encode two amino acid sequences 58
(26). More recently, examples included a dynamic programming algorithm (6) and a subsequent 59
Method
that combines dynamic programming with a hidden Markov model (HMM) (5,27). The 60
strategy was further extended to arbitrary pairs of natural protein domains (28). These 61
approaches have collectively shown that it is possible to design fully embedded overlapping 62
genes that encode functional proteins with high homology to starting sequences, underscoring 63
the potential for engineered overlapping genes in prokaryotic genetics. Another approach, termed 64
overlapping, alternate-frame insertion (OAFI) demonstrated the ability to design unidirectional 65
overlaps through insertion of one gene into an alternate reading frame of another by targeting 66
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
4
sites that may tolerate insertions (29). Recently, Byeon et al. (2025) demonstrated that deep 67
generative models can be used to design synthetic overlapping genes spanning distinct protein 68
families, achieving high in silico and experimental success rates even under the constraints of the 69
standard genetic code (30). Their results indicate that sequence space may contain many viable 70
overlapping solutions, suggesting broader applicability of computational approaches to both 71
unidirectional and convergent overlaps. 72
Transformer models have demonstrated superior performance in various sequence modeling 73
tasks due to their ability to capture long-range dependencies and contextual information (31–33). 74
This study leverages a novel application of the transformer encoder-based model architecture to 75
predict synthetic convergent overlapping genes in prokaryotes. Unlike dynamic programming 76
approaches, transformers utilize self-attention mechanisms to process the entire input sequence 77
simultaneously (32), enabling the model to capture patterns and dependencies that span across 78
long sequences. In this study, Monte Carlo (MC) dropout is applied during inference to enhance 79
prediction performance. Dropout, more generally, is a method of reducing overfitting in deep 80
neural networks during training (34). MC dropout retains dropout activity during inference, 81
effectively sampling from a distribution of possible model predictions (34–36). This provides a 82
practical Bayesian approximation of model uncertainty (37), which is especially advantageous 83
for predicting overlapping genes where the degeneracy of the genetic code permits many distinct 84
but functionally viable nucleotide encodings. By sampling multiple plausible solutions, MC 85
dropout enables exploration of the many synonymous coding possibilities inherent to the genetic 86
code, increasing the likelihood of identifying overlaps that preserve the structural and functional 87
integrity of both proteins. 88
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
5
The objectives of this study are to (i) evaluate the ability of transformer encoder models to 89
recover convergent overlaps across all three phases, (ii) quantify the impact of overlap length 90
and sequence variability on prediction success, and (iii) assess multi-objective optimization as an 91
alternative to full structural prediction in improving overlap fidelity. 92
Methods
93
Synthetic Overlapping Gene Dataset Construction for Model Training 94
While naturally occurring convergent overlaps have been systematically assessed and 95
documented (9), there is no readily available database of convergent overlapping genes with 96
equally distributed overlap lengths. Given this limitation, a database of synthetic convergent 97
overlaps was generated using a diverse set of open reading frames (ORFs) from prokaryotic 98
genomes. Synthetic convergent overlapping genes were created using ORF sequences from 24 99
different genomes from prokaryotes (Supplementary Table 2), using the R (version 4.3.1) and 100
Python programming languages (version 3.8.8). Genomes were selected to have a wide range of 101
genomic GC content (ranging from 23.3% to 74.9%, with a mean of 52.41%), with the aim of 102
controlling for potential amino acid compositional bias driven by GC content (38). The between-103
strain coding GC content (calculated using the set of ORF sequences from each strain) ranged 104
from 20.7% to 72.7%, with a mean of 50.5%. 105
Briefly, for each strain, sequences were downloaded from the NCBI database 106
(https://www.ncbi.nlm.nih.gov/genome/) and each FASTA file was processed to extract ORFs, 107
which were stored in separate list for each strain (Fig 1B). ORF sequences were filtered to 108
remove sequences shorter than 450 nucleotides, to exclude sequences with insufficient 109
opportunity for varied random sampling during overlap construction. A function was defined to 110
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
6
randomly select in-frame segments from the ORF sequence lists, to include a primary and 111
secondary sequence either with or without a convergent overlapping region. First, a contiguous 112
in-frame 312-nucleotide length section from each ORF sequence was extracted and used as the 113
basis for the primary sequence. The primary sequence segments were modified to ensure they 114
contained start and randomly selected stop codons, as these were removed during initial 115
sequence processing. 116
The reverse complement of this synthetic gene was selected to generate the secondary sequence 117
with a defined convergent overlap length. A uniform set of overlap lengths was generated for 118
each length by extracting the reverse complement, adding a stop codon, and removing any in-119
frame stop codons with a random codon. Both the added stop codon and in-frame replacements 120
were selected such that stop codons in the forward direction were not introduced. However, 121
owing to the methodology employed, a portion of generated secondary sequences with 122
overlapping sequences contained more than one stop codon and were subsequently filtered from 123
the dataset to ensure all sequences contained only one stop codon and encoded a single amino 124
acid sequence. 125
The amino acid sequence pairs with known overlaps were modified to incorporate up to 4 of 126
BLOSUM62 weighted amino acid changes. The changes included altering amino acid sequences 127
such that 1) the internal reverse stop codon was removed from both amino acid sequences 128
(primary and secondary), while keeping the known overlap for the original amino acid 129
sequences, or 2) the internal reverse stop codon was removed from both amino acid sequences 130
(primary and secondary), while also introducing additional amino acid modifications per 131
sequence. This yielded modified primary and secondary amino acid sequences, with an 132
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
7
unmodified known convergent overlap nucleotide sequence capable of encoding amino acid 133
sequences with high amino acid sequence identity to the primary and secondary amino acid 134
sequences. 135
The length of the input amino acid sequences was limited to 104 amino acids to manage 136
computational complexity during training and inference phases. Each amino acid was separated 137
by a space and the overall included a terminal asterisk (*) and the primary and secondary 138
sequences were concatenated to generate a single input sequence for training. 139
To balance the representation of different overlap lengths, data were stratified based on the 140
length of the overlap. The overlapping gene training data were balanced, ensuring equal 141
representation for each strain and each length of nucleotide overlap (specifically, convergent 142
overlaps of length 199 to 312 nucleotides). The final dataset was composed of synthetic gene 143
overlaps, each represented by a pair of modified amino acids sequences with a known target 144
convergent overlap sequence. This dataset served as the training set for the transformer encoder-145
based models. 146
Initial Model Description and Training 147
The synthetic overlapping gene dataset was used to train transformer-encoder based models with 148
the aim of generating convergent (tail-to-tail) overlaps for any two specified amino acid 149
sequences. A total of 8,892,000 combined pairs, corresponding to 78,000 pairs per convergent 150
overlap length. 151
Transformers have demonstrated superior performance in various sequence modeling tasks, 152
particularly where understanding long-range dependencies is important (31–33). As such, this 153
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
8
study utilizes a PyTorch based transformer encoder model for overlap prediction (39). 154
Tokenization and text vectorization were performed using the TensorFlow Keras API (40), and 155
data were split into training and validation sets. During training, the models each utilized 156 156
embedding dimensions, 13 attention heads, 192 feed-forward network dimensions, 5 blocks, a 157
dropout rate of 0.1, and a total of 802,983 trainable parameters (Fig. 2D). Models were compiled 158
using the Adam optimizer with cross-entropy as the loss function. 159
Each model was trained for 3 epochs with a batch size of 32 on a single NVIDIA GeForce GTX 160
5070 Ti GPU with 16GB of memory, and reached a validation accuracy of ≥90% by the third 161
and final epoch. 162
Test Dataset Generation and Model Inference 163
Amino acid pairs were generated as described above for preparation of the training dataset, but 164
without the final modification step; however, a different set of strains (n = 12) were selected than 165
those used for training, with a wide range of genomic GC content (ranging from 23.6% to 74.1%, 166
with a mean of 50.2%) (Supplementary Table 3). Given that inference was performed using 167
only coding sequences, the coding sequence GC content was calculated for each strain, and the 168
corresponding mean GC value was used for subsequent analyses (varying from 24.2% to 72.1%, 169
and a mean of 50.0%). 170
It is noted that inference was performed using the trained models along with the same tokenizer 171
to vectorize the input sequence. Using a “greedy decoder” (41), the token with the highest 172
probability at each step of the sequence generation process was selected. 173
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
9
MC dropout was implemented during inference to allow exploration of the uncertainty in model 174
predictions (eg, variation in predicted overlap sequences), leveraging the benefits of Bayesian 175
inference, as previously described (34,36,42). During inference, depending on the objective, 176
feedforward dropout was set to 0.275 for initial predictions. 177
An algorithm was developed to generate overlapping gene sequences from input amino acid 178
sequences using the trained models. The algorithm first back-translated the input amino acid 179
sequences into a nucleotide sequence, leveraging E. coli codon usage frequencies to reflect the 180
probabilistic distribution of codons for each amino acid (43). After back-translation, the models 181
were used to predict potential overlaps within these nucleotide sequences. Predicted overlaps 182
were then integrated into the original nucleotide sequences, resulting in a collection of modified 183
sequences that incorporated the predicted overlapping regions downstream of the remaining 184
back-translated nucleotide sequence. These modified sequences (ie, the back-translated and 185
predicted regions) were then translated into amino acid sequences to assess the accuracy of the 186
predictions. In instances where the correct amino acid sequence was not identified, additional 187
rounds of prediction and back-translation were performed. This iterative process continued until 188
either an overlap yielding amino acid sequences meeting the pre-specified criteria was identified, 189
or the specified number of inference rounds were completed without a predicted overlap (defined 190
as an unsuccessful prediction). 191
An alignment score was calculated for each sequence prediction. The predicted sequence was 192
aligned against the known overlap using the Needleman-Wunsch algorithm (implemented via 193
pairwise2 from Biopython (44)). For sequences with a known overlap and a predicted overlap 194
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
10
sequence, the calculated alignment score was based on the length of the overlap, normalizing 195
scores to a scale of 0 to 1 to facilitate comparison across predictions. 196
Secondary Structure Prediction, Substitution Score, and Structure Prediction 197
The prediction of secondary structures was performed using S4PRED, which enabled predictions 198
based on an input amino acid sequence alone without the need for multiple sequence alignments 199
or known homologous sequences (45). S4PRED outputs were formatted as “ss2” files, from 200
which the predicted secondary structure sequence was extracted for downstream analyses. The 201
model employs a simplified three-state classification—α-helix (H), β-sheet (E), and coil/loop 202
(C), which corresponds to a coarse-grained version of the DSSP (Dictionary of Secondary 203
Structure of Proteins) annotations (46). Minor code modifications were implemented to enable 204
efficient batch processing, substantially reducing runtime. These predicted structures were used 205
to evaluate the structural compatibility of convergent overlapping gene candidates. Substitution 206
matrices utilized were BLOSUM62 (47,48) and ProtSub (49). Protein structures were predicted 207
either with AlphaFold3 or ColabFold (50–52). 208
Optimization of predicted sequences 209
An iterative, windowed optimization procedure was used to convert paired amino-acid inputs 210
into convergent (tail-to-tail) overlapping nucleotide sequences while preserving fold-relevant 211
features of both proteins and maintaining sequence similarity. Input amino-acid sequences were 212
first back-translated to an initial nucleotide scaffold using E. coli codon usage frequencies. The 213
region targeted for overlap design was tiled into partially overlapping amino-acid windows; in 214
the experiments reported here windows were 10 amino acids with an 8-amino-acid stride (10 AA 215
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
11
window, 8 AA stride; model receptive field used = 105 AA). During optimization, only 216
nucleotides mapping to the current window were permitted to change; flanking nucleotides were 217
held fixed except at user-designated residues enforced by token-level biases (for example to 218
preserve active-site residues or motifs). Bracketed constraints on the forward strand were 219
mirrored on the reverse-complement strand so that enforced bases remained consistent with the 220
convergent orientation. 221
Within each window a transformer-encoder was executed in an uncertainty-aware mode: Monte-222
Carlo (MC) dropout was retained at inference to generate diverse synonymous codon proposals. 223
Token-level fixed-logit (base) biases were applied at specified nucleotide positions to enforce 224
required bases during sampling; combined forward/reverse fixed-logit maps ensured constraints 225
were respected on both strands. Each sampled nucleotide pair was translated to amino-acid 226
sequences and duplicate amino-acid solutions were removed prior to scoring. 227
Structural compatibility was assessed using single-sequence secondary structure prediction 228
(S4PRED; three-state H/E/C). Secondary structure preservation was reported as the percentage 229
of residues for which a candidate’s predicted state matched the original sequence prediction. 230
Substitution similarity was computed using BLOSUM62 and normalized to the original 231
sequence’s self-score to yield a percent similarity; global alignment identity was also measured. 232
ESM-2 (53) (model: esm2_t12_35M_UR50D) was used to generate residue-residue contact 233
probability matrices, having demonstrated utility as a fast and effective single-sequence 234
prediction feature (54). Contact map similarity was assessed using the structural similarity index 235
(SSIM) (55), computed between ESM-2-predicted contact maps of candidate overlap sequences 236
and their respective originals. SSIM quantifies local and global structural pattern agreement 237
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
12
(range 0-1, with 1 indicating identical maps). Raw values were linearly scaled to a percentage (0-238
100) to yield the ESM-2 contact map score. This metric served as a rapid, computationally 239
efficient proxy for structural preservation, enabling large-scale overlap design iterations while 240
retaining sensitivity to perturbations in contact map topology. 241
Candidates were ranked by a fixed weighted composite score across all experiments: 0.15 242
(secondary structure preservation), 0.15 (substitution similarity), 0.1 (alignment identity), and 0.6 243
(ESM2 contact map similarity). Although this weighting scheme was not exhaustively 244
optimized, empirical testing with randomly selected SwissProt proteins indicated that it yielded 245
stable rankings across experiments and supported overlap generation while balancing structural 246
fidelity with sequence similarity (Figs. S3, S4, and S5). The top-ranked candidate for the 247
window was integrated, and optimization proceeded to the next window. Multiple full passes 248
across windows were permitted for refinement. After the first structurally successful candidate, 249
global feedforward and attention dropout levels were reduced to bias sampling from exploratory 250
search toward fine-tuning, though the weighting scheme itself remained unchanged. 251
Efficiency measures implemented for reproducibility and speed included: (i) caching of 252
secondary structure predictions so identical amino-acid sequences were scored only once, and 253
(ii) caching of ESM-2 embeddings and contact maps. All candidate metadata (attempt, window, 254
per-candidate metrics, and secondary structure predictions) were retained in memory and 255
exported at the end of each input row to permit analysis. 256
Mutual information, PCA, and robustness testing 257
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
13
Terminal regions (104 amino acids) from 250 randomly selected SwissProt-derived proteins 258
were used as seeds to generate ensembles of mutated sequences. For each seed, trajectories were 259
created by stepwise amino acid substitutions sampled from the 20-letter alphabet, introducing 260
one mutation per step for up to 75 iterations. This yielded a progressive set of variant sequences 261
per seed protein. Each sequence in the ensemble was then scored for global alignment identity, 262
ESM-2 contact map similarity, and secondary structure match relative to the reference. These 263
metric vectors formed the input for subsequent mutual information (MI) and principal 264
component analyses. 265
MI was estimated after quantile discretization of continuous scores into 20 bins to evaluate 266
pairwise dependence among objectives. To reduce sensitivity to binning, MI was also estimated 267
using a k-nearest-neighbor approach. Statistical significance of discrete MI values was assessed 268
using permutation testing with 1,000 random shuffles. Principal component analysis (PCA) was 269
applied to z-scored metric vectors (alignment, ESM-2 contact map similarity, secondary 270
structure match) to examine the dimensional structure of variation. Robustness checks included 271
varying the number of bins (5-40) for MI and bootstrap resampling of PCA explained variance. 272
Statistical Analyses and Figure Preparation 273
The R programming language (version 4.3.1) and the ggplot2 package (56,57) were used to 274
perform statistical analyses and generate figures. Prediction success rates are reported as 275
percentages with exact 95% confidence intervals calculated using the Clopper–Pearson method. 276
Correlations between similarity metrics were assessed using Pearson’s correlation coefficient 277
with associated p-values. Relationships between secondary structure composition (coil, helix, 278
sheet) and prediction performance were examined using LOESS smoothing with 95–99% 279
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
14
confidence bands. For external validation experiments, performance metrics were stratified by 280
AlphaFold2 pLDDT confidence brackets and overlap phase. Group differences were evaluated 281
using nonparametric pairwise contrasts with adjusted p-values (Bonferroni correction). Effect 282
sizes are reported as absolute percentage-point differences between groups, alongside confidence 283
intervals. Data are presented as mean values with ranges or confidence intervals where 284
appropriate. Figures were generated with ggplot2, with additional overlays and schematics 285
prepared in Inkscape (version 1.3.2). 286
Declaration of Generative AI and AI-assisted technologies 287
During the preparation of this work, ChatGPT (5.0; OpenAI) was used to assess readability and 288
language, and as a tool to generate code snippets and assist in debugging. After using this 289
tool/service, the author reviewed and edited the content as needed and takes full responsibility 290
for the content of the publication. 291
Code and data availability 292
The code used for synthetic overlap dataset construction, model training, and inference are 293
available at: https://github.com/protosome/convergent_overlaps_aa_change 294
Results
295
Prediction of convergent overlapping genes from amino acid sequence pairs from a 296
dedicated transformer-encoder model set 297
While prior work has established that it is technically feasible to generate overlapping genes 298
from paired amino acid sequences, the performance of dedicated transformer-encoder models for 299
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
15
this task has not been systematically evaluated. In this study, transformer-based models were 300
trained to generate convergent overlapping nucleotide sequences from amino acid pairs, and their 301
ability to recover overlaps across all three phases was assessed. These models were designed to 302
identify convergent overlaps capable of encoding both input proteins, using recent advances in 303
transformer-based sequence modeling to address the large solution space introduced by the 304
degeneracy of the genetic code (31,58). Prediction quality was evaluated using alignment 305
identity scores and secondary structure preservation metrics, allowing comparison across overlap 306
lengths from 199 to 312 nucleotides. 307
Initial filtering was performed using a conservative alignment score cutoff of 34% (28,59). 308
Without Monte Carlo dropout (ie, dropout enabled during inference), successful predictions were 309
substantially reduced, and repeated inference runs converged to identical codon solutions. This 310
indicated that deterministic decoding restricted the search space to narrow, low-diversity 311
outcomes. By contrast, applying Monte Carlo dropout during inference expanded sequence 312
diversity, producing multiple distinct codon assignments that encoded the same amino acid 313
sequences. This stochastic sampling increased the likelihood of recovering overlaps that satisfied 314
both similarity and structural thresholds, underscoring the importance of uncertainty-aware 315
inference in navigating the combinatorial complexity of convergent overlap prediction. 316
To further assess training requirements, the effect of imperfect overlaps and amino acid 317
substitutions in the training data was examined. Models trained only on unaltered overlaps 318
exhibited markedly reduced success rates, particularly for phase 1 (Figs. 3D, 3E). In a shared test 319
set of 5,000 amino acid sequence pairs without a known convergent overlap, training with 320
modified sequences (stop codon removed plus 1–4 substitutions per sequence) substantially 321
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
16
improved prediction rates: phase 0, 95.3% (94.7-95.9%); phase 2, 86.7% (85.7-87.6%); and 322
phase 1, 81.6% (80.5-82.6%). In contrast, models trained on unaltered overlaps performed 323
significantly worse: phase 0, 73.7% (72.4-74.9%); phase 2, 49.1% (47.7-50.5%); and phase 1, 324
9.7% (8.9-10.6%). These findings are consistent with the hypothesis that incorporating 325
controlled sequence variation during training increased model flexibility and suggest improved 326
generalization, particularly when coupled with Monte Carlo dropout during inference. On this 327
basis, models trained with amino acid substitutions were selected for all subsequent analyses. 328
The outputs of these models were further evaluated using BLOSUM62 and ProtSub substitution 329
matrices. For overlap lengths of 310-312 nucleotides, similarity scores were strongly correlated 330
with amino acid identity (r = 0.88; p < 0.0001), and the matrices themselves were highly 331
correlated with each other (r = 0.98; p < 0.0001). 332
In addition to substitution metrics, alignment scores and secondary structure scores were 333
compared across phases. Phase 1 overlaps exhibited the highest average alignment and 334
secondary structure scores, phase 0 displayed intermediate values, and phase 2 was generally the 335
lowest. However, score distributions overlapped considerably, suggesting that phase-dependent 336
differences were not absolute. The near-identical secondary structure score distributions for 337
phase 0 and phase 2 further indicated that structural outcomes may be more sensitive to 338
sequence-specific context than to phase alone. 339
To broaden the analysis, overlap lengths from 199 to 312 nucleotides were examined across a 340
larger set of amino acid pairs. Phase-dependent patterns persisted: phase 1 overlaps were 341
increasingly recoverable at shorter lengths, and the relative rank order of secondary structure and 342
alignment scores across phases remained consistent. These findings suggest that phase 1 overlaps 343
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
17
may be predicted to be structurally and codon-compatible favorable, and that performance of the 344
transformer encoder model set is influenced by overlap length and the diversity of sequences 345
available for training. 346
High Secondary Structure Scores Are Linked to Coil-Dominated Precursor Sequences 347
To assess whether the secondary structure composition of precursor amino acid sequences 348
influences the predicted quality of convergent overlaps, the proportion of coil (C), β-sheet (E), 349
and α-helix (H) structures in each input was compared against the model’s predicted secondary 350
structure score. For each amino acid pair, structural proportions were calculated separately and 351
then averaged across both sequences. This approach reflects the biological premise that overlap 352
feasibility arises not from a single sequence in isolation, but from the combined structural 353
constraints and flexibilities of the two proteins that must be co-encoded (60,61). 354
Across all three overlap phases, sequences with higher coil content exhibited elevated secondary 355
structure scores (Fig. 3C). This trend was most apparent in phase 0 and phase 1, where LOESS-356
smoothed curves showed a modest increase in secondary structure score with increasing coil 357
content, followed by a sharper rise beyond approximately 70% coil. In phase 2, a similar increase 358
was observed at high coil levels, but the trend was less consistent, with greater variability at 359
intermediate values. By contrast, α-helix content was inversely associated with secondary 360
structure score. The most pronounced effect occurred in phase 0, where increasing helix content 361
corresponded to a marked decline in secondary structure scores. A similar, though less steep, 362
trend was observed in the other phases. For β-sheet content, no consistent relationship with 363
secondary structure score was observed; across all three phases, smoothed curves were largely 364
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
18
flat, indicating that sheet content neither strongly promoted nor hindered predicted structural 365
quality. 366
To evaluate whether these relationships extended to sequence-level similarity, mean alignment 367
scores were also plotted against precursor structural compositions. No meaningful trends were 368
observed for any secondary structure class (Fig. S1). Alignment scores remained relatively 369
constant regardless of coil, β-sheet, or α-helix content, suggesting that precursor structure did not 370
systematically influence the model's ability to recover the reference overlap sequence. 371
Multi-objective optimization and computational design of EGFP/AmpR convergent 372
overlaps 373
Although the transformer-encoder models were capable of generating convergent overlaps for 374
paired amino acid sequences, the secondary structure preservation achieved in initial predictions 375
was variable and generally lower than values reported in prior work (60). To address this 376
limitation, an iterative sequence optimization algorithm was implemented to refine model 377
outputs using a multi-objective framework that did not rely on MSA availability (Fig. 4A). 378
The optimization procedure employed a windowed feedforward Monte Carlo (MC) dropout 379
strategy to generate multiple candidate solutions for each sequence window (Fig. 4B). Candidate 380
overlaps were then evaluated against two complementary structural criteria: 1) predicted 381
secondary structure states derived from S4PRED (62), providing local fold-relevant constraints, 382
and 2) ESM-2 contact map similarity (63), which encodes sequence-level context that may 383
reflect long-range residue dependencies. A key feature of the framework was the ability to 384
enforce selective preservation of amino acid residues when required. This was achieved by 385
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
19
applying a feedforward mask to specific codon positions, constraining the model to maintain 386
defined residues (for example, active-site motifs) while permitting synonymous variation 387
elsewhere. In practice, this allowed critical sequence features to be retained while enabling 388
exploration of codon-level degeneracy to optimize overlap formation. 389
Relationships among the chosen objectives were next quantified using mutual information (MI) 390
and principal component analysis (PCA). Both approaches indicated that alignment identity, 391
ESM-2 contact map similarity, and predicted secondary structure capture overlapping but distinct 392
constraints. Discrete MI (20 quantile bins) showed that alignment carried moderate information 393
about both ESM-2 contact map similarity (MI = 0.393) and secondary structure match (MI = 394
0.381), whereas the MI between ESM-2 contact map similarity and secondary structure 395
similarity was lower (MI = 0.240), consistent with partial independence. Continuous MI 396
estimates (kNN regression) produced similar results (0.509, 0.508, and 0.340, respectively). All 397
associations were highly significant in permutation tests (p = 0.001). PCA of z-scored metrics 398
supported this interpretation (Fig. S2). A single axis (PC1) explained 72.4% of variance (95% 399
CI: 71.8-73.0) and loaded positively on all three objectives, reflecting a general similarity 400
dimension. PC2 (18.4%; 95% CI: 18.0-18.9) contrasted ESM-2 contact map similarity with 401
secondary structure similarity, while PC3 (9.2%) primarily captured sequence-only variance. 402
Thus, although alignment dominates the shared signal, secondary structure preservation and 403
ESM-2 contact map-based similarity provide orthogonal information that cannot be reduced to 404
alignment alone. This justifies their inclusion as separate objectives during overlap optimization. 405
Metric behavior under progressive sequence divergence was also assessed by subjecting 250 406
randomly sampled SwissProt proteins to iterative single-residue substitution trajectories (75 steps 407
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
20
per sequence). Across trajectories, all four similarity measures declined with increasing 408
mutational distance, but the rate and pattern of decay differed by metric (Fig. S3). Alignment 409
identity and BLOSUM62 similarity decreased rapidly and near-linearly, consistent with their 410
direct dependence on residue-level identity. By contrast, ESM-2 contact map similarity and 411
secondary structure similarity showed more gradual declines. These distinct decay profiles 412
support the interpretation that the objectives capture complementary aspects of overlap fidelity. 413
Weighting experiments further clarified the contributions of each objective. When the relative 414
weights assigned to secondary structure similarity, ESM-2 contact map similarity, alignment 415
identity, and BLOSUM62 substitution similarity were varied across randomly selected SwissProt 416
pairs with very high AlphaFold2 pLDDT values (≥90), distinct trade-off dynamics were 417
observed (Fig. S4). Increased weighting on secondary structure similarity consistently improved 418
secondary structure preservation but reduced alignment and substitution similarity, whereas 419
embedding-focused weightings enhanced ESM-2 contact map similarity at the expense of 420
secondary structure similarity. Intermediate combinations (eg, 0.15-0.4 secondary structure with 421
0.6 ESM-2 contact map similarity) yielded more balanced outcomes across the four metrics. 422
Variance patterns indicated that secondary structure weighting stabilized optimization 423
trajectories, while ESM-2 contact map similarity remained comparatively variable across 424
candidate sequences. 425
A convergent overlap between enhanced green fluorescence protein (EGFP) and a TEM-1 β-426
lactamase (ampicillin resistance marker from pCVD004; hereafter AmpR) was designed and 427
refined under the same objectives as a biologically relevant test case (64,65). The combination of 428
MC dropout sampling, secondary structure prediction, and embedding-based scoring consistently 429
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
21
improved fold preservation of computationally designed EGFP/AmpR 311 nucleotide (phase 1) 430
overlaps (Fig. 4C; Fig. 5). Compared to unoptimized outputs (ie, sequences generated from the 431
first window), optimized sequences displayed higher secondary structure similarity rates and 432
increased ESM-2 contact map similarity, demonstrating that overlap designs can be improved 433
using lightweight, MSA-independent objectives. During windowed optimization, alignment and 434
BLOSUM62 metrics remained largely constant or decreased across windows, reflecting their 435
role in constraining candidate sequences to a biologically plausible subset of sequence space. In 436
contrast, secondary structure similarity generally increased throughout optimization, despite 437
being weighted at only 0.15 in the composite score. ESM-2 contact map similarity increased 438
modestly in the early windows before plateauing. These findings suggest that fold-relevant 439
secondary structure metrics may act as the principal discriminative driver of optimization 440
progress, with alignment- and embedding-based metrics providing a stabilizing baseline. 441
Following optimization, a pair of amino acids with the highest combined score was selected for 442
structural comparison using predictions from AlphaFold3 (50), while selecting amino acid 443
sequences in both pairs for preservation based on predicted importance (eg, active site). The 444
AlphaFold3-predicted models of designed overlaps were highly similar to reference predictions 445
(Fig. 4C), with TM-scores of 0.98 (EGFP) and 0.90 (AmpR), the latter with deviations generally 446
localized to a disordered coil not located in the overlapping terminal region (66). 447
Performance of this algorithm in predicting overlapping genes using pLDDT as an 448
orthogonal proxy for intrinsic structure rigidity 449
Analyses on prokaryote-derived amino acid sequence pairs indicated that secondary structure 450
preservation increased with coil content and decreased with helix content, with sheet content 451
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
22
showing no consistent effect. To determine whether this pattern reflected a generalizable 452
property of overlap design for convergent overlap derived from terminal amino acid sequences, 453
an external validation set was assembled from SwissProt proteins stratified by AlphaFold2 (AF2) 454
per-residue confidence (pLDDT), an orthogonal proxy for intrinsic structural rigidity (52,67–69). 455
SwissProt sequences were filtered to 105–200 amino acids to align with the 105-AA receptive 456
field and to bound computational cost, then binned into Low-intermediate (50-69), Confident 457
(70-89), and Very high (90-100) pLDDT groups. For each bracket, 100 non-redundant amino-458
acid pairs were formed, and convergent overlaps were designed across phases 0, 1, and 2 at 310, 459
311, and 312 nt using the same transformer-encoder, windowed Monte-Carlo (MC) dropout 460
inference, objective weights, and constraint handling described previously. 461
Quantitative analysis of secondary structure composition (determined using S4PRED) across 462
pLDDT confidence bins revealed clear trends. Looking specifically at the terminal AA regions, 463
as pLDDT confidence increased from the Low-intermediate to the Very high bins, coil content 464
declined substantially (from ~64% to ~41%), while both helix and sheet content rose (helix from 465
~29% to ~39%; sheet from ~6% to ~20%) (Figs. 6A, 6B). These shifts suggest that AlphaFold’s 466
higher-confidence regions correspond generally to more ordered structural states, and reinforce 467
the interpretation of pLDDT as an approximate proxy for structural order and rigidity (50,70,71). 468
These AF2-stratified findings were consistent with the earlier composition-based analysis using 469
prokaryotic ORFs in this study. Sequence pairs with lower pLDDT (coil-enriched) were 470
associated with slightly higher secondary structure preservation, whereas higher-confidence 471
inputs tended to show reduced variation (Fig. 6C). Nevertheless, the overall high secondary 472
structure scores across all three brackets emphasize that the algorithm was able to recover 473
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
23
structurally faithful overlaps even for proteins predicted by AlphaFold2 to adopt stable, well-474
defined folds. Alignment identity and substitution similarity remained generally stable across 475
brackets, indicating that secondary structure preservation was the primary dimension along 476
which flexibility exerted its effect. 477
Bracket- and phase-specific effects were evident across all three metrics, consistent with the 478
secondary structure composition of the pLDDT bins. Across overlap phases, sequences in the 479
Low-intermediate (50-69) bracket generally outperformed those in the Confident (70-89) and 480
Very high (90-100) brackets, with the gap being most pronounced in phase 2 (Fig. 6D). In phase 481
1, the bracket effect was clear for secondary structure preservation and for the combined score 482
relative to Very high (90-100), whereas pairwise differences were not evident for ESM-2 contact 483
map similarity; Low-intermediate (50-69) also did not differ from Confident (70-89) for the 484
combined score. Overall, phase 1 maintained the highest preservation, while phase 2 exhibited 485
the largest bracket separations. Although group differences were statistically significant, effect 486
sizes varied by metric: within a phase, combined and ESM-2 contact map similarity differences 487
were typically ~1-4 percentage points, whereas secondary structure differences reached ~5-8 488
points in some phase 2 contrasts. All groups maintained mean secondary structure preservation 489
above 85%. Full pairwise estimates with confidence intervals and adjusted p-values are provided 490
in Supplementary Tables 4 and 5. 491
While statistically significant differences were observed between pLDDT brackets, these 492
differences were relatively small, and overall secondary structure preservation remained 493
consistently high across all conditions (Fig. 6E). These results reinforce the generalizability of 494
the earlier coil-associated trend, but also highlight the robustness of the multi-objective, 495
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
24
uncertainty-aware optimization framework in achieving high predicted structural preservation 496
regardless of precursor rigidity. 497
Discussion
498
Overlapping gene architectures have long been recognized as both a source of evolutionary 499
constraint and an avenue for novel functionality (72–80). In addition to their evolutionary 500
importance, overlapping structures can be purposefully engineered under realistic codon and 501
structural constraints. The present study shows that phase-specific convergent overlaps can be 502
computationally generated and optimized using a transformer encoder-based multi-objective 503
framework without requiring full 3D prediction at the optimization stage; targeted 3D checks 504
were applied post hoc. This framework offers a complementary tool that could expand the range 505
of strategies available for synthetic overlap design, particularly where phase, length, and 506
nucleotide-level control are critical. 507
Recent work has demonstrated that deep generative protein models can discover viable 508
overlapping solutions under genetic code constraints. Byeon et al. (2025) showed that pretrained 509
amino acid-space generative models can design synthetic overlapping genes across multiple 510
reading frames, validating expression experimentally using Gibbs sampling with codon-511
compatibility constraints (30). Xu et al. (2025) further demonstrated that pretrained generative 512
protein language models (ESM-3) can yield viable entangled protein pairs through structure-513
conditioned inverse folding and CAMEOS-based entanglements, filtered by cross-entropy and 514
Potts energy scores (81). Importantly, Xu et al. established that pretrained generative models are 515
capable of identifying functional overlap-compatible solutions, albeit in a limited experimental 516
context (the InfA/AroB system) and without explicit codon-level optimization. 517
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
25
The framework presented here builds on this momentum but takes a different route. Rather than 518
relying on pretrained amino acid-space sampling, it employs a purpose-built, phase-generalizable 519
transformer encoder model set trained on a balanced synthetic dataset of convergent overlaps 520
spanning diverse GC contexts and overlap lengths. Multi-objective optimization is integrated 521
directly into the inference loop, with candidate overlaps iteratively refined through a windowed, 522
uncertainty-aware (MC dropout) search. Evaluation criteria span predicted secondary structure 523
preservation (S4PRED), amino acid substitution similarity (BLOSUM62), pairwise alignment 524
identity, and ESM-2 contact map similarity. This approach emphasizes codon-level control while 525
maintaining structural fidelity, enabling overlap recovery without requiring full 3D structural 526
prediction at each iteration. 527
MC dropout was retained during inference to enable stochastic sampling of synonymous codon 528
solutions, thereby broadening the effective search space (35). Candidate overlaps were ranked 529
using a fixed composite score that balanced structural preservation, substitution similarity, 530
alignment identity, and embedding-based metrics. This weighting scheme served as a practical 531
approach for navigating competing design constraints without requiring full structural prediction 532
at each iteration, and was most closely aligned with prior work arguing that synthetic overlap 533
design requires explicit balancing of competing biological constraints and objectives (5,60). 534
While other multi-objective optimization strategies could be considered in future work (such as 535
ε-constraint methods (82,83), adaptive scalarization (84,85), or evolutionary multi-objective 536
optimizers (86)), the fixed weighted framework proved sufficient to achieve high overlap fidelity 537
under the conditions evaluated. 538
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
26
The synthetic convergent overlapping gene dataset construction method developed as part of this 539
study provides a structured framework for exploring overlapping gene design in silico, as it 540
permits generation of diverse and uniformly distributed overlaps. However, one potential 541
Limitation
of this approach is the reliance on synthetic, rather than natural, overlapping genes for 542
model training, which were generated using ORFs from a diverse set of prokaryotic strains. This 543
may introduce a bias affecting the predictive ability and variability in sequence output. Despite 544
this concern, the method employed to generate synthetic overlapping genes is in principle 545
analogous to the formation of new overlaps through the process of gene extension following the 546
loss of a stop codon (8). This results in synthetic overlapping genes where the second (extended) 547
strand's encoded amino acid composition is derived from the reverse complement of the first 548
strand, without undergoing further selection. As recently reported, there is extensive evidence 549
suggesting functionality in prokaryotic convergent (antisense) proteins, indicating that non-550
coding RNA regions could also encode functional proteins (87). For same-strand overlaps, a 551
significant difference in overall composition compared to non-overlapping genes has been 552
previously observed (88), along with a bias towards disorder-promoting amino acids (89). The 553
extent to which this affects convergent overlapping genes remains unclear. Future work will 554
therefore aim to (i) characterize compositional differences between natural and synthetic 555
overlaps, (ii) extend design to longer proteins beyond the 199-312 nucleotide range considered 556
here, and (iii) experimentally validate whether in silico-designed overlaps from this framework 557
maintain expression, folding, and function in vivo. 558
In summary, this study establishes that phase-aware transformer models trained directly at the 559
nucleotide level, when coupled with uncertainty-aware inference and multi-objective 560
optimization, can recover computationally designed convergent overlapping genes under realistic 561
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
27
genomic constraints. By integrating codon-level sampling with structural and substitution-based 562
objectives, this framework complements recent amino acid-space generative approaches while 563
offering length- and phase-specific control. These findings provide a computational foundation 564
for experimentally testable designs and point toward a scalable strategy for compact, synthetic 565
gene architectures. 566
567
568
569
Fig 1. Depiction of convergent overlapping genes in three reading frames and the process 570
developed to computationally design synthetic convergent overlaps from coding sequences. 571
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
28
(A) Sequences are represented for convergent (tail-to-tail) and unidirectional (same-strand) 572
overlapping genes in each of the three reading frames (phase 0, phase 1, and phase 2). Note, 573
phase 0 unidirectional overlaps are not represented as these are considered as an alternative start 574
site for the same gene (90). In each phase, the reading frame was shifted by one nucleotide. Start 575
and stop codons are indicated with green and red shading, respectively. (B) The general process 576
for creating synthetic convergent overlapping genes is presented, starting with a pool of ORF 577
sequences filtered by length (≥450 base pair [bp]). The primary sequence and in-frame 105-bp 578
fragment were randomly selected. A randomly selected stop codon from the reverse complement 579
(rc), and the fragment reverse complement was extended to that stop codon to serve as the 580
downstream sequence of the secondary sequence. Given the downstream secondary sequence 581
length, another sequence was randomly selected from the same gene list. A size appropriate 582
fragment was selected from that gene with length equal to the difference between the desired 583
gene length and the upstream overlap fragment (ie, the reverse complement of the primary 584
sequence to the selected stop codon). The upstream fragment was then combined with the 585
downstream fragment to generate the complete secondary sequence. The resultant primary and 586
secondary nucleotide sequences, sharing a known convergent overlapping region, were then 587
translated to amino acid (aa) sequences (either modified or unmodified) for subsequent use in 588
model training or inference. 589
590
591
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
29
Fig 2. Illustration of the overlap transformer-encoder architecture and general structure of 592
the data flow 593
(A) Depiction of the general structure of the data flow, for two input concatenated amino acid 594
sequences (primary and secondary) used during inference. The primary and secondary amino 595
acid sequences are represented with green and blue text, respectively, and the flow proceeds 596
from top to bottom. (B) Illustration of the transformer encoder model architecture. The model 597
input is tokenized concatenated paired amino acid sequences, and the model outputs raw logits. 598
The output logits are processed using a greedy decoder to generate a predicted overlap sequence. 599
600
601
Fig 3. Training data evaluation of secondary structure preservation and convergent 602
overlap-phase performance. 603
(A, D) Results for unmodified amino acid sequences. (A) Scatter density plot of secondary 604
structure (SS) score versus alignment score, with contours and marginal density plots showing 605
the distribution by overlap phase (0 = red, 1 = green, 2 = blue). (D) Fraction of successful 606
predictions across overlap lengths (310, 311, and 312 nucleotides), stratified by phase. Success 607
rates are reported with exact 95% Clopper–Pearson confidence intervals. (B, E) Results for 608
modified amino acid sequences. (B) Scatter density plot of SS score versus alignment score with 609
the same phase-coloring scheme. (E) Fraction of successful predictions across overlap lengths 610
for modified sequences, stratified by phase. Success rates are reported with exact 95% Clopper–611
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
30
Pearson confidence intervals. (C) Relationship between sequence-level SS composition (coil [C], 612
sheet [E], helix [H]) and mean SS score across phases 0, 1, and 2. Dashed lines indicate LOESS 613
regression fits with 99% confidence bands. 614
615
616
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
31
617
Fig 4. In silico design of convergent overlap with EGFP and AmpR 618
Overview of the multi-objective optimization process for EGFP and AmpR proteins using the 619
transformer encoder framework. (A) Schematic of the transformer encoder–based workflow. The 620
terminal 104 amino acids of EGFP and AmpR (across all tested overlap lengths) are 621
concatenated and passed through the model. Outputs are scored by a composite metric 622
integrating alignment, substitution, secondary structure preservation, and long-range interactions. 623
Bracketed amino acids from the input are preserved to enforce positional constraints. The final 624
sequence pair is selected from the output table by the highest average combined score. (B) 625
Multiple-sequence alignment of all output pairs (n=3,624) for EGFP and AmpR across both 626
optimization passes and windows (top to bottom). Diagonal dropout patterns correspond to 627
localized MC dropout applied to the overlap region. Variability reflects bracket-constrained 628
codon degeneracy introduced on the non-fixed partner sequence. (C) Predicted tertiary structures 629
generated with AlphaFold3 for EGFP (left) and AmpR (right). Native and overlapping terminal 630
regions relevant to the 311-nt overlap are highlighted (native in red, overlapped in blue); non-631
overlapping regions are shown in gray. 632
633
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
32
634
Fig 5. Trajectories of composite and component metrics during computational optimization 635
of EGFP/AmpR convergent overlaps 636
Optimization performance for EGFP and AmpR across two passes, each consisting of 13 637
windows. Metrics shown include the combined score (top left), which integrates secondary 638
structure (SS), ESM-2 contact map similarity, and alignment components, as well as the 639
individual SS (top right), ESM-2 (bottom left), and amino acid alignment (bottom right) scores. 640
Solid red (EGFP) and blue (AmpR) lines denote mean values, and shaded regions represent the 641
minimum-maximum range across sequences within each window. 642
643
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
33
644
Fig 6. Comparison of computational overlap predictions across pLDDT brackets, with 645
structural composition and optimization outcomes 646
647
Analyses are based on random sampling of SwissProt protein pairs (n=100 for each bracket) 648
from the AlphaFold2 Protein Structure Database (AF2). (A) Distribution of average pLDDT 649
values for the terminal 104 residues across sampled sequences, with red dashed lines marking the 650
confidence thresholds. (B) Counts of sequences assigned to pLDDT confidence brackets, shown 651
for both total and terminal regions (internal white bars denote terminal regions). (C) Ternary 652
plots of amino acid composition (helix [H], sheet [E], and coil [C]) for the primary and 653
secondary sequence sets, colored by pLDDT bracket (low-intermediate, confident, and very 654
high). Circles with black centers denote the centroid (mean position of all sequences within a 655
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
34
given pLDDT bracket) in ternary space. (D) Optimization performance across convergent 656
overlap phases (0, 1, 2) for Pass 1 (P1) and Pass 2 (P2). Plots show average secondary structure 657
(SS Avg) and ESM-2 contact map similarity (ESM Avg) scores by optimization window, with 658
shaded regions representing the range of values stratified by pLDDT bracket. (E) Density 659
distributions of individual metric scores (SS, ESM, and Alignment) for the highest combined 660
score sequences (primary and secondary) across phases 0, 1, and 2. 661
662
References
663
664
1. Barrell BG, Air GM, Hutchison CA. Overlapping genes in bacteriophage φX174. Nature. 665
1976 Nov;264(5581):34–41. 666
2. Keese PK, Gibbs A. Origins of genes: “big bang” or continuous creation? Proc Natl Acad Sci 667
USA. 1992 Oct 15;89(20):9489–93. 668
3. Wright BW, Molloy MP, Jaschke PR. Overlapping genes in natural and engineered genomes. 669
Nat Rev Genet. 2022 Mar;23(3):154–68. 670
4. Grainger DC. The unexpected complexity of bacterial genomes. Microbiology. 2016 July 671
1;162(7):1167–72. 672
5. Blazejewski T, Ho HI, Wang HH. Synthetic sequence entanglement augments stability and 673
containment of genetic information in cells. Science. 2019 Aug 9;365(6453):595–8. 674
6. Opuu V , Silvert M, Simonson T. Computational design of fully overlapping coding schemes 675
for protein pairs and triplets. Sci Rep. 2017 Nov 20;7(1):15873. 676
7. Sabath N, Graur D, Landan G. Same-strand overlapping genes in bacteria: compositional 677
determinants of phase bias. Biology Direct. 2008;3(1):36. 678
8. Kingsford C, Delcher AL, Salzberg SL. A Unified Model Explaining the Offsets of 679
Overlapping and Near-Overlapping Prokaryotic Genes. Molecular Biology and Evolution. 680
2007 Sept;24(9):2091–8. 681
9. Pallejà A, Harrington ED, Bork P. Large gene overlaps in prokaryotic genomes: result of 682
functional constraints or mispredictions? BMC Genomics. 2008;9(1):335. 683
10. Zehentner B, Ardern Z, Kreitmeier M, Scherer S, Neuhaus K. Evidence for Numerous 684
Embedded Antisense Overlapping Genes in Diverse E. coli Strains [Internet]. 2020 [cited 685
2024 July 12]. Available from: http://biorxiv.org/lookup/doi/10.1101/2020.11.18.388249 686
11. Hücker SM, Vanderhaeghen S, Abellan-Schneyder I, Scherer S, Neuhaus K. The Novel 687
Anaerobiosis-Responsive Overlapping Gene ano Is Overlapping Antisense to the Annotated 688
Gene ECs2385 of Escherichia coli O157:H7 Sakai. Front Microbiol. 2018 May 14;9:931. 689
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
35
12. Vanderhaeghen S, Zehentner B, Scherer S, Neuhaus K, Ardern Z. The novel EHEC gene asa 690
overlaps the TEGT transporter gene in antisense and is regulated by NaCl and growth phase. 691
Sci Rep. 2018 Dec 14;8(1):17875. 692
13. Zehentner B, Ardern Z, Kreitmeier M, Scherer S, Neuhaus K. A Novel pH-Regulated, 693
Unusual 603 bp Overlapping Protein Coding Gene pop Is Encoded Antisense to ompA in 694
Escherichia coli O157:H7 (EHEC). Front Microbiol. 2020 Mar 20;11:377. 695
14. Behrens M, Sheikh J, Nataro JP. Regulation of the Overlapping pic/set Locus in Shigella 696
flexneri and Enteroaggregative Escherichia coli. Infect Immun. 2002 June;70(6):2915–25. 697
15. Balabanov VP, Kotova VYu, Kholodii GY , Mindlin SZ, Zavilgelsky GB. A novel gene, ardD 698
, determines antirestriction activity of the non-conjugative transposon Tn 5053 and is located 699
antisense within the tniA gene. FEMS Microbiol Lett. 2012 Dec;337(1):55–60. 700
16. Delaye L, DeLuna A, Lazcano A, Becerra A. The origin of a novel gene through overprinting 701
in Escherichia coli. BMC Evol Biol. 2008;8(1):31. 702
17. Fellner L, Bechtel N, Witting MA, Simon S, Schmitt-Kopplin P, Keim D, et al. Phenotype of 703
htgA ( mbiA ), a recently evolved orphan gene of Escherichia coli and Shigella , completely 704
overlapping in antisense to yaaW. FEMS Microbiol Lett. 2014 Jan;350(1):57–64. 705
18. Graf F, Zehentner B, Fellner L, Scherer S, Neuhaus K. Three Novel Antisense Overlapping 706
Genes in E. coli O157:H7 EDL933. Gilk SD, editor. Microbiol Spectr. 2023 Feb 707
14;11(1):e02351-22. 708
19. Cock PJA, Whitworth DE. Evolution of Gene Overlaps: Relative Reading Frame Bias in 709
Prokaryotic Two-Component System Genes. J Mol Evol. 2007 Apr;64(4):457–62. 710
20. Cock PJA, Whitworth DE. Evolution of Relative Reading Frame Bias in Unidirectional 711
Prokaryotic Gene Overlaps. Molecular Biology and Evolution. 2010 Apr 1;27(4):753–6. 712
21. Pavesi A. Prediction of two novel overlapping ORFs in the genome of SARS-CoV-2. 713
Virology. 2021 Oct;562:149–57. 714
22. Nelson CW, Ardern Z, Wei X. OLGenie: Estimating Natural Selection to Predict Functional 715
Overlapping Genes. Tamura K, editor. Molecular Biology and Evolution. 2020 Apr 716
3;msaa087. 717
23. Schlub TE, Buchmann JP, Holmes EC. A Simple Method to Detect Candidate Overlapping 718
Genes in Viruses Using Single Genome Sequences. Malik H, editor. Molecular Biology and 719
Evolution. 2018 Oct 1;35(10):2572–81. 720
24. Chlebek JL, Leonard SP, Kang-Yun C, Yung MC, Ricci DP, Jiao Y , et al. Prolonging genetic 721
circuit stability through adaptive evolution of overlapping genes. Nucleic Acids Research. 722
2023 July 21;51(13):7094–108. 723
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
36
25. Decrulle AL, Frénoy A, Meiller-Legrand TA, Bernheim A, Lotton C, Gutierrez A, et al. 724
Engineering gene overlaps to sustain genetic constructs in vivo. Braun EL, editor. PLoS 725
Comput Biol. 2021 Oct 8;17(10):e1009475. 726
26. Wang B, Papamichail D, Mueller S, Skiena S. Two Proteins for the Price of One: The Design 727
of Maximally Compressed Coding Sequences. In: Carbone A, Pierce NA, editors. DNA 728
Computing [Internet]. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006 [cited 2024 July 729
10]. p. 387–98. (Lecture Notes in Computer Science; vol. 3892). Available from: 730
http://link.springer.com/10.1007/11753681_31 731
27. Manuel Martí J, Hsu C, Rochereau C, Xu C, Blazejewski T, Nisonoff H, et al. 732
GENTANGLE: integrated computational design of gene entanglements. Elofsson A, editor. 733
Bioinformatics. 2024 June 21;btae380. 734
28. Wichmann S, Scherer S, Ardern Z. Biological factors in the synthetic construction of 735
overlapping genes. BMC Genomics. 2021 Dec;22(1):888. 736
29. Leonard SP, Halvorsen T, Lim B, Park DM, Jiao Y , Yung M, et al. Creating overlapping 737
genes by alternate-frame insertion [Internet]. 2024 [cited 2024 Nov 29]. Available from: 738
http://biorxiv.org/lookup/doi/10.1101/2024.11.07.622342 739
30. Byeon GW, Expòsit M, Baker D, Seelig G. Design of overlapping genes using deep 740
generative models of protein sequences [Internet]. Synthetic Biology; 2025 [cited 2025 Aug 741
8]. Available from: http://biorxiv.org/lookup/doi/10.1101/2025.05.06.652464 742
31. Choi SR, Lee M. Transformer Architecture and Attention Mechanisms in Genome Data 743
Analysis: A Comprehensive Review. Biology. 2023 July 22;12(7):1033. 744
32. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All 745
You Need [Internet]. arXiv; 2023 [cited 2023 Oct 18]. Available from: 746
http://arxiv.org/abs/1706.03762 747
33. Lin T, Wang Y , Liu X, Qiu X. A survey of transformers. AI Open. 2022;3:111–32. 748
34. Srivastava, N, Hinton, G, Krizhevsky, A, Sutskever, I, Salakhutdinov, R. Dropout: A Simple 749
Way to Prevent Neural Networks from Overfitting. JMLR. 2014;15(56):1929−1958. 750
35. Gal Y , Ghahramani Z. Dropout as a Bayesian Approximation: Representing Model 751
Uncertainty in Deep Learning. Proceedings of The 33rd International Conference on 752
Machine Learning. 2016;(48):pp 1050-1059. 753
36. Lemay A, Hoebel K, Bridge CP, Befano B, De Sanjosé S, Egemen D, et al. Improving the 754
repeatability of deep learning models with Monte Carlo dropout. npj Digit Med. 2022 Nov 755
18;5(1):174. 756
37. Seoh R. Qualitative Analysis of Monte Carlo Dropout [Internet]. arXiv; 2020 [cited 2024 757
July 13]. Available from: https://arxiv.org/abs/2007.01720 758
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
37
38. Singer GAC, Hickey DA. Nucleotide Bias Causes a Genomewide Bias in the Amino Acid 759
Composition of Proteins. Molecular Biology and Evolution. 2000 Nov 1;17(11):1581–8. 760
39. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative 761
Style, High-Performance Deep Learning Library [Internet]. arXiv; 2019 [cited 2024 July 13]. 762
Available from: https://arxiv.org/abs/1912.01703 763
40. TensorFlow Developers. TensorFlow [Internet]. Zenodo; 2023 [cited 2024 Feb 24]. Available 764
from: https://zenodo.org/doi/10.5281/zenodo.10126399 765
41. Zarrieß S, V oigt H, Schüz S. Decoding Methods in Neural Language Generation: A Survey. 766
Information. 2021 Aug 30;12(9):355. 767
42. Graczyk KM, Pawłowski J, Majchrowska S, Golan T. Self-normalized density map (SNDM) 768
for counting microbiological objects. Sci Rep. 2022 June 22;12(1):10583. 769
43. Subramanian K, Payne B, Feyertag F, Alvarez-Ponce D. The Codon Statistics Database: A 770
Database of Codon Usage Bias. Saitou N, editor. Molecular Biology and Evolution. 2022 771
Aug 3;39(8):msac157. 772
44. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely 773
available Python tools for computational molecular biology and bioinformatics. 774
Bioinformatics. 2009 June 1;25(11):1422–3. 775
45. Moffat L, Jones DT. Increasing the accuracy of single sequence prediction methods using a 776
deep semi-supervised learning framework. Xu J, editor. Bioinformatics. 2021 Nov 777
5;37(21):3744–51. 778
46. Kabsch W, Sander C. Dictionary of protein secondary structure: Pattern recognition of 779
hydrogen‐bonded and geometrical features. Biopolymers. 1983 Dec;22(12):2577–637. 780
47. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl 781
Acad Sci USA. 1992 Nov 15;89(22):10915–9. 782
48. Eddy SR. Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol. 783
2004 Aug;22(8):1035–6. 784
49. Jia K, Jernigan RL. New amino acid substitution matrix brings sequence alignments into 785
agreement with structure matches. Proteins. 2021 June;89(6):671–82. 786
50. Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure 787
prediction of biomolecular interactions with AlphaFold 3. Nature. 2024 June 788
13;630(8016):493–500. 789
51. Mirdita M, Schütze K, Moriwaki Y , Heo L, Ovchinnikov S, Steinegger M. ColabFold: 790
making protein folding accessible to all. Nat Methods. 2022 June;19(6):679–82. 791
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
38
52. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate 792
protein structure prediction with AlphaFold. Nature. 2021 Aug 26;596(7873):583–9. 793
53. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-794
level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123–30. 795
54. Singh J, Litfin T, Singh J, Paliwal K, Zhou Y . SPOT-Contact-LM: improving single-796
sequence-based prediction of protein contact map using a transformer language model. 797
Martelli PL, editor. Bioinformatics. 2022 Mar 28;38(7):1888–94. 798
55. Zhou Wang, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error 799
visibility to structural similarity. IEEE Trans on Image Process. 2004 Apr;13(4):600–12. 800
56. Wickham H. ggplot2: Elegant Graphics for Data Analysis [Internet]. New York, NY: 801
Springer New York; 2009 [cited 2024 Mar 22]. Available from: 802
https://link.springer.com/10.1007/978-0-387-98141-3 803
57. Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al. Welcome to the 804
Tidyverse. JOSS. 2019 Nov 21;4(43):1686. 805
58. Zhang S, Fan R, Liu Y , Chen S, Liu Q, Zeng W. Applications of transformer-based language 806
models in bioinformatics: a survey. Bateman A, editor. Bioinformatics Advances. 2023 Jan 807
5;3(1):vbad001. 808
59. Rost B. Twilight zone of protein sequence alignments. Protein Engineering, Design and 809
Selection. 1999 Feb;12(2):85–94. 810
60. Wichmann S, Scherer S, Ardern Z. Biological factors in the synthetic construction of 811
overlapping genes. BMC Genomics. 2021 Dec;22(1):888. 812
61. Lebre S, Gascuel O. The combinatorics of overlapping genes. 2016 [cited 2023 Dec 29]; 813
Available from: https://arxiv.org/abs/1602.04971 814
62. Moffat L, Jones DT. Increasing the accuracy of single sequence prediction methods using a 815
deep semi-supervised learning framework. Xu J, editor. Bioinformatics. 2021 Nov 816
5;37(21):3744–51. 817
63. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-818
level protein structure with a language model. Science. 2023 Mar 17;379(6637):1123–30. 819
64. Zhang G, Gurtu V , Kain SR. An Enhanced Green Fluorescent Protein Allows Sensitive 820
Detection of Gene Transfer in Mammalian Cells. Biochemical and Biophysical Research 821
Communications. 1996 Oct;227(3):707–11. 822
65. Taton A, Unglaub F, Wright NE, Zeng WY , Paz-Yepes J, Brahamsha B, et al. Broad-host-823
range vector system for synthetic biology and biotechnology in cyanobacteria. Nucleic Acids 824
Research. 2014 Sept 29;42(17):e136–e136. 825
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
39
66. Zhang Y , Skolnick J. Scoring function for automated assessment of protein structure template 826
quality. Proteins. 2004 Dec;57(4):702–10. 827
67. Alderson TR, Pritišanac I, Kolarić Đ, Moses AM, Forman-Kay JD. Systematic identification 828
of conditionally folded intrinsically disordered regions by AlphaFold2. Proc Natl Acad Sci 829
USA. 2023 Oct 31;120(44):e2304302120. 830
68. Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, et al. Highly accurate 831
protein structure prediction for the human proteome. Nature. 2021 Aug 26;596(7873):590–6. 832
69. Binder JL, Berendzen J, Stevens AO, He Y , Wang J, Dokholyan NV , et al. AlphaFold 833
illuminates half of the dark human proteins. Current Opinion in Structural Biology. 2022 834
June;74:102372. 835
70. Abbas U, Chen J, Shao Q. Assessing Fairness of AlphaFold2 Prediction of Protein 3D 836
Structures [Internet]. Bioinformatics; 2023 [cited 2025 Sept 14]. Available from: 837
http://biorxiv.org/lookup/doi/10.1101/2023.05.23.542006 838
71. Guo HB, Perminov A, Bekele S, Kedziora G, Farajollahi S, Varaljay V , et al. AlphaFold2 839
models indicate that protein sequence determines both structure and dynamics. Sci Rep. 2022 840
June 23;12(1):10696. 841
72. Ardern Z. Alternative Reading Frames are an Underappreciated Source of Protein Sequence 842
Novelty. J Mol Evol. 2023 Oct;91(5):570–80. 843
73. Cassan E, Arigon-Chifolleau AM, Mesnard JM, Gross A, Gascuel O. Concomitant 844
emergence of the antisense protein gene of HIV-1 and of the pandemic. Proc Natl Acad Sci 845
USA. 2016 Oct 11;113(41):11537–42. 846
74. Chirico N, Vianelli A, Belshaw R. Why genes overlap in viruses. Proc R Soc B. 2010 Dec 847
22;277(1701):3809–17. 848
75. Pavesi A. Origin, Evolution and Stability of Overlapping Genes in Viruses: A Systematic 849
Review. Genes. 2021 May 26;12(6):809. 850
76. Pavesi A. Asymmetric evolution in viral overlapping genes is a source of selective protein 851
adaptation. Virology. 2019 June;532:39–47. 852
77. Pavesi A, Romerio F. Extending the Coding Potential of Viral Genomes with Overlapping 853
Antisense ORFs: A Case for the De Novo Creation of the Gene Encoding the Antisense 854
Protein ASP of HIV-1. Viruses. 2022 Jan 14;14(1):146. 855
78. Safari M, Jayaraman B, Yang S, Smith C, Fernandes JD, Frankel AD. Functional and 856
structural segregation of overlapping helices in HIV-1. eLife. 2022 May 5;11:e72482. 857
79. Fernandes JD, Faust TB, Strauli NB, Smith C, Crosby DC, Nakamura RL, et al. Functional 858
Segregation of Overlapping Genes in HIV . Cell. 2016 Dec;167(7):1762-1773.e12. 859
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
40
80. Krakauer DC. Stability and Evolution of Overlapping Genes. Evolution. 2000 860
June;54(3):731–9. 861
81. Xu C, Chlebek JL, Allen JE, Nisonoff H, Park DM. Generative Protein Design for 862
Overlapping Genes. In OpenReview; 2025. Available from: 863
https://openreview.net/forum?id=35 864
82. On a Bicriterion Formulation of the Problems of Integrated System Identification and System 865
Optimization. IEEE Trans Syst, Man, Cybern. 1971 July;SMC-1(3):296–7. 866
83. Miettinen K. Nonlinear Multiobjective Optimization [Internet]. Boston, MA: Springer US; 867
1998 [cited 2025 Sept 9]. (Hillier FS, editor. International Series in Operations Research & 868
Management Science; vol. 12). Available from: http://link.springer.com/10.1007/978-1-4615-869
5563-6 870
84. Marler RT, Arora JS. Survey of multi-objective optimization methods for engineering. 871
Structural and Multidisciplinary Optimization. 2004 Apr 1;26(6):369–95. 872
85. Das I, Dennis JE. Normal-Boundary Intersection: A New Method for Generating the Pareto 873
Surface in Nonlinear Multicriteria Optimization Problems. SIAM J Optim. 1998 874
Aug;8(3):631–57. 875
86. Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic 876
algorithm: NSGA-II. IEEE Trans Evol Computat. 2002 Apr;6(2):182–97. 877
87. Ardern Z, Neuhaus K, Scherer S. Are Antisense Proteins in Prokaryotes Functional? Front 878
Mol Biosci. 2020 Aug 14;7:187. 879
88. Pavesi A, Vianelli A, Chirico N, Bao Y , Blinkova O, Belshaw R, et al. Overlapping genes and 880
the proteins they encode differ significantly in their sequence composition from non-881
overlapping genes. Jan E, editor. PLoS ONE. 2018 Oct 19;13(10):e0202513. 882
89. Rancurel C, Khosravi M, Dunker AK, Romero PR, Karlin D. Overlapping Genes Produce 883
Proteins with Unusual Sequence Properties and Offer Insight into De Novo Protein Creation. 884
J Virol. 2009 Oct 15;83(20):10719–36. 885
90. Fonseca MM, Harris DJ, Posada D. Origin and Length Distribution of Unidirectional 886
Prokaryotic Overlapping Genes. G3 Genes|Genomes|Genetics. 2014 Jan 1;4(1):19–27. 887
888
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 10, 2025. ; https://doi.org/10.1101/2025.11.07.687268doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.