Benchmarking Generative Models for COI DNA Barcoding

preprint OA: closed
Full text JSON View at publisher
Full text 225,615 characters · extracted from preprint-html · click to expand
Benchmarking Generative Models for COI DNA Barcoding | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Benchmarking Generative Models for COI DNA Barcoding Cho-I Moon, Dae Kwon Song, Jie Eun Park, Jun Yang Jeong, Chan Eui Hong, and 5 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8982107/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 7 You are reading this latest preprint version Abstract Cytochrome c oxidase subunit I (COI) DNA barcoding is widely used for species identification and biodiversity studies. However, COI datasets exhibit high intra-species similarity and significant inter-species imbalance, which limits sequence analyses. To address data scarcity, deep learning based generative models have been explored for sequence generation. We implemented six generative models incorporating gated recurrent unit (GRU) layers, Transformer blocks, and convolutional layers to generate species-specific COI sequences across four taxonomic groups: Cypraeidae, Drosophila, Bats, and Birds. The generated sequences were evaluated in terms of plausibility, phylogenetic consistency, and diversity. Finally, GRU-based autoregressive language model achieved the best performance. It preserved codon-level structures to real data, with GC₃ content differences (Δ) ≤ 0.004, codon bias JSD ≤ 0.013, and ORF mean length differences (Δ) < 0.05. It also reproduced genetic structures with intra-species K2P mean differences (Δ) ≤ 0.13, real–synthetic K2P mean ≤ 0.09, and barcode gap rate differences (Δ) ≤ − 0.6. Additionally, it generated sequences with minimal redundancy, indicated by JSD-kmer ≤ 0.03, Self-BLEU differences (Δ) ≤ 0.001, and AA values between 0.54 and 0.75. These results suggest that GRU-based COI sequence generation can serve as robust simulation strategy for addressing data scarcity and imbalance in bioinformatics applications. Biological sciences/Computational biology and bioinformatics Biological sciences/Ecology Earth and environmental sciences/Ecology Biological sciences/Evolution Biological sciences/Genetics COI DNA barcoding deep generative models synthetic DNA generation phylogenetic fidelity biodiversity representation Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 1. Introduction In diverse animal taxa, species identification based solely on traditional morphology has limitations due to morphological variation, trait changes across life-history stages, and physical damage to diagnostic characters during distribution and processing [ 1 , 2 ]. This issue is particularly pronounced in marine invertebrates frequently traded and processed; their external features are often unavailable or obscured, making species-level identification difficult and creating downstream challenges for fisheries resource management and food safety assurance [ 3 ]. Such constraints of morphology-based classification are not restricted to a single taxonomic group and are repeatedly observed across situations in which morphological information from specimens is limited [ 1 ]. To address these limitations, DNA barcoding based on the mitochondrial cytochrome c oxidase subunit I (COI) gene has been widely adopted as standardized tool for species identification. DNA barcoding is molecular method that distinguishes interspecific genetic variation using short standardized DNA sequence, enabling broad taxonomic coverage and relatively rapid, consistent species identification even when morphological information is limited [ 2 ]. Accordingly, COI-based DNA barcoding has been validated across wide range of animal groups—including birds, fishes, mollusks, and arthropods—and has been broadly applied in species identification and biodiversity research [ 4 , 5 ]. Recently, species classification methods based on COI sequences have expanded beyond traditional approaches, such as tree-based, similarity-based, and feature-based classifications, to include machine learning and deep learning methodologies [ 6 , 7 ]. DeepBarcoding proposed a deep neural network composed of convolutional layers and evaluated its classification performance on six types of COI datasets, demonstrating superior performance compared with five conventional machine learning classifiers [ 6 ]. BarcodeBERT was introduced as a transformer-based foundation model for species classification across large-scale and diverse DNA barcoding datasets, achieving excellent classification performance [ 7 ]. To date, deep learning–based analyses of DNA barcoding data have focused on classification tasks. However, DNA barcoding datasets used in existing studies suffer from severe class imbalance due to the difficulties associated with real-world data acquisition. While some species are represented by hundreds of samples, others have only one or two samples, resulting in a pronounced long-tail distribution. Such imbalance is a major factor contributing to overfitting and degraded generalization performance in deep learning models. Moreover, COI sequences exhibit limited intra-species diversity and narrow boundaries between related species, which further increase the risk of misclassification. Therefore, there is a need for research on synthetic data generation techniques that can expand species-specific data distributions while preserving the biological characteristics and variation structures of real COI sequences. Deep learning technologies have been widely applied not only to species classification but also to various fields of biological data analysis and design [ 8 , 9 ]. In addition, with the introduction of various generative deep learning models, it has become possible to explore vast sequence spaces and directly generate novel sequences that do not exist in current databases (de novo design) [ 10 , 11 ]. Generative modeling approaches have emerged as a new paradigm for genomic data simulation in bioinformatics, as well as for applications drug discovery and protein design [ 12 , 13 ]. In the field of biological data generation, MB-GAN was proposed as a simulation approach for microbiome data using generative adversarial networks (GANs) [ 14 ]. In addition, high-quality artificial genomes have been generated from haplotype data using Wasserstein GANs (WGANs) [ 15 ]. Other studies have explored synthetic data generation and performance evaluation for genotype data incorporating SNP variation using WGANs, variational autoencoders (VAEs), and diffusion models [ 16 ]. Beyond numerical and discrete-valued data, DiscDiff was proposed as a generative model for promoter sequences composed of nucleotide characters. DiscDiff employed a latent diffusion model to generate discrete DNA sequences and subsequently refined the generated sequences using an Absorb–Escape post-processing algorithm [ 17 ]. In addition, a study proposed a graph-based deep generative neural network with a generative–adversarial module to denoise high-dimensional data and extract biologically meaningful latent features, enabling accurate classification of molecular subtypes of gastrointestinal cancers [ 18 ]. Despite extensive research on generative models across diverse types of biological data, studies targeting COI DNA barcoding sequences have not yet been conducted. Existing studies have utilized diverse datasets, evaluation metrics, and model architectures, thereby complicating direct performance comparisons and limiting rigorous validation of the effectiveness of recent models relative to traditional methods [ 19 ]. Furthermore, COI sequences contain essential evolutionary constraints, including codon-level translational structures and nucleotide sequence continuity. Therefore, a comparative analysis is required to determine which architectural approach— local pattern learning approaches based on convolutional neural networks (CNNs) or sequential pattern learning approaches based on recurrent neural networks (RNNs)—is more effective in generating synthetic COI sequences that possess both biological validity and diversity. Accordingly, we conducted a benchmarking study to explore generative deep architectures optimized for COI barcode sequence generation, with the aim of overcoming data scarcity and class imbalance problems. We selected four generative architectures that encompass distinct generation mechanisms: autoregressive networks that learn sequential probabilistic dependencies, VAEs and GANs that approximate data distributions in latent space, and the recent Latent Diffusion Model, which generates high-quality data through iterative denoising processes. We constructed a total of six generative models incorporating gated recurrent unit (GRU) layers, Transformer blocks, and convolutional layers, and conducted comparative analyses against a traditional statistical baseline, the N-gram probabilistic model. The main contributions of this study are as follows: Through a comparative analysis of six generative models with distinct structural characteristics, we propose an optimal model capable of generating high-quality data while preserving the biological characteristics of COI sequences. We propose a multidimensional evaluation framework for generative models that assesses biological plausibility, intra- and inter-species structural preservation, and the diversity of the generated sequences. Using COI sequences of vertebrates and invertebrates collected from the BOLD (Barcode of Life Data Systems) database, we validate that the synthetic sequences generated in this study are placed in phylogenetically consistent positions relative to real sequences. 2. Methods 2.1 Data acquisition and preprocessing We aim to generate new species-specific COI gene sequences using deep generative models. The COI gene, a protein-coding segment of mitochondrial DNA (mtDNA), is characterized by inherent biological, structural, and evolutionary constraints. Its sequences comprise both conserved and variable regions. The conserved regions are located at the N-terminus and C-terminus, exhibiting high conservation across diverse taxa. These conserved amino acid sequences facilitate the design of primers. Conversely, the central region of the COI gene encompasses variable segments with elevated mutation rates. These highly variable regions have been employed for species identification, population differentiation, and the assessment of evolutionary relationships. We developed a preprocessing pipeline to ensure the functional, evolutionary, and molecular biological constraints associated with the COI gene at the data level. We utilized publicly accessible DNA barcode sequence datasets (Cypraeidae, Drosophila, Birds, Bats) obtained from http://dmb.iasi.cnr.it/supbarcodes.php [ 6 ]. These open-access datasets have been utilized for species classification tasks. Notably, these datasets exhibit high phylogenetic diversity alongside minimal sequence divergence among species, thereby presenting significant challenges for accurate identification. The final training and testing datasets were constructed through preprocessing steps. The preprocessing procedure is detailed as follows (Fig. 1 ). 1. Character Normalization : We performed character normalization by converting IUPAC ambiguous nucleotide codes (Y, R, K, S, M, W, D, B, H, V) as well as gap characters introduced during sequence alignment (i.e., '-' and '_') into the character ‘N’. This step was employed to minimize uncertainty within the sequences and to prevent the model from learning erroneous pattern information. 2. Verficiation of Minimum Barcode Sequence Length (< 300bp Filter) : The COI barcode sequences typically span between 600 and 658 base pairs. It is necessary to maintain both conserved and variable regions within the overall sequence length to generate reliable barcode sequences. Therefore, sequences shorter than 300 base pairs, which may lack critical barcoding information, were removed. 3. Verification of the Number of Data per Species (≤ 3 Species Data Filter) : To generate new barcode data that reflects the characteristics of each species, it is important to train the model on a sufficiently diverse dataset within each species. Training with only one or two sequences per species results in the generated outputs that closely replicate the training data. Therefore, we set a minimum data criterion per species of 3. 4. Orientation Correction based on Open Reading Frame (ORF) : Given that COI sequences represent protein-coding genes, the presence of internal stop codons and frameshifts mutations is biologically implausible. To evaluate sequence orientation, each sequence was analyzed in all three reading frames on both the forward and reverse-complement strands. Following translation in each reading frame and strand, we determined the orientation that corresponded to the minimal occurrence of internal stop codons and the maximal length of ORF. This step enabled the normalization of all COI sequences to a uniform orientation. 5. Quality Control : To improve data quality, validation of COI sequences was performed based on established criteria. Sequences exhibiting one or more internal stop codons, an ambiguous nucleotide (N) ratio of 5% or higher, translated protein lengths of 100 amino acids or fewer, or redundant entries, were excluded from the dataset. 6. Application of Multiple Alignment using Fast Fourier Transform (MAFFT) Algorithms : The MAFFT software was utilized to perform the alignment of unaligned and heterogeneous sequence data [ 20 ]. Following multiple sequence alignment, the location of the ATA/ATT start codon was identified to determine the starting point of the barcode region. From this defined location, the core barcode segment, approximately 600–658 base pairs in length, was consistently extracted. This process mitigated length discrepancies between full-length COI sequences and partial barcode sequences, ensuring the alignment of conserved, variable, and terminal regions in a structurally coherent manner. Consequently, generative models were able to reliably capture biologically meaningful patterns. This approach enhanced the biological validity of the COI sequences, facilitating the accurate representation of species-specific patterns in COI barcode sequences throughout the training process. 7. Data Rebalancing : Following the previous steps, the quantity of data per species could differ; therefore, Step 2 was repeated to select the final data. Ultimately, we utilized preprocessed data from invertebrate taxa—Cypraeidae and Drosophila—and vertebrate taxa—Birds and Bats. Detailed information regarding the four datasets is presented in Table 1 . The entire dataset was split 8:2 into training and test sets to perform model training and validation of species-specific COI sequence generation. For the synthetic data, sequences were generated to correspond in length to those within each dataset. Table 1 Details of the barcode datasets. We constructed four final COI datasets through a preprocessing pipeline. The data were divided into training and test sets for model validation. Type Datasets Num. of Sequences (Train / Test) Length Num. of Species Invertebrates Cypraeidae 994/249 614 119 Drosophila 336/85 663 16 Vertebrates Bats 324/81 659 54 Birds 249/63 690 54 2.2 Conditional Deep Generative Models We compared and analyzed six generative models to explore the optimal generative deep learning model for simulating COI barcoding sequences. We constructed generative deep learning models designed to address two questions. Firstly, generative models can be generally categorized into RNN and CNN architectures. RNNs are proficient in capturing sequential information and patterns, whereas CNNs are more adept at recognizing localized features and patterns. Therefore, it is necessary to investigate which model architecture performs optimally for the characteristics of the data. Generative models were developed using GRU layers and convolutional layers, followed by a comparative evaluation of their performance. Secondly, previous studies on biological data generation have not demonstrated the application of the proposed models to diverse data types. Therefore, we analyzed the performance of applying the basic structure of the models employed in previous studies to COI sequence data with distinct biological characteristics. Considering these two factors, we constructed an autoregressive language model (ARLM), a variational autoencoder model (VAE), generative adversarial networks (GAN), and a latent diffusion model (LDM) (Fig. 2 ). Additionally, we employed an N-gram probabilistic model as a baseline for comparison against the deep learning models. For models based on GRU layers, start and end tokens were appended to both ends of each sequence, after which the sequence characters were encoded as integer labels (e.g., [START, END, A, C, G, T, N] = [0, 1, 2, 3, 4, 5]). This encoding was adopted to preserve sequential and positional information by defining the beginning and end of each input sequence. In contrast, convolution-based models were used with one-hot encoded representations of the nucleotide characters (A, C, G, T, N), which were processed as one-dimensional feature patterns. The N-gram probabilistic model is a classical statistical approach that generates sequences based on the probabilities of local nucleotide patterns observed within species-level data [ 21 ]. It accumulates the frequencies of the next nucleotide conditioned on a context of length n-1. In other words, it estimates the probability distribution of the next nucleotide based on the frequency statistics of consecutive n-length subsequences (n-grams). While the N-gram method offers advantages such as simplicity, intuitive interpretation of species specificity, and rapid generation speed, it struggles to capture long-term dependencies and is limited in generating novel sequences that reflect species-level characteristics. To ensure the preservation of the reading frame in the COI data, we set n = 3. The ARLM model is a sequential generative approach that predicts the next token conditioned on the tokens up to the previous time step [ 22 ]. Given the COI barcode sequence, the model predicts the subsequent token by comparing the input tokens up to position t with the target tokens up to position t + 1. ARLMs are characterized by their simple architecture, training stability, and robust performance even with small datasets [ 23 ]. We constructed two conditional ARLMs (Fig. 2 (a): GRU-based ARLM and Fig. 2 (b): Transformer-based ARLM). In the GRU-based ARLM, inputs that combine embedded nucleotide sequences and species embeddings are processed through three GRU layers (hidden dimension = 256, dropout = 0.3), followed by a fully connected layer to predict the probability distribution of the next nucleotide token. The GRU mechanism accumulates sequential dependencies, allowing the model to capture local patterns within the nucleotide sequences. For the Transformer-based ARLM, inputs formed by combining nucleotide token, positional, and species embeddings are processed through three Transformer decoder layers employing a pre-layer normalization structure (heads = 4, feedforward dimension = 512, dropout = 0.3), followed by a fully connected layer to predict the probability distribution of the next nucleotide token. The Transformer model is capable of capturing long-range dependencies within the sequences through its self-attention mechanism. The VAE is a probabilistic generative model that approximates a multivariate normal distribution to represent the intrinsic characteristics of the training data. It consists of an encoder, which infers the mean and variance ( \(\mu,{\sigma}^{2})\) —parameters of the latent vector distribution—from the input, and a decoder that generates new data from these latent vectors. By incorporating species embeddings during training, the model establishes a latent distribution conditioned on the species. Consequently, random sampling from this distribution facilitates the generation of diverse intraspecific data. We conducted a comparative analysis of two VAE model architectures in this study. First, the Chemical VAE was proposed as a model that represents discrete molecular structures in a continuous latent space and derives optimal molecular structures through optimization in that continuous representation [ 24 ]. Adopting the fundamental structure of the Chemical VAE, we constructed the Conv-based VAE with an encoder composed of three 1D convolutional layers (kernel sizes = [ 9 , 9 , 11 ]) and a decoder composed of three GRU layers (hidden dimension = 256, dropout = 0.3) (Fig. 2 (c), Conv-based VAE). The one-hot encoded nucleotide sequences were encoded, concatenated with species embeddings, and used to sample latent variables z. The decoder then generated sequences from latent samples conditioned on species embeddings. The GRU-based VAE was constructed using GRU layers for both the encoder and decoder, each comprising three GRU layers (hidden dimension = 256, dropout = 0.3). The embedded nucleotide sequences were concatenated with species embeddings to form the latent representation, which was then used to generate nucleotide sequences (Fig. 2 (d), GRU-based VAE). While convolutional architectures capture local patterns in sequence data, GRU layers are effective in preserving sequential order and modeling long-range dependencies. The Wasserstein GAN is a probabilistic generative model that learns to approximate and minimize the Wasserstein (Earth-Mover) distance between the real and generated data through adversarial training between a generator and a critic. To improve training stability and mitigate mode collapse, we applied a gradient penalty term (Fig. 2 (e), WGAN). The generator is conditioned on species labels and random noise to synthesize barcode sequences, while the critic evaluates real and generated sequences under the same species condition to calculate a Wasserstein score. Adversarial training allows the model to directly approximate the underlying COI distribution. Following previous methods on SNP generation using WGAN-GP [ 15 ], we constructed a 1D convolutional generator–critic architecture. The generator concatenates noise and species embeddings, processes the input through a fully connected layer, and passes it through two convolutional blocks (Conv1D, BatchNorm, LeakyReLU( \(\alpha=0.01\) ) to output nucleotide probabilities. The critic receives one-hot encoded sequences concatenated with species embeddings along the sequence dimension, processes them via three convolutional blocks, and projects the result to a scalar Wasserstein score. The LDM is a generative network that encodes high-dimensional data into a low-dimensional continuous latent space and learns the diffusion process within that space [ 25 ]. Building upon the core architecture of DiscDiff, a diffusion-based model originally proposed for promoter DNA sequence generation [ 17 ], we constructed a conditional LDM for COI barcode sequence generation (Fig. 2 (f), LDM). Nucleotide token embeddings were concatenated with species embeddings and transformed into a continuous latent representation( \({z}_{0}\in{\mathbb{R}}^{L\times256})\) using an encoder composed of one-dimensional convolutional layers. A diffusion process was then applied to the latent representation, progressively adding Gaussian noise over time steps t to obtain noisy latent variables \({z}_{t}\) . These noisy representations were denoised using a 1D U-Net to recover clean latent representations. The denoised latent variables were subsequently decoded into nucleotide sequences using a decoder consisting of three GRU layers (hidden dimension = 256, dropout = 0.3). For the diffusion process, a linear noise schedule was employed with \({\beta}_{t}\in[{10}^{-4},0.02]\) , and the total number of diffusion timesteps was set to 1,000. We performed temperature-based sampling using six different settings to generate 30 synthetic sequences per species. The generated sequences were filtered to ensure that they did not overlap with the real data and that the proportion of ambiguous nucleotides (N) was less than 0.05. The synthetic datasets were then used for comparative analyses against the real data. 2.3 Evaluation Metrics for Synthetic Data To assess the biological and statistical quality of synthetic COI data generated for each species by various generative models, we conducted evaluations based on the following three criteria. 1. Biological Plausibility of the Synthetic COI Sequences To evaluate whether each generative model generates biologically valid protein-coding sequences, we measured the GC content at the third codon position (GC 3 content), the Jensen-Shannon divergence (JSD) of codon bias, and the average length of ORFs. The GC 3 content, defined as the proportion of G or C nucleotides at the third base of codons within COI sequences, does not influence the encoded protein but is subject to variance due to evolutionary diversity. Moreover, species-specific GC 3 patterns arise from diverse biological factors, enabling an evaluation of how accurately synthetic sequences replicate these species-specific codon usage patterns. Consequently, a smaller difference in GC 3 content (Δ) between the real and synthetic data indicates that the generative model effectively reproduces the species-specific nucleotide composition. The Kullback-Leibler divergence (KLD) is a metric that measures the extent to which one probability distribution diverges from a reference distribution. The Jensen-Shannon Divergence (JSD) measures the average distance between two probability distributions by calculating the KLD between each distribution and their mean. Codon bias JSD represents a value calculated by using the JSD to measure the similarity between codon frequency distributions in real and synthetic data. Despite encoding identical amino acids, different species exhibit unique codon usage preferences. Consequently, a JSD value closer to 0 indicates that the codon usage patterns of the synthetic data are similar to those of the real data. This similarity indicates that the synthetic sequences are likely to be biologically compatible with natural translational processes. Biologically normal COI sequences are characterized by a single extended ORF that initiates with a start codon and terminates with a stop codon. The average ORF length represents the average length of continuous coding regions without stop codons, thereby indicating the maintenance of a normal single reading frame. Consequently, a difference in average ORF length (Δ) between real and synthetic data close to 0 implies that the synthetic sequences accurately preserve the reading frame structure and stop codon patterns. 2. Phylogenetic & Population level Fidelity of the Synthetic COI Sequences To evaluate whether the species-specific synthetic COI sequences preserve inter-specific discrimination and intra-specific diversity, we calculated the mean intraspecific K2P distance, the mean K2P distance between real and synthetic sequences, and the barcode gap rate using the Kimura 2-Parameter (K2P) model. Through an integrative analysis of these metrics, we evaluated the phylogenetic validity of the synthetic data. The K2P genetic distance model serves as a representative method for estimating evolutionary divergence between nucleotide sequences. It distinguishes between transitions (A↔G, C↔T) and transversions (A↔C, A↔T, G↔C, G↔T, etc.), thereby accounting for the different occurrence probabilities of these two substitution types. After aligning two sequences, the proportions of observed transitions and transversions are calculated and subsequently employed to estimate the expected number of substitutions per nucleotide site, referred to as genetic distance. In contrast to basic mismatch rate calculations, the K2P model accounts for the asymmetry between transitions and transversions observed in mtDNA; therefore, it is widely used for COI sequence analysis. The mean intra-species K2P distance is an indicator that evaluates the extent to which synthetic data captures the within-species genetic diversity observed in the real data. A lower intra-specific K2P value indicates higher sequence similarity among individuals within the same species. We calculated the mean intra-specific K2P distances for both real and synthetic data and determined the difference between them (Δ). Thus, a Δ value approaching zero indicates that the generative model has effectively reproduced the patterns of intra-specific variation. The mean real–synthetic K2P distance is the average K2P distance calculated between all real and synthetic sequences. It indicates whether the synthetic data are located within the actual genetic space; lower values identifies that the synthetic data are positioned closer to or within the real data distribution. When the intra-species K2P distances of both the real and synthetic data are similar, and the mean real–synthetic K2P distance is minimal, it indicates that the synthetic data not only lie within the real data space but also preserves a similar intra-species structure to real data. A barcode gap indicates distinct species separation, occurring when the maximum intra-specific genetic distance is smaller than the minimum inter-specific genetic distance. The barcode gap rate represents the proportion of species exhibiting this gap; a higher rate suggests enhanced potential for species discrimination based on COI sequence. We calculated the barcode gap rates for both real and synthetic datasets and evaluated their difference (Δ). A Δ value close to zero indicates that the generative model has effectively preserved the inter-specific boundaries. 3. Biodiversity Representation of the Synthetic COI Sequences We evaluated the properties of synthetic data which preserve the statistical properties of the real data while maintaining intra-species diversity. For quantitative evaluation, the JSD-based k-mer distribution divergence, Self-BLEU (Bilingual Evaluation Understudy), and Nearest Neighbour Adversarial Accuracy ( \(AA\) ) were utilized. The k-mer distribution reflects species-specific statistical characteristics, such as repetitive elements and GC/AT compositional patterns. We calculated frequency vectors for 3-mers, corresponding to the codon-length unit (3 nucleotides), from the COI sequences. To assess the similarity between the real and synthetic sequences, we calculated the JSD between their respective 3-mer frequency distributions. The JSD value ranges from 0 to 1, where a value closer to 0 indicates a higher similarity in k-mer distributions between the real and synthetic data. In other words, this implies that the synthetic data effectively reproduces the local patterns and complexity of the real data. Self-BLEU is a metric that evaluates the similarity among generated sequences, where a lower value indicates higher diversity. This metric facilitates the evaluation of pattern similarity and redundancy within the synthetic data. A difference between the Self-BLEU scores (Δ) of real and synthetic data close to 0 implies that the synthetic data exhibits a diversity level comparable to that of the real data. AA is a generative quality evaluation metric based on the 1-nearest neighbor distance that quantifies the overlap between real and synthetic data [ 15 ]. This metric evaluates whether the synthetic data demonstrates excessive similarity to the real data (overfitting) or excessive dissimilarity (underfitting). The metric was computed using 3-mer distributions extracted from both real and synthetic data. Values closer to 1 indicate that the synthetic data substantially diverges from the real data (underfitting), whereas values closer to 0 indicate that the synthetic data closely resembles the real data (overfitting). Therefore, an AA value near 0.5 denotes that the synthetic data effectively captures the characteristics of the real data while preserving diversity. Specifically, values above 0.5 imply a lack of realism in the synthetic data, whereas values below 0.5 indicate a propensity for the synthetic data to closely replicate the real data. 2.4 Visualization We analyzed conserved and non-conserved regions within COI sequences using position-wise Shannon entropy profiles. Shannon entropy is a measure of uncertainty and randomness within a probability distribution. Here, it reflects the nucleotide distribution (A, T, C, G) at each position across the entire sequences. Conserved regions, characterized by patterns shared among species, correspoond to low entropy values, whereas non-conserved regions or variant regions exhibit high entropy values. We validated the biological plausibility of the synthetic sequences by analyzing the Shannon entropy profiles between the real and synthetic data. To evaluate the overall similarity between real and synthetic data, we utilized principal component analysis (PCA) and uniform manifold approximation and projection (UMAP) [ 26 ]. The sequences were transformed into 3-mer frequency vectors, followed by dimensionality reduction via PCA to extract global features. Subsequently, UMAP was applied to project these features into a two-dimensional embedding space. We visually evaluated the genetic consistency of the data by examining whether the real and synthetic data occupied a common genetic manifold. Additionally, we evaluated the reproducibility of local base pattern distributions using a 3-mer abundance scatter plot. We performed a comparative analysis of 3-mer frequency distributions between real and synthetic data. Data points that closely align with the diagonal line (y = x) indicate a more accurate simulation of the real sequence characteristics. Conversely, substantial deviations from this diagonal indicate that the model generates unrealistic sequences. We also assessed the reproducibility of infrequent patterns by applying a log-scale to the frequency values. To qualitatively assess phylogenetic coherence between real and synthetic sequences, we reconstructed maximum-likelihood (ML) phylogenetic trees. For each dataset, real COI sequences and synthetic COI sequences generated by each model were concatenated, and multiple sequence alignment was performed using MAFFT [ 20 , 27 ]. The aligned sequences were then analyzed in IQ-TREE 3 to infer an ML tree, and branch support was estimated using 1,000 ultrafast bootstrap replicates, producing a Newick-format tree file [ 28 ]. Based on the resulting tree, we defined, for each taxon (species level for Bats, Cypraeidae, and Drosophila; group ID level for Birds), the smallest clade defined by the most recent common ancestor (MRCA) shared by the real sequences of that taxon as the MRCA-defined real clade. We then calculated the in-clade rate, defined as the proportion of synthetic sequences placed within the MRCA-defined real clade, to quantify the extent to which synthetic sequences were positioned within the taxon-specific phylogenetic structure of the real sequences. To additionally account for clade boundary specificity, we computed the non-focal real-tip count within the MRCA-defined real clade (i.e., the number of real tips belonging to taxa other than the focal taxon) and interpreted a value of zero as indicating that taxon-specificity of the real clade was maintained. Taxon-level summary statistics (including the in-clade rate and related metrics) are provided in the Supplementary Data. 2.5 Phylogenetic validation using BOLD reference sequences For external validation using independently collected real-world data, COI barcode reference sequences of Chiroptera and Drosophilidae were downloaded from the BOLD (Barcode of Life Data Systems) data portal [ 29 ]. From the retrieved records, we retained only sequences with collection countries recorded as the Republic of Korea, Japan, or China, and selected entries annotated with the marker code COI-5P (BOLD’s designation for the standard 5′ region of the COI barcode). We further filtered records to those containing both a process identifier (processid) and nucleotide sequence information (nuc). Sequences were cleaned by removing whitespace and normalizing all characters to uppercase, and were exported in FASTA format. The filtered BOLD reference sequences were combined with synthetic sequences generated in this study (GRU-based ARLM outputs), and multiple sequence alignment (MSA) was performed using MAFFT [ 20 , 27 ]. Using the aligned sequences as input, maximum-likelihood phylogenetic trees were inferred with IQ-TREE 3, and branch support was evaluated with 1,000 ultrafast bootstrap replicates. The resulting trees were inspected to determine whether the synthetic sequences were placed in phylogenetically coherent positions relative to the BOLD reference sequences [ 30 ]. In addition, taxonomic composition and patterns of potential mixing were examined using Krona to generate interactive HTML-based visualizations [ 31 ]. 2.6 Implementation Details All training and testing were performed using PyTorch framework. It was performed using four GeForce RTX 3090 Ti 16GB × 4 (Z-202410106770) provided by the Bio-Bigdata Analysis and Utilization of Biological Resources at Soonchunhyang University. Common configurations applied across all generative models encompassed a warm-up cosine learning rate decay and the Adam optimizer to improve training efficiency and model accuracy. The initial learning rate was established at 0.001, with training conducted over 100 epochs and a batch size of 16. The loss functions for each model are as follows: the GRU- and Transformer-based ARLMs utilized cross-entropy loss; the Conv- and GRU-based VAEs used a reconstruction loss integrating cross-entropy loss and KL convergence loss. WGAN is designed to minimize the difference between the real data distribution and the synthetic data distribution by leveraging the Wasserstein-1 distance. The critic outputs a real-valued score, and the generator is trained to maximize this score, thereby minimizing the Wasserstein distance. To enhance training stability, a gradient penalty coefficient was set to 10, and the update ratio between the critic and the generator was maintained at 5:1. LDM was optimized by defining the overall loss function as a weighted sum of the cross-entropy loss for reconstruction and the mean squared error (MSE) loss for noise prediction. 3. Results 3.1. Evaluating Biological Plausibility in Deep Generative DNA Models Table 2 presents the biological validity results of synthetic COI data generated by seven models across four datasets. For all three evaluation metrics—GC₃ content(Δ), codon bias JSD, and ORF mean length (Δ)—lower values indicate that the synthetic data more accurately replicate the actual biological structures. Overall, the N-gram models showed low biological plausibility across all four datasets. While they partially replicated the statistical distribution of nucleotide sequences, they did not accurately capture species-specific codon usage patterns and exhibited significant deviations in ORF frame preservation. This indicates that traditional statistical methods are inadequate for capturing the structural constraints intrinsic to COI sequences. In contrast, the GRU-based ARLM consistently showed the smallest deviations across all three metrics in all datasets. This demonstrates that the GRU-based ARLM generates synthetic data with high protein translatability by reproducing codon structures and patterns similar to real data and stably preserving ORF frames. Therefore, the GRU-based ARLM was validated as the most effective model for generating biologically plausible synthetic COI sequences. Additionally, the GRU-based VAE ranked second in overall performance, demonstrating that GRU layers can effectively capture nucleotide continuity and preserve the codon structure of COI data across different data types (invertebrate and vertebrate). In contrast, although the Transformer-based ARLM showed relatively good performance in replicating GC₃ content, it exhibited inferior results in the codon bias JSD and ORF mean length metrics. While the Transformer architecture is effective for learning global patterns, it has limitations in replicating local and sequential patterns at the codon level. Convolution-based feature learning models, including the Conv-based VAE, WGAN, and LDM, exhibited low biological validity across all three metrics. In particular, WGAN and the Conv-based VAE did not preserve ORF frames, leading to poor translatability at the codon level. For LDM, codon bias deviation and ORF collapse were observed. These results indicate that distribution learning within the latent space does not sufficiently capture the intrinsic biological constraints of COI sequences. Furthermore, we confirmed that in the generation of COI data, learning the sequential dependencies of nucleotide sequences using GRU layers is more effective in preserving biological properties in synthetic COI data than focusing on local patterns. Figure 3 shows the position-wise Shannon entropy profiles between real COI sequences and synthetic data generated by seven models across four datasets. The results are arranged from top to bottom for Cypraeidae, Drosophila, Bats, and Birds, respectively. The left-hand graphs show the position-wise Shannon entropy profiles, whereas the right-hand graphs represent the entropy deviation between the synthetic and real data. COI sequences comprise both functionally conserved regions and non-conserved (variable) regions. Therefore, we analyzed position-wise Shannon entropy to quantitatively evaluate how accurately each generative model replicates these biological conservation-variation patterns. We identified highly conserved regions with low variation (indicated by grey shading) by setting a threshold at the bottom 30% of values in the smoothed entropy distribution. When compared to the real data (black line), the synthetic data generated by both the GRU-based ARLM and GRU-based VAE exhibited entropy profiles similar to the real sequences, replicating the relative positions of conserved and variable regions. In particular, the GRU-based ARLM consistently maintained entropy deviations close to zero, demonstrating a superior capability to preserve nucleotide-level patterns. The Transformer-based ARLM exhibited global entropy profiles similar to the real data; however, it presented lower position-wise entropy values relative to the real sequences. Considering that high-quality synthetic data should maintain a balance between reflecting real characteristics and ensuring an adequate level of diversity, these results indicate that the Transformer-based ARLM exhibits an over-convergence to the real distribution in terms of entropy. In contrast, WGAN also exhibited entropy profiles similar to the real data; however, it exhibits significant entropy deviation, indicating high variability within the synthetic data. This indicates that while the relative patterns of the position-wise profiles were preserved, the entropy scale differs from the real data. In contrast, the N-gram, Conv-based VAE, and LDM exhibited overall entropy values higher than those of real data, characterized by linear or irregular profiles. These models maintained high entropy values even within the grey-shaded regions, indicating the generation of unnatural sequences with low biological plausibility. In addition, the comparison of datasets reveals that entropy deviations are generally larger in the Birds dataset (longest sequences) than in the Cypraeidae dataset (shortest sequences). Notably, the entropy deviation in the Birds dataset exhibits a significant increase beyond 500 bp, indicating a failure to accurately replicate the conserved terminal region near 600 bp. Within this region, the WGAN outperformed the GRU-based ARLM by replicating patterns closer to the real data. This indicates a potential limitation of GRU-based models in maintaining long-term dependencies as the sequence lengthens. In other words, while GRU-based models are effective in learning local nucleotide continuity, they encounter difficulties in accurately replicating the global structure of lengthy and complex sequences. To validate the biological plausibility of the synthetic COI data, we performed a comparative analysis of codon structure and usage patterns, frame preservation length, and the position-wise profiles of conserved and variable regions. The results demonstrate that GRU-based models are superior in capturing sequential information and continuous patterns at the nucleotide level. These models not only replicate the statistical distribution of nucleotide composition but also accurately reproduce the complex biological patterns of conserved and variable regions across the entire sequence. Table 2 Results of biological plausibility evaluation metrics for synthetic COI data generated by six generative models across the four datasets. Datasets Models GC₃ content (Δ) Codon bias JSD ORF mean length (Δ) Cypraeidae N-gram 0.1843 0.0227 8.1172 GRU-based ARLM 0.0037 0.0123 -0.0190 Transformer-based ARLM -0.0067 0.0342 -3.2274 Conv-based VAE -0.0106 0.0542 -10.2542 GRU-based VAE 0.0128 0.0132 2.1256 WGAN 0.0225 0.0435 4.9894 LDM 0.2322 0.1904 28.2121 Drosophila N-gram 0.2297 0.4262 40.4645 GRU-based ARLM -0.0015 0.0487 7.3208 Transformer-based ARLM -0.0031 0.4284 -6.2480 Conv-based VAE 0.2140 0.4433 63.9333 GRU-based VAE 0.0081 0.0727 5.8333 WGAN 0.0134 0.1216 8.3333 LDM 0.0814 0.4295 20.7083 Bats N-gram 0.1292 0.0978 -2.3815 GRU-based ARLM 0.0143 0.0215 -2.3333 Transformer-based ARLM -0.0046 0.0608 -6.6870 Conv-based VAE 0.1155 0.1623 28.5204 GRU-based VAE 0.0167 0.0272 -2.5074 WGAN 0.0101 0.0446 12.5259 LDM 0.0730 0.3255 -52.7167 Birds N-gram -0.0983 0.3627 36.9028 GRU-based ARLM 0.0016 0.0293 0.4066 Transformer-based ARLM 0.0031 0.3675 6.3973 Conv-based VAE -0.0884 0.3775 55.5732 GRU-based VAE 0.0024 0.4437 -0.3953 WGAN -0.0070 0.0601 1.3232 LDM 0.0621 0.4922 -20.1564 3.2. Maintaining Phylogenetic and Population-Level Fidelity Table 3 presents the species discrimination capabilities of synthetic COI data generated by seven models across four datasets. Generally, a K2P distance close to 0 indicates intraspecific genetic similarity, while a value of 0.05 or higher is considered as a threshold for interspecific distinction. Therefore, an intra-species K2P mean (Δ) close to 0 suggests that the synthetic data preserves intraspecific diversity similar to that of the real data. Similarly, a real–synthetic K2P mean near 0 indicates that the genetic structure of the synthetic data is highly similar to that of the real data. Finally, a barcode gap rate (Δ) near 0 indicates that the synthetic data accurately reflects the interspecific distance structure observed in the real data. An analysis of the overall results showed that both the N-gram and LDM exhibited intra-species K2P mean (Δ) and real– synthetic K2P mean values significantly exceeding 1–2. This indicates that the synthetic data occupied a genetic space markedly distinct from that of the real data, exhibiting a divergent pattern of variation. Therefore, the intraspecific and interspecific distance structures in these models were collapsed, indicating a failure to generate synthetic data that preserve phylogenetic consistency. Additionally, the barcode gap rate (Δ) exhibited substantially negative values, confirming the collapse of species boundaries within the synthetic data. Conversely, the GRU-based ARLM consistently exhibited the lowest values across all three metrics for all four datasets, indicating superior performance in species discrimination. This indicates that the synthetic data generated by the GRU-based ARLM occupied a genetic space closely aligned with that of the real data, effectively replicating both intraspecific variation patterns and interspecific boundaries. Furthermore, the barcode gap rate (Δ) was closer to 0, indicating that the genetic boundaries between species were effectively maintained within the synthetic data. The GRU-based VAE also effectively captured genetic structures at both the intraspecific and interspecific levels. In contrast, while the Transformer-based ARLM preserved consistent intraspecific and interspecific distances, it exhibited reduced performance in maintaining species boundaries, as evidenced by significant deviations in the barcode gap rate (Δ). The WGAN and Conv-based VAE exhibited substantial deviations in K2P distances and significant negative values of the barcode gap rate (Δ), indicating difficulties in distinguishing species within the synthetic data. Across all generative models examined, the barcode gap rate (Δ) was negative, indicating that species discrimination boundaries in the synthetic data were less distinct compared to those in the real data. This result is likely due to the expansion or overlap of sequence distributions for each species during the generation process. Figure 4 shows the PCA-UMAP comparison between the real data with synthetic data generated by seven models across four datasets. The results are arranged with datasets represented in rows (Cypraeidae, Drosophila, Bats, and Birds) and models in columns. Overall, the synthetic data generated by the GRU-based ARLM and the GRU-based VAE most effectively replicated the cluster distributions and the interspecies cluster distances observed in the real data. These models consistently preserved the compactness within species-specific clusters as well as the separation between species clusters, thereby reflecting the intra-species diversity and phylogenetic structures observed in the real data. This fidelity in distribution aligns with the low values of intra-species K2P mean (Δ) and barcode gap rates (Δ) reported in Table 3 . While synthetic data from the Transformer-based ARLM and WGAN closely aligned with the global distribution of the real data, they exhibited increased scattering around clusters and denser aggregation within particular clusters compared to GRU-based models. This indicates potential instability in preserving patterns of intraspecific variation and maintaining clear interspecific boundaries. In contrast, synthetic data generated by N-gram, Conv-based VAE, and LDM were broadly scattered around the clusters of the real data. In particular, the distribution of data generated by the LDM markedly diverges from that of the real data, indicating that the synthetic data exhibited characteristics that are non-biological in nature. Furthermore, the negative values of the barcode gap rate (Δ) presented in Table 3 , corroborated by the PCA-UMAP analysis, indicate that species-specific distributions in the synthetic data were more expanded than those in the real data, leading to partial overlaps between adjacent species distributions. Therefore, GRU-based models demonstrated stable performance in replicating the species-specific structures of the real data. Table 3 Results of species discrimination performance metrics for synthetic COI data generated by six generative models across the four datasets. Datasets Models Intra-species K2P mean (Δ) Real–Synth K2P mean Barcode gap rate (Δ) Cypraeidae N-gram 2.6059 2.6340 -0.7143 GRU-based ARLM 0.1269 0.0843 -0.4286 Transformer-based ARLM 0.0621 0.0775 -0.6134 Conv-based VAE 0.2588 0.2197 -0.7143 GRU-based VAE 0.1915 0.1272 -0.5462 WGAN 0.3106 0.2945 -0.7143 LDM 2.2510 2.2934 -0.7143 Drosophila N-gram 2.3128 2.3167 -0.6875 GRU-based ARLM 0.0982 0.0629 -0.5625 Transformer-based ARLM 0.0713 0.0882 -0.6250 Conv-based VAE 1.4862 1.4562 -0.6875 GRU-based VAE 0.1550 0.0974 -0.4375 WGAN 0.1585 0.2043 -0.5625 LDM 1.1384 1.4618 -0.6875 Bats N-gram 2.8969 2.9842 -0.9630 GRU-based ARLM 0.0951 0.0635 -0.6481 Transformer-based ARLM 0.1667 0.1796 -0.9630 Conv-based VAE 1.5104 1.4697 -0.9630 GRU-based VAE 0.0996 0.0539 -0.6667 WGAN 0.2357 0.2357 -0.9630 LDM 1.4538 1.7793 -0.9630 Birds N-gram 2.7890 2.8641 -0.7593 GRU-based ARLM 0.0986 0.0919 -0.6296 Transformer-based ARLM 0.1124 0.1634 -0.7593 Conv-based VAE 1.4611 1.4636 -0.7593 GRU-based VAE 0.2160 0.1640 -0.7222 WGAN 0.1357 0.1869 -0.7593 LDM 0.5852 1.3440 -0.7593 3.3. Modeling Biodiversity Through Synthetic Sequence Generation Table 4 presents the biodiversity of synthetic COI data generated by seven models across four datasets. Ideally, generative models should generate novel data that reproduce the statistical and biological attributes of the real data, rather than merely replicating the training data. To validate this, we utilized three metrics: JSD-kmer, Self-BLEU (Δ), and AA. For both JSD-kmer and Self-BLEU (Δ), lower values indicate that the synthetic data accurately reflect the structural characteristics and diversity inherent in the real data. For AA, a value close to 0.5 signifies that the synthetic data maintains appropriate diversity and adheres to the structural distribution of the real data, without exhibiting excessive self-replication. The GRU-based ARLM and GRU-based VAE consistently exhibited low JSD-kmer and Self-BLEU (Δ) values near 0 across all datasets, effectively replicating 3-mer nucleotide patterns and sequence variation structures similar to the real data. Additionally, with AA scores ranging between 0.5 and 0.7, the synthetic data generated by these models maintained intra-specific diversity while reducing data redundancy. Therefore, we confirmed that GRU-based models achieved a balanced reproduction of nucleotide composition diversity, sequence variation, and inter-cluster overlap, thereby confirming the comprehensive preservation of both ecological and genetic diversity. In contrast, the N-gram, Transformer-based ARLM, and WGAN exhibited local pattern distributions (JSD-kmer) similar to the real data. However, their AA values ranged between 0.9 and 1.0, implying that the generated sequences inhabited a markedly different genetic space and failed to reproduce the variational structure of the real data. In other words, although these models captured nucleotide-level frequencies, they did not sufficiently reflect the fine-grained structural properties underlying natural sequences. The Conv-based VAE and LDM models exhibited substantially higher JSD-kmer values, demonstrating their inability to learn the local compositional patterns of real sequences. Moreover, their AA value of 1 suggested that these models generated non-biological artifacts rather than biologically plausible sequences, further confirming their failure to reproduce the intrinsic characteristics of the real data. Figure 5 shows log-scaled scatter plots comparing the 3-mer frequencies between the real and synthetic datasets. From top to bottom, the results correspond to Cypraeidae, Drosophila, Bats, and Birds, and from left to right the results of the seven models are shown in order. When the points are uniformly distributed along the diagonal, it indicates that the model successfully reflected the 3-mer distributions of the real data, ranging from highly frequent patterns (upper-right region) to rare patterns (lower-left region). Overall, across all datasets, the N-gram, GRU-based ARLM, and GRU-based VAE models exhibited high correlations with R-squared values above 0.98, showing that their generated 3-mer distributions closely matched those of the real data. For the GRU-based ARLM and GRU-based VAE, points were distributed around the diagonal, indicating that these models reproduce the overall real sequence distribution rather than engaging in simple frequency replication. In contrast, since the N-gram model learns the 3-mer occurrence probabilities, its statistical distribution appeared highly similar to that of the real data. However, when considered alongside the results presented in Table 4 , it becomes evident that the N-gram model fails to reflect the biological characteristics of the real sequences and performs only a frequency-level replication. The Transformer-based ARLM and WGAN models exhibited high correlations with the real data in terms of overall 3-mer distributions (R-squared values ≈ 0.96–0.99). However, both models showed increased dispersion in the scatter plots when generating rare structures (lower-left region). For the Conv-based VAE, points were skewed toward the upper right, confirming poor performance in reproducing rare nucleotide patterns that are observed in the real sequences. The LDM exhibited the lowest R-squared values among the models, with the majority of points distributed below the diagonal. This indicates a distributional distortion where certain 3-mer patterns were excessively generated in the synthetic data. Furthermore, among the four datasets, the scatter plots for Cypraeidae exhibited the most uniform distributions along the diagonal, suggesting that diverse 3-mer structures were successfully reflected. In contrast, the scatter plots for Birds showed a stronger concentration of points in the upper-right region. These results indicate that as sequence length increases, the fidelity of generating rare patterns tends to decrease. Table 4 Results of biological diversity evaluation metrics for synthetic COI data generated by six generative models across the four datasets. Datasets Models JSD-kmer Self-BLEU (Δ) AA Cypraeidae N-gram 0.0175 -0.0011 0.9993 GRU-based ARLM 0.0098 -0.0001 0.6405 Transformer-based ARLM 0.0315 -0.0000 0.9023 Conv-based VAE 0.0532 -0.0001 0.9857 GRU-based VAE 0.0109 -0.0001 0.6124 WGAN 0.0416 -0.0002 0.9677 LDM 0.1899 -0.0002 1.0000 Drosophila N-gram 0.0301 -0.0004 1.0000 GRU-based ARLM 0.0314 0.0001 0.6108 Transformer-based ARLM 0.0436 0.0001 0.9560 Conv-based VAE 0.1787 -0.0001 1.0000 GRU-based VAE 0.0320 -0.0001 0.5902 WGAN 0.0513 -0.0001 0.9094 LDM 0.2836 0.0000 1.0000 Bats N-gram 0.0935 -0.0006 1.0000 GRU-based ARLM 0.0197 0.0000 0.5478 Transformer-based ARLM 0.0486 0.0000 0.9750 Conv-based VAE 0.1605 -0.0002 1.0000 GRU-based VAE 0.0263 -0.0012 0.5281 WGAN 0.0426 0.0000 0.9904 LDM 0.3236 -0.0003 1.0000 Birds N-gram 0.0204 -0.0004 1.0000 GRU-based ARLM 0.0204 -0.0001 0.7417 Transformer-based ARLM 0.0480 -0.0002 0.9431 Conv-based VAE 0.2027 -0.0003 1.0000 GRU-based VAE 0.0311 -0.0010 0.7558 WGAN 0.0473 -0.0002 0.9586 LDM 0.4699 -0.0004 1.0000 3.4. Phylogenetic coherence between real data and synthetic sequences Using alignments that combined real COI sequences (Real) with synthetic COI sequences generated by the GRU-based ARLM, we inferred maximum-likelihood (ML) phylogenetic trees in IQ-TREE 3. For each taxon, we defined the MRCA-defined real clade as the minimal clade subtended by the most recent common ancestor (MRCA) of that taxon’s real sequences, and calculated the in-clade rate as the proportion of synthetic sequences placed within the corresponding real clade. In addition, we computed the non-focal real-tip count (the number of real tips belonging to taxa other than the focal taxon within the MRCA-defined real clade) to jointly assess potential mixing during synthesis and the clarity of species boundaries in the real data. Across the four datasets (240 taxa in total), the synthetic in-clade rate had a synthetic-count–weighted mean of 0.793, and 81.7% of taxa had a non-focal real-tip count of 0. Within this subset, the weighted mean in-clade rate was 0.764. A non-focal real-tip count of 0 indicates that the MRCA-defined real clade is composed exclusively of real sequences from the focal taxon. In the bats dataset (51 taxa), 90.2% of taxa had a non-focal real-tip count of 0. The in-clade rate showed a weighted mean of 0.769 (count = 0 subset: 0.756) and a median of 0.8667 (IQR 0.6333–0.9500). Taxa with an in-clade rate = 1.0 accounted for 11.8%, and under the count = 0 condition, taxa with an in-clade rate < 0.5 accounted for 9.8%. For a subset of taxa, real_monophyly was FALSE or the non-focal real-tip count was large, indicating that other real sequences were included within the MRCA-defined real clade. Under the count = 0 condition, low in-clade rates were predominantly observed in cases with small n_real. In the birds dataset (54 taxa), 83.3% of taxa had a non-focal real-tip count of 0. The in-clade rate had a weighted mean of 0.735 (count = 0 subset: 0.699) and a median of 0.9000 (IQR 0.5667–1.0000). Taxa with an in-clade rate = 1.0 accounted for 29.6%, whereas taxa with an in-clade rate = 0.0 accounted for 7.4%, and taxa with an in-clade rate < 0.5 accounted for 20.4%. Cases with an in-clade rate = 0.0 were observed in groups with small n_real. In the Cypraeidae dataset (119 taxa), 78.2% of taxa had a non-focal real-tip count of 0. The in-clade rate had a weighted mean of 0.813 (count = 0 subset: 0.781) and a median of 0.9000 (IQR 0.7000–1.0000). Taxa with an in-clade rate = 1.0 accounted for 28.6%, and no taxa showed an in-clade rate of 0.0. Under the count = 0 condition, taxa with an in-clade rate < 0.5 accounted for 10.1%. In this dataset, increases in non-focal real-tip count were observed for some taxa, and very low in-clade rates were observed under conditions with small n_real. In the Drosophila dataset (16 taxa), 75.0% of taxa had a non-focal real-tip count of 0. The in-clade rate showed a weighted mean of 0.919 (count = 0 subset: 0.903) and a median of 0.9667 (IQR 0.8917–1.0000). Taxa with an in-clade rate = 1.0 accounted for 37.5%, and no taxa had an in-clade rate < 0.5. Overall, dataset-level weighted mean in-clade rates ranged from 0.735 to 0.919, the proportion of taxa with non-focal real-tip count = 0 ranged from 75.0% to 90.2%, and the proportion of taxa with an in-clade rate = 1.0 ranged from 11.8% to 37.5% (Supplementary Data S1 and S2). Taxonomic composition and patterns of potential mixing were additionally examined using Krona-based interactive visualizations (Supplementary Data S3). 3.5. External phylogenetic validation with BOLD reference sequences For the Chiroptera external reference set downloaded from BOLD (Supplementary Data S4A), alignment of the BOLD sequences with the GRU-based ARLM synthetic sequences comprised 2,780 sequences and 2,933 nucleotide sites. The alignment contained 1,648 invariant sites (56.19%), and 898 parsimony-informative sites. IQ-TREE 3 analyses were performed with ModelFinder-based model selection and maximum-likelihood estimation (MLE); the best-fitting model under the Bayesian Information Criterion (BIC) was GTR + F+R6. The final ML tree had a log-likelihood of − 79,788.8898 (s.e. 3,752.4036) and a total sum of branch lengths of 43.9630 (Supplementary Data S5A). For the Drosophilidae BOLD reference set (Supplementary Data S4B), the combined alignment comprised 3,672 sequences and 2,316 nucleotide sites, with 1,388 invariant sites (59.93%) and 691 parsimony-informative sites. Using the same IQ-TREE 3 workflow, the best-fitting model under BIC was GTR + F+I+R9. The final ML tree had a log-likelihood of − 101,183.1110 (s.e. 5,706.3479) and a total sum of branch lengths of 48.1559 (Supplementary Data S5B). Taxonomic composition and patterns of potential mixing were additionally examined using Krona-based interactive visualizations (Supplementary Data S6). 4. Discussion We performed species-specific COI sequence generation using generative models to address data scarcity and class imbalance in species-level COI datasets. To explore generative models optimized for COI data, we constructed six deep generative models with distinct architectures—GRU-based ARLM, Transformer-based ARLM, Conv-based VAE, GRU-based VAE, WGAN, and LDM—and conducted a comparative evaluation across four datasets (Cypraeidae, Drosophila, Bats, and Birds). These datasets differ in sequence length, number of species, and sequence complexity, enabling assessment of the models’ biological generalization capabilities. We evaluated the generated synthetic data focusing on biological plausibility, intra- and inter-species discriminability, and biological diversity. Across datasets, GRU-based models effectively reproduced biological characteristics and structural properties of COI sequences. Based on position-wise Shannon entropy and PCA-UMAP analyses, the GRU-based ARLM and GRU-based VAE generated synthetic data that reflected sequence patterns of conserved and variable regions consistent with real data. These models also preserved intra-specific clustering and inter-specific boundaries. Evaluations using JSD-kmer, Self-BLEU, and AA metrics confirmed that synthetic sequences maintained intra-species diversity similar to real data while capturing nucleotide patterns and sequence variation structures. For the Transformer-based ARLM, synthetic sequences reflected overall k-mer distributions and Shannon entropy patterns similar to real data but showed limited inter-species discriminability and failed to capture key characteristics of the real data. This suggests that although Transformer architecture learns global patterns effectively, it is less effective at capturing codon-level constraints and subtle intra-specific variations of COI sequences. In contrast, convolution-based models, including Conv-based VAE, WGAN, and LDM, showed limited capability in reproducing rare patterns and codon-level structures. Conv-based VAE and LDM exhibited high entropy, k-mer JSD, and AA values close to 1, indicating generation of unnatural sequences that do not reflect real data characteristics. Therefore, for COI sequence modeling, capturing nucleotide order and sequential dependency patterns is more critical than learning local patterns. Finally, COI sequence generation performance ranked as follows: GRU-based ARLM > GRU-based VAE > Transformer-based ARLM ≈ WGAN > Conv-based VAE > LDM. Models such as WGAN, Conv-based VAE, and LDM, previously effective for SMILES, microbiome, haplotype, genotype, and promoter data, showed poor performance for COI sequence generation, highlighting the importance of domain-specific model design. Comparisons with the statistical N-gram method confirmed that deep generative models can generate diverse synthetic sequences reflecting intra-species variation and evolutionary constraints. Alignment and phylogenetic tree estimation using external reference sequences from the BOLD COI-5P database showed that generated sequences exhibited phylogenetic alignment patterns consistent with real genetic information. In conclusion, GRU-based generative models most robustly reproduced evolutionary diversity and biological characteristics of COI sequences. 5. Conclusion DNA barcoding sequences are widely used as biomarkers for species identification and biodiversity research. Recently, machine learning– and deep learning–based approaches have been increasingly applied to COI data analysis. However, COI data present inherent limitations, characterized by high intra-specific similarity and significant data imbalances across species. In addition, public COI databases often contain extremely limited numbers of sequences per species or high levels of sequence redundancy, making reliable analysis challenging. To address the scarcity of usable data, synthetic data generation methods have been introduced. Traditional methods primarily focused on statistical properties such as nucleotide frequency. However, these approaches have limitations in reflecting the biological characteristics of COI sequences. Generative model–based approaches have recently been introduced to generate high-quality synthetic data that effectively capture these biological characteristics. By comparatively analyzing species-specific COI sequence generation performance across generative models with diverse architectures, we confirmed that GRU-based generative models effectively reproduce both the evolutionary diversity and biological characteristics of COI sequences. In particular, the GRU-based ARLM and GRU-based VAE generated diverse synthetic sequences with minimized redundancy while preserving intra-species diversity, inter-species discriminative structures, and codon-level patterns of real COI data. These findings indicate that such models can serve as a novel simulation approach to mitigate data scarcity, overcoming the limitations of existing statistical simulation methods. Moreover, the proposed method can be applied to various biological fields where acquiring real data is challenging, including eDNA-based biodiversity prediction and the simulation of rare or endangered species. Future research aims to improve the accuracy and biological fidelity of synthetic COI data by incorporating a broader range of taxa and strengthening biologically informed constraints in generative model design. Declarations Acknowledgements Not applicable. Funding This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(RS-2021-NR060121) This research was supported by Korea Basic Science Institute(National research Facilities and Equipment Center) grant funded by the Ministry of Education.(RS-2022-NF000922). Author Contributions CM and DS contributed equally to this work. CM and DS developed the methodology, conceived the experiments and wrote the manuscript. JJ, CH, HS, HL, and KL performed data curation and reviewed the manuscript. JP and HH validated the results from biological perspective. YL supervised this work. All authors discussed the results and contributed to the fnal version of the manuscript. Data availability The datasets analysed during the current study are publicly available from the SupBarcodes repository (http://dmb.iasi.cnr.it/supbarcodes.php). The processed datasets generated during the current study are available from the corresponding author on reasonable request. Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Competing interests The authors declare no competing interests. References Hebert, P. D., Cywinska, A., Ball, S. L. & deWaard, J. R. Biological identifications through DNA barcodes. Proc. Biol. Sci. 270 (1512), 313–321 (2003). Hajibabaei, M., Singer, G. A., Hebert, P. D. & Hickey, D. A. DNA barcoding: how it complements taxonomy, molecular phylogenetics and population genetics. Trends Genet. 23 (4), 167–172 (2007). Siti-Azizah, M. N. (ed) DNA Barcoding: ; 2013: Syiah Kuala University.the Molecular Detective. 3rd Syiah Kuala University Annual International Conference (2013). Hebert, P. D., Stoeckle, M. Y., Zemlak, T. S. & Francis, C. M. Identification of Birds through DNA Barcodes. PLoS Biol. 2 (10), e312 (2004). Ward, R. D., Zemlak, T. S., Innes, B. H., Last, P. R. & Hebert, P. D. DNA barcoding Australia's fish species. Philos. Trans. R Soc. Lond. B Biol. Sci. 360 (1462), 1847–1857 (2005). Yang, C-H., Wu, K-C., Chuang, L-Y. & Chang, H-W. DeepBarcoding: deep learning for species classification using DNA barcoding. IEEE/ACM Trans. Comput. Biol. Bioinf. 19 (4), 2158–2165 (2021). Arias, P. M. et al. BarcodeBERT: Transformers for biodiversity analysis. arXiv preprint arXiv :231102401. (2023). Taccaliti, E. & Aguilar–Ruiz, J. S. Improving classification on imbalanced genomic data via KDE–based synthetic sampling. BioData Min. 18 (1), 60 (2025). Gómez-Martínez, V., Chushig-Muzo, D., Veierød, M. B., Granja, C. & Soguero-Ruiz, C. Ensemble feature selection and tabular data augmentation with generative adversarial networks to enhance cutaneous melanoma identification and interpretability. BioData Min. 17 (1), 46 (2024). Zhu, Y. et al. Generative ai for controllable protein sequence design: A survey. arXiv preprint arXiv :240210516. (2024). Xia, J., Zhou, J., Chen, S., Ling, T. & Li, S. Z. (eds) A comprehensive and systematic review for deep learning-based de novo peptide sequencing. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence; (2025). Lan, L. et al. Generative adversarial networks and its applications in biomedical informatics. Front. public. health . 8 , 164 (2020). Alvi, R. et al. Generative Artificial Intelligence in Bioinformatics: A Systematic Review of Models, Applications, and Methodological Advances. arXiv preprint arXiv:251103354. (2025). Rong, R. et al. MB-GAN: Microbiome Simulation via Generative Adversarial Network. Gigascience. ; 10 (2): giab005. 2021. (2021). Yelmen, B. et al. Deep convolutional and conditional neural networks for large-scale genomic data generation. PLoS Comput. Biol. 19 (10), e1011584 (2023). Xie, S. et al. Deep Generative Models for Discrete Genotype Simulation. bioRxiv. 2025:2025.08. 08.669289. Li, Z. et al. Discdiff: Latent diffusion model for dna sequence generation. arXiv preprint arXiv :240206079. (2024). Xu, L. et al. A generative deep neural network for pan-digestive tract cancer survival analysis. BioData Min. 18 (1), 9 (2025). Zhou, J. et al. Novobench: Benchmarking deep learning-based\emph {De Novo} sequencing methods in proteomics. Adv. Neural. Inf. Process. Syst. 37 , 104776–104791 (2024). Katoh, K., Misawa, K. & Kuma Ki, Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30 (14), 3059–3066 (2002). Brown, P. F., Della Pietra, V. J., Desouza, P. V., Lai, J. C. & Mercer, R. L. Class-based n-gram models of natural language. Comput. linguistics . 18 (4), 467–480 (1992). Chen, X., Mishra, N., Rohaninejad, M. & Abbeel, P. (eds) Pixelsnail: An improved autoregressive generative model. International conference on machine learning; : PMLR. (2018). Dou, L. et al. Unisar: A unified structure-aware autoregressive language model for text-to-sql. arXiv preprint arXiv :220307781. (2022). Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4 (2), 268–276 (2018). Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. (eds) High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; (2022). McInnes, L., Healy, J., Melville, J. & Umap Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv :180203426. (2018). Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30 (4), 772–780 (2013). Wong, T. et al. IQ-TREE 3: Phylogenomic Inference Software using Complex Evolutionary Models (2025). Ratnasingham, S. & Hebert, P. D. bold: The Barcode of Life Data System. Mol. Ecol. Notes . 7 (3), 355–364 (2007). http://www.barcodinglife.org Letunic, I. & Bork, P. Interactive Tree of Life (iTOL) v6: recent updates to the phylogenetic tree display and annotation tool. Nucleic Acids Res. 52 (W1), W78–W82 (2024). Ondov, B. D., Bergman, N. H. & Phillippy, A. M. Interactive metagenomic visualization in a Web browser. BMC Bioinform. 12 , 385 (2011). Additional Declarations No competing interests reported. Supplementary Files SupplementaryDataS1iqtree3qualitysummary.xlsx S2.PhylogeneticcoherenceiqtreeKrona.zip Cite Share Download PDF Status: Under Review Version 1 posted Reviewers agreed at journal 11 May, 2026 Reviewers agreed at journal 10 May, 2026 Reviewers invited by journal 06 May, 2026 Editor assigned by journal 06 Mar, 2026 Editor invited by journal 05 Mar, 2026 Submission checks completed at journal 03 Mar, 2026 First submitted to journal 03 Mar, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8982107","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":639278617,"identity":"a619a34e-eefd-48fb-a973-9210903d8aa2","order_by":0,"name":"Cho-I Moon","email":"","orcid":"","institution":"Soonchunhyang University","correspondingAuthor":false,"prefix":"","firstName":"Cho-I","middleName":"","lastName":"Moon","suffix":""},{"id":639278618,"identity":"8deb19a3-cd74-4ce8-b8f5-193b9e132c04","order_by":1,"name":"Dae Kwon Song","email":"","orcid":"","institution":"Soonchunhyang University","correspondingAuthor":false,"prefix":"","firstName":"Dae","middleName":"Kwon","lastName":"Song","suffix":""},{"id":639278619,"identity":"1d7228ed-e581-465d-939f-dae14f217269","order_by":2,"name":"Jie Eun Park","email":"","orcid":"","institution":"Soonchunhyang University","correspondingAuthor":false,"prefix":"","firstName":"Jie","middleName":"Eun","lastName":"Park","suffix":""},{"id":639278621,"identity":"8678fdc1-61c5-42a0-83a6-e4a0944da2d2","order_by":3,"name":"Jun Yang Jeong","email":"","orcid":"","institution":"Soonchunhyang University","correspondingAuthor":false,"prefix":"","firstName":"Jun","middleName":"Yang","lastName":"Jeong","suffix":""},{"id":639278623,"identity":"2b0f6222-4254-4cc5-89ea-66fdaeceefff","order_by":4,"name":"Chan Eui Hong","email":"","orcid":"","institution":"Soonchunhyang University","correspondingAuthor":false,"prefix":"","firstName":"Chan","middleName":"Eui","lastName":"Hong","suffix":""},{"id":639278624,"identity":"d49d0694-932b-4cc1-9be5-c2c9a19519e6","order_by":5,"name":"Hyeon Jun Shin","email":"","orcid":"","institution":"Soonchunhyang University","correspondingAuthor":false,"prefix":"","firstName":"Hyeon","middleName":"Jun","lastName":"Shin","suffix":""},{"id":639278630,"identity":"80e22f67-4376-444f-ab5d-9dcf19735b16","order_by":6,"name":"Hyeok Lee","email":"","orcid":"","institution":"Soonchunhyang University","correspondingAuthor":false,"prefix":"","firstName":"Hyeok","middleName":"","lastName":"Lee","suffix":""},{"id":639278631,"identity":"ed189ec6-63c0-43b4-99a9-859b77c93192","order_by":7,"name":"Kyoung Won Lee","email":"","orcid":"","institution":"Soonchunhyang University","correspondingAuthor":false,"prefix":"","firstName":"Kyoung","middleName":"Won","lastName":"Lee","suffix":""},{"id":639278637,"identity":"61b1a05c-f229-438d-a4b4-5d2eafb93024","order_by":8,"name":"Hee-ju Hwang","email":"","orcid":"","institution":"Soonchunhyang University","correspondingAuthor":false,"prefix":"","firstName":"Hee-ju","middleName":"","lastName":"Hwang","suffix":""},{"id":639278638,"identity":"677855f8-dd4d-4ccb-9d36-18fce900ec39","order_by":9,"name":"Yong Seok Lee","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAxElEQVRIie3OMQrCMBSA4RcEJ7Vri4MeIWtB7FVSCpkKrh0cMnXqAQJewiOkPHASswbqoDfoWBDF4OCadBPMP+VBPt4DCIV+M6IANgDxZ2B+xhI+nuAIQjUq1Vc6iw6C9APwnZsYzlp57nJ5VZOkgTIVbjKjOK87RmMGS4CKehyme3y+Lpklk4cfUSUgEYocYza1W0o3SQynbXMqcmnyOm0od5OFxvtt2G+zSBZohqpwk7X6PomwdzoBwEp4fAqFQqE/7w0QkT+mtuiYIwAAAABJRU5ErkJggg==","orcid":"","institution":"Soonchunhyang University","correspondingAuthor":true,"prefix":"","firstName":"Yong","middleName":"Seok","lastName":"Lee","suffix":""}],"badges":[],"createdAt":"2026-02-27 00:38:13","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8982107/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8982107/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":109296549,"identity":"5b84f452-743a-46ce-bb89-d20d69433441","added_by":"auto","created_at":"2026-05-15 08:48:06","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":155147,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCOI barcoding data preprocessing pipeline. \u003c/strong\u003eTo improve the biological and structural characteristics of synthetic data and improve accuracy and precision, we construct a final dataset by applying the proposed preprocessing pipeline to four publicly COI datasets.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8982107/v1/d3c0a159c2c9a608e555e871.png"},{"id":109290744,"identity":"4e7e448a-d118-4e1f-a70f-8724178e8eb4","added_by":"auto","created_at":"2026-05-15 07:21:09","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":185457,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eArchitectures of the six generative models. \u003c/strong\u003eWe compare and analyze the performance of models for the generation of COI data, including (a) a GRU-based ARLM, (b) a Transformer-based ARLM, (c) a Conv-based VAE, (d) a GRU-based VAE, (e) a WGAN, and (f) an LDM.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8982107/v1/3886276852c420ef58e111c0.png"},{"id":109290743,"identity":"72e1f261-ecaf-4c49-9bc0-c1c234b45e42","added_by":"auto","created_at":"2026-05-15 07:21:09","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":554956,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePosition-wise Shannon entropy profiles of real and synthetic COI data across four datasets ((a) Cypraeidae, (b) Drosophila, (c) Bats, (d) Birds). \u003c/strong\u003eThe graph on the left shows the distributions of position-wise absolute entropy, whereas the graph on the right shows the differences in entropy between the real and synthetic datasets.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-8982107/v1/ee02854d48f060510f1d61a5.png"},{"id":109296229,"identity":"7bfd25af-40c6-4ab5-9b8c-7d1a516d33d1","added_by":"auto","created_at":"2026-05-15 08:46:18","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":433100,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePCA-UMAP results of real and synthetic COI data across four datasets ((a) Cypraeidae, (b) Drosophila, (c) Bats, (d) Birds). \u003c/strong\u003eWe employ PCA-UMAP graphs to assess whether synthetic data clusters correspond to those of the real data and to evaluate the preservation of species boundaries.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-8982107/v1/7c64edb5e9767c4a791200f1.png"},{"id":109296451,"identity":"0a9949e5-bc00-4449-aa24-c19cde070ab1","added_by":"auto","created_at":"2026-05-15 08:47:04","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":288341,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eScatter plots of k-mer frequencies for real and synthetic COI data across four datasets ((a) Cypraeidae, (b) Drosophila, (c) Bats, (d) Birds). \u003c/strong\u003eWe evaluate the similarity of local sequence patterns between real and synthetic data using log-scale scatter plots. Data points positioned closer to the diagonal line indicate higher similarity between the two distributions.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-8982107/v1/f951b9c4e1ec22dd928958a3.png"},{"id":109296349,"identity":"9628f970-c5ee-4fa9-b81a-fe6275a9398e","added_by":"auto","created_at":"2026-05-15 08:46:35","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1420266,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8982107/v1/aa3586d8-7e36-4734-a8db-ac845b0199b1.pdf"},{"id":109290740,"identity":"3b6550ed-362c-4f16-8762-e06bc2bbe569","added_by":"auto","created_at":"2026-05-15 07:21:09","extension":"xlsx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":22095,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryDataS1iqtree3qualitysummary.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8982107/v1/e388fab77d0a6d5a5bb9f6d6.xlsx"},{"id":109296601,"identity":"4b03903f-ecad-422f-a3e2-54b8426c69c5","added_by":"auto","created_at":"2026-05-15 08:48:28","extension":"zip","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":748562,"visible":true,"origin":"","legend":"","description":"","filename":"S2.PhylogeneticcoherenceiqtreeKrona.zip","url":"https://assets-eu.researchsquare.com/files/rs-8982107/v1/ebf9067a96ba4efce8cd058d.zip"}],"financialInterests":"No competing interests reported.","formattedTitle":"Benchmarking Generative Models for COI DNA Barcoding","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eIn diverse animal taxa, species identification based solely on traditional morphology has limitations due to morphological variation, trait changes across life-history stages, and physical damage to diagnostic characters during distribution and processing [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. This issue is particularly pronounced in marine invertebrates frequently traded and processed; their external features are often unavailable or obscured, making species-level identification difficult and creating downstream challenges for fisheries resource management and food safety assurance [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Such constraints of morphology-based classification are not restricted to a single taxonomic group and are repeatedly observed across situations in which morphological information from specimens is limited [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. To address these limitations, DNA barcoding based on the mitochondrial cytochrome c oxidase subunit I (COI) gene has been widely adopted as standardized tool for species identification. DNA barcoding is molecular method that distinguishes interspecific genetic variation using short standardized DNA sequence, enabling broad taxonomic coverage and relatively rapid, consistent species identification even when morphological information is limited [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Accordingly, COI-based DNA barcoding has been validated across wide range of animal groups\u0026mdash;including birds, fishes, mollusks, and arthropods\u0026mdash;and has been broadly applied in species identification and biodiversity research [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eRecently, species classification methods based on COI sequences have expanded beyond traditional approaches, such as tree-based, similarity-based, and feature-based classifications, to include machine learning and deep learning methodologies [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. DeepBarcoding proposed a deep neural network composed of convolutional layers and evaluated its classification performance on six types of COI datasets, demonstrating superior performance compared with five conventional machine learning classifiers [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. BarcodeBERT was introduced as a transformer-based foundation model for species classification across large-scale and diverse DNA barcoding datasets, achieving excellent classification performance [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. To date, deep learning\u0026ndash;based analyses of DNA barcoding data have focused on classification tasks. However, DNA barcoding datasets used in existing studies suffer from severe class imbalance due to the difficulties associated with real-world data acquisition. While some species are represented by hundreds of samples, others have only one or two samples, resulting in a pronounced long-tail distribution. Such imbalance is a major factor contributing to overfitting and degraded generalization performance in deep learning models. Moreover, COI sequences exhibit limited intra-species diversity and narrow boundaries between related species, which further increase the risk of misclassification. Therefore, there is a need for research on synthetic data generation techniques that can expand species-specific data distributions while preserving the biological characteristics and variation structures of real COI sequences.\u003c/p\u003e \u003cp\u003eDeep learning technologies have been widely applied not only to species classification but also to various fields of biological data analysis and design [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. In addition, with the introduction of various generative deep learning models, it has become possible to explore vast sequence spaces and directly generate novel sequences that do not exist in current databases (de novo design) [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Generative modeling approaches have emerged as a new paradigm for genomic data simulation in bioinformatics, as well as for applications drug discovery and protein design [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. In the field of biological data generation, MB-GAN was proposed as a simulation approach for microbiome data using generative adversarial networks (GANs) [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. In addition, high-quality artificial genomes have been generated from haplotype data using Wasserstein GANs (WGANs) [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Other studies have explored synthetic data generation and performance evaluation for genotype data incorporating SNP variation using WGANs, variational autoencoders (VAEs), and diffusion models [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Beyond numerical and discrete-valued data, DiscDiff was proposed as a generative model for promoter sequences composed of nucleotide characters. DiscDiff employed a latent diffusion model to generate discrete DNA sequences and subsequently refined the generated sequences using an Absorb\u0026ndash;Escape post-processing algorithm [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. In addition, a study proposed a graph-based deep generative neural network with a generative\u0026ndash;adversarial module to denoise high-dimensional data and extract biologically meaningful latent features, enabling accurate classification of molecular subtypes of gastrointestinal cancers [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. Despite extensive research on generative models across diverse types of biological data, studies targeting COI DNA barcoding sequences have not yet been conducted. Existing studies have utilized diverse datasets, evaluation metrics, and model architectures, thereby complicating direct performance comparisons and limiting rigorous validation of the effectiveness of recent models relative to traditional methods [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eFurthermore, COI sequences contain essential evolutionary constraints, including codon-level translational structures and nucleotide sequence continuity. Therefore, a comparative analysis is required to determine which architectural approach\u0026mdash; local pattern learning approaches based on convolutional neural networks (CNNs) or sequential pattern learning approaches based on recurrent neural networks (RNNs)\u0026mdash;is more effective in generating synthetic COI sequences that possess both biological validity and diversity. Accordingly, we conducted a benchmarking study to explore generative deep architectures optimized for COI barcode sequence generation, with the aim of overcoming data scarcity and class imbalance problems. We selected four generative architectures that encompass distinct generation mechanisms: autoregressive networks that learn sequential probabilistic dependencies, VAEs and GANs that approximate data distributions in latent space, and the recent Latent Diffusion Model, which generates high-quality data through iterative denoising processes. We constructed a total of six generative models incorporating gated recurrent unit (GRU) layers, Transformer blocks, and convolutional layers, and conducted comparative analyses against a traditional statistical baseline, the N-gram probabilistic model.\u003c/p\u003e \u003cp\u003eThe main contributions of this study are as follows:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eThrough a comparative analysis of six generative models with distinct structural characteristics, we propose an optimal model capable of generating high-quality data while preserving the biological characteristics of COI sequences.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eWe propose a multidimensional evaluation framework for generative models that assesses biological plausibility, intra- and inter-species structural preservation, and the diversity of the generated sequences.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eUsing COI sequences of vertebrates and invertebrates collected from the BOLD (Barcode of Life Data Systems) database, we validate that the synthetic sequences generated in this study are placed in phylogenetically consistent positions relative to real sequences.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e"},{"header":"2. Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\n \u003ch2\u003e2.1 Data acquisition and preprocessing\u003c/h2\u003e\n \u003cp\u003eWe aim to generate new species-specific COI gene sequences using deep generative models. The COI gene, a protein-coding segment of mitochondrial DNA (mtDNA), is characterized by inherent biological, structural, and evolutionary constraints. Its sequences comprise both conserved and variable regions. The conserved regions are located at the N-terminus and C-terminus, exhibiting high conservation across diverse taxa. These conserved amino acid sequences facilitate the design of primers. Conversely, the central region of the COI gene encompasses variable segments with elevated mutation rates. These highly variable regions have been employed for species identification, population differentiation, and the assessment of evolutionary relationships.\u003c/p\u003e\n \u003cp\u003eWe developed a preprocessing pipeline to ensure the functional, evolutionary, and molecular biological constraints associated with the COI gene at the data level. We utilized publicly accessible DNA barcode sequence datasets (Cypraeidae, Drosophila, Birds, Bats) obtained from \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://dmb.iasi.cnr.it/supbarcodes.php\u003c/span\u003e\u003c/span\u003e [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. These open-access datasets have been utilized for species classification tasks. Notably, these datasets exhibit high phylogenetic diversity alongside minimal sequence divergence among species, thereby presenting significant challenges for accurate identification. The final training and testing datasets were constructed through preprocessing steps. The preprocessing procedure is detailed as follows (Fig. \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e\u003cspan\u003e\n \u003cp\u003e\u003cstrong\u003e1. Character Normalization\u003c/strong\u003e: We performed character normalization by converting IUPAC ambiguous nucleotide codes (Y, R, K, S, M, W, D, B, H, V) as well as gap characters introduced during sequence alignment (i.e., \u0026apos;-\u0026apos; and \u0026apos;_\u0026apos;) into the character \u0026lsquo;N\u0026rsquo;. This step was employed to minimize uncertainty within the sequences and to prevent the model from learning erroneous pattern information.\u003c/p\u003e\n \u003c/span\u003e \u003cspan\u003e\n \u003cp\u003e\u003cstrong\u003e2. Verficiation of Minimum Barcode Sequence Length (\u0026lt;\u0026thinsp;300bp Filter)\u003c/strong\u003e: The COI barcode sequences typically span between 600 and 658 base pairs. It is necessary to maintain both conserved and variable regions within the overall sequence length to generate reliable barcode sequences. Therefore, sequences shorter than 300 base pairs, which may lack critical barcoding information, were removed.\u003c/p\u003e\n \u003c/span\u003e \u003cspan\u003e\n \u003cp\u003e\u003cstrong\u003e3. Verification of the Number of Data per Species (\u0026le;\u0026thinsp;3 Species Data Filter)\u003c/strong\u003e: To generate new barcode data that reflects the characteristics of each species, it is important to train the model on a sufficiently diverse dataset within each species. Training with only one or two sequences per species results in the generated outputs that closely replicate the training data. Therefore, we set a minimum data criterion per species of 3.\u003c/p\u003e\n \u003c/span\u003e \u003cspan\u003e\n \u003cp\u003e\u003cstrong\u003e4. Orientation Correction based on Open Reading Frame (ORF)\u003c/strong\u003e: Given that COI sequences represent protein-coding genes, the presence of internal stop codons and frameshifts mutations is biologically implausible. To evaluate sequence orientation, each sequence was analyzed in all three reading frames on both the forward and reverse-complement strands. Following translation in each reading frame and strand, we determined the orientation that corresponded to the minimal occurrence of internal stop codons and the maximal length of ORF. This step enabled the normalization of all COI sequences to a uniform orientation.\u003c/p\u003e\n \u003c/span\u003e \u003cspan\u003e\n \u003cp\u003e\u003cstrong\u003e5. Quality Control\u003c/strong\u003e: To improve data quality, validation of COI sequences was performed based on established criteria. Sequences exhibiting one or more internal stop codons, an ambiguous nucleotide (N) ratio of 5% or higher, translated protein lengths of 100 amino acids or fewer, or redundant entries, were excluded from the dataset.\u003c/p\u003e\n \u003c/span\u003e \u003cspan\u003e\n \u003cp\u003e\u003cstrong\u003e6. Application of Multiple Alignment using Fast Fourier Transform (MAFFT) Algorithms\u003c/strong\u003e: The MAFFT software was utilized to perform the alignment of unaligned and heterogeneous sequence data [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Following multiple sequence alignment, the location of the ATA/ATT start codon was identified to determine the starting point of the barcode region. From this defined location, the core barcode segment, approximately 600\u0026ndash;658 base pairs in length, was consistently extracted. This process mitigated length discrepancies between full-length COI sequences and partial barcode sequences, ensuring the alignment of conserved, variable, and terminal regions in a structurally coherent manner. Consequently, generative models were able to reliably capture biologically meaningful patterns. This approach enhanced the biological validity of the COI sequences, facilitating the accurate representation of species-specific patterns in COI barcode sequences throughout the training process.\u003c/p\u003e\n \u003c/span\u003e \u003cspan\u003e\n \u003cp\u003e\u003cstrong\u003e7. Data Rebalancing\u003c/strong\u003e: Following the previous steps, the quantity of data per species could differ; therefore, Step 2 was repeated to select the final data.\u003c/p\u003e\n \u003c/span\u003e\n \u003cp\u003eUltimately, we utilized preprocessed data from invertebrate taxa\u0026mdash;Cypraeidae and Drosophila\u0026mdash;and vertebrate taxa\u0026mdash;Birds and Bats. Detailed information regarding the four datasets is presented in Table \u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. The entire dataset was split 8:2 into training and test sets to perform model training and validation of species-specific COI sequence generation. For the synthetic data, sequences were generated to correspond in length to those within each dataset.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\u0026nbsp;\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eDetails of the barcode datasets. We constructed four final COI datasets through a preprocessing pipeline. The data were divided into training and test sets for model validation.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ccolgroup cols=\"5\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\" colname=\"c1\"\u003e\n \u003cp\u003eType\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colname=\"c2\"\u003e\n \u003cp\u003eDatasets\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colname=\"c3\"\u003e\n \u003cp\u003eNum. of Sequences\u003c/p\u003e\n \u003cp\u003e(Train / Test)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colname=\"c4\"\u003e\n \u003cp\u003eLength\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colname=\"c5\"\u003e\n \u003cp\u003eNum. of Species\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\n \u003cp\u003eInvertebrates\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colname=\"c2\"\u003e\n \u003cp\u003eCypraeidae\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colname=\"c3\"\u003e\n \u003cp\u003e994/249\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\n \u003cp\u003e614\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\n \u003cp\u003e119\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" colname=\"c2\"\u003e\n \u003cp\u003eDrosophila\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colname=\"c3\"\u003e\n \u003cp\u003e336/85\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\n \u003cp\u003e663\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\n \u003cp\u003e16\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\n \u003cp\u003eVertebrates\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colname=\"c2\"\u003e\n \u003cp\u003eBats\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colname=\"c3\"\u003e\n \u003cp\u003e324/81\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\n \u003cp\u003e659\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\n \u003cp\u003e54\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" colname=\"c2\"\u003e\n \u003cp\u003eBirds\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colname=\"c3\"\u003e\n \u003cp\u003e249/63\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\n \u003cp\u003e690\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\n \u003cp\u003e54\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\n \u003ch2\u003e2.2 Conditional Deep Generative Models\u003c/h2\u003e\n \u003cp\u003eWe compared and analyzed six generative models to explore the optimal generative deep learning model for simulating COI barcoding sequences. We constructed generative deep learning models designed to address two questions. Firstly, generative models can be generally categorized into RNN and CNN architectures. RNNs are proficient in capturing sequential information and patterns, whereas CNNs are more adept at recognizing localized features and patterns. Therefore, it is necessary to investigate which model architecture performs optimally for the characteristics of the data. Generative models were developed using GRU layers and convolutional layers, followed by a comparative evaluation of their performance. Secondly, previous studies on biological data generation have not demonstrated the application of the proposed models to diverse data types. Therefore, we analyzed the performance of applying the basic structure of the models employed in previous studies to COI sequence data with distinct biological characteristics. Considering these two factors, we constructed an autoregressive language model (ARLM), a variational autoencoder model (VAE), generative adversarial networks (GAN), and a latent diffusion model (LDM) (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Additionally, we employed an N-gram probabilistic model as a baseline for comparison against the deep learning models.\u003c/p\u003e\n \u003cp\u003eFor models based on GRU layers, start and end tokens were appended to both ends of each sequence, after which the sequence characters were encoded as integer labels (e.g., [START, END, A, C, G, T, N] = [0, 1, 2, 3, 4, 5]). This encoding was adopted to preserve sequential and positional information by defining the beginning and end of each input sequence. In contrast, convolution-based models were used with one-hot encoded representations of the nucleotide characters (A, C, G, T, N), which were processed as one-dimensional feature patterns.\u003c/p\u003e\n \u003cp\u003eThe N-gram probabilistic model is a classical statistical approach that generates sequences based on the probabilities of local nucleotide patterns observed within species-level data [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. It accumulates the frequencies of the next nucleotide conditioned on a context of length n-1. In other words, it estimates the probability distribution of the next nucleotide based on the frequency statistics of consecutive n-length subsequences (n-grams). While the N-gram method offers advantages such as simplicity, intuitive interpretation of species specificity, and rapid generation speed, it struggles to capture long-term dependencies and is limited in generating novel sequences that reflect species-level characteristics. To ensure the preservation of the reading frame in the COI data, we set n\u0026thinsp;=\u0026thinsp;3.\u003c/p\u003e\n \u003cp\u003eThe ARLM model is a sequential generative approach that predicts the next token conditioned on the tokens up to the previous time step [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. Given the COI barcode sequence, the model predicts the subsequent token by comparing the input tokens up to position t with the target tokens up to position t\u0026thinsp;+\u0026thinsp;1. ARLMs are characterized by their simple architecture, training stability, and robust performance even with small datasets [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. We constructed two conditional ARLMs (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e(a): GRU-based ARLM and Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e(b): Transformer-based ARLM). In the GRU-based ARLM, inputs that combine embedded nucleotide sequences and species embeddings are processed through three GRU layers (hidden dimension\u0026thinsp;=\u0026thinsp;256, dropout\u0026thinsp;=\u0026thinsp;0.3), followed by a fully connected layer to predict the probability distribution of the next nucleotide token. The GRU mechanism accumulates sequential dependencies, allowing the model to capture local patterns within the nucleotide sequences. For the Transformer-based ARLM, inputs formed by combining nucleotide token, positional, and species embeddings are processed through three Transformer decoder layers employing a pre-layer normalization structure (heads\u0026thinsp;=\u0026thinsp;4, feedforward dimension\u0026thinsp;=\u0026thinsp;512, dropout\u0026thinsp;=\u0026thinsp;0.3), followed by a fully connected layer to predict the probability distribution of the next nucleotide token. The Transformer model is capable of capturing long-range dependencies within the sequences through its self-attention mechanism.\u003c/p\u003e\n \u003cp\u003eThe VAE is a probabilistic generative model that approximates a multivariate normal distribution to represent the intrinsic characteristics of the training data. It consists of an encoder, which infers the mean and variance (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\mu,{\\sigma}^{2})\\)\u003c/span\u003e\u003c/span\u003e\u0026mdash;parameters of the latent vector distribution\u0026mdash;from the input, and a decoder that generates new data from these latent vectors. By incorporating species embeddings during training, the model establishes a latent distribution conditioned on the species. Consequently, random sampling from this distribution facilitates the generation of diverse intraspecific data. We conducted a comparative analysis of two VAE model architectures in this study. First, the Chemical VAE was proposed as a model that represents discrete molecular structures in a continuous latent space and derives optimal molecular structures through optimization in that continuous representation [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. Adopting the fundamental structure of the Chemical VAE, we constructed the Conv-based VAE with an encoder composed of three 1D convolutional layers (kernel sizes = [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]) and a decoder composed of three GRU layers (hidden dimension\u0026thinsp;=\u0026thinsp;256, dropout\u0026thinsp;=\u0026thinsp;0.3) (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e(c), Conv-based VAE). The one-hot encoded nucleotide sequences were encoded, concatenated with species embeddings, and used to sample latent variables z. The decoder then generated sequences from latent samples conditioned on species embeddings. The GRU-based VAE was constructed using GRU layers for both the encoder and decoder, each comprising three GRU layers (hidden dimension\u0026thinsp;=\u0026thinsp;256, dropout\u0026thinsp;=\u0026thinsp;0.3). The embedded nucleotide sequences were concatenated with species embeddings to form the latent representation, which was then used to generate nucleotide sequences (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e(d), GRU-based VAE). While convolutional architectures capture local patterns in sequence data, GRU layers are effective in preserving sequential order and modeling long-range dependencies.\u003c/p\u003e\n \u003cp\u003eThe Wasserstein GAN is a probabilistic generative model that learns to approximate and minimize the Wasserstein (Earth-Mover) distance between the real and generated data through adversarial training between a generator and a critic. To improve training stability and mitigate mode collapse, we applied a gradient penalty term (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e(e), WGAN). The generator is conditioned on species labels and random noise to synthesize barcode sequences, while the critic evaluates real and generated sequences under the same species condition to calculate a Wasserstein score. Adversarial training allows the model to directly approximate the underlying COI distribution. Following previous methods on SNP generation using WGAN-GP [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e], we constructed a 1D convolutional generator\u0026ndash;critic architecture. The generator concatenates noise and species embeddings, processes the input through a fully connected layer, and passes it through two convolutional blocks (Conv1D, BatchNorm, LeakyReLU(\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\alpha=0.01\\)\u003c/span\u003e\u003c/span\u003e) to output nucleotide probabilities. The critic receives one-hot encoded sequences concatenated with species embeddings along the sequence dimension, processes them via three convolutional blocks, and projects the result to a scalar Wasserstein score.\u003c/p\u003e\n \u003cp\u003eThe LDM is a generative network that encodes high-dimensional data into a low-dimensional continuous latent space and learns the diffusion process within that space [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. Building upon the core architecture of DiscDiff, a diffusion-based model originally proposed for promoter DNA sequence generation [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e], we constructed a conditional LDM for COI barcode sequence generation (Fig. \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e(f), LDM). Nucleotide token embeddings were concatenated with species embeddings and transformed into a continuous latent representation(\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({z}_{0}\\in{\\mathbb{R}}^{L\\times256})\\)\u003c/span\u003e\u003c/span\u003e using an encoder composed of one-dimensional convolutional layers. A diffusion process was then applied to the latent representation, progressively adding Gaussian noise over time steps t to obtain noisy latent variables \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({z}_{t}\\)\u003c/span\u003e\u003c/span\u003e. These noisy representations were denoised using a 1D U-Net to recover clean latent representations. The denoised latent variables were subsequently decoded into nucleotide sequences using a decoder consisting of three GRU layers (hidden dimension\u0026thinsp;=\u0026thinsp;256, dropout\u0026thinsp;=\u0026thinsp;0.3). For the diffusion process, a linear noise schedule was employed with \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({\\beta}_{t}\\in[{10}^{-4},0.02]\\)\u003c/span\u003e\u003c/span\u003e, and the total number of diffusion timesteps was set to 1,000.\u003c/p\u003e\n \u003cp\u003eWe performed temperature-based sampling using six different settings to generate 30 synthetic sequences per species. The generated sequences were filtered to ensure that they did not overlap with the real data and that the proportion of ambiguous nucleotides (N) was less than 0.05. The synthetic datasets were then used for comparative analyses against the real data.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\n \u003ch2\u003e2.3 Evaluation Metrics for Synthetic Data\u003c/h2\u003e\n \u003cp\u003eTo assess the biological and statistical quality of synthetic COI data generated for each species by various generative models, we conducted evaluations based on the following three criteria.\u003c/p\u003e\u003cspan\u003e\n \u003cp\u003e\u003cstrong\u003e1. Biological Plausibility of the Synthetic COI Sequences\u003c/strong\u003e\u003c/p\u003e\n \u003c/span\u003e\n \u003cp\u003eTo evaluate whether each generative model generates biologically valid protein-coding sequences, we measured the GC content at the third codon position (GC\u003csub\u003e3\u003c/sub\u003e content), the Jensen-Shannon divergence (JSD) of codon bias, and the average length of ORFs.\u003c/p\u003e\n \u003cp\u003eThe GC\u003csub\u003e3\u003c/sub\u003e content, defined as the proportion of G or C nucleotides at the third base of codons within COI sequences, does not influence the encoded protein but is subject to variance due to evolutionary diversity. Moreover, species-specific GC\u003csub\u003e3\u003c/sub\u003e patterns arise from diverse biological factors, enabling an evaluation of how accurately synthetic sequences replicate these species-specific codon usage patterns. Consequently, a smaller difference in GC\u003csub\u003e3\u003c/sub\u003e content (\u0026Delta;) between the real and synthetic data indicates that the generative model effectively reproduces the species-specific nucleotide composition.\u003c/p\u003e\n \u003cp\u003eThe Kullback-Leibler divergence (KLD) is a metric that measures the extent to which one probability distribution diverges from a reference distribution. The Jensen-Shannon Divergence (JSD) measures the average distance between two probability distributions by calculating the KLD between each distribution and their mean. Codon bias JSD represents a value calculated by using the JSD to measure the similarity between codon frequency distributions in real and synthetic data. Despite encoding identical amino acids, different species exhibit unique codon usage preferences. Consequently, a JSD value closer to 0 indicates that the codon usage patterns of the synthetic data are similar to those of the real data. This similarity indicates that the synthetic sequences are likely to be biologically compatible with natural translational processes.\u003c/p\u003e\n \u003cp\u003eBiologically normal COI sequences are characterized by a single extended ORF that initiates with a start codon and terminates with a stop codon. The average ORF length represents the average length of continuous coding regions without stop codons, thereby indicating the maintenance of a normal single reading frame. Consequently, a difference in average ORF length (\u0026Delta;) between real and synthetic data close to 0 implies that the synthetic sequences accurately preserve the reading frame structure and stop codon patterns.\u003c/p\u003e\u003cspan\u003e\n \u003cp\u003e\u003cstrong\u003e2. Phylogenetic \u0026amp; Population level Fidelity of the Synthetic COI Sequences\u003c/strong\u003e\u003c/p\u003e\n \u003c/span\u003e\n \u003cp\u003eTo evaluate whether the species-specific synthetic COI sequences preserve inter-specific discrimination and intra-specific diversity, we calculated the mean intraspecific K2P distance, the mean K2P distance between real and synthetic sequences, and the barcode gap rate using the Kimura 2-Parameter (K2P) model. Through an integrative analysis of these metrics, we evaluated the phylogenetic validity of the synthetic data.\u003c/p\u003e\n \u003cp\u003eThe K2P genetic distance model serves as a representative method for estimating evolutionary divergence between nucleotide sequences. It distinguishes between transitions (A\u0026harr;G, C\u0026harr;T) and transversions (A\u0026harr;C, A\u0026harr;T, G\u0026harr;C, G\u0026harr;T, etc.), thereby accounting for the different occurrence probabilities of these two substitution types. After aligning two sequences, the proportions of observed transitions and transversions are calculated and subsequently employed to estimate the expected number of substitutions per nucleotide site, referred to as genetic distance. In contrast to basic mismatch rate calculations, the K2P model accounts for the asymmetry between transitions and transversions observed in mtDNA; therefore, it is widely used for COI sequence analysis.\u003c/p\u003e\n \u003cp\u003eThe mean intra-species K2P distance is an indicator that evaluates the extent to which synthetic data captures the within-species genetic diversity observed in the real data. A lower intra-specific K2P value indicates higher sequence similarity among individuals within the same species. We calculated the mean intra-specific K2P distances for both real and synthetic data and determined the difference between them (\u0026Delta;). Thus, a \u0026Delta; value approaching zero indicates that the generative model has effectively reproduced the patterns of intra-specific variation.\u003c/p\u003e\n \u003cp\u003eThe mean real\u0026ndash;synthetic K2P distance is the average K2P distance calculated between all real and synthetic sequences. It indicates whether the synthetic data are located within the actual genetic space; lower values identifies that the synthetic data are positioned closer to or within the real data distribution. When the intra-species K2P distances of both the real and synthetic data are similar, and the mean real\u0026ndash;synthetic K2P distance is minimal, it indicates that the synthetic data not only lie within the real data space but also preserves a similar intra-species structure to real data.\u003c/p\u003e\n \u003cp\u003eA barcode gap indicates distinct species separation, occurring when the maximum intra-specific genetic distance is smaller than the minimum inter-specific genetic distance. The barcode gap rate represents the proportion of species exhibiting this gap; a higher rate suggests enhanced potential for species discrimination based on COI sequence. We calculated the barcode gap rates for both real and synthetic datasets and evaluated their difference (\u0026Delta;). A \u0026Delta; value close to zero indicates that the generative model has effectively preserved the inter-specific boundaries.\u003c/p\u003e\u003cspan\u003e\n \u003cp\u003e\u003cstrong\u003e3. Biodiversity Representation of the Synthetic COI Sequences\u003c/strong\u003e\u003c/p\u003e\n \u003c/span\u003e\n \u003cp\u003eWe evaluated the properties of synthetic data which preserve the statistical properties of the real data while maintaining intra-species diversity. For quantitative evaluation, the JSD-based k-mer distribution divergence, Self-BLEU (Bilingual Evaluation Understudy), and Nearest Neighbour Adversarial Accuracy (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(AA\\)\u003c/span\u003e\u003c/span\u003e) were utilized.\u003c/p\u003e\n \u003cp\u003eThe k-mer distribution reflects species-specific statistical characteristics, such as repetitive elements and GC/AT compositional patterns. We calculated frequency vectors for 3-mers, corresponding to the codon-length unit (3 nucleotides), from the COI sequences. To assess the similarity between the real and synthetic sequences, we calculated the JSD between their respective 3-mer frequency distributions. The JSD value ranges from 0 to 1, where a value closer to 0 indicates a higher similarity in k-mer distributions between the real and synthetic data. In other words, this implies that the synthetic data effectively reproduces the local patterns and complexity of the real data.\u003c/p\u003e\n \u003cp\u003eSelf-BLEU is a metric that evaluates the similarity among generated sequences, where a lower value indicates higher diversity. This metric facilitates the evaluation of pattern similarity and redundancy within the synthetic data. A difference between the Self-BLEU scores (\u0026Delta;) of real and synthetic data close to 0 implies that the synthetic data exhibits a diversity level comparable to that of the real data.\u003c/p\u003e\n \u003cp\u003eAA is a generative quality evaluation metric based on the 1-nearest neighbor distance that quantifies the overlap between real and synthetic data [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. This metric evaluates whether the synthetic data demonstrates excessive similarity to the real data (overfitting) or excessive dissimilarity (underfitting). The metric was computed using 3-mer distributions extracted from both real and synthetic data. Values closer to 1 indicate that the synthetic data substantially diverges from the real data (underfitting), whereas values closer to 0 indicate that the synthetic data closely resembles the real data (overfitting). Therefore, an AA value near 0.5 denotes that the synthetic data effectively captures the characteristics of the real data while preserving diversity. Specifically, values above 0.5 imply a lack of realism in the synthetic data, whereas values below 0.5 indicate a propensity for the synthetic data to closely replicate the real data.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\n \u003ch2\u003e2.4 Visualization\u003c/h2\u003e\n \u003cp\u003eWe analyzed conserved and non-conserved regions within COI sequences using position-wise Shannon entropy profiles. Shannon entropy is a measure of uncertainty and randomness within a probability distribution. Here, it reflects the nucleotide distribution (A, T, C, G) at each position across the entire sequences. Conserved regions, characterized by patterns shared among species, correspoond to low entropy values, whereas non-conserved regions or variant regions exhibit high entropy values. We validated the biological plausibility of the synthetic sequences by analyzing the Shannon entropy profiles between the real and synthetic data.\u003c/p\u003e\n \u003cp\u003eTo evaluate the overall similarity between real and synthetic data, we utilized principal component analysis (PCA) and uniform manifold approximation and projection (UMAP) [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. The sequences were transformed into 3-mer frequency vectors, followed by dimensionality reduction via PCA to extract global features. Subsequently, UMAP was applied to project these features into a two-dimensional embedding space. We visually evaluated the genetic consistency of the data by examining whether the real and synthetic data occupied a common genetic manifold.\u003c/p\u003e\n \u003cp\u003eAdditionally, we evaluated the reproducibility of local base pattern distributions using a 3-mer abundance scatter plot. We performed a comparative analysis of 3-mer frequency distributions between real and synthetic data. Data points that closely align with the diagonal line (y\u0026thinsp;=\u0026thinsp;x) indicate a more accurate simulation of the real sequence characteristics. Conversely, substantial deviations from this diagonal indicate that the model generates unrealistic sequences. We also assessed the reproducibility of infrequent patterns by applying a log-scale to the frequency values.\u003c/p\u003e\n \u003cp\u003eTo qualitatively assess phylogenetic coherence between real and synthetic sequences, we reconstructed maximum-likelihood (ML) phylogenetic trees. For each dataset, real COI sequences and synthetic COI sequences generated by each model were concatenated, and multiple sequence alignment was performed using MAFFT [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. The aligned sequences were then analyzed in IQ-TREE 3 to infer an ML tree, and branch support was estimated using 1,000 ultrafast bootstrap replicates, producing a Newick-format tree file [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e].\u003c/p\u003e\n \u003cp\u003eBased on the resulting tree, we defined, for each taxon (species level for Bats, Cypraeidae, and Drosophila; group ID level for Birds), the smallest clade defined by the most recent common ancestor (MRCA) shared by the real sequences of that taxon as the MRCA-defined real clade. We then calculated the in-clade rate, defined as the proportion of synthetic sequences placed within the MRCA-defined real clade, to quantify the extent to which synthetic sequences were positioned within the taxon-specific phylogenetic structure of the real sequences. To additionally account for clade boundary specificity, we computed the non-focal real-tip count within the MRCA-defined real clade (i.e., the number of real tips belonging to taxa other than the focal taxon) and interpreted a value of zero as indicating that taxon-specificity of the real clade was maintained. Taxon-level summary statistics (including the in-clade rate and related metrics) are provided in the Supplementary Data.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\n \u003ch2\u003e2.5 Phylogenetic validation using BOLD reference sequences\u003c/h2\u003e\n \u003cp\u003eFor external validation using independently collected real-world data, COI barcode reference sequences of Chiroptera and Drosophilidae were downloaded from the BOLD (Barcode of Life Data Systems) data portal [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. From the retrieved records, we retained only sequences with collection countries recorded as the Republic of Korea, Japan, or China, and selected entries annotated with the marker code COI-5P (BOLD\u0026rsquo;s designation for the standard 5\u0026prime; region of the COI barcode). We further filtered records to those containing both a process identifier (processid) and nucleotide sequence information (nuc). Sequences were cleaned by removing whitespace and normalizing all characters to uppercase, and were exported in FASTA format.\u003c/p\u003e\n \u003cp\u003eThe filtered BOLD reference sequences were combined with synthetic sequences generated in this study (GRU-based ARLM outputs), and multiple sequence alignment (MSA) was performed using MAFFT [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. Using the aligned sequences as input, maximum-likelihood phylogenetic trees were inferred with IQ-TREE 3, and branch support was evaluated with 1,000 ultrafast bootstrap replicates. The resulting trees were inspected to determine whether the synthetic sequences were placed in phylogenetically coherent positions relative to the BOLD reference sequences [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]. In addition, taxonomic composition and patterns of potential mixing were examined using Krona to generate interactive HTML-based visualizations [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e].\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\n \u003ch2\u003e2.6 Implementation Details\u003c/h2\u003e\n \u003cp\u003eAll training and testing were performed using PyTorch framework. It was performed using four GeForce RTX 3090 Ti 16GB \u0026times; 4 (Z-202410106770) provided by the Bio-Bigdata Analysis and Utilization of Biological Resources at Soonchunhyang University. Common configurations applied across all generative models encompassed a warm-up cosine learning rate decay and the Adam optimizer to improve training efficiency and model accuracy. The initial learning rate was established at 0.001, with training conducted over 100 epochs and a batch size of 16. The loss functions for each model are as follows: the GRU- and Transformer-based ARLMs utilized cross-entropy loss; the Conv- and GRU-based VAEs used a reconstruction loss integrating cross-entropy loss and KL convergence loss. WGAN is designed to minimize the difference between the real data distribution and the synthetic data distribution by leveraging the Wasserstein-1 distance. The critic outputs a real-valued score, and the generator is trained to maximize this score, thereby minimizing the Wasserstein distance. To enhance training stability, a gradient penalty coefficient was set to 10, and the update ratio between the critic and the generator was maintained at 5:1. LDM was optimized by defining the overall loss function as a weighted sum of the cross-entropy loss for reconstruction and the mean squared error (MSE) loss for noise prediction.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"3. Results","content":"\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e3.1. Evaluating Biological Plausibility in Deep Generative DNA Models\u003c/h2\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e presents the biological validity results of synthetic COI data generated by seven models across four datasets. For all three evaluation metrics\u0026mdash;GC₃ content(Δ), codon bias JSD, and ORF mean length (Δ)\u0026mdash;lower values indicate that the synthetic data more accurately replicate the actual biological structures.\u003c/p\u003e \u003cp\u003eOverall, the N-gram models showed low biological plausibility across all four datasets. While they partially replicated the statistical distribution of nucleotide sequences, they did not accurately capture species-specific codon usage patterns and exhibited significant deviations in ORF frame preservation. This indicates that traditional statistical methods are inadequate for capturing the structural constraints intrinsic to COI sequences. In contrast, the GRU-based ARLM consistently showed the smallest deviations across all three metrics in all datasets. This demonstrates that the GRU-based ARLM generates synthetic data with high protein translatability by reproducing codon structures and patterns similar to real data and stably preserving ORF frames. Therefore, the GRU-based ARLM was validated as the most effective model for generating biologically plausible synthetic COI sequences. Additionally, the GRU-based VAE ranked second in overall performance, demonstrating that GRU layers can effectively capture nucleotide continuity and preserve the codon structure of COI data across different data types (invertebrate and vertebrate). In contrast, although the Transformer-based ARLM showed relatively good performance in replicating GC₃ content, it exhibited inferior results in the codon bias JSD and ORF mean length metrics. While the Transformer architecture is effective for learning global patterns, it has limitations in replicating local and sequential patterns at the codon level. Convolution-based feature learning models, including the Conv-based VAE, WGAN, and LDM, exhibited low biological validity across all three metrics. In particular, WGAN and the Conv-based VAE did not preserve ORF frames, leading to poor translatability at the codon level. For LDM, codon bias deviation and ORF collapse were observed. These results indicate that distribution learning within the latent space does not sufficiently capture the intrinsic biological constraints of COI sequences. Furthermore, we confirmed that in the generation of COI data, learning the sequential dependencies of nucleotide sequences using GRU layers is more effective in preserving biological properties in synthetic COI data than focusing on local patterns.\u003c/p\u003e \u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows the position-wise Shannon entropy profiles between real COI sequences and synthetic data generated by seven models across four datasets. The results are arranged from top to bottom for Cypraeidae, Drosophila, Bats, and Birds, respectively. The left-hand graphs show the position-wise Shannon entropy profiles, whereas the right-hand graphs represent the entropy deviation between the synthetic and real data. COI sequences comprise both functionally conserved regions and non-conserved (variable) regions. Therefore, we analyzed position-wise Shannon entropy to quantitatively evaluate how accurately each generative model replicates these biological conservation-variation patterns. We identified highly conserved regions with low variation (indicated by grey shading) by setting a threshold at the bottom 30% of values in the smoothed entropy distribution. When compared to the real data (black line), the synthetic data generated by both the GRU-based ARLM and GRU-based VAE exhibited entropy profiles similar to the real sequences, replicating the relative positions of conserved and variable regions. In particular, the GRU-based ARLM consistently maintained entropy deviations close to zero, demonstrating a superior capability to preserve nucleotide-level patterns. The Transformer-based ARLM exhibited global entropy profiles similar to the real data; however, it presented lower position-wise entropy values relative to the real sequences. Considering that high-quality synthetic data should maintain a balance between reflecting real characteristics and ensuring an adequate level of diversity, these results indicate that the Transformer-based ARLM exhibits an over-convergence to the real distribution in terms of entropy. In contrast, WGAN also exhibited entropy profiles similar to the real data; however, it exhibits significant entropy deviation, indicating high variability within the synthetic data. This indicates that while the relative patterns of the position-wise profiles were preserved, the entropy scale differs from the real data. In contrast, the N-gram, Conv-based VAE, and LDM exhibited overall entropy values higher than those of real data, characterized by linear or irregular profiles. These models maintained high entropy values even within the grey-shaded regions, indicating the generation of unnatural sequences with low biological plausibility. In addition, the comparison of datasets reveals that entropy deviations are generally larger in the Birds dataset (longest sequences) than in the Cypraeidae dataset (shortest sequences). Notably, the entropy deviation in the Birds dataset exhibits a significant increase beyond 500 bp, indicating a failure to accurately replicate the conserved terminal region near 600 bp. Within this region, the WGAN outperformed the GRU-based ARLM by replicating patterns closer to the real data. This indicates a potential limitation of GRU-based models in maintaining long-term dependencies as the sequence lengthens. In other words, while GRU-based models are effective in learning local nucleotide continuity, they encounter difficulties in accurately replicating the global structure of lengthy and complex sequences.\u003c/p\u003e \u003cp\u003eTo validate the biological plausibility of the synthetic COI data, we performed a comparative analysis of codon structure and usage patterns, frame preservation length, and the position-wise profiles of conserved and variable regions. The results demonstrate that GRU-based models are superior in capturing sequential information and continuous patterns at the nucleotide level. These models not only replicate the statistical distribution of nucleotide composition but also accurately reproduce the complex biological patterns of conserved and variable regions across the entire sequence.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eResults of biological plausibility evaluation metrics for synthetic COI data generated by six generative models across the four datasets.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDatasets\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eModels\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eGC₃ content (Δ)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCodon bias JSD\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eORF mean length (Δ)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003eCypraeidae\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN-gram\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1843\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0227\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e8.1172\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.0037\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.0123\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e-0.0190\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTransformer-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e-0.0067\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0342\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-3.2274\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConv-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e-0.0106\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0542\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-10.2542\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0128\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0132\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e2.1256\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWGAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0225\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0435\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e4.9894\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLDM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.2322\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.1904\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e28.2121\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003eDrosophila\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN-gram\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.2297\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.4262\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e40.4645\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e-0.0015\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.0487\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e7.3208\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTransformer-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e-0.0031\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.4284\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-6.2480\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConv-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.2140\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.4433\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e63.9333\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0081\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0727\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e5.8333\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWGAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0134\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.1216\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e8.3333\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLDM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0814\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.4295\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e20.7083\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003eBats\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN-gram\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1292\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0978\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-2.3815\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0143\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.0215\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e-2.3333\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTransformer-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e-0.0046\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0608\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-6.6870\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConv-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1155\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.1623\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e28.5204\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0167\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0272\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-2.5074\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWGAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0101\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0446\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e12.5259\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLDM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0730\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.3255\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-52.7167\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003eBirds\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN-gram\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e-0.0983\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.3627\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e36.9028\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.0016\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.0293\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.4066\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTransformer-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0031\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.3675\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e6.3973\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConv-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e-0.0884\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.3775\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e55.5732\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0024\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.4437\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e-0.3953\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWGAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e-0.0070\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0601\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.3232\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLDM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0621\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.4922\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-20.1564\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e3.2. Maintaining Phylogenetic and Population-Level Fidelity\u003c/h2\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e presents the species discrimination capabilities of synthetic COI data generated by seven models across four datasets. Generally, a K2P distance close to 0 indicates intraspecific genetic similarity, while a value of 0.05 or higher is considered as a threshold for interspecific distinction. Therefore, an intra-species K2P mean (Δ) close to 0 suggests that the synthetic data preserves intraspecific diversity similar to that of the real data. Similarly, a real\u0026ndash;synthetic K2P mean near 0 indicates that the genetic structure of the synthetic data is highly similar to that of the real data. Finally, a barcode gap rate (Δ) near 0 indicates that the synthetic data accurately reflects the interspecific distance structure observed in the real data.\u003c/p\u003e \u003cp\u003eAn analysis of the overall results showed that both the N-gram and LDM exhibited intra-species K2P mean (Δ) and real\u0026ndash; synthetic K2P mean values significantly exceeding 1\u0026ndash;2. This indicates that the synthetic data occupied a genetic space markedly distinct from that of the real data, exhibiting a divergent pattern of variation. Therefore, the intraspecific and interspecific distance structures in these models were collapsed, indicating a failure to generate synthetic data that preserve phylogenetic consistency. Additionally, the barcode gap rate (Δ) exhibited substantially negative values, confirming the collapse of species boundaries within the synthetic data. Conversely, the GRU-based ARLM consistently exhibited the lowest values across all three metrics for all four datasets, indicating superior performance in species discrimination. This indicates that the synthetic data generated by the GRU-based ARLM occupied a genetic space closely aligned with that of the real data, effectively replicating both intraspecific variation patterns and interspecific boundaries. Furthermore, the barcode gap rate (Δ) was closer to 0, indicating that the genetic boundaries between species were effectively maintained within the synthetic data. The GRU-based VAE also effectively captured genetic structures at both the intraspecific and interspecific levels. In contrast, while the Transformer-based ARLM preserved consistent intraspecific and interspecific distances, it exhibited reduced performance in maintaining species boundaries, as evidenced by significant deviations in the barcode gap rate (Δ). The WGAN and Conv-based VAE exhibited substantial deviations in K2P distances and significant negative values of the barcode gap rate (Δ), indicating difficulties in distinguishing species within the synthetic data. Across all generative models examined, the barcode gap rate (Δ) was negative, indicating that species discrimination boundaries in the synthetic data were less distinct compared to those in the real data. This result is likely due to the expansion or overlap of sequence distributions for each species during the generation process.\u003c/p\u003e \u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e shows the PCA-UMAP comparison between the real data with synthetic data generated by seven models across four datasets. The results are arranged with datasets represented in rows (Cypraeidae, Drosophila, Bats, and Birds) and models in columns. Overall, the synthetic data generated by the GRU-based ARLM and the GRU-based VAE most effectively replicated the cluster distributions and the interspecies cluster distances observed in the real data. These models consistently preserved the compactness within species-specific clusters as well as the separation between species clusters, thereby reflecting the intra-species diversity and phylogenetic structures observed in the real data. This fidelity in distribution aligns with the low values of intra-species K2P mean (Δ) and barcode gap rates (Δ) reported in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. While synthetic data from the Transformer-based ARLM and WGAN closely aligned with the global distribution of the real data, they exhibited increased scattering around clusters and denser aggregation within particular clusters compared to GRU-based models. This indicates potential instability in preserving patterns of intraspecific variation and maintaining clear interspecific boundaries. In contrast, synthetic data generated by N-gram, Conv-based VAE, and LDM were broadly scattered around the clusters of the real data. In particular, the distribution of data generated by the LDM markedly diverges from that of the real data, indicating that the synthetic data exhibited characteristics that are non-biological in nature. Furthermore, the negative values of the barcode gap rate (Δ) presented in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, corroborated by the PCA-UMAP analysis, indicate that species-specific distributions in the synthetic data were more expanded than those in the real data, leading to partial overlaps between adjacent species distributions. Therefore, GRU-based models demonstrated stable performance in replicating the species-specific structures of the real data.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eResults of species discrimination performance metrics for synthetic COI data generated by six generative models across the four datasets.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDatasets\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eModels\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eIntra-species\u003c/p\u003e \u003cp\u003eK2P mean (Δ)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eReal\u0026ndash;Synth\u003c/p\u003e \u003cp\u003eK2P mean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eBarcode gap rate (Δ)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003eCypraeidae\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN-gram\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2.6059\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.6340\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.7143\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1269\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0843\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e-0.4286\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTransformer-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.0621\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.0775\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.6134\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConv-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.2588\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.2197\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.7143\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1915\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.1272\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.5462\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWGAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.3106\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.2945\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.7143\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLDM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2.2510\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.2934\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.7143\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003eDrosophila\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN-gram\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2.3128\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.3167\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.6875\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0982\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.0629\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e-0.5625\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTransformer-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.0713\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0882\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.6250\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConv-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.4862\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.4562\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.6875\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1550\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0974\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.4375\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWGAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1585\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.2043\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.5625\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLDM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.1384\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.4618\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.6875\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003eBats\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN-gram\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2.8969\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.9842\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.9630\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.0951\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0635\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e-0.6481\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTransformer-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1667\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.1796\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.9630\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConv-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.5104\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.4697\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.9630\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0996\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.0539\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.6667\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWGAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.2357\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.2357\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.9630\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLDM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.4538\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.7793\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.9630\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003eBirds\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN-gram\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2.7890\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.8641\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.7593\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.0986\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.0919\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e-0.6296\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTransformer-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1124\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.1634\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.7593\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConv-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.4611\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.4636\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.7593\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.2160\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.1640\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.7222\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWGAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1357\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.1869\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.7593\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLDM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.5852\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.3440\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.7593\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e3.3. Modeling Biodiversity Through Synthetic Sequence Generation\u003c/h2\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e presents the biodiversity of synthetic COI data generated by seven models across four datasets. Ideally, generative models should generate novel data that reproduce the statistical and biological attributes of the real data, rather than merely replicating the training data. To validate this, we utilized three metrics: JSD-kmer, Self-BLEU (Δ), and AA. For both JSD-kmer and Self-BLEU (Δ), lower values indicate that the synthetic data accurately reflect the structural characteristics and diversity inherent in the real data. For AA, a value close to 0.5 signifies that the synthetic data maintains appropriate diversity and adheres to the structural distribution of the real data, without exhibiting excessive self-replication.\u003c/p\u003e \u003cp\u003eThe GRU-based ARLM and GRU-based VAE consistently exhibited low JSD-kmer and Self-BLEU (Δ) values near 0 across all datasets, effectively replicating 3-mer nucleotide patterns and sequence variation structures similar to the real data. Additionally, with AA scores ranging between 0.5 and 0.7, the synthetic data generated by these models maintained intra-specific diversity while reducing data redundancy. Therefore, we confirmed that GRU-based models achieved a balanced reproduction of nucleotide composition diversity, sequence variation, and inter-cluster overlap, thereby confirming the comprehensive preservation of both ecological and genetic diversity. In contrast, the N-gram, Transformer-based ARLM, and WGAN exhibited local pattern distributions (JSD-kmer) similar to the real data. However, their AA values ranged between 0.9 and 1.0, implying that the generated sequences inhabited a markedly different genetic space and failed to reproduce the variational structure of the real data. In other words, although these models captured nucleotide-level frequencies, they did not sufficiently reflect the fine-grained structural properties underlying natural sequences. The Conv-based VAE and LDM models exhibited substantially higher JSD-kmer values, demonstrating their inability to learn the local compositional patterns of real sequences. Moreover, their AA value of 1 suggested that these models generated non-biological artifacts rather than biologically plausible sequences, further confirming their failure to reproduce the intrinsic characteristics of the real data.\u003c/p\u003e \u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e shows log-scaled scatter plots comparing the 3-mer frequencies between the real and synthetic datasets. From top to bottom, the results correspond to Cypraeidae, Drosophila, Bats, and Birds, and from left to right the results of the seven models are shown in order. When the points are uniformly distributed along the diagonal, it indicates that the model successfully reflected the 3-mer distributions of the real data, ranging from highly frequent patterns (upper-right region) to rare patterns (lower-left region). Overall, across all datasets, the N-gram, GRU-based ARLM, and GRU-based VAE models exhibited high correlations with R-squared values above 0.98, showing that their generated 3-mer distributions closely matched those of the real data. For the GRU-based ARLM and GRU-based VAE, points were distributed around the diagonal, indicating that these models reproduce the overall real sequence distribution rather than engaging in simple frequency replication. In contrast, since the N-gram model learns the 3-mer occurrence probabilities, its statistical distribution appeared highly similar to that of the real data. However, when considered alongside the results presented in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, it becomes evident that the N-gram model fails to reflect the biological characteristics of the real sequences and performs only a frequency-level replication. The Transformer-based ARLM and WGAN models exhibited high correlations with the real data in terms of overall 3-mer distributions (R-squared values\u0026thinsp;\u0026asymp;\u0026thinsp;0.96\u0026ndash;0.99). However, both models showed increased dispersion in the scatter plots when generating rare structures (lower-left region). For the Conv-based VAE, points were skewed toward the upper right, confirming poor performance in reproducing rare nucleotide patterns that are observed in the real sequences. The LDM exhibited the lowest R-squared values among the models, with the majority of points distributed below the diagonal. This indicates a distributional distortion where certain 3-mer patterns were excessively generated in the synthetic data. Furthermore, among the four datasets, the scatter plots for Cypraeidae exhibited the most uniform distributions along the diagonal, suggesting that diverse 3-mer structures were successfully reflected. In contrast, the scatter plots for Birds showed a stronger concentration of points in the upper-right region. These results indicate that as sequence length increases, the fidelity of generating rare patterns tends to decrease.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eResults of biological diversity evaluation metrics for synthetic COI data generated by six generative models across the four datasets.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDatasets\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eModels\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eJSD-kmer\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSelf-BLEU (Δ)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eAA\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003eCypraeidae\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN-gram\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0175\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0011\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9993\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.0098\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.6405\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTransformer-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0315\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9023\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConv-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0532\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9857\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0109\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.6124\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWGAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0416\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0002\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9677\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLDM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1899\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0002\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.0000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003eDrosophila\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN-gram\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.0301\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0004\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.0000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0314\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.6108\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTransformer-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0436\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9560\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConv-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1787\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.0000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0320\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.5902\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWGAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0513\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9094\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLDM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.2836\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.0000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003eBats\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN-gram\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0935\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0006\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.0000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.0197\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.5478\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTransformer-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0486\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9750\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConv-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.1605\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0002\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.0000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0263\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0012\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.5281\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWGAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0426\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.0000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9904\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLDM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.3236\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0003\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.0000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003eBirds\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN-gram\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0204\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0004\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.0000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.0204\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0001\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.7417\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTransformer-based ARLM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0480\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0002\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9431\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eConv-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.2027\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0003\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.0000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGRU-based VAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0311\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0010\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.7558\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWGAN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.0473\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0002\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.9586\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLDM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.4699\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.0004\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1.0000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e3.4. Phylogenetic coherence between real data and synthetic sequences\u003c/h2\u003e \u003cp\u003eUsing alignments that combined real COI sequences (Real) with synthetic COI sequences generated by the GRU-based ARLM, we inferred maximum-likelihood (ML) phylogenetic trees in IQ-TREE 3. For each taxon, we defined the MRCA-defined real clade as the minimal clade subtended by the most recent common ancestor (MRCA) of that taxon\u0026rsquo;s real sequences, and calculated the in-clade rate as the proportion of synthetic sequences placed within the corresponding real clade. In addition, we computed the non-focal real-tip count (the number of real tips belonging to taxa other than the focal taxon within the MRCA-defined real clade) to jointly assess potential mixing during synthesis and the clarity of species boundaries in the real data.\u003c/p\u003e \u003cp\u003eAcross the four datasets (240 taxa in total), the synthetic in-clade rate had a synthetic-count\u0026ndash;weighted mean of 0.793, and 81.7% of taxa had a non-focal real-tip count of 0. Within this subset, the weighted mean in-clade rate was 0.764. A non-focal real-tip count of 0 indicates that the MRCA-defined real clade is composed exclusively of real sequences from the focal taxon.\u003c/p\u003e \u003cp\u003eIn the bats dataset (51 taxa), 90.2% of taxa had a non-focal real-tip count of 0. The in-clade rate showed a weighted mean of 0.769 (count\u0026thinsp;=\u0026thinsp;0 subset: 0.756) and a median of 0.8667 (IQR 0.6333\u0026ndash;0.9500). Taxa with an in-clade rate\u0026thinsp;=\u0026thinsp;1.0 accounted for 11.8%, and under the count\u0026thinsp;=\u0026thinsp;0 condition, taxa with an in-clade rate\u0026thinsp;\u0026lt;\u0026thinsp;0.5 accounted for 9.8%. For a subset of taxa, real_monophyly was FALSE or the non-focal real-tip count was large, indicating that other real sequences were included within the MRCA-defined real clade. Under the count\u0026thinsp;=\u0026thinsp;0 condition, low in-clade rates were predominantly observed in cases with small n_real.\u003c/p\u003e \u003cp\u003eIn the birds dataset (54 taxa), 83.3% of taxa had a non-focal real-tip count of 0. The in-clade rate had a weighted mean of 0.735 (count\u0026thinsp;=\u0026thinsp;0 subset: 0.699) and a median of 0.9000 (IQR 0.5667\u0026ndash;1.0000). Taxa with an in-clade rate\u0026thinsp;=\u0026thinsp;1.0 accounted for 29.6%, whereas taxa with an in-clade rate\u0026thinsp;=\u0026thinsp;0.0 accounted for 7.4%, and taxa with an in-clade rate\u0026thinsp;\u0026lt;\u0026thinsp;0.5 accounted for 20.4%. Cases with an in-clade rate\u0026thinsp;=\u0026thinsp;0.0 were observed in groups with small n_real.\u003c/p\u003e \u003cp\u003eIn the Cypraeidae dataset (119 taxa), 78.2% of taxa had a non-focal real-tip count of 0. The in-clade rate had a weighted mean of 0.813 (count\u0026thinsp;=\u0026thinsp;0 subset: 0.781) and a median of 0.9000 (IQR 0.7000\u0026ndash;1.0000). Taxa with an in-clade rate\u0026thinsp;=\u0026thinsp;1.0 accounted for 28.6%, and no taxa showed an in-clade rate of 0.0. Under the count\u0026thinsp;=\u0026thinsp;0 condition, taxa with an in-clade rate\u0026thinsp;\u0026lt;\u0026thinsp;0.5 accounted for 10.1%. In this dataset, increases in non-focal real-tip count were observed for some taxa, and very low in-clade rates were observed under conditions with small n_real.\u003c/p\u003e \u003cp\u003eIn the Drosophila dataset (16 taxa), 75.0% of taxa had a non-focal real-tip count of 0. The in-clade rate showed a weighted mean of 0.919 (count\u0026thinsp;=\u0026thinsp;0 subset: 0.903) and a median of 0.9667 (IQR 0.8917\u0026ndash;1.0000). Taxa with an in-clade rate\u0026thinsp;=\u0026thinsp;1.0 accounted for 37.5%, and no taxa had an in-clade rate\u0026thinsp;\u0026lt;\u0026thinsp;0.5.\u003c/p\u003e \u003cp\u003eOverall, dataset-level weighted mean in-clade rates ranged from 0.735 to 0.919, the proportion of taxa with non-focal real-tip count\u0026thinsp;=\u0026thinsp;0 ranged from 75.0% to 90.2%, and the proportion of taxa with an in-clade rate\u0026thinsp;=\u0026thinsp;1.0 ranged from 11.8% to 37.5% (Supplementary Data S1 and S2). Taxonomic composition and patterns of potential mixing were additionally examined using Krona-based interactive visualizations (Supplementary Data S3).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e3.5. External phylogenetic validation with BOLD reference sequences\u003c/h2\u003e \u003cp\u003eFor the Chiroptera external reference set downloaded from BOLD (Supplementary Data S4A), alignment of the BOLD sequences with the GRU-based ARLM synthetic sequences comprised 2,780 sequences and 2,933 nucleotide sites. The alignment contained 1,648 invariant sites (56.19%), and 898 parsimony-informative sites. IQ-TREE 3 analyses were performed with ModelFinder-based model selection and maximum-likelihood estimation (MLE); the best-fitting model under the Bayesian Information Criterion (BIC) was GTR\u0026thinsp;+\u0026thinsp;F+R6. The final ML tree had a log-likelihood of \u0026minus;\u0026thinsp;79,788.8898 (s.e. 3,752.4036) and a total sum of branch lengths of 43.9630 (Supplementary Data S5A).\u003c/p\u003e \u003cp\u003eFor the Drosophilidae BOLD reference set (Supplementary Data S4B), the combined alignment comprised 3,672 sequences and 2,316 nucleotide sites, with 1,388 invariant sites (59.93%) and 691 parsimony-informative sites. Using the same IQ-TREE 3 workflow, the best-fitting model under BIC was GTR\u0026thinsp;+\u0026thinsp;F+I+R9. The final ML tree had a log-likelihood of \u0026minus;\u0026thinsp;101,183.1110 (s.e. 5,706.3479) and a total sum of branch lengths of 48.1559 (Supplementary Data S5B). Taxonomic composition and patterns of potential mixing were additionally examined using Krona-based interactive visualizations (Supplementary Data S6).\u003c/p\u003e \u003c/div\u003e"},{"header":"4. Discussion","content":"\u003cp\u003eWe performed species-specific COI sequence generation using generative models to address data scarcity and class imbalance in species-level COI datasets. To explore generative models optimized for COI data, we constructed six deep generative models with distinct architectures\u0026mdash;GRU-based ARLM, Transformer-based ARLM, Conv-based VAE, GRU-based VAE, WGAN, and LDM\u0026mdash;and conducted a comparative evaluation across four datasets (Cypraeidae, Drosophila, Bats, and Birds). These datasets differ in sequence length, number of species, and sequence complexity, enabling assessment of the models\u0026rsquo; biological generalization capabilities. We evaluated the generated synthetic data focusing on biological plausibility, intra- and inter-species discriminability, and biological diversity.\u003c/p\u003e \u003cp\u003eAcross datasets, GRU-based models effectively reproduced biological characteristics and structural properties of COI sequences. Based on position-wise Shannon entropy and PCA-UMAP analyses, the GRU-based ARLM and GRU-based VAE generated synthetic data that reflected sequence patterns of conserved and variable regions consistent with real data. These models also preserved intra-specific clustering and inter-specific boundaries. Evaluations using JSD-kmer, Self-BLEU, and AA metrics confirmed that synthetic sequences maintained intra-species diversity similar to real data while capturing nucleotide patterns and sequence variation structures. For the Transformer-based ARLM, synthetic sequences reflected overall k-mer distributions and Shannon entropy patterns similar to real data but showed limited inter-species discriminability and failed to capture key characteristics of the real data. This suggests that although Transformer architecture learns global patterns effectively, it is less effective at capturing codon-level constraints and subtle intra-specific variations of COI sequences. In contrast, convolution-based models, including Conv-based VAE, WGAN, and LDM, showed limited capability in reproducing rare patterns and codon-level structures. Conv-based VAE and LDM exhibited high entropy, k-mer JSD, and AA values close to 1, indicating generation of unnatural sequences that do not reflect real data characteristics. Therefore, for COI sequence modeling, capturing nucleotide order and sequential dependency patterns is more critical than learning local patterns.\u003c/p\u003e \u003cp\u003eFinally, COI sequence generation performance ranked as follows: GRU-based ARLM\u0026thinsp;\u0026gt;\u0026thinsp;GRU-based VAE\u0026thinsp;\u0026gt;\u0026thinsp;Transformer-based ARLM\u0026thinsp;\u0026asymp;\u0026thinsp;WGAN\u0026thinsp;\u0026gt;\u0026thinsp;Conv-based VAE\u0026thinsp;\u0026gt;\u0026thinsp;LDM. Models such as WGAN, Conv-based VAE, and LDM, previously effective for SMILES, microbiome, haplotype, genotype, and promoter data, showed poor performance for COI sequence generation, highlighting the importance of domain-specific model design. Comparisons with the statistical N-gram method confirmed that deep generative models can generate diverse synthetic sequences reflecting intra-species variation and evolutionary constraints. Alignment and phylogenetic tree estimation using external reference sequences from the BOLD COI-5P database showed that generated sequences exhibited phylogenetic alignment patterns consistent with real genetic information. In conclusion, GRU-based generative models most robustly reproduced evolutionary diversity and biological characteristics of COI sequences.\u003c/p\u003e"},{"header":"5. Conclusion","content":"\u003cp\u003eDNA barcoding sequences are widely used as biomarkers for species identification and biodiversity research. Recently, machine learning\u0026ndash; and deep learning\u0026ndash;based approaches have been increasingly applied to COI data analysis. However, COI data present inherent limitations, characterized by high intra-specific similarity and significant data imbalances across species. In addition, public COI databases often contain extremely limited numbers of sequences per species or high levels of sequence redundancy, making reliable analysis challenging. To address the scarcity of usable data, synthetic data generation methods have been introduced. Traditional methods primarily focused on statistical properties such as nucleotide frequency. However, these approaches have limitations in reflecting the biological characteristics of COI sequences. Generative model\u0026ndash;based approaches have recently been introduced to generate high-quality synthetic data that effectively capture these biological characteristics.\u003c/p\u003e \u003cp\u003eBy comparatively analyzing species-specific COI sequence generation performance across generative models with diverse architectures, we confirmed that GRU-based generative models effectively reproduce both the evolutionary diversity and biological characteristics of COI sequences. In particular, the GRU-based ARLM and GRU-based VAE generated diverse synthetic sequences with minimized redundancy while preserving intra-species diversity, inter-species discriminative structures, and codon-level patterns of real COI data. These findings indicate that such models can serve as a novel simulation approach to mitigate data scarcity, overcoming the limitations of existing statistical simulation methods. Moreover, the proposed method can be applied to various biological fields where acquiring real data is challenging, including eDNA-based biodiversity prediction and the simulation of rare or endangered species. Future research aims to improve the accuracy and biological fidelity of synthetic COI data by incorporating a broader range of taxa and strengthening biologically informed constraints in generative model design.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(RS-2021-NR060121)\u003c/p\u003e\n\u003cp\u003eThis research was supported by Korea Basic Science Institute(National research Facilities and Equipment Center) grant funded by the Ministry of Education.(RS-2022-NF000922).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCM and DS contributed equally to this work. CM and DS developed the methodology, conceived the experiments and wrote the manuscript. JJ, CH, HS, HL, and KL performed data curation and reviewed the manuscript. JP and HH validated the results from biological perspective. YL supervised this work. All authors discussed the results and contributed to the fnal version of the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets analysed during the current study are publicly available from the SupBarcodes repository (http://dmb.iasi.cnr.it/supbarcodes.php). The processed datasets generated during the current study are available from the corresponding author on reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests. \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eHebert, P. D., Cywinska, A., Ball, S. L. \u0026amp; deWaard, J. R. Biological identifications through DNA barcodes. \u003cem\u003eProc. Biol. Sci.\u003c/em\u003e \u003cb\u003e270\u003c/b\u003e (1512), 313\u0026ndash;321 (2003).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHajibabaei, M., Singer, G. A., Hebert, P. D. \u0026amp; Hickey, D. A. DNA barcoding: how it complements taxonomy, molecular phylogenetics and population genetics. \u003cem\u003eTrends Genet.\u003c/em\u003e \u003cb\u003e23\u003c/b\u003e (4), 167\u0026ndash;172 (2007).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSiti-Azizah, M. N. (ed) DNA Barcoding: ; 2013: Syiah Kuala University.the Molecular Detective. 3rd Syiah Kuala University Annual International Conference (2013).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHebert, P. D., Stoeckle, M. Y., Zemlak, T. S. \u0026amp; Francis, C. M. Identification of Birds through DNA Barcodes. \u003cem\u003ePLoS Biol.\u003c/em\u003e \u003cb\u003e2\u003c/b\u003e (10), e312 (2004).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWard, R. D., Zemlak, T. S., Innes, B. H., Last, P. R. \u0026amp; Hebert, P. D. DNA barcoding Australia's fish species. \u003cem\u003ePhilos. Trans. R Soc. Lond. B Biol. Sci.\u003c/em\u003e \u003cb\u003e360\u003c/b\u003e (1462), 1847\u0026ndash;1857 (2005).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang, C-H., Wu, K-C., Chuang, L-Y. \u0026amp; Chang, H-W. DeepBarcoding: deep learning for species classification using DNA barcoding. \u003cem\u003eIEEE/ACM Trans. Comput. Biol. Bioinf.\u003c/em\u003e \u003cb\u003e19\u003c/b\u003e (4), 2158\u0026ndash;2165 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eArias, P. M. et al. BarcodeBERT: Transformers for biodiversity analysis. \u003cem\u003earXiv preprint arXiv\u003c/em\u003e :231102401. (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTaccaliti, E. \u0026amp; Aguilar\u0026ndash;Ruiz, J. S. Improving classification on imbalanced genomic data via KDE\u0026ndash;based synthetic sampling. \u003cem\u003eBioData Min.\u003c/em\u003e \u003cb\u003e18\u003c/b\u003e (1), 60 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eG\u0026oacute;mez-Mart\u0026iacute;nez, V., Chushig-Muzo, D., Veier\u0026oslash;d, M. B., Granja, C. \u0026amp; Soguero-Ruiz, C. Ensemble feature selection and tabular data augmentation with generative adversarial networks to enhance cutaneous melanoma identification and interpretability. \u003cem\u003eBioData Min.\u003c/em\u003e \u003cb\u003e17\u003c/b\u003e (1), 46 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhu, Y. et al. Generative ai for controllable protein sequence design: A survey. \u003cem\u003earXiv preprint arXiv\u003c/em\u003e :240210516. (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXia, J., Zhou, J., Chen, S., Ling, T. \u0026amp; Li, S. Z. (eds) A comprehensive and systematic review for deep learning-based de novo peptide sequencing. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence; (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLan, L. et al. Generative adversarial networks and its applications in biomedical informatics. \u003cem\u003eFront. public. health\u003c/em\u003e. \u003cb\u003e8\u003c/b\u003e, 164 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlvi, R. et al. Generative Artificial Intelligence in Bioinformatics: A Systematic Review of Models, Applications, and Methodological Advances. arXiv preprint arXiv:251103354. (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRong, R. et al. MB-GAN: Microbiome Simulation via Generative Adversarial Network. Gigascience. ; 10 (2): giab005. 2021. (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYelmen, B. et al. Deep convolutional and conditional neural networks for large-scale genomic data generation. \u003cem\u003ePLoS Comput. Biol.\u003c/em\u003e \u003cb\u003e19\u003c/b\u003e (10), e1011584 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXie, S. et al. Deep Generative Models for Discrete Genotype Simulation. bioRxiv. 2025:2025.08. 08.669289.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, Z. et al. Discdiff: Latent diffusion model for dna sequence generation. \u003cem\u003earXiv preprint arXiv\u003c/em\u003e :240206079. (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu, L. et al. A generative deep neural network for pan-digestive tract cancer survival analysis. \u003cem\u003eBioData Min.\u003c/em\u003e \u003cb\u003e18\u003c/b\u003e (1), 9 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou, J. et al. Novobench: Benchmarking deep learning-based\\emph {De Novo} sequencing methods in proteomics. \u003cem\u003eAdv. Neural. Inf. Process. Syst.\u003c/em\u003e \u003cb\u003e37\u003c/b\u003e, 104776\u0026ndash;104791 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKatoh, K., Misawa, K. \u0026amp; Kuma Ki, Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cb\u003e30\u003c/b\u003e (14), 3059\u0026ndash;3066 (2002).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBrown, P. F., Della Pietra, V. J., Desouza, P. V., Lai, J. C. \u0026amp; Mercer, R. L. Class-based n-gram models of natural language. \u003cem\u003eComput. linguistics\u003c/em\u003e. \u003cb\u003e18\u003c/b\u003e (4), 467\u0026ndash;480 (1992).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen, X., Mishra, N., Rohaninejad, M. \u0026amp; Abbeel, P. (eds) Pixelsnail: An improved autoregressive generative model. International conference on machine learning; : PMLR. (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDou, L. et al. Unisar: A unified structure-aware autoregressive language model for text-to-sql. \u003cem\u003earXiv preprint arXiv\u003c/em\u003e :220307781. (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eG\u0026oacute;mez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. \u003cem\u003eACS Cent. Sci.\u003c/em\u003e \u003cb\u003e4\u003c/b\u003e (2), 268\u0026ndash;276 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRombach, R., Blattmann, A., Lorenz, D., Esser, P. \u0026amp; Ommer, B. (eds) High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMcInnes, L., Healy, J., Melville, J. \u0026amp; Umap Uniform manifold approximation and projection for dimension reduction. \u003cem\u003earXiv preprint arXiv\u003c/em\u003e :180203426. (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKatoh, K. \u0026amp; Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. \u003cem\u003eMol. Biol. Evol.\u003c/em\u003e \u003cb\u003e30\u003c/b\u003e (4), 772\u0026ndash;780 (2013).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWong, T. et al. IQ-TREE 3: Phylogenomic Inference Software using Complex Evolutionary Models (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRatnasingham, S. \u0026amp; Hebert, P. D. bold: The Barcode of Life Data System. \u003cem\u003eMol. Ecol. Notes\u003c/em\u003e. \u003cb\u003e7\u003c/b\u003e (3), 355\u0026ndash;364 (2007). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://www.barcodinglife.org\u003c/span\u003e\u003cspan address=\"http://www.barcodinglife.org\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLetunic, I. \u0026amp; Bork, P. Interactive Tree of Life (iTOL) v6: recent updates to the phylogenetic tree display and annotation tool. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cb\u003e52\u003c/b\u003e (W1), W78\u0026ndash;W82 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOndov, B. D., Bergman, N. H. \u0026amp; Phillippy, A. M. Interactive metagenomic visualization in a Web browser. \u003cem\u003eBMC Bioinform.\u003c/em\u003e \u003cb\u003e12\u003c/b\u003e, 385 (2011).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"COI DNA barcoding, deep generative models, synthetic DNA generation, phylogenetic fidelity, biodiversity representation","lastPublishedDoi":"10.21203/rs.3.rs-8982107/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8982107/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eCytochrome c oxidase subunit I (COI) DNA barcoding is widely used for species identification and biodiversity studies. However, COI datasets exhibit high intra-species similarity and significant inter-species imbalance, which limits sequence analyses. To address data scarcity, deep learning based generative models have been explored for sequence generation. We implemented six generative models incorporating gated recurrent unit (GRU) layers, Transformer blocks, and convolutional layers to generate species-specific COI sequences across four taxonomic groups: Cypraeidae, Drosophila, Bats, and Birds. The generated sequences were evaluated in terms of plausibility, phylogenetic consistency, and diversity. Finally, GRU-based autoregressive language model achieved the best performance. It preserved codon-level structures to real data, with GC₃ content differences (Δ)\u0026thinsp;\u0026le;\u0026thinsp;0.004, codon bias JSD\u0026thinsp;\u0026le;\u0026thinsp;0.013, and ORF mean length differences (Δ)\u0026thinsp;\u0026lt;\u0026thinsp;0.05. It also reproduced genetic structures with intra-species K2P mean differences (Δ)\u0026thinsp;\u0026le;\u0026thinsp;0.13, real\u0026ndash;synthetic K2P mean\u0026thinsp;\u0026le;\u0026thinsp;0.09, and barcode gap rate differences (Δ)\u0026thinsp;\u0026le;\u0026thinsp;\u0026minus;\u0026thinsp;0.6. Additionally, it generated sequences with minimal redundancy, indicated by JSD-kmer\u0026thinsp;\u0026le;\u0026thinsp;0.03, Self-BLEU differences (Δ)\u0026thinsp;\u0026le;\u0026thinsp;0.001, and AA values between 0.54 and 0.75. These results suggest that GRU-based COI sequence generation can serve as robust simulation strategy for addressing data scarcity and imbalance in bioinformatics applications.\u003c/p\u003e","manuscriptTitle":"Benchmarking Generative Models for COI DNA Barcoding","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-15 07:21:05","doi":"10.21203/rs.3.rs-8982107/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"reviewerAgreed","content":"58600849853451860450563251667846202447","date":"2026-05-11T21:31:03+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"34162840315904301597117639561037819505","date":"2026-05-10T20:55:35+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-05-06T10:04:23+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-03-06T16:38:27+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-03-05T15:18:33+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-03-04T01:14:33+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2026-03-04T01:08:58+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"e271365d-232a-4139-975f-402a73aeb45f","owner":[],"postedDate":"May 15th, 2026","published":true,"recentEditorialEvents":[{"type":"reviewerAgreed","content":"58600849853451860450563251667846202447","date":"2026-05-11T21:31:03+00:00","index":98,"fulltext":""},{"type":"reviewerAgreed","content":"34162840315904301597117639561037819505","date":"2026-05-10T20:55:35+00:00","index":95,"fulltext":""},{"type":"reviewersInvited","content":"16","date":"2026-05-06T10:04:23+00:00","index":"","fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":68042227,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":68042228,"name":"Biological sciences/Ecology"},{"id":68042229,"name":"Earth and environmental sciences/Ecology"},{"id":68042230,"name":"Biological sciences/Evolution"},{"id":68042231,"name":"Biological sciences/Genetics"}],"tags":[],"updatedAt":"2026-05-15T07:21:05+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-15 07:21:05","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8982107","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8982107","identity":"rs-8982107","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00