Tabular Transformer Generative Adversarial Network for Heterogeneous distribution in healthcare

preprint OA: closed
Full text JSON View at publisher
Full text 147,021 characters · extracted from preprint-html · click to expand
Tabular Transformer Generative Adversarial Network for Heterogeneous distribution in healthcare | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Tabular Transformer Generative Adversarial Network for Heterogeneous distribution in healthcare Ha Ye Jin Kang, Minsam Ko, Kwang Sun Ryu This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4134206/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract In healthcare, the most common type of data is tabular data, which hold high significance and potential in the field of medical AI. However, privacy concerns have hindered their widespread use. Despite the emergence of synthetic data as a viable solution, the generation of healthcare tabular data (HTD) is complex owing to the extensive interdependencies between the variables within each record that incorporate diverse clinical characteristics, including sensitive information. To overcome these issues, this study proposed a tabular transformer generative adversarial network (TT-GAN) to generate synthetic data that can effectively consider the relationships between variables potentially present in the HTD dataset. Transformers can consider the relationships between the columns in each record using a multi-attention mechanism. In addition, to address the potential risk of restoring sensitive data in patient information, a Transformer was employed in a generative adversarial network (GAN) architecture, to ensure an implicit-based algorithm. To consider the heterogeneous characteristics of the continuous variables in the HTD dataset, the discretization and converter methodology were applied. The experimental results confirmed the superior performance of the TT-GAN than the Conditional Tabular GAN (CTGAN) and copula GAN. Discretization and converters were proven to be effective using our proposed Transformer algorithm. However, the application of the same methodology to Transformer-based models without discretization and converters exhibited a significantly inferior performance. The CTGAN and copula GAN indicated minimal effectiveness with discretization and converter methodologies. Thus, the TT-GAN exhibited considerable potential in healthcare, demonstrating its ability to generate artificial data that closely resembled real healthcare datasets. The ability of the algorithm to handle different types of mixed variables efficiently, including polynomial, discrete, and continuous variables, demonstrated its versatility and practicality in health care research and data synthesis. tabular Transformer generative adversarial network (TT-GAN) heterogenous distribution healthcare tabular data (HTD) Figures Figure 1 Figure 2 Introduction Tabular data, which are organized in rows and columns, are the most common type of data across various real-world applications. Its prevalence in various domains underlines its importance in practical machine learning applications and research environments [ 1 ]. In particular, in healthcare, wherein structured information such as patient demographics, diagnoses, and treatments is critical, tabular data play an important role. Tabular data in healthcare have tremendous potential for artificial intelligence (AI) research, which is less utilized in this field than in other areas owing to various issues related to privacy and data sharing [ 2 ],[ 3 ]. To overcome these issues, tabular synthetic data have been proposed, which have shown reliable research results [ 4 ], [ 5 ]. However, minimal research has been conducted on the generation of synthetic data that considers the relationships between columns and the handling of continuous variables with various distributions. The complexity of tabular data, particularly in healthcare, is characterized by the relationships between columns, rendering the existence of a single distribution rare. This complexity presents challenges in generating synthetic tabular data (STD) in the context of healthcare tabular data (HTD). The addressal of these challenges necessitates an approach that accurately captures and replicates the inherent relationships in tabular healthcare datasets. Recently, the Transformer method has been demonstrated to be a good method for generating STDs using multi-attention to consider the relationships between columns. However, it inherently encounters difficulty in handling continuous attributes on the right side. In addition, the models proposed thus far based on Transformer algorithms are mostly intended for prediction tasks. Only a few transformer algorithms have been used to generate STDs; however, they are not suited to tabular data in healthcare and, in particular, lack deep thought about handling continuous variables. To overcome these challenges, we propose the tabular transformer generative adversarial network (TT-GAN) algorithm. The TT-GAN comprises three stages. The first stage was the discretization stage, where continuous variables were converted to discrete variables in order to apply the Transformer method. The second stage is the generation stage, wherein synthetic data were generated using a generator and discriminator based on a GAN architecture with a transformer. The third stage was the converter stage, wherein the discretized columns were translated into continuous variables in the STD. The contributions of this study are as follows: 1) TT-GAN for HTD generation: We propose the TT-GAN, which aimed to capture the intricate dependencies, irregular patterns, and varied distributions present in real-world healthcare datasets, while ensuring that the synthesized data closely aligned with the statistical characteristics of authentic healthcare information. TT-GAN demonstrated superior performance with HTD compared with previously proposed generative adversarial network (GAN) algorithms designed for the synthesis of STD. The performance improvement was attributed to the efficient architecture design to apply column-to-column relationships, which are characteristic of healthcare data, to the Transformer algorithm. Moreover, our architecture was designed to avoid explicit density-based methods, thereby overcoming the issue of data privacy, and consequently propose a reasonable method for effectively handling continuous variables. 2) Synth Health Discovery Network: The constructed generative model and its corresponding code on shared platforms provided opportunities for transparency and reproducibility. This active contribution is expected to help advance overall progress in healthcare research. Sharing our generative model facilitates a fair evaluation and contributes to field advancement. Further, sharing codes and models for educational purposes would also be valuable. Students, researchers, and practitioners can gain a deeper understanding of optimal techniques and methodologies in the healthcare generative modeling domain. The Synth Health Discovery Network will play an important role in enabling the research ecosystem to promote the effective use of healthcare data and contribute to future healthcare breakthroughs. The remainder of this paper is organized as follows. The "Related Works" section presents the relevant studies explored to lay the foundation for our contributions. The "Method" section provides detailed information about the TT-GAN model. The "Results" presents the model's performance analysis and highlights the findings. Further, the "Discussion" section offers an analysis and explores the implications, comparisons, and future avenues. Finally, the “Conclusion" section summarizes our core findings. Related works Conditional tabular GAN (CTGAN) and copula GAN were developed by Xu et al. [ 6 ]. Copula GAN is an extension or variant of the CTGAN and is a type of GAN designed to generate synthetic tabular data. copula GAN incorporates copula functions based on the CTGAN to enhance the learning process. Juan Carlos Quirz et al. [ 7 ] proposed a machine-learning framework for automating the severity assessment of COVID-19 using clinical and imaging data. To address the imbalance problem in tabular clinical data, they employed the CTGAN model, along with various oversampling techniques. Ultimately, logistic regression models with balanced synthetic data effectively distinguished between mild and severe cases. Syde et al. [ 8 ] developed a fundamental tumor type classification model for decision support. An approach involving the utilization of a CTGAN was implemented to address imbalances in clinical data. All evaluation metrics demonstrated an improvement as when increasing the sample size through the application of the CTGAN. Kang et al. [ 9 ] demonstrated the preservation of data with logical relationships while generating STD using a CTGAN. They implemented a divide-and-conquer approach to mitigate the risk of information loss caused by dependence on condition columns. This DC-based strategy facilitated the creation of an STD that accurately reflects the inherent patterns and relationships within each subset of the Original Data. Although much research has been conducted on STD, its implementation in real-world healthcare datasets remains challenging. This is because the HTD has the following unique characteristics. (1) Different types of columns: In general, HTD, which are non-standardized patient records, are heterogeneous and voluminous, that is, the data contain different column types such as numeric, float, integer, and character. (2) Non-Gaussian distributions: In healthcare, data may exhibit skewness and kurtosis values that deviate significantly from the normal distribution, contain outliers, combine multiple distributions (bimodal or multimodal), or contain insufficient data. Therefore, generating synthetic HTD involves the addressing of the complex dependencies, irregularities, and diverse distributions found in real-world datasets. Moreover, privacy must be protected carefully. The inconsistent and diverse distribution and characteristics of such tabular data, and privacy concerns hinder the application and diffusion of effective algorithms representative of AI. To address the distributional issues of these data, we applied transformers to consider the complex relationships between the data columns and validate our proposed research framework. Method Tabular Transformer Generative Adversarial Network (TT-GAN) The proposed TT-GAN should efficiently process HTD with different distributions and generate approximately good synthetic data. The relationship between the columns is based on the Transformer and follows three steps, as shown in Fig. 1 . In the discretization stage, a clustering algorithm was used to preprocess continuous variables, thereby transforming them into categorical variables. In the generation stage, categorical features were transformed into discretized data using ordinal encoding. We incorporated a Transformer encoder that used a multi-headed attention mechanism to capture the relational nature of each column and learn categorical features, to be fed into the generator. The generator approximated the probability distribution of real data within a high-dimensional latent space and used multi-head attention from the Transformer encoder to exploit contextual embeddings that capture the relational properties of each column and learn categorical features. The discriminator evaluated the data and output probabilities. It minimized binary cross-entropy by aiming for low probabilities for fake data and high probabilities for real data. Moreover, it was verified that the generated data satisfied the specified conditions from the condition vector. In the converter stage, we used a prediction model to convert categorical data that were originally continuous into continuous data. The TT-GAN successfully generated mixed variable types (multinomial, discrete, and continuous) similar to real tabular data (Algorithm 1). Algorithm 1. TT-GAN algorithms Input: real mixed type (continuous and categorical) data D ; list of continuous variables VContinuous ; list of categorical variables VCategorical ; number of synthetic data N Output: synthetic mixed type (continuous and categorical) data D’ D categorical ← select V categorical from D D continuous ← select V continuous from D D discretized ← Discretization ( D continuous ) D all_categorical ← D discretized + D categorical for each i in V continuous do R i ← Train Regressor ( D all_categorical , i ) end for G ← Train generator ( D all_categorical ) D’ ← sample (G, N) for each i in Ddiscretized do D’ i ← Converter ( D’ , R i ) end for Discretization stage The first stage was the preprocessing stage. Herein, discretization is the process of transforming continuous properties into discrete properties by forming a group of adjacent intervals that extend the range of the property (Algorithm 2). Data discretization with k-means clustering [ 10 ] for the numerical variables was performed before applying the transformer module. K-means clustering is a popular method for calculating continuous distance-based similarity measures to cluster data points, rendering it suitable for the discretization of continuous-valued variables. The k-means clustering algorithm divides the input data into clusters by first assigning k random data points as centroids. Each data point was then assigned to its nearest center to form the initial cluster distribution. This process discretized the data using min-max values, calculated clusters, and distances between clusters. The algorithm iteratively created clusters by recalculating the cluster centers as the averages of the values in each cluster and reassigning the data points to the nearest center [ 11 ]. Algorithm 2. Discretization function Discretization (D continuous ) : D discretized = [] for each i in V continuous do Discretize i from D continuous Append i discretized into D discretized end for return D discretized Generation Stage The second stage was the generation stage. A combination of the CTGAN architecture, comprising a generator and discriminator, and a Transformer encoder was used to learn the real data distribution and produce an optimal generative model (Algorithm 3). The transformer encoder had the following order. The Transformer input is an embedded vector from the column embedding. Embedding techniques were used to represent the data as a dense vector to place highly similar data at similar positions in the vector space to calculate similarity. We set a categorical variable \({x}_{i}={\{x}_{1}^{cat},...,{x}_{m}^{cat}\}\) , for \(i\in \{1,...,m\}\) . Using each of the \(xi\) categorical features in a parametric embedding of dimension \(d\) by column embedding yields \({e}_{\varphi i}\left({x}_{i}\right)\in {R}^{d}\) . Column embedding shared optimal dimension of parameters \(ci\) in column \(i.\) The input to the encoder first passed through a self-attention layer, which examined the relationship between all the vectors of the columns in the input for the encoder to encode one particular column. After the input passed through the self-attention layer, the output was returned to the feed-forward neural network. The same feedforward neural network was applied independently to each vector of the columns at each position to create the output. The attention layer facilitates the creation of multiple “representation spaces.” After training, each set was multiplied by the input vectors to project the vectors for each purpose. The fact that there are several such sets implies that each vector was represented in a different space. The final encoder output a representation of the input columns. To capture all possible correlations between columns, fully connected networks were used in both the generator and critic because the columns in a row do not exhibit a local structure. A synthetic row representation was generated using a mix of activation functions after two hidden layers. The scalar value αi was generated by tanh, while the mode indicator \(\beta i\) and discrete values \(di\) were generated by Gumbel-SoftMax. The critic used the LeakyReLU function and dropped each hidden layer to ensure accuracy. Both the generator and critic used two fully connected hidden layers. Batch normalization and ReLU activation functions were used in the generator. Through training based on these steps, the synthetic at a generation model generated synthetic data (Algorithm 4). Algorithm 3. Model Generation function train_generator (D all_categorical ) : Initialize generative model G Initialize Transformer Encoder Initialize transformer encoder parameters for number of training iterations do for k steps do Sample mini-batch of noise samples from noise prior Sample mini-batch of examples from data generating distribution Update the discriminator by ascending its stochastic gradient end for Sample mini-batch of noise samples from noise prior Update G model parameters by descending its stochastic gradient end for return G Algorithm 4. Synthetic Generation function sample (G , N) : Sample N noise vectors from noise prior distribution Generate samples from the generator model G with the noise vectors return D’ Converter Stage In the converter stage, we developed models to predict the original value of the discretized continuous variable based on the original data (Algorithm 5). This model was applied to the synthetic data to convert the categorical variables into continuous variables (Algorithm 6). We applied various tree-based ensemble models, including Random Forest (RF), Categorical Boosting (CatBoost), Extreme Gradient Boosting (XGBoost), and light gradient boosting machine (LightGBM). RF, which was originally developed in 1995 by Tin Kam Ho [ 12 ], utilizes an ensemble of decision trees, randomly selects subsets of features and samples during training, and provides robustness against overfitting. CatBoost [ 13 ] is designed to seamlessly support categorical features and automatically handle their encoding complexity. Known for its speed, performance, and scalability, XGBoost [ 14 ] supports regularization, efficiently handles missing data, and facilitates parallel processing. LightGBM [ 15 ] was optimized for large and high-dimensional datasets. It efficiently handles categorical features without the need for one-hot encoding, and uses a histogram-based approach for faster training. Algorithm 5. Train Regressor function train_regressor (D all_categorical , i) : Define ensemble regressors ( R i1 , R i2 , ..., R iM ) for number of regressors do Initialize a new regressor Train R i regressors end for For a given M ensemble regressor ( R i1 , R i2 , ..., R iM ) predicts the final value return R i Algorithm 6. Converter function Converter (D’ , R i ) : Set empty list of predicted values for number of regressors do Predict using each regressor in R i Append predicted value to the list end for Combine predictions from all regressors in the ensemble into D’ i return D’ i Result Experimental setting To objectively evaluate the performance of the proposed algorithm, we conducted the following experiments. First, synthetic data were generated using CTGAN, copula GAN, and TT-GAN, and data were generated with and without discretization and converter methodology in each generation Secondly, we developed a prediction model for liver and lung cancer mortality based on synthetic data. Third, the model generated using the synthetic data was evaluated by measuring the AUC based on the original TEST data, and the model test was repeated five times. Dataset Study Population Lung and liver cancer data from the Korea Central Cancer Registry at the National Cancer Center ( https://kccrsurvey.cancer.go.kr/index.do ) were used in this study, and the data were reviewed by the institutional review board [ 16 ]. Our study used nonduplicate lung cancer and liver data after excluding missing variables. Lung cancer data were divided into development (n = 1, 616), validation (n = 228), and test (n = 460) groups, which were further divided into development (n = 4, 767), validation (n = 681), and test (n = 1, 363) groups using stratified random sampling. The basic characteristics of the datasets showed similar distributions (Appendix 1). Generation and validation of STDs The CTGAN, copula GAN, and TT-GAN models were trained for comparison (Appendix 2). Subsequently, we employed RF, CatBoost, XGBoost, and LightGBM as regression classifiers to predict continuous variables. Subsequently, we implemented the discretization and converter methodology. To assess prediction performance, we conducted evaluations using the RF, CatBoost, XGBoost, and LightGBM models individually for each of the generated GAN models (Appendix 3). We generated 1,616 lung cancer STD and 4,766 liver cancer STD. As shown in Table 1 , for the lung cancer dataset, the following AUC values were obtained for the original dataset: RF: 85.02%, CatBoost: 86.02%, XGBoost: 84.24%, and LightGBM: 84.49%. The model performance was observed through the STD, which was generated by each GAN model without the preprocessing stage. For the STD generated by CTGAN, the AUC for RF was 84.00 ± 0.55, while CatBoost achieved 83.80 ± 0.45, XGBoost attained 81.20 ± 0, and LightGBM obtained 82.88 ± 0.72. When the STD was produced by copula GAN, the values were 84.45 ± 0.26 for RF, 81.58 ± 0.77 CatBoost, 79.40 ± 0.71 for XGBoost, and 84.07 ± 0.58 for LightGBM. The STD generated by TT-GAN yielded an AUC of 81.53 ± 0.46 for RF, 82.64 ± 0.44 for CatBoost, 84.45 ± 0.51 for XGBoost, 84.32 ± 0.18 for LightGBM. The model performance was assessed by examining the STD generated by each GAN model after the preprocessing stage. The STD generated by CTGAN with the RF classifier yielded AUC of 82.31 ± 0.50 for RF, 83.68 ± 0.74 for CatBoost, 81.33 ± 0.55 for XGBoost, and 80.59 ± 0.32 for LightGBM. With the CatBoost classifier, the AUC was 83.39 ± 0.33 for RF, 83.93 ± 0.67 for CatBoost, 83.07 ± 1.13 for XGBoost, and 82.38 ± 0.31 for LightGBM. When the XGBoost classifier was used, the AUC was 82.95 ± 0.55 for RF, 83.58 ± 0.24 for CB, 82.09 ± 1.02 for XGBoost, and 82.67 ± 0.54 for LightGBM. As for the application of the LightGBM classifier, the AUC was 83.32 ± 0.67 for RF, 82.99 ± 0.28 for CatBoost, 80.76 ± 0.58 for XGBoost, and 82.97 ± 1.71 for LightGBM. The STD was produced by copula GAN using RF classifier, the AUC was 81.18 ± 0.83 for RF, 80.50 ± 0.80 for CatBoost, 78.30 ± 0.92 for XGBoost, and 77.78 ± 1.61 for LightGBM. When the CatBoost classifier was utilized, the AUC was 81.90 ± 0.21 for RF, 82.17 ± 0.50 for CB, 79.81 ± 1.01 for XGBoost, and 81.65 ± 1.16 for LightGBM. Further, with the XGBoost classifier, the AUC was 81.50 ± 0.76 for RF, 82.30 ± 0.77 for CatBoost, 80.35 ± 0.60 for XGBoost, and 82.50 ± 0.87 for LightGBM. The application of the LightGBM classifier yielded AUC values of 81.18 ± 0.78 for RF, 80.04 ± 0.92 for CatBoost, 81.36 ± 0.44 for XGBoost, and 81.27 ± 1.07 for LightGBM. Further, the TT-GAN-derived lung STD when used with the RF classifier yielded AUC values of 83.24 ± 0.26 for RF, 83.83 ± 0.13 for CatBoost, 82.99 ± 0.26 for XGBoost, and 82.76 ± 0.19 for LightGBM. When the CatBoost classifier was used, the AUC was 83.32 ± 0.24 for RF, 83.96 ± 0.19 for CatBoost, 83.10 ± 0.31 for XGBoost, 82.37 ± 0.18 for LightGBM. The utilization of the XGBoost classifier, 83.32 ± 0.18 for RF, 84.06 ± 0.15 for CatBoost, and 83.29 ± 0.15 for XGBoost, 84.04 ± 0.20 for LightGBM. When the LightGBM classifier was used, the AUC was 82.32 ± 0.37 for RF, 84.13 ± 0.12 for CatBoost, 83.28 ± 0.46 for XGBoost, and 83.16 ± 0.48 for LightGBM. Table 1 Performance evaluation of prediction models using lung cancer SSD test dataset Data Generator Classifier Prediction model RF CatBoost XGBoost LightGBM Original - - 85.02% 86.02% 84.24% 84.49% Without Discretization and converter CTGAN - 84.00 ± 0.55 83.80 ± 0.45 81.20 ± 0.48 82.88 ± 0.72 Copula GAN - 84.45 ± 0.26 81.58 ± 0.77 79.40 ± 0.71 84.07 ± 0.58 TT-GAN - 81.53 ± 0.46 82.64 ± 0.44 84.45 ± 0.51 84.32 ± 0.18 Discretization and converter CTGAN RF 82.31 ± 0.50 83.68 ± 0.74 81.33 ± 0.55 80.59 ± 0.32 CatBoost 83.39 ± 0.33 83.93 ± 0.67 83.07 ± 1.13 82.38 ± 0.31 XGBoost 82.95 ± 0.55 83.58 ± 0.24 82.09 ± 1.02 82.67 ± 0.54 LightGBM 83.32 ± 0.67 82.99 ± 0.28 80.76 ± 0.58 82.97 ± 1.71 Copula GAN RF 81.18 ± 0.83 80.50 ± 0.80 78.30 ± 0.92 77.78 ± 1.61 CatBoost 81.90 ± 0.21 82.17 ± 0.50 79.81 ± 1.01 81.65 ± 1.16 XGBoost 81.50 ± 0.76 82.30 ± 0.77 80.35 ± 0.60 82.50 ± 0.87 LightGBM 81.18 ± 0.78 80.04 ± 0.92 81.36 ± 0.44 81.27 ± 1.07 TT-GAN RF 83.53 ± 0.22 83.92 ± 0.44 83.19 ± 0.77 82.58 ± 0.19 CB 84.69 ± 0.55 85.86 ± 0.30 85.94 ± 0.51 84.55 ± 0.56 XGB 84.84 ± 0.47 85.91 ± 0.14 85.44 ± 0.19 85.34 ± 0.60 LGBM 84.69 ± 0.37 85.69 ± 0.09 82.97 ± 0.36 85.42 ± 0.71 In Table 2 , the performances of the RF, CatBoost, XGBoost, and LightGBM prediction models for the liver cancer dataset were evaluated using the AUC metric for the test sets. The original dataset showed AUC values of 85.96% for RF, 86.69% for CatBoost, 85.14% for XGBoost, and 85.91% for LightGBM. Without the preprocessing stage, the STD from CTGAN, exhibited AUC values of 83.31 ± 0.17 for RF, 83.81 ± 0.23 for CatBoost, 81.20 ± 0.50 for XGBoost, and 82.69 ± 0.19 for LightGBM. The STD from copula GAN exhibited AUC values of 82.46 ± 0.07 for RF, 83.61 ± 0.24 for CatBoost, 80.93 ± 0.62 for XGBoost, and 82.53 ± 0.42 for LightGBM. The STD from TT-GAN exhibited AUC values of 80.29 ± 0.14 for RF, 81.98 ± 0.35 for CatBoost, 80.43 ± 0.31 for XGBoost, and 80.33 ± 0.37 for LightGBM. When evaluating the impact of pre-processing, the STD generated from CTGAN, in conjunction with the RF classifier, yielded AUC values of 81.77 ± 0.21 for RF, 82.78 ± 0.52 for CatBoost, 79.60 ± 0.62 for XGBoost, and 80.94 ± 0.34 for LightGBM. The implementation of the CatBoost classifier results in an AUC of 82.65 ± 0.24 for RF, 81.00 ± 0.33 for CatBoost, 77.60 ± 0.27 for XGBoost, and 80.34 ± 0.60 for LightGBM. Employing the XGBoost classifier yielded AUC values of 82.96 ± 0.21 for RF, 82.44 ± 0.50 for CatBoost, 80.43 ± 0.50 for XGBoost, and 81.81 ± 0.48 for LightGBM. Finally, using the LightGBM classifier, AUC values of 82.47 ± 0.43 for RF, 81.47 ± 0.29 for CatBoost, 78.76 ± 0.41 for XGBoost, and 80.34 ± 0.19 for LightGBM were obtained. The STD generated by copula GAN exhibited AUC values of 78.95 ± 0.28 for RF, 71.70 ± 0.70 for CatBoost, 65.62 ± 2.14 for XGBoost, and 74.54 ± 1.42 for LightGBM when utilized by the RF classifier. The CatBoost classifier yielded AUC values of 80.95 ± 0.38 for RF, 79.41 ± 0.69 for CatBoost, 75.10 ± 1.09 for XGBoost, and 78.69 ± 1.41 for LightGBM. The XGBoost classifier yielded AUC values of 78.96 ± 0.45 for RF, 79.75 ± 1.13 for CatBoost, 74.95 ± 1.24 for XGBoost, and 74.30 ± 1.42 for LightGBM. Whereas the LightGBM classifier yielded AUC values of 77.67 ± 1.01 for RF, 70.46 ± 1.15 for CatBoost, 68.46 ± 1.43 for XGBoost, and 71.32 ± 0.56 for LightGBM. The STD obtained using the TT-GAN yielded various AUC. When employing the RF classifiers, the AUC values were 83.24 ± 0.26 for RF, 83.83 ± 0.13 for CatBoost, 82.99 ± 0.26 for XGBoost, and 82.76 ± 0.19 for LightGBM. The application of the CatBoost classifier yielded AUC values of 83.32 ± 0.24 for RF, 83.96 ± 0.19 for CatBoost, 83.10 ± 0.31 for XGBoost, and 82.37 ± 0.18 for LightGBM. Implementing the XGBoost classifier yielded AUC values of 83.32 ± 0.18 for RF, 84.06 ± 0.15 for CatBoost, 83.29 ± 0.15 for XGBoost, and 84.04 ± 0.20 LightGBM. Finally, the AUC with the LightGBM classifier was 82.32 ± 0.37 for RF, 84.13 ± 0.12 for CatBoost, 83.28 ± 0.46 for XGBoost, and 83.16 ± 0.48 for LightGBM. Table 2 . Performance evaluation of prediction models using liver cancer SSD test dataset Data Generator Classifier Prediction model RF CatBoost XGBoost LightGBM Original - - 85.96% 86.69% 85.14% 85.91% Without Discretization and converter CTGAN - 83.31 ± 0.17 83.81 ± 0.23 81.20 ± 0.50 82.69 ± 0.19 Copula GAN - 82.46 ± 0.07 83.61 ± 0.24 80.93 ± 0.62 82.53 ± 0.42 TT-GAN - 80.29 ± 0.14 81.98 ± 0.35 80.43 ± 0.31 80.33 ± 0.37 Discretization and converter CTGAN RF 81.77 ± 0.21 82.78 ± 0.52 79.60 ± 0.62 80.94 ± 0.34 CatBoost 82.65 ± 0.24 81.00 ± 0.33 77.60 ± 0.27 80.34 ± 0.60 XGBoost 82.96 ± 0.21 82.44 ± 0.50 80.43 ± 0.50 81.81 ± 0.48 LightGBM 82.47 ± 0.43 81.47 ± 0.29 78.76 ± 0.41 80.34 ± 0.19 Copula GAN RF 78.95 ± 0.28 71.70 ± 0.70 65.62 ± 2.14 74.54 ± 1.42 CatBoost 80.95 ± 0.38 79.41 ± 0.69 75.10 ± 1.09 78.69 ± 1.41 XGBoost 78.96 ± 0.45 79.75 ± 1.13 74.95 ± 1.24 74.30 ± 1.42 LightGBM 77.67 ± 1.01 70.46 ± 1.15 68.46 ± 1.43 71.32 ± 0.56 TT-GAN RF 83.24 ± 0.26 83.83 ± 0.13 82.99 ± 0.26 82.76 ± 0.19 CatBoost 83.32 ± 0.24 83.96 ± 0.19 83.10 ± 0.31 82.37 ± 0.18 XGBoost 83.32 ± 0.18 84.06 ± 0.15 83.29 ± 0.15 84.04 ± 0.20 LightGBM 82.32 ± 0.37 84.13 ± 0.12 83.28 ± 0.46 83.16 ± 0.48 The TT-GAN preserved the attributes of the original data and the relationships between variables, thereby maintaining connections between continuous and categorical values during the generation of the STD. It exhibited good efficacy in safeguarding real-world patterns and commendable performance in terms of model efficiency. Discussion Synthetic data are commonly perceived as irreversibly generated in traditional practice [ 17 ], [ 18 ]. However, certain techniques that involve the estimation of explicit distributions during the generation of synthetic data, coupled with the corresponding model, can reconstruct original data. In cases involving sensitive information such as healthcare data, synthetic data must be generated based on implicit density rather than explicit density. This ensures that the generation process adheres to the non-disclosure of explicit distributions, thereby mitigating the risks associated with reconstructing the original data. In the context of datasets containing sensitive information, such as healthcare data, the generation of synthetic data should be based on the implicit rather than explicit density. In cases where sensitive information is not included, synthetic data based on explicit density may have a higher quality and performance. Therefore, the use of explicit density to generate these datasets offers advantages. However, in certain studies, the distinct differences between explicit and implicit density methods are often overlooked. Consequently, the performance of algorithms is compared and evaluated while disregarding the disparities between explicit and non-explicit density methods [ 19 ]-[ 21 ] This experimental design can be considered irrational. The evaluation of algorithms based on the implicit density of sensitive data is considered an appropriate objective approach. There was a high interdependence between the variables in the healthcare datasets. This is because clinical datasets often contain multiple individual clinical characteristics in a single record. Therefore, synthesizing data that accurately reflects the relationships between different columns is a critical task. Implicit models, such as GANs, generate realistic data without explicitly learning or representing the underlying probability distribution. This inherent characteristic of implicit models mitigates the risk of the unintentional disclosure of sensitive information, thus rendering them a more suitable choice for preserving privacy in healthcare data. However, generative models, such as CTGAN and copula GAN, encounter challenges when tasked with generating realistic HTD. These challenges arise from the intricate nature of real-world healthcare data, wherein capturing and replicating complex patterns is a formidable task. Moreover, accurately learning and reproducing nonstandard distribution patterns is difficult and may yield generated samples that cannot appropriately represent the complexities inherent in the original data. Recent advancements in deep learning, particularly those centered on Transformer architectures, have demonstrated promising applications in handling tabular datasets [ 22 ]– [ 24 ]. A notable development involves the implementation of a transformer-based GAN for the generation of synthetic data in the text and sequence areas [ 25 ], [ 26 ]. This study aimed to address these critical issues by generating synthetic data based on Transformers. The Transformer can effectively handle the relationships between columns within each dataset through multi-attention mechanisms. It can be considered as an appropriate algorithm for healthcare data when generating synthetic data. However, one major challenge that must be overcome before its application in healthcare is the processing of continuous variables. When generating synthetic data for healthcare, dealing with the diverse distributions of continuous variables present in healthcare datasets is a major challenge. Typically, it is ideal if all the continuous variables in the synthetic data follow a normalized Gaussian distribution during the learning process. However, cases wherein the actual data follow a Gaussian distribution are rare. Using this methodology, we developed a TT-GAN. First, that all continuous variables were discretized before training a model to generate synthetic data. Consequently, a model was built to predict the continuous variables of these discretized variables. Subsequently, the model was used to predict the continuous variables of these discretized variables after generating synthetic data. Based on our methodology, TT-GAN was found to be remarkably simple, user-friendly, and powerful. In our experimental results, the Transformer model applying this methodology exhibited outstanding performance. Despite applying the same methodology to CTGAN and copula GAN, the performance improvement was not as pronounced as that in case of the Transformer-based model. This is attributed to the inherent ability of CT-GAN and copula GAN to handle continuous variables to a certain extent. In contrast, the traditional Transformer model, which is large language model (LLM)-based, lacked the ability to handle these continuous variables effectively. As observed in our experimental results, synthetic data generated by the Transformer model without the application of discretization and converter methodology exhibited significantly worse performance. Thus, although Transformer-based synthetic data generation models exhibit significant potential in the healthcare domain characterized by high inter-column interdependence, their capabilities cannot be fully realized without effective handling of continuous variables. Based on our methodology, TT-GAN was found to be remarkably simple, user-friendly, and powerful. In our experimental results, the Transformer model applying this methodology exhibited outstanding performance. Despite applying the same methodology to CTGAN and copula GAN, the performance improvement was not as pronounced as that in case of the Transformer-based model. This is attributed to the inherent ability of CT-GAN and copula GAN to handle continuous variables to a certain extent. In contrast, the traditional Transformer model, which is large language model (LLM)-based, lacked the ability to handle these continuous variables effectively. As observed in our experimental results, synthetic data generated by the Transformer model without the application of discretization and converter methodology exhibited significantly worse performance. Thus, although Transformer-based synthetic data generation models exhibit significant potential in the healthcare domain characterized by high inter-column interdependence, their capabilities cannot be fully realized without effective handling of continuous variables. However, the application of discretization and transformers to all healthcare datasets may not be necessary. In cases involving minimal continuous variables, or wherein such variables have a minor impact on the dependent variables of predictive models, disregarding them may not result in significant differences in performance. Conclusion This study proposed TT-GAN as a specialized GAN algorithm for healthcare within the practical constraints of clinical settings. The TT-GAN operated on a devised three-stage framework: discretization, generation, and conversion stages. The discretization and converter methodology were the primary process applied to transform continuous variables into categorical data, thereby facilitating the subsequent vectorization process for the transformer of the generator. The entire dataset was cast in a categorical format, thereby enabling the Transformer to capture the unique attributes associated with each value. Subsequently, the original continuous data of the generated dataset were reconverted into continuous data by applying a prediction model. The integration of the Transformer encoder into the GAN framework ensured that the relational characteristics between the columns were preserved during the generation process. In particular, the TT-GAN exhibited better performance than the representative algorithms of CTGAN, and copulaGAN. Finally, the TT-GAN effectively produced mixed variable types, including multinomial, discrete, and continuous, which closely resemble the characteristics of the original HTD. In particular, the discretization and converter methodology could be interpreted as a demonstration of the potential of the existing LLM model to be used effectively with a wide variety of data. Abbreviations HTD Healthcare tabular data STD Synthetic tabular data GAN Generative adversarial network CTGAN Conditional tabular GAN RF Random forest CatBoost Category boosting XGBoost Extreme gradient boosting LightGBM Light gradient boosting machine AUC Area under the curve LLM Large language model Declarations Acknowledgements This study was supported by a grant (no: 2310440-3) offered by the National Cancer Center of Korea, Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (no: NRF-2022R1F1A107504). Availability of data and materials Anyone can use the original data after registering as a member on the Korea Central Cancer Registry (KCCR) portal [16] and passing through the data application and review. Users need to fill out an application form, including a research proposal describing how they will use the data and that the data access request will be accessed by the KCCR and the National Statistics Office. All synthetic data can be shared for research purposes by contacting the authors. Please note that this service is only available to Koreans; it is a domestic service. All code for data generation and validation associated with the current submission is available in a GitHub repository [27]. Author’s Contributions Conceptualization was managed by HYJK, MSK, and KSR; methodology, HYJK, MSK, and KSR; validation, HYJK, MSK, and KSR; investigation, HYJK; data curation, HYJK and KSR; writing of the original draft preparation, HYJK, and KSR. All the authors assisted in drafting and editing the manuscript Funding No Funding. Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Competing interests The authors declare that they have no competing interests. References Borisov V, Leemann T, Seßler K, Haug J, Pawelczyk M, Kasneci G. Deep Neural Networks and Tabular Data: A Survey. IEEE Trans Neural Netw Learn Syst. 2022;1–21. 10.1109/TNNLS.2022.3229161 . de Kok JWTM, de la Hoz MÁA, de Jong Y, Brokke V, Elbers PWG, Thoral P, et al. Sci Data. 2023;10:404d. 10.1038/s41597-023-02256-2 . A guide to sharing open healthcare data under the General Data Protection Regulation. Hernandez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic data generation for tabular health records: A systematic review. Neurocomputing 2022:493:28–45; 10.1016/j.neucom.2022.04.053 . Giuffrè M, Shung DL. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digit Med. 2023;6:186. 10.1038/s41746-023-00927-3 . Rankin D, Black M, Bond R, Wallace J, Mulvenna M, Epelde G. Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing. JMIR Med Inf. 2020;8:e18910. 10.2196/18910 . Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling tabular data using conditional GAN. Adv Neural Inf Process Syst 2019;32. Quiroz JC, Feng Y, Cheng Z, Rezazadegan D, Chen P, Lin Q, et al. development and validation of a machine learning approach for automated severity assessment of COVID-19 based on clinical and imaging data: retrospective study. JMIR Med Inf. 2021;9:e24572. 10.2196/24572 . Syed ARP, Anbalagan R, Setlur AS, Karunakaran C, Shetty J, Kumar J, et al. Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers. BMC Bioinform. 2022;23:496. 10.1186/s12859-022-05050-w . Kang HYJ, Batbaatar E, Choi DW, Choi KS, Ko M, Ryu KS. Synthetic tabular data based on generative adversarial networks in health care: Generation and validation using the divide-and-conquer strategy. JMIR Med Inf. 2023;24:e47859. 10.2196/47859 . Khan A, Swaleha Z. Expansion of regularized k means discretization machine learning approach in prognosis of dementia progression. 2020 11th Int Conf Comp Commun Netw Technol (ICCCNT) 2020. Garcia S, Luengo J, Sáez JA, Lopez V, Herrera F. A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng. 2012;25:734–50. Ho TK. Random decision forests. Proc 3rd Int Conf Doc Anal Recog 1995. Dorogush AV, Vasily E, Andrey G. CatBoost: gradient boosting with categorical features support. arXiv preprint 2018; arXiv:1810.11363. Chen T, Carlos G, XGBoost:. A scalable tree boosting system. Proc 22nd ACM SIGKDD Int Conf Knowl Discov Data Min 2016. Guolin K, Qi M, Thomas F, Taifeng W, Wei C, Weidong M, Qiwei Y, Tie-Yan L. LightGBM: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 2017;30. Home page. Korea Central Cancer Registry. URL: https://kccrsurvey.cancer.go.kr/index.do[accessed 2024-3-08]. Ansari AF, Scarlett J, Soh H. A characteristic function approach to deep implicit. generative modeling. Proc IEEE/CVF Conf Comp Vis Pattern Recog; 2020. Subakan C, Oluwasanmi Ko, Paris S. Learning the base distribution in implicit generative models. arXiv preprint 2018; arXiv:1803.04357. Zhang Y, Zaidi NA, Zhou J, Li G. GANBLR: A tabular data generation model. IEEE Int Conf Data Min (ICDM) 2021:181; 10.1109/ICDM51629.2021.00103 . Zhang Y, Zaidi N, Zhou J, Li G, GANBLR++. Incorporating capacity to generate numeric attributes and leveraging unrestricted Bayesian networks. Proc 2022 SIAM Int Conf Data Mining (SDM), Society for Industrial and Applied Mathematics 2022. Han P, Xu W, Lin W, Cao J, Liu C, Duan S, et al. C3-TGAN-controllable tabular data synthesis with explicit correlations and property constraints. Authorea Preprints; 2023. Huang X, Khetan A, Cvitkovic M, Karnin Z. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv preprint 2020; arXiv:2012.06678. Gorishniy Y, Rubachev I, Khrulkov V, Babenko A. Revisiting deep learning models for tabular data. Adv Neural Inf Process Syst. 2021;34:18932–43. Solatorio AV, Dupriez O, REaLTabFormer. Generating realistic relational and tabular data using transformers. arXiv preprint 2023; arXiv:2302.02041. Diao S, Shen X, Shum K, Song Y, Zhang T. TILGAN: Transformer-based implicit latent GAN for diverse and coherent text generation. Find Ass Comput Linguist ACL-IJCNLP 2021:4844–58. Li X, Metsis V, Wang H, Ngu AHH. Tts-gan: A transformer-based time-series generative adversarial network. Int Conf Artif Intell Med 2022:133–43. Kwang SR. Sally/ttgan. GitHub. URL: https://github.com/KwangSun-Ryu/Sally.git . Additional Declarations No competing interests reported. Supplementary Files Additionalfile1.Datasetcharacteristics.docx Additionalfile2.HyperparametersforGANmodelstraining.docx Additionalfile3.HyperparametersforMLmodels.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4134206","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":281656521,"identity":"94fb0874-873f-4aa0-80e2-a75fa3f4cafc","order_by":0,"name":"Ha Ye Jin Kang","email":"","orcid":"","institution":"Hanyang University","correspondingAuthor":false,"prefix":"","firstName":"Ha","middleName":"Ye Jin","lastName":"Kang","suffix":""},{"id":281656522,"identity":"999f4095-90d7-4f88-a8fa-a28ecb2da77d","order_by":1,"name":"Minsam Ko","email":"","orcid":"","institution":"Hanyang University","correspondingAuthor":false,"prefix":"","firstName":"Minsam","middleName":"","lastName":"Ko","suffix":""},{"id":281656524,"identity":"d4cd1a2a-98e5-4ce8-8379-00e9451f1d0a","order_by":2,"name":"Kwang Sun Ryu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA10lEQVRIiWNgGAWjYDACZhBRwMDADxOQIE6LAQODZAPRWhigWgwOEKvFnJ354eMCA7vEzbebH35g3GPDIDn7AH4tls1sxsYzDJITt905ZizB8CyNQZovgYCTDvOwSfMYMCduu5HDxsBw4DCDHA8hXxzmYf/NY1CfuHkGWMt/orSwMfMYHE7cIAHWcoBBmrAWNmPpGQbHjWfcSDOWSDiQzCPZQ0jL+cMPPxdUVMv2z0h++OHDATs5iTMEtIAAKDYdG0CsBAYGQs5CaLEnSuUoGAWjYBSMTAAAj2c5IEn+kq4AAAAASUVORK5CYII=","orcid":"","institution":"Graduate School of Cancer Science and Policy, National Cancer Center","correspondingAuthor":true,"prefix":"","firstName":"Kwang","middleName":"Sun","lastName":"Ryu","suffix":""}],"badges":[],"createdAt":"2024-03-20 05:14:17","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4134206/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4134206/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":53418878,"identity":"cede3897-ff74-4bc9-84c1-91bc0cc8fb27","added_by":"auto","created_at":"2024-03-25 18:10:12","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":230413,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eArchitecture of tabular Transformer generative adversarial network (TT-GAN)\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Figure1.ArchitectureofTTGAN.png","url":"https://assets-eu.researchsquare.com/files/rs-4134206/v1/9c5ceb6f9e5f9d84e5908344.png"},{"id":53417067,"identity":"7add6e9e-df47-4839-933f-9d6e42e7d648","added_by":"auto","created_at":"2024-03-25 18:02:11","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":184234,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eExperiment setting for TT-GAN\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Figure2.ExperimentalsettingforTTGAN.png","url":"https://assets-eu.researchsquare.com/files/rs-4134206/v1/6cb5534f73ced3dcdef0578c.png"},{"id":63310858,"identity":"d0e26751-ef6f-4ed1-84dc-e05e20ee7772","added_by":"auto","created_at":"2024-08-26 19:51:42","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1191440,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4134206/v1/f52280dd-8a26-430a-8f19-5147b5fb70fb.pdf"},{"id":53417070,"identity":"c6abf35c-3277-4025-b82e-519a47c22aea","added_by":"auto","created_at":"2024-03-25 18:02:12","extension":"docx","order_by":6,"title":"","display":"","copyAsset":false,"role":"supplement","size":34871,"visible":true,"origin":"","legend":"","description":"","filename":"Additionalfile1.Datasetcharacteristics.docx","url":"https://assets-eu.researchsquare.com/files/rs-4134206/v1/8908a080f9588c8c5f056bee.docx"},{"id":53417068,"identity":"8b3081c5-bbbc-405f-8f23-a18545cd4b0b","added_by":"auto","created_at":"2024-03-25 18:02:12","extension":"docx","order_by":7,"title":"","display":"","copyAsset":false,"role":"supplement","size":17973,"visible":true,"origin":"","legend":"","description":"","filename":"Additionalfile2.HyperparametersforGANmodelstraining.docx","url":"https://assets-eu.researchsquare.com/files/rs-4134206/v1/7c57275878b5830744065d60.docx"},{"id":53417071,"identity":"c2898d02-b186-40c5-a2bb-e1d3957bec37","added_by":"auto","created_at":"2024-03-25 18:02:12","extension":"docx","order_by":8,"title":"","display":"","copyAsset":false,"role":"supplement","size":20650,"visible":true,"origin":"","legend":"","description":"","filename":"Additionalfile3.HyperparametersforMLmodels.docx","url":"https://assets-eu.researchsquare.com/files/rs-4134206/v1/2a1ace7b1427a93705175a34.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Tabular Transformer Generative Adversarial Network for Heterogeneous distribution in healthcare","fulltext":[{"header":"Introduction","content":"\u003cp\u003eTabular data, which are organized in rows and columns, are the most common type of data across various real-world applications. Its prevalence in various domains underlines its importance in practical machine learning applications and research environments [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. In particular, in healthcare, wherein structured information such as patient demographics, diagnoses, and treatments is critical, tabular data play an important role. Tabular data in healthcare have tremendous potential for artificial intelligence (AI) research, which is less utilized in this field than in other areas owing to various issues related to privacy and data sharing [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e],[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. To overcome these issues, tabular synthetic data have been proposed, which have shown reliable research results [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. However, minimal research has been conducted on the generation of synthetic data that considers the relationships between columns and the handling of continuous variables with various distributions. The complexity of tabular data, particularly in healthcare, is characterized by the relationships between columns, rendering the existence of a single distribution rare. This complexity presents challenges in generating synthetic tabular data (STD) in the context of healthcare tabular data (HTD). The addressal of these challenges necessitates an approach that accurately captures and replicates the inherent relationships in tabular healthcare datasets. Recently, the Transformer method has been demonstrated to be a good method for generating STDs using multi-attention to consider the relationships between columns. However, it inherently encounters difficulty in handling continuous attributes on the right side. In addition, the models proposed thus far based on Transformer algorithms are mostly intended for prediction tasks. Only a few transformer algorithms have been used to generate STDs; however, they are not suited to tabular data in healthcare and, in particular, lack deep thought about handling continuous variables. To overcome these challenges, we propose the tabular transformer generative adversarial network (TT-GAN) algorithm. The TT-GAN comprises three stages. The first stage was the discretization stage, where continuous variables were converted to discrete variables in order to apply the Transformer method. The second stage is the generation stage, wherein synthetic data were generated using a generator and discriminator based on a GAN architecture with a transformer. The third stage was the converter stage, wherein the discretized columns were translated into continuous variables in the STD.\u003c/p\u003e \u003cp\u003eThe contributions of this study are as follows:\u003c/p\u003e \u003cp\u003e1) TT-GAN for HTD generation: We propose the TT-GAN, which aimed to capture the intricate dependencies, irregular patterns, and varied distributions present in real-world healthcare datasets, while ensuring that the synthesized data closely aligned with the statistical characteristics of authentic healthcare information. TT-GAN demonstrated superior performance with HTD compared with previously proposed generative adversarial network (GAN) algorithms designed for the synthesis of STD. The performance improvement was attributed to the efficient architecture design to apply column-to-column relationships, which are characteristic of healthcare data, to the Transformer algorithm. Moreover, our architecture was designed to avoid explicit density-based methods, thereby overcoming the issue of data privacy, and consequently propose a reasonable method for effectively handling continuous variables.\u003c/p\u003e \u003cp\u003e2) Synth Health Discovery Network: The constructed generative model and its corresponding code on shared platforms provided opportunities for transparency and reproducibility. This active contribution is expected to help advance overall progress in healthcare research. Sharing our generative model facilitates a fair evaluation and contributes to field advancement. Further, sharing codes and models for educational purposes would also be valuable. Students, researchers, and practitioners can gain a deeper understanding of optimal techniques and methodologies in the healthcare generative modeling domain. The Synth Health Discovery Network will play an important role in enabling the research ecosystem to promote the effective use of healthcare data and contribute to future healthcare breakthroughs.\u003c/p\u003e \u003cp\u003eThe remainder of this paper is organized as follows. The \"Related Works\" section presents the relevant studies explored to lay the foundation for our contributions. The \"Method\" section provides detailed information about the TT-GAN model. The \"Results\" presents the model's performance analysis and highlights the findings. Further, the \"Discussion\" section offers an analysis and explores the implications, comparisons, and future avenues. Finally, the \u0026ldquo;Conclusion\" section summarizes our core findings.\u003c/p\u003e\n\u003ch3\u003eRelated works\u003c/h3\u003e\n\u003cp\u003eConditional tabular GAN (CTGAN) and copula GAN were developed by Xu et al. [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Copula GAN is an extension or variant of the CTGAN and is a type of GAN designed to generate synthetic tabular data. copula GAN incorporates copula functions based on the CTGAN to enhance the learning process. Juan Carlos Quirz et al. [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e] proposed a machine-learning framework for automating the severity assessment of COVID-19 using clinical and imaging data. To address the imbalance problem in tabular clinical data, they employed the CTGAN model, along with various oversampling techniques. Ultimately, logistic regression models with balanced synthetic data effectively distinguished between mild and severe cases. Syde et al. [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e] developed a fundamental tumor type classification model for decision support. An approach involving the utilization of a CTGAN was implemented to address imbalances in clinical data. All evaluation metrics demonstrated an improvement as when increasing the sample size through the application of the CTGAN. Kang et al. [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] demonstrated the preservation of data with logical relationships while generating STD using a CTGAN. They implemented a divide-and-conquer approach to mitigate the risk of information loss caused by dependence on condition columns. This DC-based strategy facilitated the creation of an STD that accurately reflects the inherent patterns and relationships within each subset of the Original Data.\u003c/p\u003e \u003cp\u003eAlthough much research has been conducted on STD, its implementation in real-world healthcare datasets remains challenging. This is because the HTD has the following unique characteristics. (1) Different types of columns: In general, HTD, which are non-standardized patient records, are heterogeneous and voluminous, that is, the data contain different column types such as numeric, float, integer, and character. (2) Non-Gaussian distributions: In healthcare, data may exhibit skewness and kurtosis values that deviate significantly from the normal distribution, contain outliers, combine multiple distributions (bimodal or multimodal), or contain insufficient data. Therefore, generating synthetic HTD involves the addressing of the complex dependencies, irregularities, and diverse distributions found in real-world datasets. Moreover, privacy must be protected carefully. The inconsistent and diverse distribution and characteristics of such tabular data, and privacy concerns hinder the application and diffusion of effective algorithms representative of AI. To address the distributional issues of these data, we applied transformers to consider the complex relationships between the data columns and validate our proposed research framework.\u003c/p\u003e"},{"header":"Method","content":"\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eTabular Transformer Generative Adversarial Network (TT-GAN)\u003c/h2\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe proposed TT-GAN should efficiently process HTD with different distributions and generate approximately good synthetic data. The relationship between the columns is based on the Transformer and follows three steps, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. In the discretization stage, a clustering algorithm was used to preprocess continuous variables, thereby transforming them into categorical variables. In the generation stage, categorical features were transformed into discretized data using ordinal encoding. We incorporated a Transformer encoder that used a multi-headed attention mechanism to capture the relational nature of each column and learn categorical features, to be fed into the generator. The generator approximated the probability distribution of real data within a high-dimensional latent space and used multi-head attention from the Transformer encoder to exploit contextual embeddings that capture the relational properties of each column and learn categorical features. The discriminator evaluated the data and output probabilities. It minimized binary cross-entropy by aiming for low probabilities for fake data and high probabilities for real data. Moreover, it was verified that the generated data satisfied the specified conditions from the condition vector. In the converter stage, we used a prediction model to convert categorical data that were originally continuous into continuous data. The TT-GAN successfully generated mixed variable types (multinomial, discrete, and continuous) similar to real tabular data (Algorithm 1).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Taba\" border=\"1\"\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlgorithm 1. TT-GAN algorithms\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eInput: real mixed type (continuous and categorical) data \u003cem\u003eD\u003c/em\u003e; list of continuous variables \u003cem\u003eVContinuous\u003c/em\u003e; list of categorical variables \u003cem\u003eVCategorical\u003c/em\u003e; number of synthetic data \u003cem\u003eN\u003c/em\u003e\u003c/p\u003e \u003cp\u003eOutput: synthetic mixed type (continuous and categorical) data \u003cem\u003eD\u0026rsquo;\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003ecategorical\u003c/em\u003e\u003c/sub\u003e \u0026larr; select \u003cem\u003eV\u003c/em\u003e\u003csub\u003e\u003cem\u003ecategorical\u003c/em\u003e\u003c/sub\u003e from \u003cem\u003eD\u003c/em\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003econtinuous\u003c/em\u003e\u003c/sub\u003e \u0026larr; select \u003cem\u003eV\u003c/em\u003e\u003csub\u003e\u003cem\u003econtinuous\u003c/em\u003e\u003c/sub\u003e from \u003cem\u003eD\u003c/em\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003ediscretized\u003c/em\u003e\u003c/sub\u003e \u0026larr; \u003cem\u003eDiscretization\u003c/em\u003e (\u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003econtinuous\u003c/em\u003e\u003c/sub\u003e)\u003c/p\u003e \u003cp\u003e\u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003eall_categorical\u003c/em\u003e\u003c/sub\u003e \u0026larr; \u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003ediscretized\u003c/em\u003e\u003c/sub\u003e + \u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003ecategorical\u003c/em\u003e\u003c/sub\u003e\u003c/p\u003e \u003cp\u003efor each \u003cem\u003ei\u003c/em\u003e in \u003cem\u003eV\u003c/em\u003e\u003csub\u003e\u003cem\u003econtinuous\u003c/em\u003e\u003c/sub\u003e \u003cem\u003edo\u003c/em\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003eR\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e \u0026larr; \u003cem\u003eTrain Regressor\u003c/em\u003e (\u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003eall_categorical\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003ei\u003c/em\u003e)\u003c/p\u003e \u003cp\u003eend for\u003c/p\u003e \u003cp\u003e\u003cem\u003eG\u003c/em\u003e \u0026larr; \u003cem\u003eTrain generator\u003c/em\u003e (\u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003eall_categorical\u003c/em\u003e\u003c/sub\u003e)\u003c/p\u003e \u003cp\u003e\u003cem\u003eD\u0026rsquo;\u003c/em\u003e\u0026larr; \u003cem\u003esample (G, N)\u003c/em\u003e\u003c/p\u003e \u003cp\u003efor each \u003cem\u003ei\u003c/em\u003e in \u003cem\u003eDdiscretized\u003c/em\u003e \u003cem\u003edo\u003c/em\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003eD\u0026rsquo;\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e \u0026larr; \u003cem\u003eConverter\u003c/em\u003e (\u003cem\u003eD\u0026rsquo;\u003c/em\u003e, \u003cem\u003eR\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e)\u003c/p\u003e \u003cp\u003eend for\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cdiv id=\"Sec5\" class=\"Section3\"\u003e \u003ch2\u003eDiscretization stage\u003c/h2\u003e \u003cp\u003eThe first stage was the preprocessing stage. Herein, discretization is the process of transforming continuous properties into discrete properties by forming a group of adjacent intervals that extend the range of the property (Algorithm 2). Data discretization with k-means clustering [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e] for the numerical variables was performed before applying the transformer module. K-means clustering is a popular method for calculating continuous distance-based similarity measures to cluster data points, rendering it suitable for the discretization of continuous-valued variables. The k-means clustering algorithm divides the input data into clusters by first assigning k random data points as centroids. Each data point was then assigned to its nearest center to form the initial cluster distribution. This process discretized the data using min-max values, calculated clusters, and distances between clusters. The algorithm iteratively created clusters by recalculating the cluster centers as the averages of the values in each cluster and reassigning the data points to the nearest center [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Tabb\" border=\"1\"\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlgorithm 2. Discretization\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003efunction \u003cem\u003eDiscretization (D\u003c/em\u003e\u003csub\u003e\u003cem\u003econtinuous\u003c/em\u003e\u003c/sub\u003e\u003cem\u003e)\u003c/em\u003e:\u003c/p\u003e \u003cp\u003e\u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003ediscretized\u003c/em\u003e\u003c/sub\u003e = \u003cem\u003e[]\u003c/em\u003e\u003c/p\u003e \u003cp\u003e\u003cb\u003efor\u003c/b\u003e each \u003cem\u003ei\u003c/em\u003e in \u003cem\u003eV\u003c/em\u003e\u003csub\u003e\u003cem\u003econtinuous\u003c/em\u003e\u003c/sub\u003e \u003cb\u003edo\u003c/b\u003e\u003c/p\u003e \u003cp\u003eDiscretize \u003cem\u003ei\u003c/em\u003e from \u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003econtinuous\u003c/em\u003e\u003c/sub\u003e\u003c/p\u003e \u003cp\u003eAppend \u003cem\u003ei\u003c/em\u003e\u003csub\u003e\u003cem\u003ediscretized\u003c/em\u003e\u003c/sub\u003e into \u003cem\u003eD\u003c/em\u003e\u003csub\u003e\u003cem\u003ediscretized\u003c/em\u003e\u003c/sub\u003e\u003c/p\u003e \u003cp\u003e\u003cb\u003eend for\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003ereturn D\u003c/em\u003e\u003csub\u003e\u003cem\u003ediscretized\u003c/em\u003e\u003c/sub\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section3\"\u003e \u003ch2\u003eGeneration Stage\u003c/h2\u003e \u003cp\u003eThe second stage was the generation stage. A combination of the CTGAN architecture, comprising a generator and discriminator, and a Transformer encoder was used to learn the real data distribution and produce an optimal generative model (Algorithm 3). The transformer encoder had the following order. The Transformer input is an embedded vector from the column embedding. Embedding techniques were used to represent the data as a dense vector to place highly similar data at similar positions in the vector space to calculate similarity. We set a categorical variable \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({x}_{i}={\\{x}_{1}^{cat},...,{x}_{m}^{cat}\\}\\)\u003c/span\u003e\u003c/span\u003e, for \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(i\\in \\{1,...,m\\}\\)\u003c/span\u003e\u003c/span\u003e. Using each of the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(xi\\)\u003c/span\u003e\u003c/span\u003e categorical features in a parametric embedding of dimension \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(d\\)\u003c/span\u003e\u003c/span\u003e by column embedding yields \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({e}_{\\varphi i}\\left({x}_{i}\\right)\\in {R}^{d}\\)\u003c/span\u003e\u003c/span\u003e. Column embedding shared optimal dimension of parameters \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(ci\\)\u003c/span\u003e\u003c/span\u003e in column \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(i.\\)\u003c/span\u003e\u003c/span\u003e The input to the encoder first passed through a self-attention layer, which examined the relationship between all the vectors of the columns in the input for the encoder to encode one particular column. After the input passed through the self-attention layer, the output was returned to the feed-forward neural network. The same feedforward neural network was applied independently to each vector of the columns at each position to create the output. The attention layer facilitates the creation of multiple \u0026ldquo;representation spaces.\u0026rdquo; After training, each set was multiplied by the input vectors to project the vectors for each purpose. The fact that there are several such sets implies that each vector was represented in a different space. The final encoder output a representation of the input columns. To capture all possible correlations between columns, fully connected networks were used in both the generator and critic because the columns in a row do not exhibit a local structure. A synthetic row representation was generated using a mix of activation functions after two hidden layers. The scalar value αi was generated by tanh, while the mode indicator \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\beta i\\)\u003c/span\u003e\u003c/span\u003e and discrete values \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(di\\)\u003c/span\u003e\u003c/span\u003e were generated by Gumbel-SoftMax. The critic used the LeakyReLU function and dropped each hidden layer to ensure accuracy. Both the generator and critic used two fully connected hidden layers. Batch normalization and ReLU activation functions were used in the generator. Through training based on these steps, the synthetic at a generation model generated synthetic data (Algorithm 4).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Tabc\" border=\"1\"\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlgorithm 3. Model Generation\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003efunction \u003cem\u003etrain_generator (D\u003c/em\u003e\u003csub\u003e\u003cem\u003eall_categorical\u003c/em\u003e\u003c/sub\u003e\u003cem\u003e)\u003c/em\u003e:\u003c/p\u003e \u003cp\u003eInitialize generative model \u003cem\u003eG\u003c/em\u003e\u003c/p\u003e \u003cp\u003eInitialize Transformer Encoder\u003c/p\u003e \u003cp\u003eInitialize transformer encoder parameters\u003c/p\u003e \u003cp\u003e\u003cb\u003efor\u003c/b\u003e number of training iterations \u003cb\u003edo\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cb\u003efor\u003c/b\u003e \u003cem\u003ek\u003c/em\u003e steps \u003cb\u003edo\u003c/b\u003e\u003c/p\u003e \u003cp\u003eSample mini-batch of noise samples from noise prior\u003c/p\u003e \u003cp\u003eSample mini-batch of examples from data generating distribution\u003c/p\u003e \u003cp\u003eUpdate the discriminator by ascending its stochastic gradient\u003c/p\u003e \u003cp\u003e\u003cb\u003eend for\u003c/b\u003e\u003c/p\u003e \u003cp\u003eSample mini-batch of noise samples from noise prior\u003c/p\u003e \u003cp\u003eUpdate \u003cem\u003eG\u003c/em\u003e model parameters by descending its stochastic gradient\u003c/p\u003e \u003cp\u003e\u003cb\u003eend for\u003c/b\u003e\u003c/p\u003e \u003cp\u003ereturn \u003cem\u003eG\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Tabd\" border=\"1\"\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlgorithm 4. Synthetic Generation\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003efunction \u003cem\u003esample (G\u003c/em\u003e, \u003cem\u003eN)\u003c/em\u003e:\u003c/p\u003e \u003cp\u003eSample \u003cem\u003eN\u003c/em\u003e noise vectors from noise prior distribution\u003c/p\u003e \u003cp\u003eGenerate samples from the generator model \u003cem\u003eG\u003c/em\u003e with the noise vectors\u003c/p\u003e \u003cp\u003ereturn \u003cem\u003eD\u0026rsquo;\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eConverter Stage\u003c/h2\u003e \u003cp\u003eIn the converter stage, we developed models to predict the original value of the discretized continuous variable based on the original data (Algorithm 5). This model was applied to the synthetic data to convert the categorical variables into continuous variables (Algorithm 6). We applied various tree-based ensemble models, including Random Forest (RF), Categorical Boosting (CatBoost), Extreme Gradient Boosting (XGBoost), and light gradient boosting machine (LightGBM). RF, which was originally developed in 1995 by Tin Kam Ho [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e], utilizes an ensemble of decision trees, randomly selects subsets of features and samples during training, and provides robustness against overfitting. CatBoost [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e] is designed to seamlessly support categorical features and automatically handle their encoding complexity. Known for its speed, performance, and scalability, XGBoost [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] supports regularization, efficiently handles missing data, and facilitates parallel processing. LightGBM [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e] was optimized for large and high-dimensional datasets. It efficiently handles categorical features without the need for one-hot encoding, and uses a histogram-based approach for faster training.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Tabe\" border=\"1\"\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlgorithm 5. Train Regressor\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003efunction \u003cem\u003etrain_regressor (D\u003c/em\u003e\u003csub\u003e\u003cem\u003eall_categorical\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003ei)\u003c/em\u003e:\u003c/p\u003e \u003cp\u003eDefine ensemble regressors (\u003cem\u003eR\u003c/em\u003e\u003csub\u003e\u003cem\u003ei1\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003eR\u003c/em\u003e\u003csub\u003e\u003cem\u003ei2\u003c/em\u003e\u003c/sub\u003e,\u003cem\u003e..., R\u003c/em\u003e\u003csub\u003e\u003cem\u003eiM\u003c/em\u003e\u003c/sub\u003e\u003cem\u003e)\u003c/em\u003e\u003c/p\u003e \u003cp\u003e\u003cb\u003efor\u003c/b\u003e number of regressors \u003cb\u003edo\u003c/b\u003e\u003c/p\u003e \u003cp\u003eInitialize a new regressor\u003c/p\u003e \u003cp\u003eTrain \u003cem\u003eR\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e regressors\u003c/p\u003e \u003cp\u003e\u003cb\u003eend for\u003c/b\u003e\u003c/p\u003e \u003cp\u003eFor a given \u003cem\u003eM\u003c/em\u003e ensemble regressor (\u003cem\u003eR\u003c/em\u003e\u003csub\u003e\u003cem\u003ei1\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003eR\u003c/em\u003e\u003csub\u003e\u003cem\u003ei2\u003c/em\u003e\u003c/sub\u003e,\u003cem\u003e..., R\u003c/em\u003e\u003csub\u003e\u003cem\u003eiM\u003c/em\u003e\u003c/sub\u003e\u003cem\u003e)\u003c/em\u003e predicts the final value\u003c/p\u003e \u003cp\u003ereturn \u003cem\u003eR\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Tabf\" border=\"1\"\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlgorithm 6. Converter\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003efunction \u003cem\u003eConverter (D\u0026rsquo;\u003c/em\u003e, \u003cem\u003eR\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e\u003cem\u003e)\u003c/em\u003e:\u003c/p\u003e \u003cp\u003eSet empty list of predicted values\u003c/p\u003e \u003cp\u003e\u003cb\u003efor\u003c/b\u003e number of regressors \u003cb\u003edo\u003c/b\u003e\u003c/p\u003e \u003cp\u003ePredict using each regressor in \u003cem\u003eR\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e\u003c/p\u003e \u003cp\u003eAppend predicted value to the list\u003c/p\u003e \u003cp\u003e\u003cb\u003eend for\u003c/b\u003e\u003c/p\u003e \u003cp\u003eCombine predictions from all regressors in the ensemble into \u003cem\u003eD\u0026rsquo;\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e\u003c/p\u003e \u003cp\u003ereturn \u003cem\u003eD\u0026rsquo;\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Result","content":"\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\n \u003ch2\u003eExperimental setting\u003c/h2\u003e\n \u003cp\u003eTo objectively evaluate the performance of the proposed algorithm, we conducted the following experiments. First, synthetic data were generated using CTGAN, copula GAN, and TT-GAN, and data were generated with and without discretization and converter methodology in each generation Secondly, we developed a prediction model for liver and lung cancer mortality based on synthetic data. Third, the model generated using the synthetic data was evaluated by measuring the AUC based on the original TEST data, and the model test was repeated five times.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\n \u003ch2\u003eDataset\u003c/h2\u003e\n \u003cdiv id=\"Sec11\" class=\"Section3\"\u003e\n \u003ch2\u003eStudy Population\u003c/h2\u003e\n \u003cp\u003eLung and liver cancer data from the Korea Central Cancer Registry at the National Cancer Center (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://kccrsurvey.cancer.go.kr/index.do\u003c/span\u003e\u003c/span\u003e) were used in this study, and the data were reviewed by the institutional review board [\u003cspan class=\"CitationRef\"\u003e16\u003c/span\u003e]. Our study used nonduplicate lung cancer and liver data after excluding missing variables. Lung cancer data were divided into development (n\u0026thinsp;=\u0026thinsp;1, 616), validation (n\u0026thinsp;=\u0026thinsp;228), and test (n\u0026thinsp;=\u0026thinsp;460) groups, which were further divided into development (n\u0026thinsp;=\u0026thinsp;4, 767), validation (n\u0026thinsp;=\u0026thinsp;681), and test (n\u0026thinsp;=\u0026thinsp;1, 363) groups using stratified random sampling. The basic characteristics of the datasets showed similar distributions (Appendix 1).\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\n \u003ch2\u003eGeneration and validation of STDs\u003c/h2\u003e\n \u003cp\u003eThe CTGAN, copula GAN, and TT-GAN models were trained for comparison (Appendix 2). Subsequently, we employed RF, CatBoost, XGBoost, and LightGBM as regression classifiers to predict continuous variables. Subsequently, we implemented the discretization and converter methodology. To assess prediction performance, we conducted evaluations using the RF, CatBoost, XGBoost, and LightGBM models individually for each of the generated GAN models (Appendix 3). We generated 1,616 lung cancer STD and 4,766 liver cancer STD.\u003c/p\u003e\n \u003cp\u003eAs shown in Table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e, for the lung cancer dataset, the following AUC values were obtained for the original dataset: RF: 85.02%, CatBoost: 86.02%, XGBoost: 84.24%, and LightGBM: 84.49%. The model performance was observed through the STD, which was generated by each GAN model without the preprocessing stage. For the STD generated by CTGAN, the AUC for RF was 84.00\u0026thinsp;\u0026plusmn;\u0026thinsp;0.55, while CatBoost achieved 83.80\u0026thinsp;\u0026plusmn;\u0026thinsp;0.45, XGBoost attained 81.20\u0026thinsp;\u0026plusmn;\u0026thinsp;0, and LightGBM obtained 82.88\u0026thinsp;\u0026plusmn;\u0026thinsp;0.72. When the STD was produced by copula GAN, the values were 84.45\u0026thinsp;\u0026plusmn;\u0026thinsp;0.26 for RF, 81.58\u0026thinsp;\u0026plusmn;\u0026thinsp;0.77 CatBoost, 79.40\u0026thinsp;\u0026plusmn;\u0026thinsp;0.71 for XGBoost, and 84.07\u0026thinsp;\u0026plusmn;\u0026thinsp;0.58 for LightGBM. The STD generated by TT-GAN yielded an AUC of 81.53\u0026thinsp;\u0026plusmn;\u0026thinsp;0.46 for RF, 82.64\u0026thinsp;\u0026plusmn;\u0026thinsp;0.44 for CatBoost, 84.45\u0026thinsp;\u0026plusmn;\u0026thinsp;0.51 for XGBoost, 84.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.18 for LightGBM.\u003c/p\u003e\n \u003cp\u003eThe model performance was assessed by examining the STD generated by each GAN model after the preprocessing stage. The STD generated by CTGAN with the RF classifier yielded AUC of 82.31\u0026thinsp;\u0026plusmn;\u0026thinsp;0.50 for RF, 83.68\u0026thinsp;\u0026plusmn;\u0026thinsp;0.74 for CatBoost, 81.33\u0026thinsp;\u0026plusmn;\u0026thinsp;0.55 for XGBoost, and 80.59\u0026thinsp;\u0026plusmn;\u0026thinsp;0.32 for LightGBM. With the CatBoost classifier, the AUC was 83.39\u0026thinsp;\u0026plusmn;\u0026thinsp;0.33 for RF, 83.93\u0026thinsp;\u0026plusmn;\u0026thinsp;0.67 for CatBoost, 83.07\u0026thinsp;\u0026plusmn;\u0026thinsp;1.13 for XGBoost, and 82.38\u0026thinsp;\u0026plusmn;\u0026thinsp;0.31 for LightGBM. When the XGBoost classifier was used, the AUC was 82.95\u0026thinsp;\u0026plusmn;\u0026thinsp;0.55 for RF, 83.58\u0026thinsp;\u0026plusmn;\u0026thinsp;0.24 for CB, 82.09\u0026thinsp;\u0026plusmn;\u0026thinsp;1.02 for XGBoost, and 82.67\u0026thinsp;\u0026plusmn;\u0026thinsp;0.54 for LightGBM. As for the application of the LightGBM classifier, the AUC was 83.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.67 for RF, 82.99\u0026thinsp;\u0026plusmn;\u0026thinsp;0.28 for CatBoost, 80.76\u0026thinsp;\u0026plusmn;\u0026thinsp;0.58 for XGBoost, and 82.97\u0026thinsp;\u0026plusmn;\u0026thinsp;1.71 for LightGBM.\u003c/p\u003e\n \u003cp\u003eThe STD was produced by copula GAN using RF classifier, the AUC was 81.18\u0026thinsp;\u0026plusmn;\u0026thinsp;0.83 for RF, 80.50\u0026thinsp;\u0026plusmn;\u0026thinsp;0.80 for CatBoost, 78.30\u0026thinsp;\u0026plusmn;\u0026thinsp;0.92 for XGBoost, and 77.78\u0026thinsp;\u0026plusmn;\u0026thinsp;1.61 for LightGBM. When the CatBoost classifier was utilized, the AUC was 81.90\u0026thinsp;\u0026plusmn;\u0026thinsp;0.21 for RF, 82.17\u0026thinsp;\u0026plusmn;\u0026thinsp;0.50 for CB, 79.81\u0026thinsp;\u0026plusmn;\u0026thinsp;1.01 for XGBoost, and 81.65\u0026thinsp;\u0026plusmn;\u0026thinsp;1.16 for LightGBM. Further, with the XGBoost classifier, the AUC was 81.50\u0026thinsp;\u0026plusmn;\u0026thinsp;0.76 for RF, 82.30\u0026thinsp;\u0026plusmn;\u0026thinsp;0.77 for CatBoost, 80.35\u0026thinsp;\u0026plusmn;\u0026thinsp;0.60 for XGBoost, and 82.50\u0026thinsp;\u0026plusmn;\u0026thinsp;0.87 for LightGBM. The application of the LightGBM classifier yielded AUC values of 81.18\u0026thinsp;\u0026plusmn;\u0026thinsp;0.78 for RF, 80.04\u0026thinsp;\u0026plusmn;\u0026thinsp;0.92 for CatBoost, 81.36\u0026thinsp;\u0026plusmn;\u0026thinsp;0.44 for XGBoost, and 81.27\u0026thinsp;\u0026plusmn;\u0026thinsp;1.07 for LightGBM.\u003c/p\u003e\n \u003cp\u003eFurther, the TT-GAN-derived lung STD when used with the RF classifier yielded AUC values of 83.24\u0026thinsp;\u0026plusmn;\u0026thinsp;0.26 for RF, 83.83\u0026thinsp;\u0026plusmn;\u0026thinsp;0.13 for CatBoost, 82.99\u0026thinsp;\u0026plusmn;\u0026thinsp;0.26 for XGBoost, and 82.76\u0026thinsp;\u0026plusmn;\u0026thinsp;0.19 for LightGBM. When the CatBoost classifier was used, the AUC was 83.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.24 for RF, 83.96\u0026thinsp;\u0026plusmn;\u0026thinsp;0.19 for CatBoost, 83.10\u0026thinsp;\u0026plusmn;\u0026thinsp;0.31 for XGBoost, 82.37\u0026thinsp;\u0026plusmn;\u0026thinsp;0.18 for LightGBM. The utilization of the XGBoost classifier, 83.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.18 for RF, 84.06\u0026thinsp;\u0026plusmn;\u0026thinsp;0.15 for CatBoost, and 83.29\u0026thinsp;\u0026plusmn;\u0026thinsp;0.15 for XGBoost, 84.04\u0026thinsp;\u0026plusmn;\u0026thinsp;0.20 for LightGBM. When the LightGBM classifier was used, the AUC was 82.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.37 for RF, 84.13\u0026thinsp;\u0026plusmn;\u0026thinsp;0.12 for CatBoost, 83.28\u0026thinsp;\u0026plusmn;\u0026thinsp;0.46 for XGBoost, and 83.16\u0026thinsp;\u0026plusmn;\u0026thinsp;0.48 for LightGBM.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\u0026nbsp;\u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003ePerformance evaluation of prediction models using lung cancer SSD test dataset\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ccolgroup cols=\"8\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\" rowspan=\"2\"\u003e\n \u003cp\u003eData\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" rowspan=\"2\"\u003e\n \u003cp\u003eGenerator\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" rowspan=\"2\"\u003e\n \u003cp\u003eClassifier\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colspan=\"4\"\u003e\n \u003cp\u003ePrediction model\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eRF\u003c/strong\u003e\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eCatBoost\u003c/strong\u003e\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eXGBoost\u003c/strong\u003e\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eLightGBM\u003c/strong\u003e\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eOriginal\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e85.02%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e86.02%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e84.24%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e84.49%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"3\"\u003e\n \u003cp\u003e\u003cstrong\u003eWithout Discretization and converter\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCTGAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e84.00\u0026thinsp;\u0026plusmn;\u0026thinsp;0.55\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e83.80\u0026thinsp;\u0026plusmn;\u0026thinsp;0.45\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e81.20\u0026thinsp;\u0026plusmn;\u0026thinsp;0.48\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.88\u0026thinsp;\u0026plusmn;\u0026thinsp;0.72\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCopula GAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e84.45\u0026thinsp;\u0026plusmn;\u0026thinsp;0.26\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e81.58\u0026thinsp;\u0026plusmn;\u0026thinsp;0.77\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e79.40\u0026thinsp;\u0026plusmn;\u0026thinsp;0.71\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e84.07\u0026thinsp;\u0026plusmn;\u0026thinsp;0.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTT-GAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e81.53\u0026thinsp;\u0026plusmn;\u0026thinsp;0.46\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.64\u0026thinsp;\u0026plusmn;\u0026thinsp;0.44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e84.45\u0026thinsp;\u0026plusmn;\u0026thinsp;0.51\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e84.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"12\"\u003e\n \u003cp\u003e\u003cstrong\u003eDiscretization and converter\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" rowspan=\"4\"\u003e\n \u003cp\u003eCTGAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.31\u0026thinsp;\u0026plusmn;\u0026thinsp;0.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e83.68\u0026thinsp;\u0026plusmn;\u0026thinsp;0.74\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e81.33\u0026thinsp;\u0026plusmn;\u0026thinsp;0.55\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e80.59\u0026thinsp;\u0026plusmn;\u0026thinsp;0.32\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCatBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e83.39\u0026thinsp;\u0026plusmn;\u0026thinsp;0.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e83.93\u0026thinsp;\u0026plusmn;\u0026thinsp;0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e83.07\u0026thinsp;\u0026plusmn;\u0026thinsp;1.13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.38\u0026thinsp;\u0026plusmn;\u0026thinsp;0.31\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eXGBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.95\u0026thinsp;\u0026plusmn;\u0026thinsp;0.55\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e83.58\u0026thinsp;\u0026plusmn;\u0026thinsp;0.24\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.09\u0026thinsp;\u0026plusmn;\u0026thinsp;1.02\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.67\u0026thinsp;\u0026plusmn;\u0026thinsp;0.54\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLightGBM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e83.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.67\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.99\u0026thinsp;\u0026plusmn;\u0026thinsp;0.28\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e80.76\u0026thinsp;\u0026plusmn;\u0026thinsp;0.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.97\u0026thinsp;\u0026plusmn;\u0026thinsp;1.71\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"4\"\u003e\n \u003cp\u003eCopula GAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e81.18\u0026thinsp;\u0026plusmn;\u0026thinsp;0.83\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e80.50\u0026thinsp;\u0026plusmn;\u0026thinsp;0.80\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e78.30\u0026thinsp;\u0026plusmn;\u0026thinsp;0.92\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e77.78\u0026thinsp;\u0026plusmn;\u0026thinsp;1.61\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCatBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e81.90\u0026thinsp;\u0026plusmn;\u0026thinsp;0.21\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.17\u0026thinsp;\u0026plusmn;\u0026thinsp;0.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e79.81\u0026thinsp;\u0026plusmn;\u0026thinsp;1.01\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e81.65\u0026thinsp;\u0026plusmn;\u0026thinsp;1.16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eXGBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e81.50\u0026thinsp;\u0026plusmn;\u0026thinsp;0.76\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.30\u0026thinsp;\u0026plusmn;\u0026thinsp;0.77\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e80.35\u0026thinsp;\u0026plusmn;\u0026thinsp;0.60\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.50\u0026thinsp;\u0026plusmn;\u0026thinsp;0.87\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLightGBM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e81.18\u0026thinsp;\u0026plusmn;\u0026thinsp;0.78\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e80.04\u0026thinsp;\u0026plusmn;\u0026thinsp;0.92\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e81.36\u0026thinsp;\u0026plusmn;\u0026thinsp;0.44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e81.27\u0026thinsp;\u0026plusmn;\u0026thinsp;1.07\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"4\"\u003e\n \u003cp\u003eTT-GAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e83.53\u0026thinsp;\u0026plusmn;\u0026thinsp;0.22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e83.92\u0026thinsp;\u0026plusmn;\u0026thinsp;0.44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e83.19\u0026thinsp;\u0026plusmn;\u0026thinsp;0.77\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.58\u0026thinsp;\u0026plusmn;\u0026thinsp;0.19\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e84.69\u0026thinsp;\u0026plusmn;\u0026thinsp;0.55\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e85.86\u0026thinsp;\u0026plusmn;\u0026thinsp;0.30\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e85.94\u0026thinsp;\u0026plusmn;\u0026thinsp;0.51\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e84.55\u0026thinsp;\u0026plusmn;\u0026thinsp;0.56\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eXGB\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e84.84\u0026thinsp;\u0026plusmn;\u0026thinsp;0.47\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e85.91\u0026thinsp;\u0026plusmn;\u0026thinsp;0.14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e85.44\u0026thinsp;\u0026plusmn;\u0026thinsp;0.19\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e85.34\u0026thinsp;\u0026plusmn;\u0026thinsp;0.60\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLGBM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e84.69\u0026thinsp;\u0026plusmn;\u0026thinsp;0.37\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e85.69\u0026thinsp;\u0026plusmn;\u0026thinsp;0.09\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e82.97\u0026thinsp;\u0026plusmn;\u0026thinsp;0.36\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e85.42\u0026thinsp;\u0026plusmn;\u0026thinsp;0.71\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eIn Table \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e, the performances of the RF, CatBoost, XGBoost, and LightGBM prediction models for the liver cancer dataset were evaluated using the AUC metric for the test sets. The original dataset showed AUC values of 85.96% for RF, 86.69% for CatBoost, 85.14% for XGBoost, and 85.91% for LightGBM. Without the preprocessing stage, the STD from CTGAN, exhibited AUC values of 83.31\u0026thinsp;\u0026plusmn;\u0026thinsp;0.17 for RF, 83.81\u0026thinsp;\u0026plusmn;\u0026thinsp;0.23 for CatBoost, 81.20\u0026thinsp;\u0026plusmn;\u0026thinsp;0.50 for XGBoost, and 82.69\u0026thinsp;\u0026plusmn;\u0026thinsp;0.19 for LightGBM. The STD from copula GAN exhibited AUC values of 82.46\u0026thinsp;\u0026plusmn;\u0026thinsp;0.07 for RF, 83.61\u0026thinsp;\u0026plusmn;\u0026thinsp;0.24 for CatBoost, 80.93\u0026thinsp;\u0026plusmn;\u0026thinsp;0.62 for XGBoost, and 82.53\u0026thinsp;\u0026plusmn;\u0026thinsp;0.42 for LightGBM. The STD from TT-GAN exhibited AUC values of 80.29\u0026thinsp;\u0026plusmn;\u0026thinsp;0.14 for RF, 81.98\u0026thinsp;\u0026plusmn;\u0026thinsp;0.35 for CatBoost, 80.43\u0026thinsp;\u0026plusmn;\u0026thinsp;0.31 for XGBoost, and 80.33\u0026thinsp;\u0026plusmn;\u0026thinsp;0.37 for LightGBM.\u003c/p\u003e\n \u003cp\u003eWhen evaluating the impact of pre-processing, the STD generated from CTGAN, in conjunction with the RF classifier, yielded AUC values of 81.77\u0026thinsp;\u0026plusmn;\u0026thinsp;0.21 for RF, 82.78\u0026thinsp;\u0026plusmn;\u0026thinsp;0.52 for CatBoost, 79.60\u0026thinsp;\u0026plusmn;\u0026thinsp;0.62 for XGBoost, and 80.94\u0026thinsp;\u0026plusmn;\u0026thinsp;0.34 for LightGBM. The implementation of the CatBoost classifier results in an AUC of 82.65\u0026thinsp;\u0026plusmn;\u0026thinsp;0.24 for RF, 81.00\u0026thinsp;\u0026plusmn;\u0026thinsp;0.33 for CatBoost, 77.60\u0026thinsp;\u0026plusmn;\u0026thinsp;0.27 for XGBoost, and 80.34\u0026thinsp;\u0026plusmn;\u0026thinsp;0.60 for LightGBM. Employing the XGBoost classifier yielded AUC values of 82.96\u0026thinsp;\u0026plusmn;\u0026thinsp;0.21 for RF, 82.44\u0026thinsp;\u0026plusmn;\u0026thinsp;0.50 for CatBoost, 80.43\u0026thinsp;\u0026plusmn;\u0026thinsp;0.50 for XGBoost, and 81.81\u0026thinsp;\u0026plusmn;\u0026thinsp;0.48 for LightGBM. Finally, using the LightGBM classifier, AUC values of 82.47\u0026thinsp;\u0026plusmn;\u0026thinsp;0.43 for RF, 81.47\u0026thinsp;\u0026plusmn;\u0026thinsp;0.29 for CatBoost, 78.76\u0026thinsp;\u0026plusmn;\u0026thinsp;0.41 for XGBoost, and 80.34\u0026thinsp;\u0026plusmn;\u0026thinsp;0.19 for LightGBM were obtained.\u003c/p\u003e\n \u003cp\u003eThe STD generated by copula GAN exhibited AUC values of 78.95\u0026thinsp;\u0026plusmn;\u0026thinsp;0.28 for RF, 71.70\u0026thinsp;\u0026plusmn;\u0026thinsp;0.70 for CatBoost, 65.62\u0026thinsp;\u0026plusmn;\u0026thinsp;2.14 for XGBoost, and 74.54\u0026thinsp;\u0026plusmn;\u0026thinsp;1.42 for LightGBM when utilized by the RF classifier. The CatBoost classifier yielded AUC values of 80.95\u0026thinsp;\u0026plusmn;\u0026thinsp;0.38 for RF, 79.41\u0026thinsp;\u0026plusmn;\u0026thinsp;0.69 for CatBoost, 75.10\u0026thinsp;\u0026plusmn;\u0026thinsp;1.09 for XGBoost, and 78.69\u0026thinsp;\u0026plusmn;\u0026thinsp;1.41 for LightGBM. The XGBoost classifier yielded AUC values of 78.96\u0026thinsp;\u0026plusmn;\u0026thinsp;0.45 for RF, 79.75\u0026thinsp;\u0026plusmn;\u0026thinsp;1.13 for CatBoost, 74.95\u0026thinsp;\u0026plusmn;\u0026thinsp;1.24 for XGBoost, and 74.30\u0026thinsp;\u0026plusmn;\u0026thinsp;1.42 for LightGBM. Whereas the LightGBM classifier yielded AUC values of 77.67\u0026thinsp;\u0026plusmn;\u0026thinsp;1.01 for RF, 70.46\u0026thinsp;\u0026plusmn;\u0026thinsp;1.15 for CatBoost, 68.46\u0026thinsp;\u0026plusmn;\u0026thinsp;1.43 for XGBoost, and 71.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.56 for LightGBM.\u003c/p\u003e\n \u003cp\u003eThe STD obtained using the TT-GAN yielded various AUC. When employing the RF classifiers, the AUC values were 83.24\u0026thinsp;\u0026plusmn;\u0026thinsp;0.26 for RF, 83.83\u0026thinsp;\u0026plusmn;\u0026thinsp;0.13 for CatBoost, 82.99\u0026thinsp;\u0026plusmn;\u0026thinsp;0.26 for XGBoost, and 82.76\u0026thinsp;\u0026plusmn;\u0026thinsp;0.19 for LightGBM. The application of the CatBoost classifier yielded AUC values of 83.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.24 for RF, 83.96\u0026thinsp;\u0026plusmn;\u0026thinsp;0.19 for CatBoost, 83.10\u0026thinsp;\u0026plusmn;\u0026thinsp;0.31 for XGBoost, and 82.37\u0026thinsp;\u0026plusmn;\u0026thinsp;0.18 for LightGBM. Implementing the XGBoost classifier yielded AUC values of 83.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.18 for RF, 84.06\u0026thinsp;\u0026plusmn;\u0026thinsp;0.15 for CatBoost, 83.29\u0026thinsp;\u0026plusmn;\u0026thinsp;0.15 for XGBoost, and 84.04\u0026thinsp;\u0026plusmn;\u0026thinsp;0.20 LightGBM. Finally, the AUC with the LightGBM classifier was 82.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.37 for RF, 84.13\u0026thinsp;\u0026plusmn;\u0026thinsp;0.12 for CatBoost, 83.28\u0026thinsp;\u0026plusmn;\u0026thinsp;0.46 for XGBoost, and 83.16\u0026thinsp;\u0026plusmn;\u0026thinsp;0.48 for LightGBM.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003cdiv align=\"left\" class=\"colspec\"\u003e\u003cbr\u003e\u003c/div\u003e\u0026nbsp;\u003ctable id=\"Tabg\" border=\"1\"\u003e\n \u003ccolgroup cols=\"8\"\u003e\u003c/colgroup\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\" colspan=\"7\"\u003e\n \u003cp\u003eTable \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e. Performance evaluation of prediction models using liver cancer SSD test dataset\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\" colspan=\"1\"\u003e\u0026nbsp;\u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"2\"\u003e\n \u003cp\u003eData\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" rowspan=\"2\"\u003e\n \u003cp\u003eGenerator\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" rowspan=\"2\"\u003e\n \u003cp\u003eClassifier\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"5\"\u003e\n \u003cp\u003ePrediction model\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eRF\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eCatBoost\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eXGBoost\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e\u003cstrong\u003eLightGBM\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eOriginal\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e85.96%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e86.69%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e85.14%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e85.91%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"3\"\u003e\n \u003cp\u003e\u003cstrong\u003eWithout Discretization and converter\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCTGAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e83.31\u0026thinsp;\u0026plusmn;\u0026thinsp;0.17\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e83.81\u0026thinsp;\u0026plusmn;\u0026thinsp;0.23\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e81.20\u0026thinsp;\u0026plusmn;\u0026thinsp;0.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e82.69\u0026thinsp;\u0026plusmn;\u0026thinsp;0.19\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCopula GAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e82.46\u0026thinsp;\u0026plusmn;\u0026thinsp;0.07\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e83.61\u0026thinsp;\u0026plusmn;\u0026thinsp;0.24\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e80.93\u0026thinsp;\u0026plusmn;\u0026thinsp;0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e82.53\u0026thinsp;\u0026plusmn;\u0026thinsp;0.42\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTT-GAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e-\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e80.29\u0026thinsp;\u0026plusmn;\u0026thinsp;0.14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e81.98\u0026thinsp;\u0026plusmn;\u0026thinsp;0.35\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e80.43\u0026thinsp;\u0026plusmn;\u0026thinsp;0.31\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e80.33\u0026thinsp;\u0026plusmn;\u0026thinsp;0.37\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"12\"\u003e\n \u003cp\u003e\u003cstrong\u003eDiscretization and converter\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" rowspan=\"4\"\u003e\n \u003cp\u003eCTGAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e81.77\u0026thinsp;\u0026plusmn;\u0026thinsp;0.21\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e82.78\u0026thinsp;\u0026plusmn;\u0026thinsp;0.52\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e79.60\u0026thinsp;\u0026plusmn;\u0026thinsp;0.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e80.94\u0026thinsp;\u0026plusmn;\u0026thinsp;0.34\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCatBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e82.65\u0026thinsp;\u0026plusmn;\u0026thinsp;0.24\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e81.00\u0026thinsp;\u0026plusmn;\u0026thinsp;0.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e77.60\u0026thinsp;\u0026plusmn;\u0026thinsp;0.27\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e80.34\u0026thinsp;\u0026plusmn;\u0026thinsp;0.60\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eXGBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e82.96\u0026thinsp;\u0026plusmn;\u0026thinsp;0.21\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e82.44\u0026thinsp;\u0026plusmn;\u0026thinsp;0.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e80.43\u0026thinsp;\u0026plusmn;\u0026thinsp;0.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e81.81\u0026thinsp;\u0026plusmn;\u0026thinsp;0.48\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLightGBM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e82.47\u0026thinsp;\u0026plusmn;\u0026thinsp;0.43\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e81.47\u0026thinsp;\u0026plusmn;\u0026thinsp;0.29\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e78.76\u0026thinsp;\u0026plusmn;\u0026thinsp;0.41\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e80.34\u0026thinsp;\u0026plusmn;\u0026thinsp;0.19\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"4\"\u003e\n \u003cp\u003eCopula GAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e78.95\u0026thinsp;\u0026plusmn;\u0026thinsp;0.28\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e71.70\u0026thinsp;\u0026plusmn;\u0026thinsp;0.70\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e65.62\u0026thinsp;\u0026plusmn;\u0026thinsp;2.14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e74.54\u0026thinsp;\u0026plusmn;\u0026thinsp;1.42\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCatBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e80.95\u0026thinsp;\u0026plusmn;\u0026thinsp;0.38\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e79.41\u0026thinsp;\u0026plusmn;\u0026thinsp;0.69\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e75.10\u0026thinsp;\u0026plusmn;\u0026thinsp;1.09\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e78.69\u0026thinsp;\u0026plusmn;\u0026thinsp;1.41\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eXGBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e78.96\u0026thinsp;\u0026plusmn;\u0026thinsp;0.45\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e79.75\u0026thinsp;\u0026plusmn;\u0026thinsp;1.13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e74.95\u0026thinsp;\u0026plusmn;\u0026thinsp;1.24\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e74.30\u0026thinsp;\u0026plusmn;\u0026thinsp;1.42\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLightGBM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e77.67\u0026thinsp;\u0026plusmn;\u0026thinsp;1.01\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e70.46\u0026thinsp;\u0026plusmn;\u0026thinsp;1.15\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e68.46\u0026thinsp;\u0026plusmn;\u0026thinsp;1.43\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e71.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.56\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" rowspan=\"4\"\u003e\n \u003cp\u003eTT-GAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRF\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e83.24\u0026thinsp;\u0026plusmn;\u0026thinsp;0.26\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e83.83\u0026thinsp;\u0026plusmn;\u0026thinsp;0.13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e82.99\u0026thinsp;\u0026plusmn;\u0026thinsp;0.26\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e82.76\u0026thinsp;\u0026plusmn;\u0026thinsp;0.19\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCatBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e83.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.24\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e83.96\u0026thinsp;\u0026plusmn;\u0026thinsp;0.19\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e83.10\u0026thinsp;\u0026plusmn;\u0026thinsp;0.31\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e82.37\u0026thinsp;\u0026plusmn;\u0026thinsp;0.18\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eXGBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e83.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e84.06\u0026thinsp;\u0026plusmn;\u0026thinsp;0.15\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e83.29\u0026thinsp;\u0026plusmn;\u0026thinsp;0.15\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e84.04\u0026thinsp;\u0026plusmn;\u0026thinsp;0.20\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLightGBM\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e82.32\u0026thinsp;\u0026plusmn;\u0026thinsp;0.37\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e84.13\u0026thinsp;\u0026plusmn;\u0026thinsp;0.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e83.28\u0026thinsp;\u0026plusmn;\u0026thinsp;0.46\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\" colspan=\"2\"\u003e\n \u003cp\u003e83.16\u0026thinsp;\u0026plusmn;\u0026thinsp;0.48\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eThe TT-GAN preserved the attributes of the original data and the relationships between variables, thereby maintaining connections between continuous and categorical values during the generation of the STD. It exhibited good efficacy in safeguarding real-world patterns and commendable performance in terms of model efficiency.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eSynthetic data are commonly perceived as irreversibly generated in traditional practice [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e], [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. However, certain techniques that involve the estimation of explicit distributions during the generation of synthetic data, coupled with the corresponding model, can reconstruct original data. In cases involving sensitive information such as healthcare data, synthetic data must be generated based on implicit density rather than explicit density. This ensures that the generation process adheres to the non-disclosure of explicit distributions, thereby mitigating the risks associated with reconstructing the original data. In the context of datasets containing sensitive information, such as healthcare data, the generation of synthetic data should be based on the implicit rather than explicit density.\u003c/p\u003e \u003cp\u003eIn cases where sensitive information is not included, synthetic data based on explicit density may have a higher quality and performance. Therefore, the use of explicit density to generate these datasets offers advantages. However, in certain studies, the distinct differences between explicit and implicit density methods are often overlooked. Consequently, the performance of algorithms is compared and evaluated while disregarding the disparities between explicit and non-explicit density methods [\u003cspan additionalcitationids=\"CR20\" citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]-[\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] This experimental design can be considered irrational. The evaluation of algorithms based on the implicit density of sensitive data is considered an appropriate objective approach.\u003c/p\u003e \u003cp\u003eThere was a high interdependence between the variables in the healthcare datasets. This is because clinical datasets often contain multiple individual clinical characteristics in a single record. Therefore, synthesizing data that accurately reflects the relationships between different columns is a critical task. Implicit models, such as GANs, generate realistic data without explicitly learning or representing the underlying probability distribution. This inherent characteristic of implicit models mitigates the risk of the unintentional disclosure of sensitive information, thus rendering them a more suitable choice for preserving privacy in healthcare data.\u003c/p\u003e \u003cp\u003eHowever, generative models, such as CTGAN and copula GAN, encounter challenges when tasked with generating realistic HTD. These challenges arise from the intricate nature of real-world healthcare data, wherein capturing and replicating complex patterns is a formidable task. Moreover, accurately learning and reproducing nonstandard distribution patterns is difficult and may yield generated samples that cannot appropriately represent the complexities inherent in the original data. Recent advancements in deep learning, particularly those centered on Transformer architectures, have demonstrated promising applications in handling tabular datasets [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]\u0026ndash; [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. A notable development involves the implementation of a transformer-based GAN for the generation of synthetic data in the text and sequence areas [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e], [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThis study aimed to address these critical issues by generating synthetic data based on Transformers. The Transformer can effectively handle the relationships between columns within each dataset through multi-attention mechanisms. It can be considered as an appropriate algorithm for healthcare data when generating synthetic data. However, one major challenge that must be overcome before its application in healthcare is the processing of continuous variables.\u003c/p\u003e \u003cp\u003eWhen generating synthetic data for healthcare, dealing with the diverse distributions of continuous variables present in healthcare datasets is a major challenge. Typically, it is ideal if all the continuous variables in the synthetic data follow a normalized Gaussian distribution during the learning process. However, cases wherein the actual data follow a Gaussian distribution are rare. Using this methodology, we developed a TT-GAN. First, that all continuous variables were discretized before training a model to generate synthetic data. Consequently, a model was built to predict the continuous variables of these discretized variables. Subsequently, the model was used to predict the continuous variables of these discretized variables after generating synthetic data.\u003c/p\u003e \u003cp\u003eBased on our methodology, TT-GAN was found to be remarkably simple, user-friendly, and powerful. In our experimental results, the Transformer model applying this methodology exhibited outstanding performance. Despite applying the same methodology to CTGAN and copula GAN, the performance improvement was not as pronounced as that in case of the Transformer-based model. This is attributed to the inherent ability of CT-GAN and copula GAN to handle continuous variables to a certain extent. In contrast, the traditional Transformer model, which is large language model (LLM)-based, lacked the ability to handle these continuous variables effectively. As observed in our experimental results, synthetic data generated by the Transformer model without the application of discretization and converter methodology exhibited significantly worse performance. Thus, although Transformer-based synthetic data generation models exhibit significant potential in the healthcare domain characterized by high inter-column interdependence, their capabilities cannot be fully realized without effective handling of continuous variables.\u003c/p\u003e \u003cp\u003eBased on our methodology, TT-GAN was found to be remarkably simple, user-friendly, and powerful. In our experimental results, the Transformer model applying this methodology exhibited outstanding performance. Despite applying the same methodology to CTGAN and copula GAN, the performance improvement was not as pronounced as that in case of the Transformer-based model. This is attributed to the inherent ability of CT-GAN and copula GAN to handle continuous variables to a certain extent. In contrast, the traditional Transformer model, which is large language model (LLM)-based, lacked the ability to handle these continuous variables effectively. As observed in our experimental results, synthetic data generated by the Transformer model without the application of discretization and converter methodology exhibited significantly worse performance. Thus, although Transformer-based synthetic data generation models exhibit significant potential in the healthcare domain characterized by high inter-column interdependence, their capabilities cannot be fully realized without effective handling of continuous variables. However, the application of discretization and transformers to all healthcare datasets may not be necessary. In cases involving minimal continuous variables, or wherein such variables have a minor impact on the dependent variables of predictive models, disregarding them may not result in significant differences in performance.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study proposed TT-GAN as a specialized GAN algorithm for healthcare within the practical constraints of clinical settings. The TT-GAN operated on a devised three-stage framework: discretization, generation, and conversion stages. The discretization and converter methodology were the primary process applied to transform continuous variables into categorical data, thereby facilitating the subsequent vectorization process for the transformer of the generator. The entire dataset was cast in a categorical format, thereby enabling the Transformer to capture the unique attributes associated with each value. Subsequently, the original continuous data of the generated dataset were reconverted into continuous data by applying a prediction model. The integration of the Transformer encoder into the GAN framework ensured that the relational characteristics between the columns were preserved during the generation process. In particular, the TT-GAN exhibited better performance than the representative algorithms of CTGAN, and copulaGAN.\u003c/p\u003e \u003cp\u003eFinally, the TT-GAN effectively produced mixed variable types, including multinomial, discrete, and continuous, which closely resemble the characteristics of the original HTD. In particular, the discretization and converter methodology could be interpreted as a demonstration of the potential of the existing LLM model to be used effectively with a wide variety of data.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eHTD\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;Healthcare tabular data\u003c/p\u003e\n\u003cp\u003eSTD\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;Synthetic tabular data\u003c/p\u003e\n\u003cp\u003eGAN\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;Generative adversarial network\u003c/p\u003e\n\u003cp\u003eCTGAN\u0026nbsp;\u0026nbsp;\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;Conditional tabular GAN\u003c/p\u003e\n\u003cp\u003eRF\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;Random forest\u003c/p\u003e\n\u003cp\u003eCatBoost\u0026nbsp;\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;Category boosting\u003c/p\u003e\n\u003cp\u003eXGBoost\u0026nbsp;\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;Extreme gradient boosting\u003c/p\u003e\n\u003cp\u003eLightGBM\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;Light gradient boosting machine\u003c/p\u003e\n\u003cp\u003eAUC\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;Area under the curve\u003c/p\u003e\n\u003cp\u003eLLM \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;Large language model\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was supported by a grant (no: 2310440-3) offered by the National Cancer Center of Korea, Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (no: NRF-2022R1F1A107504).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAnyone can use the original data after registering as a member on the Korea Central Cancer Registry (KCCR) portal [16] and passing through the data application and review. Users need to fill out an application form, including a research proposal describing how they will use the data and that the data access request will be accessed by the KCCR and the National Statistics Office. All synthetic data can be shared for research purposes by contacting the authors.\u0026nbsp;Please note that this service is only available to Koreans; it is a domestic service.\u003c/p\u003e\n\u003cp\u003eAll code for data generation and validation associated with the current submission is available in a GitHub repository [27].\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor\u0026rsquo;s Contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eConceptualization was managed by HYJK, MSK, and KSR; methodology, HYJK, MSK, and KSR; validation, HYJK, MSK, and KSR; investigation, HYJK; data curation, HYJK and KSR; writing\u0026nbsp;of\u0026nbsp;the original draft preparation, HYJK, and KSR. All the authors\u0026nbsp;assisted in drafting and editing the manuscript\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNo Funding.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eBorisov V, Leemann T, Se\u0026szlig;ler K, Haug J, Pawelczyk M, Kasneci G. Deep Neural Networks and Tabular Data: A Survey. IEEE Trans Neural Netw Learn Syst. 2022;1\u0026ndash;21. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/TNNLS.2022.3229161\u003c/span\u003e\u003cspan address=\"10.1109/TNNLS.2022.3229161\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ede Kok JWTM, de la Hoz M\u0026Aacute;A, de Jong Y, Brokke V, Elbers PWG, Thoral P, et al. Sci Data. 2023;10:404d. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41597-023-02256-2\u003c/span\u003e\u003cspan address=\"10.1038/s41597-023-02256-2\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. A guide to sharing open healthcare data under the General Data Protection Regulation.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHernandez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic data generation for tabular health records: A systematic review. Neurocomputing 2022:493:28\u0026ndash;45; \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1016/j.neucom.2022.04.053\u003c/span\u003e\u003cspan address=\"10.1016/j.neucom.2022.04.053\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGiuffr\u0026egrave; M, Shung DL. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digit Med. 2023;6:186. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41746-023-00927-3\u003c/span\u003e\u003cspan address=\"10.1038/s41746-023-00927-3\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRankin D, Black M, Bond R, Wallace J, Mulvenna M, Epelde G. Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing. JMIR Med Inf. 2020;8:e18910. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.2196/18910\u003c/span\u003e\u003cspan address=\"10.2196/18910\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling tabular data using conditional GAN. Adv Neural Inf Process Syst 2019;32.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQuiroz JC, Feng Y, Cheng Z, Rezazadegan D, Chen P, Lin Q, et al. development and validation of a machine learning approach for automated severity assessment of COVID-19 based on clinical and imaging data: retrospective study. JMIR Med Inf. 2021;9:e24572. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.2196/24572\u003c/span\u003e\u003cspan address=\"10.2196/24572\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSyed ARP, Anbalagan R, Setlur AS, Karunakaran C, Shetty J, Kumar J, et al. Implementation of ensemble machine learning algorithms on exome datasets for predicting early diagnosis of cancers. BMC Bioinform. 2022;23:496. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s12859-022-05050-w\u003c/span\u003e\u003cspan address=\"10.1186/s12859-022-05050-w\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKang HYJ, Batbaatar E, Choi DW, Choi KS, Ko M, Ryu KS. Synthetic tabular data based on generative adversarial networks in health care: Generation and validation using the divide-and-conquer strategy. JMIR Med Inf. 2023;24:e47859. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.2196/47859\u003c/span\u003e\u003cspan address=\"10.2196/47859\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKhan A, Swaleha Z. Expansion of regularized k means discretization machine learning approach in prognosis of dementia progression. 2020 11th Int Conf Comp Commun Netw Technol (ICCCNT) 2020.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGarcia S, Luengo J, S\u0026aacute;ez JA, Lopez V, Herrera F. A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng. 2012;25:734\u0026ndash;50.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHo TK. Random decision forests. Proc 3rd Int Conf Doc Anal Recog 1995.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDorogush AV, Vasily E, Andrey G. CatBoost: gradient boosting with categorical features support. arXiv preprint 2018; arXiv:1810.11363.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen T, Carlos G, XGBoost:. A scalable tree boosting system. Proc 22nd ACM SIGKDD Int Conf Knowl Discov Data Min 2016.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuolin K, Qi M, Thomas F, Taifeng W, Wei C, Weidong M, Qiwei Y, Tie-Yan L. LightGBM: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 2017;30.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHome page. Korea Central Cancer Registry. URL: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://kccrsurvey.cancer.go.kr/index.do[accessed\u003c/span\u003e\u003cspan address=\"https://kccrsurvey.cancer.go.kr/index.do[accessed\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e 2024-3-08].\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAnsari AF, Scarlett J, Soh H. A characteristic function approach to deep implicit. generative modeling. Proc IEEE/CVF Conf Comp Vis Pattern Recog; 2020.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSubakan C, Oluwasanmi Ko, Paris S. Learning the base distribution in implicit generative models. arXiv preprint 2018; arXiv:1803.04357.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang Y, Zaidi NA, Zhou J, Li G. GANBLR: A tabular data generation model. IEEE Int Conf Data Min (ICDM) 2021:181; \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/ICDM51629.2021.00103\u003c/span\u003e\u003cspan address=\"10.1109/ICDM51629.2021.00103\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang Y, Zaidi N, Zhou J, Li G, GANBLR++. Incorporating capacity to generate numeric attributes and leveraging unrestricted Bayesian networks. Proc 2022 SIAM Int Conf Data Mining (SDM), Society for Industrial and Applied Mathematics 2022.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHan P, Xu W, Lin W, Cao J, Liu C, Duan S, et al. C3-TGAN-controllable tabular data synthesis with explicit correlations and property constraints. Authorea Preprints; 2023.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuang X, Khetan A, Cvitkovic M, Karnin Z. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv preprint 2020; arXiv:2012.06678.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGorishniy Y, Rubachev I, Khrulkov V, Babenko A. Revisiting deep learning models for tabular data. Adv Neural Inf Process Syst. 2021;34:18932\u0026ndash;43.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSolatorio AV, Dupriez O, REaLTabFormer. Generating realistic relational and tabular data using transformers. arXiv preprint 2023; arXiv:2302.02041.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDiao S, Shen X, Shum K, Song Y, Zhang T. TILGAN: Transformer-based implicit latent GAN for diverse and coherent text generation. Find Ass Comput Linguist ACL-IJCNLP 2021:4844\u0026ndash;58.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi X, Metsis V, Wang H, Ngu AHH. Tts-gan: A transformer-based time-series generative adversarial network. Int Conf Artif Intell Med 2022:133\u0026ndash;43.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKwang SR. Sally/ttgan. GitHub. URL: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/KwangSun-Ryu/Sally.git\u003c/span\u003e\u003cspan address=\"https://github.com/KwangSun-Ryu/Sally.git\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"tabular Transformer generative adversarial network (TT-GAN), heterogenous distribution, healthcare tabular data (HTD)","lastPublishedDoi":"10.21203/rs.3.rs-4134206/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4134206/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eIn healthcare, the most common type of data is tabular data, which hold high significance and potential in the field of medical AI. However, privacy concerns have hindered their widespread use. Despite the emergence of synthetic data as a viable solution, the generation of healthcare tabular data (HTD) is complex owing to the extensive interdependencies between the variables within each record that incorporate diverse clinical characteristics, including sensitive information. To overcome these issues, this study proposed a tabular transformer generative adversarial network (TT-GAN) to generate synthetic data that can effectively consider the relationships between variables potentially present in the HTD dataset. Transformers can consider the relationships between the columns in each record using a multi-attention mechanism. In addition, to address the potential risk of restoring sensitive data in patient information, a Transformer was employed in a generative adversarial network (GAN) architecture, to ensure an implicit-based algorithm. To consider the heterogeneous characteristics of the continuous variables in the HTD dataset, the discretization and converter methodology were applied. The experimental results confirmed the superior performance of the TT-GAN than the Conditional Tabular GAN (CTGAN) and copula GAN. Discretization and converters were proven to be effective using our proposed Transformer algorithm. However, the application of the same methodology to Transformer-based models without discretization and converters exhibited a significantly inferior performance. The CTGAN and copula GAN indicated minimal effectiveness with discretization and converter methodologies. Thus, the TT-GAN exhibited considerable potential in healthcare, demonstrating its ability to generate artificial data that closely resembled real healthcare datasets. The ability of the algorithm to handle different types of mixed variables efficiently, including polynomial, discrete, and continuous variables, demonstrated its versatility and practicality in health care research and data synthesis.\u003c/p\u003e","manuscriptTitle":"Tabular Transformer Generative Adversarial Network for Heterogeneous distribution in healthcare","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-03-25 18:02:07","doi":"10.21203/rs.3.rs-4134206/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"6d6faecd-a623-406c-8160-637f8eced7b6","owner":[],"postedDate":"March 25th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-11-13T05:08:39+00:00","versionOfRecord":[],"versionCreatedAt":"2024-03-25 18:02:07","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4134206","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4134206","identity":"rs-4134206","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00