On the use of Synthetic Data for Machine Learning prediction of Self-Healing Capacity of Concrete

doi:10.21203/rs.3.rs-4668609/v1

On the use of Synthetic Data for Machine Learning prediction of Self-Healing Capacity of Concrete

2024 · doi:10.21203/rs.3.rs-4668609/v1

preprint OA: closed

Full text JSON View at publisher

Full text 115,862 characters · extracted from preprint-html · click to expand

On the use of Synthetic Data for Machine Learning prediction of Self-Healing Capacity of Concrete | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article On the use of Synthetic Data for Machine Learning prediction of Self-Healing Capacity of Concrete Franciana Sokoloski de Oliveira, Ricardo Stefani This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4668609/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract This work investigated the use of synthetic data to overcome the limitations of scarce experimental data in predicting the self-healing capacity of bacteria-driven concrete. We generated a synthetic dataset based on real-world data, significantly expanding the original dataset and then trained and compared machine learning models, including probabilistic and ensemble methods, to predict the concrete self-healing capacity. The results demonstrate that the ensemble methods, particularly the random forest (RF) method (accuracy = 0.863 and F1-score = 0.863), outperformed the probabilistic models and achieved high accuracy in predicting self-healing capacity. The trained models were further applied to real-word data examples, showing high accuracy. This research validates the utility of synthetic data in predicting modelling accuracy and reliability in civil engineering, particularly in areas with limited experimental data. The findings contribute to the growing use of ML and AI in concrete research and demonstrate the transformative potential of synthetic data in addressing challenges in civil engineering. self-healing concrete bacteria synthetic data machine learning Figures Figure 1 Figure 2 Figure 3 Figure 4 1. Introduction Concrete is the most widely used material in civil engineering. It is inexpensive, durable and resistant. Despite its advantages, concrete has several disadvantages, such as cracking, which causes high permeability in concrete structures (Wiktor and Jonkers 2011 ). Thus, the life span of concrete decreases as cracking increases. Therefore, there is a concern about fixing cracks before they comprise the concrete or even the entire building, and there are many restauring methods to fix cracks before they can be undone. Crack fixing often occurs by adding new cementitious materials to the cracks. One innovative approach to mitigate these issues is through the integration of self-healing mechanisms, particularly those involving bacteria (Talaiekhozan et al. 2014 ; Karthiga Shenbagam and Praveena 2022 ; Hossain et al. 2022 ). Self-healing concrete is a type of smart construction material that aims to autonomously repair cracks that develop over time. Among various strategies, incorporating bacteria into concrete formulations has garnered significant attention (Gupta et al. 2018 ; Feng et al. 2019 ). Bacteria have shown promise in enhancing the self-healing properties of concrete by metabolizing nutrients within the concrete matrix to produce calcite, thereby sealing cracks (Su et al. 2021 ). Concrete formulations, including those of self-healing concrete, are often developed and modelled using experimental design and statistical regressions (Li and Yang 2007 ; Ochi et al. 2007 ; Talaiekhozan et al. 2014 ; Hossain et al. 2022 ). Despite the importance of classical experimental design, the advent of machine learning and artificial intelligence in recent years has driven the development of new concrete with improved properties (Ziolkowski and Niedostatkiewicz 2019 ; Feng et al. 2020 ; Chaabene et al. 2020 ; Dehestani et al. 2022 ). Therefore, there has been a rapid shift from classical experimental design to machine learning-assisted concrete development, especially for materials whose development is not straightforward, such as self-healing concrete (Althoey et al. 2022 ; Huang et al. 2022 ). Despite the growing importance of machine learning in the development of self-healing concrete, experimental data on this subject are limited (Alabduljabbar et al. 2023 ; Onyelowe et al. 2024 ). As machine learning algorithm performance is very sensitive to dataset size and availability (Janiesch et al. 2021 ; Zhou 2021 ; Pilania 2021 ), that is, a limited dataset can lead to underfit and inaccurate models, this could be a problem that prevents the design and implementation of accurate machine learning models to predict the self-healing capacity of concrete with bacteria. Therefore, techniques such as data augmentation (Shorten and Khoshgoftaar 2019 ; Mumuni and Mumuni 2022 ) and synthetic data generation (Hong et al. 2021 ) have been used to improve data quality and to obtain adequate and representative training data for ML when data on the subject are limited or do not have good quality. Synthetic data are statistically generated from real sample data to increase the volume of ML model training and development data (Wendland et al. 2022 ) and have been used to generate data in many fields of knowledge, such as chemistry, medicine, finance, marketing and engineering, when the availability of real data is limited (Alghamdi, 2023 ; Gelernter et al., 1990 ; Hittmeir et al., 2019 ; Krüger Mariusand Vogel-Heuser, 2024). Moreover, as the experimental data for the development of machine learning models for predicting the self-healing capacity of concrete with bacteria are still limited, collecting a large and diverse dataset may be impractical or expensive, limiting the development of robust models. On the other hand, a small and nondiverse dataset can also lead to a data imbalance problem, leading the model to be biased and perform poorly. Therefore, due to the limited data available regarding self-healing concrete with bacteria, this work aimed to generate synthetic data and develop and compare the performance of machine learning models designed explicitly to classify a given concrete with a bacterial formulation as self-healing or not and to select the best model or ML algorithm to predict whether the concrete can self-heal or not and to evaluate whether the models trained with synthetic data can be successfully applied to real data. 2. Materials and Methods All the necessary code was written in Python 3 using the following packages: Pandas (v. 1.4.2), NumPy (v. 1.22.4), Scikit-Learn (v. 1.1.2), Matplotlib (v. 3.5.1), Seaborn (v. 0.2), SciPy (v. 1.7.3), Synthetic Data Vault (version 1.2.0) and SDMetrics (version 0.10.1). 2.1 Dataset Data on self-healing concrete containing bacteria in its formulation were collected from the literature. A total of 38 occurrences involving six bacillus bacteria, Bacillus cohnii , Bacillus subtilus , Bacillus megaterium , Bacillus sphaericus , Bacillus mucilaginous and Bacillus pseudofirmus , were identified (Wiktor and Jonkers 2011 ; Luo et al. 2015 ; Gupta et al. 2018 ; Feng et al. 2019 ; Kumar Jogi and Vara Lakshmi 2020 ; Su et al. 2021 ; Huang et al. 2022 ; Karthiga Shenbagam and Praveena 2022 ; Nodehi et al. 2022 ). Moreover, the following information was extracted and tabulated to prepare the dataset: quantity of sand, aggregate, calcium lactate and plasticizer, all measured in kg/m 3 ; ratio of cement/water; fissure size (mm); time to self-healing (days); and self-healing percentage. All the data were tabulated and recorded in a CSV table. 2.2 Data preprocessing The data were preprocessed, and the following six bacterial genera were identified: Bacillus cohnii (1), Bacillus subtilus (2), Bacillus megaterium (3), Bacillus sphaericus (4), Bacillus mucilaginous (5) and Bacillus pseudofirmus (6). Moreover, as ML models were developed to classify each concrete formulation as self-healing or not, while in the original dataset, self-healing capacity was expressed as a percentage, it was necessary to encode each concrete formulation as capable or not capable of self-healing. Therefore, we considered 70% as the cut-off for self-healing capacity and considered all self-healing capacities with percentages of 70% and above as 1 (can self-heal) and all self-healing capacity percentages under 70% as 0 (cannot self-heal). This kind of encoding allows the development of classification machine learning models. All the data were recorded in a CSV table. Furthermore, from 38 occurrences, 9 (23%) were removed from the CSV table and recorded into a new spreadsheet for further testing the trained models with real data. 2.3 Synthetic Data Generation As machine learning algorithms often require a large amount of data and because the data collected from the literature are insufficient for developing machine learning models, the data were subjected to synthetic data generation via the Synthetic Data Vault (Patki et al. 2016 ). Synthetic Data Vault (SDV) is a Python framework for generating synthetic data from real data; that is, this tool can generate new data that statistically correlate to the given real data. Synthetic data generation involves a range of techniques that can be employed to expand datasets with limited information (Hittmeir et al. 2019 ). Therefore, this approach has been gaining popularity in ML. To create a dataset that is 10 times larger than the original set, 350 lines of data were generated based on the original dataset. To ensure the quality of the generated data and that it correlates with the original dataset, the synthetic dataset was further analysed using methods and tools provided by the SDV framework. Finally, the generated dataset was recorded into a CSV table. 2.4 Development of the Machine Learning Models To classify each concrete formulation as self-healing or not, ML classification models were developed and trained. Five ML methods were selected to develop the classification models: naïve Bayes (NB), logistic regression (LR), K-nearest neighbors (KNN), support vector classification (SVC) and random forest classification (RF). The dataset generated by SDV was split into training (80%) and test sets (20%) and subsequently used to train and test each ML model. 2.4.1 Naïve Bayes Naïve Bayes is a machine learning classification technique that is based on probabilistic statistical modelling (Shields et al. 2021 ). That is, it is based on the statistical probability of a certain target variable belonging or not belonging to a given class. In this work, the Naïve Bayes ML model was trained and tested using the Gaussian Naïve Bayes implementation of the Scikit-learn framework with the framework default parameters. 2.4.2 Logistic Regression Although this ML technique is named, it is in fact a classification technique. LR is based on the estimation of the probability of something being true or false, the details of which are well described in the literature (Bisong 2019 ). Like NB, LR is a kind of statistical modelling that considers the probability of a true value versus the probability of a false value (Janiesch et al. 2021 ; Zhou 2021 ). This feature makes this technique good for classification predictive analysis. In this work, a logistic regression (LR) ML model was trained and tested using the logistic regression solver present in the Scikit-Learn framework with the parameter solver set to ‘liblinear’ and the random state set to 10. The solver parameter was set to the ‘ liblinear ’ parameter because it is adequate for binary classification. 2.4.3 K-nearest neighbors The KNN algorithm is a nonparametric learning method of classification and regression. It is one of the many ML algorithms known as voting algorithms; that is, data are classified into each class by a plurality of votes from their neighbours, across Euclidean distance, into common classes (Zheng and Tropsha 2000 ; dos Santos Freitas et al. 2022 ). As the performance of KNN depends upon the number of neighbours ( k ), an optimization was conducted to find the optimal number of neighbours. This optimization was performed by training models using the KNN implementation present in Scikit-learn by setting the parameter n_neighbors from 1 to 50 with 10-fold cross-validation and by further measuring the accuracy score of each training and cross-validation. During this process, the optimal k value was found to be 26 (Fig. 1 ). Hence, the final KNN model was trained with a k value equal to 26. 2.4.4 Support Vector Classification This model implements a support vector machine (SVM) algorithm. SVC provides a binary output (e.g., determining whether an object belongs to a specific class). Thus, SVC is appropriate for addressing classification problems and works by generating a hyperplane separating the dataset and nonlinearly remapping data into a higher-dimensional space to achieve linear correlation (Burges 1998 ; Mammone et al. 2009 ). To train the SVC model, it was first necessary to establish the required parameters for the calculations. Hence, the following parameters and their functions were defined as described in the literature: kernel, which is a mathematical function responsible for data transformations in SVM; C (regularization parameter), which controls the trade-off between maximizing the hyperplane margin and minimizing the training error term; gamma, which improves classification accuracy for training data; and degree, which controls the complexity of the transformed space (Cervantes et al. 2020 ). Thus, this work assessed the impact of different values assigned to these parameters on the model performance using the GridSearchCV method. The tested values for each of the necessary parameters are presented in Table 1 . Table 1 Values of the support vector classification (SVC) parameters. Parameters Values Kernel Linear; Polynomial; RBF; Sigmoid C 1; 5; 10; 100 Degree 1; 2; 3 Gamma 0.01; 0.05; 0.1; 0.5; 1.0 Max_iter 1000 Note: RBF = radial basis function kernel. GridSearchCV automatically determines the optimal values for the required parameters. This tool operates through an iterative process, where each iteration defines a combination of parameter values and calculates a score, identifying the combination with the highest score as the best one. In this work, GridSearchCV was run with n_splits = 5, test_size = 0.2 and random_state = 10. This means a 5-fold cross validation with data split into 80% training and 20% testing, whereas random_state fixes a randomness value to ensure reproducibility. After running, the best combination of parameter values that yielded the highest score for SVC was C = 10, gamma = 1 and a linear kernel with a maximum of 1000 iterations. Once the values were determined, the final SVC model was trained using those values. 2.4.5 Random forest classification Random forest classification is an ensemble machine learning technique that combines decision trees to predict the classification of a target response (Ehrman et al. 2007 ). The predictive ability and range of random forest models can be tuned by adjusting certain parameters. In this study, two parameters were used, as discussed below. The first parameter is "n_estimators", which represents the number of trees in the forest. In this work, the final value of n_estimators = 100 was selected. The second parameter is "random_state", which controls the bootstrap randomness. To ensure reproducibility, an integer value equal to 10 was adopted in this work. 2.5 Performance comparison of ML models The performances of each model were further compared. The quality of the model fit was assessed using several common metrics for measuring classification model performance: accuracy, precision, recall, F1-score and confusion matrix. Accuracy is a metric of the overall correctness of the model and measures how much the model is making correct predictions. Precision measures how much the model correctly identified a concrete formulation as self-healing of the total of concrete formulations. The total amount of actual self-healing concrete was correctly identified relative to the total amount of self-healing concrete. The F1-score is a statistical measure of both precision and recall, representing the overall performance of the model and measuring true positives, true negatives, false positives and false negatives with the same metric. The F1-Score can vary from 0 to 1, and higher F1-Score values indicate better model performance. The confusion matrix is not a metric itself but rather a way to graphically show and interpret model performance. Moreover, to determine the influence of each concrete component (variable) on the self-healing capacity, we used the permutation_importance function in the sklearn.inspection package. Details on the methodology are described elsewhere (Pessoa et al. 2024 ). Finally, each model was tested with real data to ensure that each model correctly predicted the concrete self-healing capacity. 3. Results and Discussion 3.1 Synthetic data quality analysis An overall quality score of 79.43% was achieved, with the analysis of column pair trends yielding 88.18% (Fig. 2 ), which measures the correlation between synthetic and real data; a higher percentage indicates a higher quality of generated data. Figure 2 clearly shows that the correlation between the synthetic and real data is high. Therefore, synthetic data can safely represent the variety of formulations and self-healing capacity of the original data. Furthermore, a final data analysis was conducted to assess the coverage of the synthetic data and determine whether new data or copies of real data had been generated. The results were satisfactory, as the synthetic data covered over 90% of the range of possible values, and over 90% of the synthetic data comprised new data. 3.2 Machine Learning Modelling The results of ML modelling are summarized in Table 2 . The results show that the best ML model is the RF model, while the worst are both the NB and KNN models. The NB model displayed only moderate performance in the test set (accuracy = 0.712 and F1-Score = 0.713) compared to the other models, indicating its limitation in handling the complexity of the underlying relationships. Compared with more advanced ML algorithms such as ensemble models, probabilistic models, such as naïve Bayes, are known to have poorer performance and limitations for nonlinear data relations (Cloutier and Sirois 2008 ; Kumar Tipu et al. 2022 ). Hence, this result shows that such a probabilistic model is not adequate for solving this class of ML problems. The SVC and LR (accuracy = 0.765 and F1-Score = 0.763 for the test set) models have similar overall performances on both the training and test sets, with slightly better SVC performances (accuracy = 0.773 and F1-Score = 0.788 for the test set). As logistic regression performs classification on the basis of linear-logarithmic transformations, this result can be explained by the type of kernel selected for SVC (linear). The SVC linear kernel is very similar in function to the logistic regression algorithm, although it is still different since, similar to NB, logistic regression is considered a probabilistic algorithm. From a practical point of view, both the LR and SVC show equal performance. Table 2 Comparison of the performances of the ML models on the training and test sets ML Model Train Test Accuracy Precision Recall F1-Score Accuracy Precision Recall F1-Score NB 0.828 0.828 0.870 0.828 0.712 0.704 0.756 0.713 LR 0.843 0.843 0.881 0.844 0.765 0.750 0.805 0.763 KNN 0.790 0.798 0.830 0.791 0.712 0.695 0.780 0.714 SVC 0.865 0.868 0.892 0.865 0.787 0.773 0.830 0.788 RF 1.000 1.000 1.000 1.000 0.863 0.895 0.830 0.863 KNN shows the worst performance during training (accuracy = 0.790 and F1-Score = 0.791), while its performance for the test set is equal to that of NB (accuracy = 0.712 and F1-Score = 0.714). As KNN performance depends upon the similarity of the data, predictions can be difficult if many differences in the data exist, including nonlinearity of the data. RF outperforms any other model in both training and testing. It shows a value of 1.000 in all the metrics during training, which could normally be indicative of overfitting (Janiesch et al. 2021 ), but the test performance (accuracy = 0.863 and F1-Score = 0.863) does not corroborate this hypothesis. Thus, RF can predict the self-healing capacity of a concrete formulation with a high degree of confidence. The superior performance of RF in both training and testing is because RF is an ensemble method that is a compilation of multiple instances of decision trees. Consequently, its predictive performance increases compared to that of a single regression tree by utilizing a process known in ML as the voting process and then combining several independent learners to make an overall more accurate prediction compared to that of an individual model (Cook et al. 2019 ; Dehestani et al. 2022 ). 3.3 Performance of the ML models on real data Table 3 shows the performance of the trained ML models when applied to real data. This can also be considered a kind of validation to explore and understand how much ML models trained on synthetic data are reliable when applied to real data. Table 3 Comparison of the performances of the ML models on real data ML model Accuracy Precision F1-Score Recall (positive) Recall (negative) NB 0.666 0.625 0.605 1.000 0.250 LR 0.777 0.714 0.759 1.000 0.500 KNN 0.666 0.625 0.605 1.000 0.250 SVC 1.000 1.000 1.000 1.000 1.000 RF 1.000 1.000 1.000 1.000 1.000 The results show that nonprobabilistic models (SVC and RF) perform better than probabilistic models (NB and LR) and that KNN has poor performance, comparable to NB. A series of confusion matrices is shown in Fig. 3 to compare how each model performs on real data. Figure 3 shows that for the 5 real self-healing concrete samples, all the ML models were able to correctly classify them as self-healing, while for the 4 non-self-healing concrete samples, the SVC and RF models correctly classified all of them. Hence, the KNN, LR and NB models cannot correctly classify non-self-healing concrete. Table 3 and Fig. 3 show that NB and KNN correctly classified only 25% of nonself-healing concrete (negative values), while LR performed better, correctly classifying 50% of negative values. In other words, 75% of the nonself-healing concrete was incorrectly classified by the NB and KNN models, and 50% of the nonself-healing concrete was incorrectly classified by the LR model. Support vector machines and random forests are ML algorithms that excel in dealing with nonlinear associations within the data and typically outperform probabilistic and linear algorithms (Fan et al. 2008 ; Niazi et al. 2008 ). These algorithms demonstrate the ability to capture nonlinear data associations and interactions and provide good or superior performance even when subjected to fewer parameter adjustments. Moreover, they are less susceptible to overfitting, yielding accurate predictions even when confronted with limited information. The superior performance of both SVC and RF can be noted by the correct classification of all instances of both non-self-healing concrete (negative values) and self-healing concrete (positive values). Similar performances for ensemble methods, including formulations with bacteria, have been reported in the literature for different kinds of self-healing concrete with accuracies greater than 0.90 (Zhuang and Zhou 2019 ; Huang et al. 2022 ; Alabduljabbar et al. 2023 ). However, this work outperformed previous works by correctly classifying all instances in a real-data test. 3.4 Influence of the independent variable sensitivity A graph with the relative importance of each variable (Fig. 4 ) shows that the cement/water ratio (FACTOR), lactate content, aggregate content and time are features that have a direct influence on the self-healing capacity of concrete. Regardless of the type of bacteria present in the concrete formulation, self-healing capacity is induced by the bacterial production of calcium lactate, which is further converted to calcium carbonate (Luo et al. 2015 ; Rong et al. 2020 ). Thus, it is not surprising that LACTATE has a greater influence on machine learning modelling than does the type of bacteria; therefore, the importance of bacteria decreases since bacteria can produce lactate, which is the final factor responsible for self-healing. The cement/water ratio also has great importance for self-healing since the rate of bacterial activation depends upon the presence of water (Luo et al. 2015 ), and even concrete without any bacteria can show little self-healing capacity (Zhuang and Zhou 2019 ). The time in days and the fissure size in mm are also of utmost importance in self-healing modelling and prediction. 3.5 Limitations and contributions of this work The main contribution of this work is to propose the use of synthetic data and to compare the performances of ML models created based on synthetic data. Thus, this work contributes to the development of AI and ML tools for concrete design and other fields of industrial engineering. On the other hand, the synthetic data generation and ML design methodology in this paper can be adopted for wider application in civil engineering. The ignorance of other influencing variables for crack healing, such as the carrier of bacteria and the nutrient medium of bacteria, is a limitation not only of the present work but also of previous works since experimental data on the subject are scarce (Zhuang and Zhou 2019 ; Feng et al. 2019 ; Huang et al. 2022 ; Pessoa et al. 2024 ). 4. Conclusions The generation and utilization of synthetic data have proven useful in overcoming limitations due to the scarcity of experimental data in the field of concrete self-healing driven by bacteria. Synthetic data generation substantially expanded the initial dataset, enabling the development and testing of machine learning models with enhanced accuracy and reliability, ensuring that those models captured a wider range of possible scenarios and variations in concrete formulations. Analysis and comparison of multiple ML algorithms demonstrated that ensemble methods, particularly random forest, outperformed probabilistic models in predicting concrete self-healing capacity. This approach not only validated the utility of synthetic data but also showed its potential to advance predictive modelling in civil engineering, particularly where experimental data availability is limited. Furthermore, the key factors influencing concrete self-healing, including the cement-to-water ratio, calcium lactate concentration, aggregate composition, and curing time, were identified. These findings contribute to the ongoing increasing use of machine learning and AI in the field of concrete research and show the transformative impact of synthetic data in driving innovation and addressing critical challenges in civil engineering. Declarations Disclosure statement The authors report there are no competing interests to declare. Author Contribution F.S.O Experimental Design and Development. Data Analysis. Worote the first draft of manuscript. R. S. Data Curation. Data Analysis. Review and improvement of the manuscript. All authors read and approved the final manuscript. Data Availability All data and code are available on corresponding author's github page. References Alabduljabbar H, Khan K, Awan HH et al (2023) Modeling the capacity of engineered cementitious composites for self-healing using AI-based ensemble techniques. Case Stud Constr Mater 18. https://doi.org/10.1016/j.cscm.2022.e01805 Alghamdi SJ (2023) Prediction of Concrete’s Compressive Strength via Artificial Neural Network Trained on Synthetic Data. Eng Technol Appl Sci Res 13:12404–12408. https://doi.org/10.48084/etasr.6560 Althoey F, Amin MN, Khan K et al (2022) Machine learning based computational approach for crack width detection of self-healing concrete. Case Stud Constr Mater 17:e01610. https://doi.org/10.1016/j.cscm.2022.e01610 Bisong E (2019) Logistic Regression. In: Bisong E (ed) Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners. A, Berkeley, CA, pp 243–250 Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2:121–167. https://doi.org/10.1023/A:1009715923555 Cervantes J, Garcia-Lamont F, Rodríguez-Mazahua L, Lopez A (2020) A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing 408:189–215. https://doi.org/10.1016/j.neucom.2019.10.118 Chaabene W, Ben, Flah M, Nehdi ML (2020) Machine learning prediction of mechanical properties of concrete: Critical review. Constr Build Mater 260:1–18. https://doi.org/10.1016/j.conbuildmat.2020.119889 Cloutier LM, Sirois S (2008) Bayesian versus Frequentist statistical modeling: a debate for hit selection from HTS campaigns. Drug Discov Today 13:536–542. https://doi.org/10.1016/j.drudis.2008.03.022 Cook R, Lapeyre J, Ma H et al (2019) Prediction of Compressive Strength of Concrete: Critical Comparison of Performance of a Hybrid Machine Learning Model. https://doi.org/10.1061/(ASCE) . with Standalone Models Dehestani A, Kazemi F, Abdi R, Nitka M (2022) Prediction of fracture toughness in fibre-reinforced concrete, mortar, and rocks using various machine learning techniques. Eng Fract Mech 276. https://doi.org/10.1016/j.engfracmech.2022.108914 dos Santos Freitas MM, Barbosa JR, dos Santos Martins EM et al (2022) KNN algorithm and multivariate analysis to select and classify starch films. Food Packag Shelf Life 34. https://doi.org/10.1016/j.fpsl.2022.100976 Ehrman TM, Barlow DJ, Hylands PJ (2007) Virtual screening of Chinese herbs with Random Forest. J Chem Inf Model 47:264–278. https://doi.org/10.1021/ci600289v Fan R-E, Chang K-W, Hsieh C-J et al (2008) LIBLINEAR: A Library for Large Linear Classification Feng DC, Liu ZT, Wang XD et al (2020) Machine learning-based compressive strength prediction for concrete: An adaptive boosting approach. Constr Build Mater 230. https://doi.org/10.1016/j.conbuildmat.2019.117000 Feng J, Su Y, Qian C (2019) Coupled effect of PP fiber, PVA fiber and bacteria on self-healing efficiency of early-age cracks in concrete. Constr Build Mater 228:116810. https://doi.org/10.1016/J.CONBUILDMAT.2019.116810 Gelernter H, Rose JR, Chen C (1990) Building and Refining a Knowledge Base for Synthetic Organic Chemistry via the Methodology of Inductive and Deductive Machine Learning EXTRACTING REACTION SCHEMATA FROM A DATABASE VIA INDUCTIVE AND DEDUCTIVE GENERALIZATION Building a Synthetic Chemistry Knowledge Base Gupta S, Kua HW, Pang SD (2018) Healing cement mortar by immobilization of bacteria in biochar: An integrated approach of self-healing and carbon sequestration. Cem Concr Compos 86:238–254. https://doi.org/10.1016/j.cemconcomp.2017.11.015 Hittmeir M, Ekelhart A, Mayer R (2019) On the Utility of Synthetic Data: An Empirical Evaluation on Machine Learning Tasks. In: Proceedings of the 14th International Conference on Availability, Reliability and Security. Association for Computing Machinery, pp 1–6 Hong Y, Park S, Kim H, Kim H (2021) Synthetic data generation using building information models. Autom Constr 130:103871. https://doi.org/https://doi.org/10.1016/j.autcon.2021.103871 Hossain MR, Sultana R, Patwary MM et al (2022) Self-healing concrete for sustainable buildings. A review. Environ Chem Lett 20:1265–1273 Huang X, Sresakoolchai J, Qin X et al (2022) Self-Healing Performance Assessment of Bacterial-Based Concrete Using Machine Learning Approaches. Materials 15. https://doi.org/10.3390/ma15134436 Janiesch C, Zschech P, Heinrich K (2021) Machine learning and deep learning. Electron Markets 31:685–695. https://doi.org/10.1007/s12525-021-00475-2/Published Karthiga Shenbagam N, Praveena R (2022) Performance of bacteria on self-healing concrete and its effects as carrier. Mater Today Proc 65:1987–1989. https://doi.org/10.1016/j.matpr.2022.05.322 Krüger, Marius, Vogel-Heuser B, HD, WJ, PT, PD, CS and KC (2024) Synthetic Data Generation for the Enrichment of Civil Engineering Machine Data. Fottner Johannes and Nübel K and MD (ed) Construction Logistics, Equipment, and Robotics. Springer Nature Switzerland, Cham, pp 166–175 Kumar Jogi P, Vara Lakshmi TVS (2020) Self healing concrete based on different bacteria: A review. In: Materials Today: Proceedings. Elsevier Ltd, pp 1246–1252 Kumar Tipu R, Panchal VR, Pandya KS (2022) An ensemble approach to improve BPNN model precision for predicting compressive strength of high-performance concrete. Structures 45:500–508. https://doi.org/10.1016/j.istruc.2022.09.046 Li VC, Yang E-H (2007) Self Healing in Concrete Materials. In: van der Zwaag S (ed) Self healing materials: an alternative approach to 20 centuries of materials science. Springer, pp 161–193 Luo M, Qian CX, Li RY (2015) Factors affecting crack repairing capacity of bacteria-based self-healing concrete. Constr Build Mater 87:1–7. https://doi.org/10.1016/j.conbuildmat.2015.03.117 Mammone A, Turchi M, Cristianini N (2009) Support vector machines. Wiley Interdiscip Rev Comput Stat 1:283–289 Mumuni A, Mumuni F (2022) Data augmentation: A comprehensive survey of modern approaches. Array 16 Niazi A, Jameh-Bozorghi S, Nori-Shargh D (2008) Prediction of toxicity of nitrobenzenes using ab initio and least squares support vector machines. J Hazard Mater 151:603–609. https://doi.org/10.1016/j.jhazmat.2007.06.030 Nodehi M, Ozbakkaloglu T, Gholampour A (2022) A systematic review of bacteria-based self-healing concrete: Biomineralization, mechanical, and durability properties. J Building Eng 49 Ochi T, Okubo S, Fukui K (2007) Development of recycled PET fiber and its application as concrete-reinforcing fiber. Cem Concr Compos 29:448–455. https://doi.org/10.1016/j.cemconcomp.2007.02.002 Onyelowe KC, Adam AFH, Ulloa N et al (2024) Modeling the influence of bacteria concentration on the mechanical properties of self-healing concrete (SHC) for sustainable bio-concrete structures. Sci Rep 14. https://doi.org/10.1038/s41598-024-58666-8 Patki N, Wedge R, Veeramachaneni K (2016) The synthetic data vault. In: Proceedings – 3rd IEEE International Conference on Data Science and Advanced Analytics, DSAA 2016. Institute of Electrical and Electronics Engineers Inc., pp 399–410 Pessoa CLE, Peres Silva VH, Stefani R (2024) Prediction of the self-healing properties of concrete modified with bacteria and fibers using machine learning. Asian J Civil Eng 25:1801–1810. https://doi.org/10.1007/s42107-023-00878-w Pilania G (2021) Machine learning in materials science: From explainable predictions to autonomous design. Comput Mater Sci 193:110360. https://doi.org/10.1016/J.COMMATSCI.2021.110360 Rong H, Wei G, Ma G et al (2020) Influence of bacterial concentration on crack self-healing of cement-based materials. Constr Build Mater 244:118372. https://doi.org/10.1016/j.conbuildmat.2020.118372 Shields BJ, Stevens J, Li J et al (2021) Bayesian reaction optimization as a tool for chemical synthesis. Nature 590:89–96. https://doi.org/10.1038/s41586-021-03213-y Shorten C, Khoshgoftaar TM (2019) A survey on Image Data Augmentation for Deep Learning. J Big Data. https://doi.org/10.1186/s40537-019-0197-0 . 6: Su Y, Qian C, Rui Y, Feng J (2021) Exploring the coupled mechanism of fibers and bacteria on self-healing concrete from bacterial extracellular polymeric substances (EPS). Cem Concr Compos 116:103896. https://doi.org/10.1016/J.CEMCONCOMP.2020.103896 Talaiekhozan A, Keyvanfar A, Shafaghat A et al (2014) A Review of Self-healing Concrete Research Development. J Environ Treat Techniques 2:1–11 Wendland P, Birkenbihl C, Gomez-Freixa M et al (2022) Generation of realistic synthetic data using Multimodal Neural Ordinary Differential Equations. https://doi.org/10.1038/s41746-022-00666-x . NPJ Digit Med 5: Wiktor V, Jonkers HM (2011) Quantification of crack-healing in novel bacteria-based self-healing concrete. Cem Concr Compos 33:763–770. https://doi.org/10.1016/j.cemconcomp.2011.03.012 Zheng W, Tropsha a (2000) Novel variable selection quantitative structure–property relationship approach based on the k-nearest-neighbor principle. J Chem Inf Comput Sci 40:185–194 Zhou Z-H (2021) Machine Learning, 1st edn. Springer Singapore Zhuang X, Zhou S (2019) The prediction of self-healing capacity of bacteria-based concrete using machine learning approaches. Computers Mater Continua 59:57–77. https://doi.org/10.32604/cmc.2019.04589 Ziolkowski P, Niedostatkiewicz M (2019) Machine learning techniques in concrete mix design. Materials 12. https://doi.org/10.3390/ma12081256 Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4668609","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":333023605,"identity":"d703b36e-8ca3-4ff1-b4d2-a2870e878eb9","order_by":0,"name":"Franciana Sokoloski de Oliveira","email":"","orcid":"","institution":"Universidade Federal de Mato Grosso","correspondingAuthor":false,"prefix":"","firstName":"Franciana","middleName":"Sokoloski","lastName":"de Oliveira","suffix":""},{"id":333023607,"identity":"24166638-a0c3-45b4-ba0a-e32c0b937bee","order_by":1,"name":"Ricardo Stefani","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABFklEQVRIie3SPUvDQBjA8SccPNNp1ytNm69wJdAq7YdJltzkWjqIXAici8VVJ79CR0fl4FxCdydTnIWKUwfFi1VROBrdRO4/hFy4H/dCAHy+vxipH7h5XwGwHpCr9++siSAEZ3ZaDJg0EPhOIJVNZHhMlk/TyRiii+Jari/3xNwQU8F0lMrOrHKRUGPcLhcZcINpPivZwdyg4FCKVIY33EUYodDOlQaOdFjtKEvuTgYsUDqVLHNuzBKyrkmkWo/5s2KCG2rJy1aCb6uAoUFhV0k2RG4jONiXi4xyk/WLrmL98/osiRGxCo2btPT9rZyMe1Ghl/mDOop26xtbHY66px3lJB/RrwOEBD5/iZ/1q8k+n8/3/3sFiidXHzdL474AAAAASUVORK5CYII=","orcid":"","institution":"Universidade Federal de Ouro Preto","correspondingAuthor":true,"prefix":"","firstName":"Ricardo","middleName":"","lastName":"Stefani","suffix":""}],"badges":[],"createdAt":"2024-07-01 14:04:18","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-4668609/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4668609/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":61496767,"identity":"b14b1dd0-8e72-485f-a3b8-34eb593d87d8","added_by":"auto","created_at":"2024-07-31 11:54:32","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":32447,"visible":true,"origin":"","legend":"\u003cp\u003eAccuracy scores for KNN optimization.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-4668609/v1/ea2a4662258ac725d4ff760b.png"},{"id":61496766,"identity":"4773fc59-16ad-4f02-9dd1-d0efe4ba007b","added_by":"auto","created_at":"2024-07-31 11:54:32","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":104295,"visible":true,"origin":"","legend":"\u003cp\u003eCorrelation analysis of synthetic vs. real data similarity\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-4668609/v1/1d77c0c57493d6ead55ed6af.png"},{"id":61498522,"identity":"2d0bdb30-60dd-4af2-a4c8-427881965189","added_by":"auto","created_at":"2024-07-31 12:10:32","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":82842,"visible":true,"origin":"","legend":"\u003cp\u003eConfusion matrix of real data prediction for each ML model\u003c/p\u003e","description":"","filename":"floatimage316.png","url":"https://assets-eu.researchsquare.com/files/rs-4668609/v1/512f4e9b8f1af45dbe7caaf5.png"},{"id":61497392,"identity":"2d93dce8-a3cb-4c91-8ba7-35a29a7c93c2","added_by":"auto","created_at":"2024-07-31 12:02:32","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":15844,"visible":true,"origin":"","legend":"\u003cp\u003eRelative importance of the independent variables in the final ML models\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-4668609/v1/ad6be5146ef8b354a80ec8dd.png"},{"id":68617298,"identity":"0b97dc81-5222-4399-bdff-c386a4515835","added_by":"auto","created_at":"2024-11-09 12:46:53","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":732245,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4668609/v1/77feb075-3aa5-4085-be9a-4737e5430761.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":" On the use of Synthetic Data for Machine Learning prediction of Self-Healing Capacity of Concrete","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eConcrete is the most widely used material in civil engineering. It is inexpensive, durable and resistant. Despite its advantages, concrete has several disadvantages, such as cracking, which causes high permeability in concrete structures (Wiktor and Jonkers \u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e2011\u003c/span\u003e). Thus, the life span of concrete decreases as cracking increases. Therefore, there is a concern about fixing cracks before they comprise the concrete or even the entire building, and there are many restauring methods to fix cracks before they can be undone. Crack fixing often occurs by adding new cementitious materials to the cracks. One innovative approach to mitigate these issues is through the integration of self-healing mechanisms, particularly those involving bacteria (Talaiekhozan et al. \u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e2014\u003c/span\u003e; Karthiga Shenbagam and Praveena \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Hossain et al. \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Self-healing concrete is a type of smart construction material that aims to autonomously repair cracks that develop over time. Among various strategies, incorporating bacteria into concrete formulations has garnered significant attention (Gupta et al. \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Feng et al. \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). Bacteria have shown promise in enhancing the self-healing properties of concrete by metabolizing nutrients within the concrete matrix to produce calcite, thereby sealing cracks (Su et al. \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e2021\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eConcrete formulations, including those of self-healing concrete, are often developed and modelled using experimental design and statistical regressions (Li and Yang \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2007\u003c/span\u003e; Ochi et al. \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2007\u003c/span\u003e; Talaiekhozan et al. \u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e2014\u003c/span\u003e; Hossain et al. \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Despite the importance of classical experimental design, the advent of machine learning and artificial intelligence in recent years has driven the development of new concrete with improved properties (Ziolkowski and Niedostatkiewicz \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Feng et al. \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Chaabene et al. \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Dehestani et al. \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Therefore, there has been a rapid shift from classical experimental design to machine learning-assisted concrete development, especially for materials whose development is not straightforward, such as self-healing concrete (Althoey et al. \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Huang et al. \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2022\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eDespite the growing importance of machine learning in the development of self-healing concrete, experimental data on this subject are limited (Alabduljabbar et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Onyelowe et al. \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). As machine learning algorithm performance is very sensitive to dataset size and availability (Janiesch et al. \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Zhou \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Pilania \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e2021\u003c/span\u003e), that is, a limited dataset can lead to underfit and inaccurate models, this could be a problem that prevents the design and implementation of accurate machine learning models to predict the self-healing capacity of concrete with bacteria. Therefore, techniques such as data augmentation (Shorten and Khoshgoftaar \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Mumuni and Mumuni \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) and synthetic data generation (Hong et al. \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2021\u003c/span\u003e) have been used to improve data quality and to obtain adequate and representative training data for ML when data on the subject are limited or do not have good quality. Synthetic data are statistically generated from real sample data to increase the volume of ML model training and development data (Wendland et al. \u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) and have been used to generate data in many fields of knowledge, such as chemistry, medicine, finance, marketing and engineering, when the availability of real data is limited\u003c/p\u003e \u003cp\u003e(Alghamdi, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Gelernter et al., \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e1990\u003c/span\u003e; Hittmeir et al., \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Kr\u0026uuml;ger Mariusand Vogel-Heuser, 2024).\u003c/p\u003e\u003cp\u003eMoreover, as the experimental data for the development of machine learning models for predicting the self-healing capacity of concrete with bacteria are still limited, collecting a large and diverse dataset may be impractical or expensive, limiting the development of robust models. On the other hand, a small and nondiverse dataset can also lead to a data imbalance problem, leading the model to be biased and perform poorly.\u003c/p\u003e \u003cp\u003eTherefore, due to the limited data available regarding self-healing concrete with bacteria, this work aimed to generate synthetic data and develop and compare the performance of machine learning models designed explicitly to classify a given concrete with a bacterial formulation as self-healing or not and to select the best model or ML algorithm to predict whether the concrete can self-heal or not and to evaluate whether the models trained with synthetic data can be successfully applied to real data.\u003c/p\u003e"},{"header":"2. Materials and Methods","content":"\u003cp\u003eAll the necessary code was written in Python 3 using the following packages: Pandas (v. 1.4.2), NumPy (v. 1.22.4), Scikit-Learn (v. 1.1.2), Matplotlib (v. 3.5.1), Seaborn (v. 0.2), SciPy (v. 1.7.3), Synthetic Data Vault (version 1.2.0) and SDMetrics (version 0.10.1).\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Dataset\u003c/h2\u003e \u003cp\u003eData on self-healing concrete containing bacteria in its formulation were collected from the literature. A total of 38 occurrences involving six bacillus bacteria, \u003cem\u003eBacillus cohnii\u003c/em\u003e, \u003cem\u003eBacillus subtilus\u003c/em\u003e, \u003cem\u003eBacillus megaterium\u003c/em\u003e, \u003cem\u003eBacillus sphaericus\u003c/em\u003e, \u003cem\u003eBacillus mucilaginous\u003c/em\u003e and \u003cem\u003eBacillus pseudofirmus\u003c/em\u003e, were identified (Wiktor and Jonkers \u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e2011\u003c/span\u003e; Luo et al. \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e2015\u003c/span\u003e; Gupta et al. \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Feng et al. \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Kumar Jogi and Vara Lakshmi \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Su et al. \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Huang et al. \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Karthiga Shenbagam and Praveena \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Nodehi et al. \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Moreover, the following information was extracted and tabulated to prepare the dataset: quantity of sand, aggregate, calcium lactate and plasticizer, all measured in kg/m\u003csup\u003e3\u003c/sup\u003e; ratio of cement/water; fissure size (mm); time to self-healing (days); and self-healing percentage. All the data were tabulated and recorded in a CSV table.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2 Data preprocessing\u003c/h2\u003e \u003cp\u003eThe data were preprocessed, and the following six bacterial genera were identified: \u003cem\u003eBacillus cohnii\u003c/em\u003e (1), \u003cem\u003eBacillus subtilus\u003c/em\u003e (2), \u003cem\u003eBacillus megaterium\u003c/em\u003e (3), \u003cem\u003eBacillus sphaericus\u003c/em\u003e (4), \u003cem\u003eBacillus mucilaginous\u003c/em\u003e (5) and \u003cem\u003eBacillus pseudofirmus\u003c/em\u003e (6). Moreover, as ML models were developed to classify each concrete formulation as self-healing or not, while in the original dataset, self-healing capacity was expressed as a percentage, it was necessary to encode each concrete formulation as capable or not capable of self-healing. Therefore, we considered 70% as the cut-off for self-healing capacity and considered all self-healing capacities with percentages of 70% and above as 1 (can self-heal) and all self-healing capacity percentages under 70% as 0 (cannot self-heal). This kind of encoding allows the development of classification machine learning models. All the data were recorded in a CSV table. Furthermore, from 38 occurrences, 9 (23%) were removed from the CSV table and recorded into a new spreadsheet for further testing the trained models with real data.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Synthetic Data Generation\u003c/h2\u003e \u003cp\u003eAs machine learning algorithms often require a large amount of data and because the data collected from the literature are insufficient for developing machine learning models, the data were subjected to synthetic data generation via the Synthetic Data Vault (Patki et al. \u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e2016\u003c/span\u003e). Synthetic Data Vault (SDV) is a Python framework for generating synthetic data from real data; that is, this tool can generate new data that statistically correlate to the given real data.\u003c/p\u003e \u003cp\u003eSynthetic data generation involves a range of techniques that can be employed to expand datasets with limited information (Hittmeir et al. \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). Therefore, this approach has been gaining popularity in ML. To create a dataset that is 10 times larger than the original set, 350 lines of data were generated based on the original dataset. To ensure the quality of the generated data and that it correlates with the original dataset, the synthetic dataset was further analysed using methods and tools provided by the SDV framework. Finally, the generated dataset was recorded into a CSV table.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4 Development of the Machine Learning Models\u003c/h2\u003e \u003cp\u003eTo classify each concrete formulation as self-healing or not, ML classification models were developed and trained. Five ML methods were selected to develop the classification models: na\u0026iuml;ve Bayes (NB), logistic regression (LR), K-nearest neighbors (KNN), support vector classification (SVC) and random forest classification (RF). The dataset generated by SDV was split into training (80%) and test sets (20%) and subsequently used to train and test each ML model.\u003c/p\u003e \u003cdiv id=\"Sec7\" class=\"Section3\"\u003e \u003ch2\u003e2.4.1 Na\u0026iuml;ve Bayes\u003c/h2\u003e \u003cp\u003eNa\u0026iuml;ve Bayes is a machine learning classification technique that is based on probabilistic statistical modelling (Shields et al. \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). That is, it is based on the statistical probability of a certain target variable belonging or not belonging to a given class. In this work, the Na\u0026iuml;ve Bayes ML model was trained and tested using the Gaussian Na\u0026iuml;ve Bayes implementation of the Scikit-learn framework with the framework default parameters.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section3\"\u003e \u003ch2\u003e2.4.2 Logistic Regression\u003c/h2\u003e \u003cp\u003eAlthough this ML technique is named, it is in fact a classification technique. LR is based on the estimation of the probability of something being true or false, the details of which are well described in the literature (Bisong \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). Like NB, LR is a kind of statistical modelling that considers the probability of a true value versus the probability of a false value (Janiesch et al. \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Zhou \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). This feature makes this technique good for classification predictive analysis. In this work, a logistic regression (LR) ML model was trained and tested using the logistic regression solver present in the Scikit-Learn framework with the parameter solver set to \u0026lsquo;liblinear\u0026rsquo; and the random state set to 10. The solver parameter was set to the \u0026lsquo;\u003cem\u003eliblinear\u003c/em\u003e\u0026rsquo; parameter because it is adequate for binary classification.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section3\"\u003e \u003ch2\u003e2.4.3 K-nearest neighbors\u003c/h2\u003e \u003cp\u003eThe KNN algorithm is a nonparametric learning method of classification and regression. It is one of the many ML algorithms known as voting algorithms; that is, data are classified into each class by a plurality of votes from their neighbours, across Euclidean distance, into common classes (Zheng and Tropsha \u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e2000\u003c/span\u003e; dos Santos Freitas et al. \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). As the performance of KNN depends upon the number of neighbours (\u003cem\u003ek\u003c/em\u003e), an optimization was conducted to find the optimal number of neighbours. This optimization was performed by training models using the KNN implementation present in Scikit-learn by setting the parameter \u003cem\u003en_neighbors\u003c/em\u003e from 1 to 50 with 10-fold cross-validation and by further measuring the accuracy score of each training and cross-validation. During this process, the optimal \u003cem\u003ek\u003c/em\u003e value was found to be 26 (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Hence, the final KNN model was trained with a \u003cem\u003ek\u003c/em\u003e value equal to 26.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section3\"\u003e \u003ch2\u003e2.4.4 Support Vector Classification\u003c/h2\u003e \u003cp\u003eThis model implements a support vector machine (SVM) algorithm. SVC provides a binary output (e.g., determining whether an object belongs to a specific class). Thus, SVC is appropriate for addressing classification problems and works by generating a hyperplane separating the dataset and nonlinearly remapping data into a higher-dimensional space to achieve linear correlation (Burges \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e1998\u003c/span\u003e; Mammone et al. \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2009\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eTo train the SVC model, it was first necessary to establish the required parameters for the calculations. Hence, the following parameters and their functions were defined as described in the literature: kernel, which is a mathematical function responsible for data transformations in SVM; C (regularization parameter), which controls the trade-off between maximizing the hyperplane margin and minimizing the training error term; gamma, which improves classification accuracy for training data; and degree, which controls the complexity of the transformed space (Cervantes et al. \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). Thus, this work assessed the impact of different values assigned to these parameters on the model performance using the GridSearchCV method. The tested values for each of the necessary parameters are presented in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eValues of the support vector classification (SVC) parameters.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eParameters\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eValues\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKernel\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLinear; Polynomial; RBF; Sigmoid\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1; 5; 10; 100\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDegree\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1; 2; 3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGamma\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.01; 0.05; 0.1; 0.5; 1.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMax_iter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"2\"\u003eNote: RBF\u0026thinsp;=\u0026thinsp;radial basis function kernel.\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eGridSearchCV automatically determines the optimal values for the required parameters. This tool operates through an iterative process, where each iteration defines a combination of parameter values and calculates a score, identifying the combination with the highest score as the best one. In this work, GridSearchCV was run with n_splits\u0026thinsp;=\u0026thinsp;5, test_size\u0026thinsp;=\u0026thinsp;0.2 and random_state\u0026thinsp;=\u0026thinsp;10. This means a 5-fold cross validation with data split into 80% training and 20% testing, whereas random_state fixes a randomness value to ensure reproducibility. After running, the best combination of parameter values that yielded the highest score for SVC was C\u0026thinsp;=\u0026thinsp;10, gamma\u0026thinsp;=\u0026thinsp;1 and a linear kernel with a maximum of 1000 iterations. Once the values were determined, the final SVC model was trained using those values.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section3\"\u003e \u003ch2\u003e2.4.5 Random forest classification\u003c/h2\u003e \u003cp\u003eRandom forest classification is an ensemble machine learning technique that combines decision trees to predict the classification of a target response (Ehrman et al. \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2007\u003c/span\u003e). The predictive ability and range of random forest models can be tuned by adjusting certain parameters. In this study, two parameters were used, as discussed below. The first parameter is \"n_estimators\", which represents the number of trees in the forest. In this work, the final value of n_estimators\u0026thinsp;=\u0026thinsp;100 was selected. The second parameter is \"random_state\", which controls the bootstrap randomness. To ensure reproducibility, an integer value equal to 10 was adopted in this work.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003e2.5 Performance comparison of ML models\u003c/h2\u003e \u003cp\u003eThe performances of each model were further compared. The quality of the model fit was assessed using several common metrics for measuring classification model performance: accuracy, precision, recall, F1-score and confusion matrix. Accuracy is a metric of the overall correctness of the model and measures how much the model is making correct predictions. Precision measures how much the model correctly identified a concrete formulation as self-healing of the total of concrete formulations. The total amount of actual self-healing concrete was correctly identified relative to the total amount of self-healing concrete. The F1-score is a statistical measure of both precision and recall, representing the overall performance of the model and measuring true positives, true negatives, false positives and false negatives with the same metric. The F1-Score can vary from 0 to 1, and higher F1-Score values indicate better model performance. The confusion matrix is not a metric itself but rather a way to graphically show and interpret model performance. Moreover, to determine the influence of each concrete component (variable) on the self-healing capacity, we used the permutation_importance function in the sklearn.inspection package. Details on the methodology are described elsewhere (Pessoa et al. \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). Finally, each model was tested with real data to ensure that each model correctly predicted the concrete self-healing capacity.\u003c/p\u003e \u003c/div\u003e"},{"header":"3. Results and Discussion","content":"\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Synthetic data quality analysis\u003c/h2\u003e \u003cp\u003eAn overall quality score of 79.43% was achieved, with the analysis of column pair trends yielding 88.18% (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e), which measures the correlation between synthetic and real data; a higher percentage indicates a higher quality of generated data. Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e clearly shows that the correlation between the synthetic and real data is high. Therefore, synthetic data can safely represent the variety of formulations and self-healing capacity of the original data. Furthermore, a final data analysis was conducted to assess the coverage of the synthetic data and determine whether new data or copies of real data had been generated. The results were satisfactory, as the synthetic data covered over 90% of the range of possible values, and over 90% of the synthetic data comprised new data.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Machine Learning Modelling\u003c/h2\u003e \u003cp\u003eThe results of ML modelling are summarized in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. The results show that the best ML model is the RF model, while the worst are both the NB and KNN models. The NB model displayed only moderate performance in the test set (accuracy\u0026thinsp;=\u0026thinsp;0.712 and F1-Score\u0026thinsp;=\u0026thinsp;0.713) compared to the other models, indicating its limitation in handling the complexity of the underlying relationships. Compared with more advanced ML algorithms such as ensemble models, probabilistic models, such as na\u0026iuml;ve Bayes, are known to have poorer performance and limitations for nonlinear data relations (Cloutier and Sirois \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2008\u003c/span\u003e; Kumar Tipu et al. \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Hence, this result shows that such a probabilistic model is not adequate for solving this class of ML problems. The SVC and LR (accuracy\u0026thinsp;=\u0026thinsp;0.765 and F1-Score\u0026thinsp;=\u0026thinsp;0.763 for the test set) models have similar overall performances on both the training and test sets, with slightly better SVC performances (accuracy\u0026thinsp;=\u0026thinsp;0.773 and F1-Score\u0026thinsp;=\u0026thinsp;0.788 for the test set). As logistic regression performs classification on the basis of linear-logarithmic transformations, this result can be explained by the type of kernel selected for SVC (linear). The SVC linear kernel is very similar in function to the logistic regression algorithm, although it is still different since, similar to NB, logistic regression is considered a probabilistic algorithm. From a practical point of view, both the LR and SVC show equal performance.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparison of the performances of the ML models on the training and test sets\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"10\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eML Model\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003eTrain\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"5\" nameend=\"c10\" namest=\"c6\"\u003e \u003cp\u003eTest\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eAccuracy\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cem\u003ePrecision\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eRecall\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c6\" namest=\"c5\"\u003e \u003cp\u003e\u003cem\u003eF1-Score\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e\u003cem\u003eAccuracy\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e\u003cem\u003ePrecision\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e\u003cem\u003eRecall\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e\u003cem\u003eF1-Score\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eNB\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.828\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.828\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.870\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c6\" namest=\"c5\"\u003e \u003cp\u003e0.828\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.712\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.704\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.756\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e0.713\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLR\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.843\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.843\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.881\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c6\" namest=\"c5\"\u003e \u003cp\u003e0.844\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.765\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.750\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.805\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e0.763\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eKNN\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.790\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.798\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.830\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c6\" namest=\"c5\"\u003e \u003cp\u003e0.791\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.712\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.695\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.780\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e0.714\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eSVC\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.865\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.868\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.892\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c6\" namest=\"c5\"\u003e \u003cp\u003e0.865\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.787\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.773\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.830\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e0.788\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eRF\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"2\" nameend=\"c6\" namest=\"c5\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0.863\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0.895\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.830\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c10\"\u003e \u003cp\u003e0.863\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eKNN shows the worst performance during training (accuracy\u0026thinsp;=\u0026thinsp;0.790 and F1-Score\u0026thinsp;=\u0026thinsp;0.791), while its performance for the test set is equal to that of NB (accuracy\u0026thinsp;=\u0026thinsp;0.712 and F1-Score\u0026thinsp;=\u0026thinsp;0.714). As KNN performance depends upon the similarity of the data, predictions can be difficult if many differences in the data exist, including nonlinearity of the data.\u003c/p\u003e \u003cp\u003eRF outperforms any other model in both training and testing. It shows a value of 1.000 in all the metrics during training, which could normally be indicative of overfitting (Janiesch et al. \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2021\u003c/span\u003e), but the test performance (accuracy\u0026thinsp;=\u0026thinsp;0.863 and F1-Score\u0026thinsp;=\u0026thinsp;0.863) does not corroborate this hypothesis. Thus, RF can predict the self-healing capacity of a concrete formulation with a high degree of confidence. The superior performance of RF in both training and testing is because RF is an ensemble method that is a compilation of multiple instances of decision trees. Consequently, its predictive performance increases compared to that of a single regression tree by utilizing a process known in ML as the voting process and then combining several independent learners to make an overall more accurate prediction compared to that of an individual model (Cook et al. \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Dehestani et al. \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2022\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Performance of the ML models on real data\u003c/h2\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows the performance of the trained ML models when applied to real data. This can also be considered a kind of validation to explore and understand how much ML models trained on synthetic data are reliable when applied to real data.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparison of the performances of the ML models on real data\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eML model\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eAccuracy\u003c/em\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cem\u003ePrecision\u003c/em\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cem\u003eF1-Score\u003c/em\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u003cem\u003eRecall (positive)\u003c/em\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u003cem\u003eRecall\u003c/em\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003e(negative)\u003c/em\u003e\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eNB\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.666\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.625\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.605\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.250\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLR\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.777\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.714\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.759\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.500\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eKNN\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.666\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.625\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.605\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.250\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eSVC\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eRF\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1.000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe results show that nonprobabilistic models (SVC and RF) perform better than probabilistic models (NB and LR) and that KNN has poor performance, comparable to NB. A series of confusion matrices is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e to compare how each model performs on real data. Figure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows that for the 5 real self-healing concrete samples, all the ML models were able to correctly classify them as self-healing, while for the 4 non-self-healing concrete samples, the SVC and RF models correctly classified all of them. Hence, the KNN, LR and NB models cannot correctly classify non-self-healing concrete. Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e and Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e show that NB and KNN correctly classified only 25% of nonself-healing concrete (negative values), while LR performed better, correctly classifying 50% of negative values. In other words, 75% of the nonself-healing concrete was incorrectly classified by the NB and KNN models, and 50% of the nonself-healing concrete was incorrectly classified by the LR model. Support vector machines and random forests are ML algorithms that excel in dealing with nonlinear associations within the data and typically outperform probabilistic and linear algorithms (Fan et al. \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2008\u003c/span\u003e; Niazi et al. \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e2008\u003c/span\u003e). These algorithms demonstrate the ability to capture nonlinear data associations and interactions and provide good or superior performance even when subjected to fewer parameter adjustments. Moreover, they are less susceptible to overfitting, yielding accurate predictions even when confronted with limited information.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe superior performance of both SVC and RF can be noted by the correct classification of all instances of both non-self-healing concrete (negative values) and self-healing concrete (positive values). Similar performances for ensemble methods, including formulations with bacteria, have been reported in the literature for different kinds of self-healing concrete with accuracies greater than 0.90 (Zhuang and Zhou \u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Huang et al. \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Alabduljabbar et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). However, this work outperformed previous works by correctly classifying all instances in a real-data test.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003e3.4 Influence of the independent variable sensitivity\u003c/h2\u003e \u003cp\u003eA graph with the relative importance of each variable (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e) shows that the cement/water ratio (FACTOR), lactate content, aggregate content and time are features that have a direct influence on the self-healing capacity of concrete.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eRegardless of the type of bacteria present in the concrete formulation, self-healing capacity is induced by the bacterial production of calcium lactate, which is further converted to calcium carbonate (Luo et al. \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e2015\u003c/span\u003e; Rong et al. \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). Thus, it is not surprising that LACTATE has a greater influence on machine learning modelling than does the type of bacteria; therefore, the importance of bacteria decreases since bacteria can produce lactate, which is the final factor responsible for self-healing. The cement/water ratio also has great importance for self-healing since the rate of bacterial activation depends upon the presence of water (Luo et al. \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e2015\u003c/span\u003e), and even concrete without any bacteria can show little self-healing capacity (Zhuang and Zhou \u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). The time in days and the fissure size in mm are also of utmost importance in self-healing modelling and prediction.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003e3.5 Limitations and contributions of this work\u003c/h2\u003e \u003cp\u003eThe main contribution of this work is to propose the use of synthetic data and to compare the performances of ML models created based on synthetic data. Thus, this work contributes to the development of AI and ML tools for concrete design and other fields of industrial engineering. On the other hand, the synthetic data generation and ML design methodology in this paper can be adopted for wider application in civil engineering. The ignorance of other influencing variables for crack healing, such as the carrier of bacteria and the nutrient medium of bacteria, is a limitation not only of the present work but also of previous works since experimental data on the subject are scarce (Zhuang and Zhou \u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Feng et al. \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Huang et al. \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Pessoa et al. \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e"},{"header":"4. Conclusions","content":"\u003cp\u003eThe generation and utilization of synthetic data have proven useful in overcoming limitations due to the scarcity of experimental data in the field of concrete self-healing driven by bacteria. Synthetic data generation substantially expanded the initial dataset, enabling the development and testing of machine learning models with enhanced accuracy and reliability, ensuring that those models captured a wider range of possible scenarios and variations in concrete formulations. Analysis and comparison of multiple ML algorithms demonstrated that ensemble methods, particularly random forest, outperformed probabilistic models in predicting concrete self-healing capacity. This approach not only validated the utility of synthetic data but also showed its potential to advance predictive modelling in civil engineering, particularly where experimental data availability is limited. Furthermore, the key factors influencing concrete self-healing, including the cement-to-water ratio, calcium lactate concentration, aggregate composition, and curing time, were identified. These findings contribute to the ongoing increasing use of machine learning and AI in the field of concrete research and show the transformative impact of synthetic data in driving innovation and addressing critical challenges in civil engineering.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eDisclosure statement\u003c/h2\u003e\n\u003cp\u003eThe authors report there are no competing interests to declare.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eF.S.O Experimental Design and Development. Data Analysis. Worote the first draft of manuscript. R. S. Data Curation. Data Analysis. Review and improvement of the manuscript. All authors read and approved the final manuscript.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eAll data and code are available on corresponding author's github page.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAlabduljabbar H, Khan K, Awan HH et al (2023) Modeling the capacity of engineered cementitious composites for self-healing using AI-based ensemble techniques. Case Stud Constr Mater 18. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.cscm.2022.e01805\u003c/span\u003e\u003cspan address=\"10.1016/j.cscm.2022.e01805\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlghamdi SJ (2023) Prediction of Concrete\u0026rsquo;s Compressive Strength via Artificial Neural Network Trained on Synthetic Data. Eng Technol Appl Sci Res 13:12404\u0026ndash;12408. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48084/etasr.6560\u003c/span\u003e\u003cspan address=\"10.48084/etasr.6560\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlthoey F, Amin MN, Khan K et al (2022) Machine learning based computational approach for crack width detection of self-healing concrete. Case Stud Constr Mater 17:e01610. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.cscm.2022.e01610\u003c/span\u003e\u003cspan address=\"10.1016/j.cscm.2022.e01610\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBisong E (2019) Logistic Regression. In: Bisong E (ed) Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners. A, Berkeley, CA, pp 243\u0026ndash;250\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBurges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2:121\u0026ndash;167. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1023/A:1009715923555\u003c/span\u003e\u003cspan address=\"10.1023/A:1009715923555\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCervantes J, Garcia-Lamont F, Rodr\u0026iacute;guez-Mazahua L, Lopez A (2020) A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing 408:189\u0026ndash;215. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.neucom.2019.10.118\u003c/span\u003e\u003cspan address=\"10.1016/j.neucom.2019.10.118\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChaabene W, Ben, Flah M, Nehdi ML (2020) Machine learning prediction of mechanical properties of concrete: Critical review. Constr Build Mater 260:1\u0026ndash;18. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.conbuildmat.2020.119889\u003c/span\u003e\u003cspan address=\"10.1016/j.conbuildmat.2020.119889\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCloutier LM, Sirois S (2008) Bayesian versus Frequentist statistical modeling: a debate for hit selection from HTS campaigns. Drug Discov Today 13:536\u0026ndash;542. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.drudis.2008.03.022\u003c/span\u003e\u003cspan address=\"10.1016/j.drudis.2008.03.022\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCook R, Lapeyre J, Ma H et al (2019) Prediction of Compressive Strength of Concrete: Critical Comparison of Performance of a Hybrid Machine Learning Model. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1061/(ASCE)\u003c/span\u003e\u003cspan address=\"10.1061/(ASCE)\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. with Standalone Models\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDehestani A, Kazemi F, Abdi R, Nitka M (2022) Prediction of fracture toughness in fibre-reinforced concrete, mortar, and rocks using various machine learning techniques. Eng Fract Mech 276. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.engfracmech.2022.108914\u003c/span\u003e\u003cspan address=\"10.1016/j.engfracmech.2022.108914\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003edos Santos Freitas MM, Barbosa JR, dos Santos Martins EM et al (2022) KNN algorithm and multivariate analysis to select and classify starch films. Food Packag Shelf Life 34. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.fpsl.2022.100976\u003c/span\u003e\u003cspan address=\"10.1016/j.fpsl.2022.100976\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEhrman TM, Barlow DJ, Hylands PJ (2007) Virtual screening of Chinese herbs with Random Forest. J Chem Inf Model 47:264\u0026ndash;278. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/ci600289v\u003c/span\u003e\u003cspan address=\"10.1021/ci600289v\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFan R-E, Chang K-W, Hsieh C-J et al (2008) LIBLINEAR: A Library for Large Linear Classification\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFeng DC, Liu ZT, Wang XD et al (2020) Machine learning-based compressive strength prediction for concrete: An adaptive boosting approach. Constr Build Mater 230. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.conbuildmat.2019.117000\u003c/span\u003e\u003cspan address=\"10.1016/j.conbuildmat.2019.117000\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFeng J, Su Y, Qian C (2019) Coupled effect of PP fiber, PVA fiber and bacteria on self-healing efficiency of early-age cracks in concrete. Constr Build Mater 228:116810. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/J.CONBUILDMAT.2019.116810\u003c/span\u003e\u003cspan address=\"10.1016/J.CONBUILDMAT.2019.116810\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGelernter H, Rose JR, Chen C (1990) Building and Refining a Knowledge Base for Synthetic Organic Chemistry via the Methodology of Inductive and Deductive Machine Learning EXTRACTING REACTION SCHEMATA FROM A DATABASE VIA INDUCTIVE AND DEDUCTIVE GENERALIZATION Building a Synthetic Chemistry Knowledge Base\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGupta S, Kua HW, Pang SD (2018) Healing cement mortar by immobilization of bacteria in biochar: An integrated approach of self-healing and carbon sequestration. Cem Concr Compos 86:238\u0026ndash;254. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.cemconcomp.2017.11.015\u003c/span\u003e\u003cspan address=\"10.1016/j.cemconcomp.2017.11.015\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHittmeir M, Ekelhart A, Mayer R (2019) On the Utility of Synthetic Data: An Empirical Evaluation on Machine Learning Tasks. In: Proceedings of the 14th International Conference on Availability, Reliability and Security. Association for Computing Machinery, pp 1\u0026ndash;6\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHong Y, Park S, Kim H, Kim H (2021) Synthetic data generation using building information models. Autom Constr 130:103871. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/https://doi.org/10.1016/j.autcon.2021.103871\u003c/span\u003e\u003cspan address=\"10.1016/j.autcon.2021.103871\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHossain MR, Sultana R, Patwary MM et al (2022) Self-healing concrete for sustainable buildings. A review. Environ Chem Lett 20:1265\u0026ndash;1273\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuang X, Sresakoolchai J, Qin X et al (2022) Self-Healing Performance Assessment of Bacterial-Based Concrete Using Machine Learning Approaches. Materials 15. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/ma15134436\u003c/span\u003e\u003cspan address=\"10.3390/ma15134436\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJaniesch C, Zschech P, Heinrich K (2021) Machine learning and deep learning. Electron Markets 31:685\u0026ndash;695. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s12525-021-00475-2/Published\u003c/span\u003e\u003cspan address=\"10.1007/s12525-021-00475-2/Published\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKarthiga Shenbagam N, Praveena R (2022) Performance of bacteria on self-healing concrete and its effects as carrier. Mater Today Proc 65:1987\u0026ndash;1989. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.matpr.2022.05.322\u003c/span\u003e\u003cspan address=\"10.1016/j.matpr.2022.05.322\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKr\u0026uuml;ger, Marius, Vogel-Heuser B, HD, WJ, PT, PD, CS and KC (2024) Synthetic Data Generation for the Enrichment of Civil Engineering Machine Data. Fottner Johannes and N\u0026uuml;bel K and MD (ed) Construction Logistics, Equipment, and Robotics. Springer Nature Switzerland, Cham, pp 166\u0026ndash;175\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKumar Jogi P, Vara Lakshmi TVS (2020) Self healing concrete based on different bacteria: A review. In: Materials Today: Proceedings. Elsevier Ltd, pp 1246\u0026ndash;1252\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKumar Tipu R, Panchal VR, Pandya KS (2022) An ensemble approach to improve BPNN model precision for predicting compressive strength of high-performance concrete. Structures 45:500\u0026ndash;508. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.istruc.2022.09.046\u003c/span\u003e\u003cspan address=\"10.1016/j.istruc.2022.09.046\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi VC, Yang E-H (2007) Self Healing in Concrete Materials. In: van der Zwaag S (ed) Self healing materials: an alternative approach to 20 centuries of materials science. Springer, pp 161\u0026ndash;193\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLuo M, Qian CX, Li RY (2015) Factors affecting crack repairing capacity of bacteria-based self-healing concrete. Constr Build Mater 87:1\u0026ndash;7. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.conbuildmat.2015.03.117\u003c/span\u003e\u003cspan address=\"10.1016/j.conbuildmat.2015.03.117\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMammone A, Turchi M, Cristianini N (2009) Support vector machines. Wiley Interdiscip Rev Comput Stat 1:283\u0026ndash;289\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMumuni A, Mumuni F (2022) Data augmentation: A comprehensive survey of modern approaches. Array 16\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNiazi A, Jameh-Bozorghi S, Nori-Shargh D (2008) Prediction of toxicity of nitrobenzenes using ab initio and least squares support vector machines. J Hazard Mater 151:603\u0026ndash;609. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.jhazmat.2007.06.030\u003c/span\u003e\u003cspan address=\"10.1016/j.jhazmat.2007.06.030\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNodehi M, Ozbakkaloglu T, Gholampour A (2022) A systematic review of bacteria-based self-healing concrete: Biomineralization, mechanical, and durability properties. J Building Eng 49\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOchi T, Okubo S, Fukui K (2007) Development of recycled PET fiber and its application as concrete-reinforcing fiber. Cem Concr Compos 29:448\u0026ndash;455. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.cemconcomp.2007.02.002\u003c/span\u003e\u003cspan address=\"10.1016/j.cemconcomp.2007.02.002\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOnyelowe KC, Adam AFH, Ulloa N et al (2024) Modeling the influence of bacteria concentration on the mechanical properties of self-healing concrete (SHC) for sustainable bio-concrete structures. Sci Rep 14. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41598-024-58666-8\u003c/span\u003e\u003cspan address=\"10.1038/s41598-024-58666-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePatki N, Wedge R, Veeramachaneni K (2016) The synthetic data vault. In: Proceedings \u0026ndash;\u0026thinsp;3rd IEEE International Conference on Data Science and Advanced Analytics, DSAA 2016. Institute of Electrical and Electronics Engineers Inc., pp 399\u0026ndash;410\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePessoa CLE, Peres Silva VH, Stefani R (2024) Prediction of the self-healing properties of concrete modified with bacteria and fibers using machine learning. Asian J Civil Eng 25:1801\u0026ndash;1810. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s42107-023-00878-w\u003c/span\u003e\u003cspan address=\"10.1007/s42107-023-00878-w\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePilania G (2021) Machine learning in materials science: From explainable predictions to autonomous design. Comput Mater Sci 193:110360. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/J.COMMATSCI.2021.110360\u003c/span\u003e\u003cspan address=\"10.1016/J.COMMATSCI.2021.110360\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRong H, Wei G, Ma G et al (2020) Influence of bacterial concentration on crack self-healing of cement-based materials. Constr Build Mater 244:118372. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.conbuildmat.2020.118372\u003c/span\u003e\u003cspan address=\"10.1016/j.conbuildmat.2020.118372\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShields BJ, Stevens J, Li J et al (2021) Bayesian reaction optimization as a tool for chemical synthesis. Nature 590:89\u0026ndash;96. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41586-021-03213-y\u003c/span\u003e\u003cspan address=\"10.1038/s41586-021-03213-y\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShorten C, Khoshgoftaar TM (2019) A survey on Image Data Augmentation for Deep Learning. J Big Data. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/s40537-019-0197-0\u003c/span\u003e\u003cspan address=\"10.1186/s40537-019-0197-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. 6:\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSu Y, Qian C, Rui Y, Feng J (2021) Exploring the coupled mechanism of fibers and bacteria on self-healing concrete from bacterial extracellular polymeric substances (EPS). Cem Concr Compos 116:103896. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/J.CEMCONCOMP.2020.103896\u003c/span\u003e\u003cspan address=\"10.1016/J.CEMCONCOMP.2020.103896\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTalaiekhozan A, Keyvanfar A, Shafaghat A et al (2014) A Review of Self-healing Concrete Research Development. J Environ Treat Techniques 2:1\u0026ndash;11\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWendland P, Birkenbihl C, Gomez-Freixa M et al (2022) Generation of realistic synthetic data using Multimodal Neural Ordinary Differential Equations. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41746-022-00666-x\u003c/span\u003e\u003cspan address=\"10.1038/s41746-022-00666-x\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. NPJ Digit Med 5:\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWiktor V, Jonkers HM (2011) Quantification of crack-healing in novel bacteria-based self-healing concrete. Cem Concr Compos 33:763\u0026ndash;770. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.cemconcomp.2011.03.012\u003c/span\u003e\u003cspan address=\"10.1016/j.cemconcomp.2011.03.012\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZheng W, Tropsha a (2000) Novel variable selection quantitative structure\u0026ndash;property relationship approach based on the k-nearest-neighbor principle. J Chem Inf Comput Sci 40:185\u0026ndash;194\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou Z-H (2021) Machine Learning, 1st edn. Springer Singapore\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhuang X, Zhou S (2019) The prediction of self-healing capacity of bacteria-based concrete using machine learning approaches. Computers Mater Continua 59:57\u0026ndash;77. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.32604/cmc.2019.04589\u003c/span\u003e\u003cspan address=\"10.32604/cmc.2019.04589\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZiolkowski P, Niedostatkiewicz M (2019) Machine learning techniques in concrete mix design. Materials 12. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/ma12081256\u003c/span\u003e\u003cspan address=\"10.3390/ma12081256\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":true,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"self-healing, concrete, bacteria, synthetic data, machine learning","lastPublishedDoi":"10.21203/rs.3.rs-4668609/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4668609/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThis work investigated the use of synthetic data to overcome the limitations of scarce experimental data in predicting the self-healing capacity of bacteria-driven concrete. We generated a synthetic dataset based on real-world data, significantly expanding the original dataset and then trained and compared machine learning models, including probabilistic and ensemble methods, to predict the concrete self-healing capacity. The results demonstrate that the ensemble methods, particularly the random forest (RF) method (accuracy\u0026thinsp;=\u0026thinsp;0.863 and F1-score\u0026thinsp;=\u0026thinsp;0.863), outperformed the probabilistic models and achieved high accuracy in predicting self-healing capacity. The trained models were further applied to real-word data examples, showing high accuracy. This research validates the utility of synthetic data in predicting modelling accuracy and reliability in civil engineering, particularly in areas with limited experimental data. The findings contribute to the growing use of ML and AI in concrete research and demonstrate the transformative potential of synthetic data in addressing challenges in civil engineering.\u003c/p\u003e","manuscriptTitle":" On the use of Synthetic Data for Machine Learning prediction of Self-Healing Capacity of Concrete","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-07-31 11:54:28","doi":"10.21203/rs.3.rs-4668609/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"bf1e2be7-e4ca-41f8-ba8e-53ddb1cde112","owner":[],"postedDate":"July 31st, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-11-09T12:38:39+00:00","versionOfRecord":[],"versionCreatedAt":"2024-07-31 11:54:28","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4668609","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4668609","identity":"rs-4668609","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00