Machine Learning-Driven Models to Predict the optimum Genotype and Planting Date on yield and phytochemical Traits in Roselle (Hibiscus sabdariffa L.)

doi:10.21203/rs.3.rs-6917293/v1

Machine Learning-Driven Models to Predict the optimum Genotype and Planting Date on yield and phytochemical Traits in Roselle (Hibiscus sabdariffa L.)

2025 · doi:10.21203/rs.3.rs-6917293/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 130,980 characters · extracted from preprint-html · click to expand

Machine Learning-Driven Models to Predict the optimum Genotype and Planting Date on yield and phytochemical Traits in Roselle (Hibiscus sabdariffa L.) | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Machine Learning-Driven Models to Predict the optimum Genotype and Planting Date on yield and phytochemical Traits in Roselle (Hibiscus sabdariffa L.) WarqaaMuhammed ShariffAl-Sheikh, Fazilat Fakhrzad, Mohammed M. Mohammed, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6917293/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 12 You are reading this latest preprint version Abstract Accurate prediction and optimization of morphological traits in Roselle are essential for enhancing crop productivity and adaptability to diverse environments. In the present study, a machine learning framework was developed using Random Forest and Multi-layer Perceptron algorithms to model and predict key morphological traits, branch number, growth period, boll number, and seed number per plant, based on genotype and planting date. The dataset was generated from a field experiment involving ten Roselle genotypes and five planting dates. Both RF and MLP exhibited robust predictive capabilities; however, RF (R² = 0.84) demonstrated superior performance compared to MLP (R² = 0.80), underscoring its efficacy in capturing the nonlinear genotype-by-environment interactions. Permutation-based feature importance analysis further revealed that planting date had a more significant impact on trait variation than genotype. To identify optimal combinations of genotype and planting date for maximizing morphological traits, the RF model was integrated with the Non-dominated Sorting Genetic Algorithm II (NSGA-II). According to the RF–NSGA-II optimization results, the optimal values for branch number (26), growth period (176 days), boll number (116), and seed number per plant (1517) were achieved with the Qaleganj population planted on May 5. Collectively, these findings highlight the potential of integrating machine learning and evolutionary optimization algorithms as powerful computational tools for crop improvement and agronomic decision-making. Biological sciences/Ecology Biological sciences/Plant sciences Machine Learning techniques prediction optimization algorithm Multi-layer Perceptron Random Forest Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction Roselle ( Hibiscus sabdariffa L.) is an annual herbaceous plant belonging to the Malvaceae family that is widely cultivated in tropical and subtropical regions due to its high adaptability and drought tolerance. Various parts of the plant, including the flowers, stem fiber, leaves, seeds, fruits, and roots, are multi-purpose in the food and medicinal industries. Among various parts of the plant, the bright red fleshy calyx holds the most significant economic value, which contains vitamin C, iron, beta-carotene, anthocyanins, and phenolic compounds (Ali et al.2005). Additionally, roselle seeds contain high levels of protein, vitamin E, and unsaturated fatty acids such as oleic and linoleic acids (Zand-Silakhoor et al., 2022). In recent years, global demand for roselle and its processed products has increased steadily, particularly in the health food and natural product markets (Mahunu, 2021). Plant growth and development are influenced by a variety of environmental and genetic factors. Among these, the selection of suitable genotypes and the determination of optimal planting dates are critical agronomic decisions that directly impact plant performance (Mahunu, 2021). Genotype selection allows breeders to identify plant varieties with favorable characteristics such as early maturity, high yield, and resistance to environmental stress (Mulyaningsih et al., 2025). The results of research conducted on different genotypes of roselle indicate that these traits have high heritability and are largely influenced by genetic factors (Ibrahim and Hussein, 2006). In this regard, Richardson and Arlotta (2021) reported that there are significant differences among genotypes in terms of flower and leaf production and nutrient density. Ullah, (2025) also showed that genetic differences among genotypes can significantly affect traits such as plant height, number of branches, fruit weight, and number of fruits per plant. Simultaneously, the planting date must align with regional climatic conditions to ensure proper germination, growth, and flowering, thereby enhancing productivity. Multiple studies have shown that delayed sowing can significantly reduce crop yield, including plant height, number of bolls, biomass, calyx, and seed yield. Also, early planting leads to a longer growth period and more desirable growth traits (El-Sagher et al., 2022). A study conducted by Aung and Uape (2022) demonstrated that planting date significantly affects the growth and yield of Roselle (Aung and Uape, 2022). They reported that planting in July produced the best results in terms of early flowering, increased flower bud count, and longer calyces. El-Sagher et al (2022) reported that planting on May 15 led to the best performance in terms of vegetative growth (plant height, stem diameter, number of branches and fruits) and yield components (fresh and dry weights of shoots, calyces, and seeds) (El-Sagher et al., 2022). However, the interaction between genotype and planting date is complex and non-linear, requiring advanced modeling approaches to accurately predict outcomes and make informed recommendations. Traditional statistical methods, while useful, often fall short in capturing the non-linear interactions and high-dimensional relationships among multiple plant traits and environmental variables. In contrast, machine learning (ML) techniques offer powerful alternatives by learning patterns from historical data without relying on predefined models (Fakhrzad et al., 2022). Among these, Random Forest (RF) and Multi-layer Perceptron (MLP) are popular ML algorithms used in agricultural modeling due to their flexibility and robustness. Among these, MLP, as a type of feed-forward artificial neural network (ANN), is particularly known for its ability to model highly nonlinear functions due to its deep architecture composed of interconnected neurons and hidden layers (Fakhrzad et al., 2022). Random Forest (RF) is a widely used machine learning algorithm known for its simplicity and robust performance and reliable predictive capabilities. Its flexibility in handling both classification and regression problems has made it a valuable tool in numerous fields, including biological and agricultural research. RF models are particularly effective in processing complex datasets characterized by noise, imbalance, and high dimensionality, and they are also recognized for their ability to mitigate overfitting (Yoosefzadeh Najafabadi et al., 2021a). In biosystems, prediction alone is insufficient, and decision-making often involves balancing multiple conflicting objectives. Among the various optimization approaches, evolutionary algorithms, particularly the Non-dominated Sorting Genetic Algorithm II (NSGA-II), address this challenge by identifying a set of optimal solutions known as the Pareto front. Unlike single-objective genetic algorithms, NSGA-II can handle multiple objectives simultaneously and identify a set of Pareto-optimal solutions, representing the best trade-offs between conflicting goals. These solutions offer decision-makers a spectrum of options, enabling tailored strategies based on specific priorities and constraints. The integration of ML models with NSGA-II provides a synergistic framework that leverages the strengths of both approaches (Fakhrzad et al., 2022; Zarbakhsh and Shahsavar, 2022). Recently, the efficiency of the ANN-NSGA II model was used to predict and optimize pomegranate morphological traits under salinity and drought stress influenced by γ-aminobutyric acid. This approach helped identify optimal treatment conditions and provided insights into improving stress tolerance (Zarbakhsh and Shahsavar, 2022). Moreover, in a study by Yoosefzadeh-Najafabadi et al. (2021b), the combination of machine learning algorithms and genetic optimization was employed to model and optimize soybean yield based on its component traits. This integrated approach provided a better understanding of the relationships between yield and its morphological components, and can be effectively used in selecting parental lines and designing crosses aimed at improving the genetic yield potential of soybean cultivars (Yoosefzadeh-Najafabadi et al. 2021b). The present study aims to develop an integrated framework for predicting and optimizing the main agronomic traits of Roselle, focusing on four key traits, including the number of branches, growth period, number of bolls, and seeds per plant. Using a comprehensive dataset collected from a field experiment, we trained and compared the performance of multiple ML models, followed by the application of NSGA-II for multi-objective optimization. While ML models ensure the accurate prediction of morphological traits based on genotype and planting date, NSGA-II enables the optimization of these traits under a multi-objective setting, assisting in the selection of the best genotype–planning date combinations that meet multiple morphological targets simultaneously. The novelty of this study lies in its data-driven approach to simultaneously address traits prediction and optimal genotype-date selection. Specifically, the objectives are to: (1) examine how different planting dates and genotype combinations influence roselle's morphological performance, (2) identify stable and high-performing genotypes across diverse environmental conditions, (3) develop predictive machine learning models that can accurately forecast key traits based on genotype and planting date, and (4) employ cutting-edge multi-objective optimization algorithms to recommend the most effective genotype–planting date combinations that maximize crop yield. The insights derived from this model can inform breeding programs and cultivation planning, contributing to sustainable and high-yielding Roselle production systems. Materials and methods 2.1. Plant materials and experimental design The field experiment was conducted in Dalgan (27° 28′ N, 59° 27′ E; 389 m a.s.l.), located in Sistan and Baluchestan province, southeast Iran. The region has a hot-arid climate, with minimal rainfall and high summer temperatures. The soil texture was loam with a pH of 7.6 and EC of 2.12 dS m⁻¹. The study employed a factorial experimental design based on a randomized complete block design (RCBD) with three replications. The experimental treatments consisted of ten roselle genotypes including eight native accessions Jiroft , Dalgan , Bampoor , Iranshahr , Nikshahr , Roodbar , Saravan , and Qaleganj were collected from different agro-ecological zones of Iran, along with two exotic landraces: HA (originating from Ghana) and HS-24 (from Bangladesh), both supplied by the Jiroft Agricultural Research Station. The ten genotypes were sown under five different planting dates: March 6, April 6, May 5, June 5, and July 1. 2.2. Morphological trait measurement Morphological trait measurements were performed at the physiological maturity stage, defined as the point when approximately 70% of the seeds within each flower capsule had browned and the sepals had reached their maximum development. The following traits were recorded for each genotype and planting date treatment: plant height, number of branches per plant, fruit length, number of bolls per plant, fresh sepal weight per plant, number of seeds per capsule, seed weight per plant, 1000-seed weight, biomass yield, harvest index, calyx yield, and growth period (life cycle duration). Sepals were harvested at maturity and dried at ambient room temperature (~ 30°C) for further chemical analysis. Pearson correlation analysis was conducted using Python (version 3.11.12) to evaluate the relationships between input variables (genotype and planting date) and output traits, as well as their interactions. Data Pre-processing, and statistical analyses The dataset included ten genotypes and five planting dates, with four primary output traits: number of branches, growth period, number of bolls, and seed yield per plant. Input features were encoded using one-hot encoding, and the output variables were normalized using z-score standardization. Outlier removal was performed using IQR and Z-score techniques to enhance data quality. The dataset was split into 80% for training and 20% for testing. Prior to model training, a two-step statistical analysis was conducted to evaluate the relevance of each phenotypic trait with respect to the input variables (genotype and planting date). First, Pearson correlation coefficients were computed to assess the linear relationship between inputs and each target. Subsequently, a two-way ANOVA was performed for each target to test the statistical significance of three main effects including genotype, planting date, and their interaction. Hyperparameter Optimization in ML Models In machine learning, prior optimization and tuning of model hyperparameters are critical steps that significantly affect predictive performance and generalization ability. In this study, a structured grid search algorithm, Grid Search Cross Validation (GridSearchCV) was used for hyperparameter tuning of two machine learning models, MLP and RF, combined with 10-fold cross-validation. During the K-fold cross-validation process (K = 10), the dataset was partitioned into 10 subsets. Each subset served once as a validation set while the remaining subsets were used to train the model. This approach ensured that all data points contributed to both training and validation, thereby reducing the risk of overfitting or underfitting. The hyperparameter combinations yielding the highest cross-validated R² scores and lowest generalization errors on the test set were selected as the optimal configurations. 2.3. Description of ML models and optimization algorithm Model Development In this study, a supervised machine learning framework was developed to predict agronomic traits of roselle as influenced by genotype and planting date. Two models, Multilayer Perceptron (MLP) and Random Forest (RF), were implemented within a unified scikit-learn (Pedregosa et al., 2011) pipeline structure. These models were selected due to their proven effectiveness in handling nonlinear relationships and complex feature interactions in plant trait prediction tasks. Each model was trained and evaluated based on multi-output regression performance (Fig. 1 ). RF Model RF model was implemented as a robust ensemble-based learning method for multi-output regression to predict agronomic traits of roselle plants. RF operates by constructing an ensemble of decision trees, each trained on a bootstrapped subset of the original dataset. For a single decision tree with L terminal leaves, the input feature space is partitioned into R non-overlapping regions Rₘ such that the prediction function is defined as: $$\:f\left(x\right)=\sum\:_{m=1}^{R}{C}_{m}.\varPi\:(x,{R}_{m}\:)$$ where cₘ represents the predicted constant value in region Rₘ , and Π (x, Rₘ) is an indicator function defined by: $$\:\varPi\:\left(x,{R}_{m}\:\right)=\left\{\begin{array}{c}x\:\in\:\:Rₘ\\\:0,\:otherwise\end{array}\right.$$ The final RF prediction for an input sample x is obtained by averaging the outputs of all T individual trees. The final prediction ŷ for an input x is the average of predictions from all trees: $$\:ŷ=\left(\frac{1}{T}\right)\sum\:_{t=1}^{T}{f}_{t}\left(x\right)$$ where fₜ(x) is the output of the t th tree. This ensemble averaging mechanism enhances the predictive performance by reducing variance, improving robustness across different training subsets, and minimizing the risk of overfitting. In this study, the RF model was configured with 50 estimators (n_estimators = 50) and a maximum depth of 5 per tree (max_depth = 5). At each node, a random subset of predictors was selected to determine the best split, which encourages diversity among the trees and improves generalization. The model’s hyperparameters were optimized using GridSearchCV with 10-fold cross-validation. MLP Model In this study, MLP model was implemented as a supervised learning algorithm within a feedforward neural network architecture to predict agronomic traits of H. sabdariffa . The final MLP model consisted of three layers: an input layer, two hidden layers with five neurons each (5, 5), and a single output layer. The input vector x = [ x₁, x₂, ..., xₙ ], representing genotype and planting date encoded via one-hot encoding, was passed through the hidden layers, each employing the hyperbolic tangent activation function (tanh): $\:{h}_{j}=tanh(\sum\:wⱼᵢ\:·\:xᵢ\:+\:bⱼ)$ $\:\text{w}\text{h}\text{e}\text{r}\text{e}\:\text{w}\text{ⱼ}\text{ᵢ}\:\text{d}\text{e}\text{n}\text{o}\text{t}\text{e}\text{s}\:\text{t}\text{h}\text{e}\:\text{w}\text{e}\text{i}\text{g}\text{h}\text{t}\:\text{b}\text{e}\text{t}\text{w}\text{e}\text{e}\text{n}\:\text{t}\text{h}\text{e}\:$ i th input and the j th hidden neuron, bⱼ is the bias for the j th neuron, and tanh() is the hyperbolic tangent activation function. The final output ŷ , representing predicted plant traits (branch number, growth period, boll number, and seed yield), was computed through a linear activation function: $\:ŷ=\sum\:wⱼ\:·\:hⱼ\:+\:b₀$ Where wⱼ is the weight connecting the j th hidden neuron to the output, and b₀ is the output bias. The model was trained using the Adam optimizer with a maximum of 1000 iterations. To improve generalization and prevent overfitting, L2 regularization (weight decay) was incorporated via the alpha parameter, set to 1. Accordingly, the loss function minimized during training was defined as the regularized mean squared error (MSE): $$\:MSE=\left(\frac{1}{K}\right)\sum\:{\left(yₖ\:-\:ŷₖ\right)}^{2}+a\sum\:wⱼ²$$ where K is the number of training samples, ŷₖ is the predicted output for the k th sample, and α = 1 is the regularization parameter controlling the penalty applied to large weights. This formulation penalizes large weights and enhances the model’s robustness against noise and overfitting. The architecture and number of hidden neurons were selected based on a combination of empirical trial-and-error and structured grid search optimization, ensuring the best trade-off between complexity and predictive accuracy. Model Performance Evaluation To evaluate the performance of the developed machine learning models, several statistical indicators were employed, including the coefficient of determination (R²), root mean square error (RMSE), mean absolute percentage error (MAPE), and mean bias error (MBE). These metrics provide a comprehensive assessment of the model’s accuracy, precision, and bias. These quantitative indicators can be found in Table 1 . They were computed for both training and testing subsets to ensure the generalization capability of the models across unseen data. The prediction results for each target trait were subsequently compared and visualized through scatter plots of observed vs. predicted values (Fig. 3 a-h). The two implemented models were then compared based on these performance metrics in the investigated targets. Table 1 Description of regression evaluation metrics Metric Formula Description R² (Coefficient of Determination) R² = 1 - (Σ (y i - ŷ i )²) / (Σ (y i - ȳ)²) R² indicates the proportion of the variance in the observed values explained by the model. A higher R² value (closer to 1) reflects better model performance and fit. RMSE (Root Mean Squared Error) RMSE = √[(1/n) Σ (y i - ŷ i )²] RMSE measures the standard deviation of prediction errors. It quantifies the model's ability to predict observed values close to actual ones. Lower values indicate higher accuracy. MAPE (Mean Absolute Percentage Error) MAPE = (100/n) Σ |(y i - ŷ i ) / y i | MAPE expresses the average absolute error as a percentage of actual values. It allows for easy interpretation and comparison across different scales or units. MBE (Mean Bias Error) MBE = (1/n) Σ (ŷ i - y i ) MBE quantifies the average bias in predictions, indicating whether a model tends to overestimate (positive MBE) or underestimate (negative MBE) values. Where y i : observed value, ŷ i : predicted value, ȳ: mean of observed values, n: number of samples Permutation-based feature importance To assess the relative contribution of each input feature to the model’s predictive performance, permutation-based feature importance analysis was performed for the RF and MLP models. For each model, a MultiOutputRegressor was trained and evaluated, and permutation importance was computed separately for each target trait: number of branches, growth period, number of bolls, and seed yield per plant. The permutation importance method involves randomly shuffling the values of each feature and observing the resulting change in prediction error. This process was repeated 50 times to obtain a stable estimate of feature relevance. The procedure was applied to the test set after transforming it using the model’s preprocessing pipeline (including standardization and one-hot encoding). The importance scores were averaged across all repetitions, and the results were visualized using bar plots for each target trait and also aggregated to assess global feature influence. The implementation was carried out using the Scikit-learn library (version 1.3.2) (Pedregosa et al., 2011) via the permutation_importance function. The results were visualized using bar plots to highlight the most influential genotype and planting date combinations. Optimization of ML Model via Non-Dominated Sorting Genetic Algorithm II (NSGA-II) To identify optimal genotype and planting date combinations for maximizing agronomic performance, the best-performing machine learning model, RF, was employed as a surrogate fitness function and integrated with the NSGA-II for multi-objective optimization (Fig. 1 ). In this optimization framework, the genotype and planting date were treated as decision variables, and four agronomic traits were simultaneously optimized: branch number, boll count, and seed yield per plant were maximized, while the growth period was minimized. The initial population of candidate solutions was randomly generated, and the tournament selection method (selTournamentDCD) was used to select the elite individuals for reproduction. A binary crossover operator was applied to create offspring, while real-valued polynomial mutation (real_pm) introduced diversity to prevent premature convergence. The optimization process adhered to the standard NSGA-II framework. In each generation, individuals in the population were first sorted into non-dominated fronts based on the principle of Pareto dominance. Each front was then assigned a rank, with lower ranks indicating better (non-dominated) solutions. To maintain both convergence toward the Pareto-optimal front and diversity among solutions, the algorithm utilized the concept of crowding distance. This metric estimates the density of surrounding solutions in the objective space by calculating the average distance between a solution and its immediate neighbors for each objective. During selection, individuals were prioritized based on lower rank and, within the same rank, higher crowding distance. This ensures that solutions in sparser regions of the objective space are favored, thereby preserving diversity and preventing premature convergence. To achieve an improved fitness function during the optimization process, the optimal values for crucial operators such as the crossover rate, maximum generation, initial population, and mutation rate were regulated through trial and error. In the current study, to achieve an improved fitness function during the optimization process, the optimal values for crucial operators such as the crossover rate, maximum generation, initial population, and mutation rate were regulated through trial and error. In the current study, population size = 100, number of generations = 200, crossover probability = 0.9, and mutation probability = 0.1. The distribution index for crossover and mutation operators was set to 15 and 20, respectively. All mathematical codes for implementing and evaluating RF and MLP models were performed using the Scikit-learn library (version 1.3.2) in Python (Pedregosa et al., 2011). All modeling and optimization procedures were implemented using Python (version 3.11.12). The Random Forest model was developed using the Scikit-learn library (version 1.3.2) (Pedregosa et al., 2011), and the NSGA-II optimization was carried out with the DEAP library (version 1.4.2) (Fortin et al., 2012). Hyperparameter tuning was performed using GridSearchCV. Results Correlation coefficient, and statistical analyses Comprehensive correlation and two-way ANOVA analyses were conducted to evaluate the influence of genotype and planting date on a broad set of morphological and yield traits in Roselle. The correlation results (Fig. 2 a) revealed that most traits were more strongly associated with planting date than with genotype, highlighting the dominant role of temporal factors in trait variation. Notably, branch number exhibited the highest correlation with planting date (r = 0.63), followed by sepal weight (r = 0.49), calyx size (r = 0.47), harvest index (r = 0.34), and boll number (r = 0.31). In contrast, correlations with genotype were consistently weak, with the highest observed for seed per plant (r = 0.082) and negligible values for other traits, including a slight negative correlation for plant height (r = − 0.12). Based on these findings, traits with minimal correlation to either input, such as plant height, calyx, harvest index, and sepal weight, were excluded from further modeling to enhance predictive clarity and reduce noise. Complementing the correlation analysis, two-way ANOVA was performed to assess the statistical significance of genotype, planting date, and their interaction (G×E) on each trait (Fig. 2 b). Our findings highlight that planting date consistently emerged as the most influential factor across most traits, underscoring the critical role of environmental conditions and sowing time in shaping Roselle performance. Traits such as branch number, boll number, and seed yield per plant demonstrated highly significant responses (p < 0.01) to all three factors, genotype, planting date, and their interaction, suggesting that these traits are highly responsive to both genotype and planting date. In contrast, traits such as harvest index, growth period, sepal weight, and calyx size were mainly influenced by planting date and G×E interaction, but not by genotype (p > 0.05), highlighting their strong environmental dependence. Overall, among all the morphological traits measured in this study, branch number, boll number, seed per plant, and growth period were selected as the final targets for predictive modeling and optimization. Model performance evaluation In this study, RF and MLP models were used to predict morphological parameters based on genotype and planting date. The performances of both models are presented in Table 2 . The accuracy of each model was evaluated by R², RMSE, MBE, and MAPE. The results show that two models provided acceptable levels of accuracy. however, among the models tested, the RF algorithm (R² = 0.84) outperformed the MLP (R² = 0.80) in most of the target traits. Specifically, the RF model yielded the highest R² and lowest error values in the majority of traits, indicating robust generalization ability and a well-fitted predictive structure. The regression lines demonstrated a good fit correlation between the observed and predicted data for all growth parameters during the training and testing processes of the RF model (Fig. 3 a-h). Table 2 Comparison statistics of multilayer perceptron (MLP) and random forest (RF) models for various morphological traits of H. sabdariffa under training and testing conditions. Traits include branch number (branch), growth period (growth_period), number of bolls per plant (boll), and seeds per plant (seedperplant). R² coefficient of determination, RMSE root mean square error, MBE mean bias error, MAPE mean absolute percentage error. Model Subset Criterion branch growth_period boll seedperplant RandomForest Training R2 0.857 0.969 0.856 0.702 RMSE 1.90 5.59 8.33 190.15 MBE 0.007 0.020 0.141 -3.856 MAPE 0.097 0.025 0.083 0.181 Testing R2 0.845 0.946 0.834 0.770 RMSE 1.88 6.72 8.49 199.83 MBE -0.21 -0.33 -1.17 17.45 MAPE 0.093 0.031 0.089 0.198 MLP Training R2 0.871 0.967 0.860 0.755 RMSE 1.81 5.81 8.23 172.66 MBE -0.005 -0.002 0.004 0.758 MAPE 0.092 0.026 0.077 0.160 Testing R2 0.860 0.953 0.821 0.739 RMSE 1.99 6.83 8.98 210.12 MBE -0.356 -0.259 -1.285 28.62 MAPE 0.105 0.032 0.096 0.220 Feature Importance Evaluation The permutation importance analysis revealed clear and consistent patterns regarding the relative influence of each feature, including planting date and genotype, on the prediction of Roselle morphological traits (Fig. 4 a-d). Among the evaluated features, including five planting dates and ten genotypes, planting date emerged as the dominant factor across most traits. Traits such as branch number, growth period, and boll number demonstrated high sensitivity to planting date, with May, April, and March contributing most significantly to model performance. These temporal variables consistently ranked above genotypic features, indicating that environmental timing plays a more decisive role than genetic variation in shaping phenotypic outcomes. In terms of genotypic influence, although generally less impactful than planting date, a few genotypes showed relatively higher importance in specific traits. For branch number and boll count, ‘ Iranshahr’ and ‘ Qaleganj’ emerged as the most influential genotypes, contributing noticeably to model predictions. In the case of the growth period, ‘HA ’ was the most prominent genotype, though its influence remained modest compared to planting date. For seed per plant, both planting date and genotype exhibited very low importance, indicating limited predictability and a likely dependence on unmeasured physiological or environmental factors. Overall, the analysis confirmed that planting date was consistently more critical than genotype in determining Roselle trait variability. Nonetheless, certain genotypes, such as Iranshahr , Qaleganj , and HA, exhibited trait-specific influence, highlighting their potential role in breeding programs when integrated with optimized planting schedules. Model optimization using NSGA-II The NSGA-II algorithm was integrated with the RF model, as the most accurate predictive algorithm, in this study. The combined RF-NSGA-II model effectively determined the optimal values of the input variables (genotype and planting date) to simultaneously maximize four agronomic traits: branch number, growth period, boll number, and seeds per plant. The results of the multi-objective optimization process using the NSGA-II algorithm are summarized in Table 3 . The theoretically optimal performance can be achieved with the genotype Qaleganj and the planting date of May, yielding the predicted trait values of branch number, growth duration, boll number, and seed production per plant, 26.009, 175.872, 116.078, and 1517.165, respectively. Table 3 Optimization of genotype and planting date of Roselle according to the RF-NSGA-II algorithm to obtain the best morphological traits, including number of branches, growth period, number of bolls, and number of seeds per plant. Input Items Output Items Genotype Planting date Predicted Branch number Predicted growth period Predicted boll number Predicted seed per plant Qaleganj May 26.009 175.872 116.078 1517.165 Discussion Yield prediction has an important role in crop farming aimed at efficient and sustainable production. Accurate and timely predictions are important for farmers’ decision-making regarding genotypes, planting date, irrigation, fertilization, harvesting, and trading (Khaki & Wang, 2019). Yield prediction in crop science is inherently challenging because of the multifaceted interactions among genetic factors (G), environmental conditions (E), and management practices (Leukel et al.2023). Traditional linear models, including multiple linear regression and correlation-based approaches, often fall short in capturing these nonlinear, dynamic relationships, particularly under variable environmental contexts (Bejo et al., 2014). In contrast, ML models such as RF, support vector regression (SVR), ANNs, and deep learning (DL) architectures have emerged as robust alternatives, offering improved accuracy and adaptability in diverse agricultural scenarios (Leukel et al., 2023). In this study, we compared the performance accuracy of MLP and RF on the yield traits of Roselle. Although the difference in R² values between RF (0.84) and MLP (0.80) was relatively small, RF consistently outperformed MLP, which aligns with previous research highlighting RF's superior capability to handle high-dimensional and complex agricultural datasets (Gupta et al., 2023; Asamoah et al). The ensemble structure of RF, leveraging multiple independent decision trees, minimizes overfitting and enhances model generalizability advantages that are especially important when working with noisy or limited datasets typical in agricultural studies (Gupta et al., 2023). In contrast, MLP typically requires meticulous hyperparameter tuning and careful network architecture design to accurately capture complex nonlinear relationships, which can be computationally demanding and prone to overfitting. Another notable advantage of RF is its relative simplicity and computational efficiency. The training process for RF is more straightforward and faster than MLP, making RF not only more robust in capturing nonlinear interactions but also more practical for agricultural applications where resources and data quality may be constrained. Overall, these findings highlight RF as a more accurate, interpretable, and computationally efficient model for Roselle yield prediction, particularly in the context of complex genotype-environment interactions that are typical in crop yield studies (Gupta et al., 2023; Asamoah et al). After diagnosing the RF model as the best model based on the highest accuracy, the NSGA-II algorithm was linked to the RF. The results of the RF-NSGA-II algorithm indicated that the optimal input variables for maximizing morphological traits, identified as the Qaleganj genotype and a planting date of May 5, correspond to the Pareto frontier solutions, highlighting this combination as the best compromise or optimal point among the studied scenarios. Permutation importance analysis identified planting date as the most influential factor across most traits, overshadowing the contributions of genotype. This aligns with findings of previous studies which have consistently highlighted early planting as a critical determinant of biomass accumulation and overall productivity in Roselle and other crops (Aung and Uape, 2022; Parsa Motlagh et al., 2018). Early planting dates (March to May) likely provide more favorable thermal and photoperiodic conditions, enhancing growth rates and extending the vegetative phase, and eventually improving the yield (El-Sagher et al., 2024). Although genotype had a comparatively lesser effect, its role was not negligible. Specific genotypes such as Iranshahr and Qaleganj demonstrated higher predictive importance for branch number and boll count, while the HA genotype showed prominence in extending the growth period. These findings corroborate earlier observations by Ibrahim et al. (2013) and Tetteh et al. (2019), which identified substantial genetic variability in Roselle for key yield-contributing traits, including plant height, branching, calyx yield, fibre production, and nutritional composition. Collectively, these studies highlight the considerable potential for selective breeding and genetic improvement in Roselle, emphasizing the importance of exploiting this variability to develop high-yielding and resilient cultivars. High heritability observed in traits such as plant height and number of branches suggests that these are primarily governed by genetic factors and can be effectively improved through selection. The interactions between genotype and planting date observed in this study suggest that while environmental timing is the primary driver of trait variability, genotype selection can further refine yield performance when optimized within suitable environmental windows. Moreover, our study revealed that seed yield per plant consistently exhibited low predictive importance for both planting date and genotype, suggesting a reliance on factors not directly captured in this dataset. This finding underscores the complexity and multifaceted nature of seed production in Roselle. Specifically, it highlights that while other morphological traits (such as branch number and boll count) are primarily shaped by genotype and planting date, seed yield is more strongly influenced by unmeasured physiological and environmental interactions. Our findings are consistent with earlier studies that have underscored the pivotal role of micro-environmental and physiological factors in determining seed production in Roselle (Atta et al., 2011; Ibrahim et al., 2013). These factors include variations in plant morphology, soil moisture levels, and physiological responses to environmental stressors, all of which contribute significantly to yield differences across genotypes. To the best of our knowledge, this is the first study to apply ML-NSGA-II for modeling and optimizing morphological traits of Roselle in response to various genotypes and planting dates. In conclusion, our study highlights the applicability and reliability of ML models, particularly RF, for analyzing complex genotype- environment interactions in Roselle cultivation. Conclusion This study aimed to predict and understand how genotype and planting date affect the morphological traits of Roselle by leveraging advanced machine learning models, specifically RF and MLP, for the first time. The comparative analysis demonstrated that despite the relatively small difference in predictive performance, the RF model consistently outperformed MLP, confirming its robustness in capturing the complex genotype-by-environment interactions typical in agricultural systems. By integrating RF with the NSGA-II algorithm, we successfully identified the optimal combination of the Qaleganj genotype and the planting date of May 5 to maximize morphological traits. Furthermore, the RF-NSGA-II hybrid algorithm proved highly effective for multi-objective optimization, offering a powerful tool to identify ideal input combinations under varying conditions. These findings emphasize the potential of advanced ML-based approaches as robust alternatives to traditional statistical methods, offering new avenues for optimizing and improving morphological traits in Roselle and other crops in future studies. Future research can expand on these findings by incorporating additional environmental variables, exploring genomic selection approaches, and validating the models across multiple growing seasons and agroecological zones. Declarations CRediT authorship contribution statement WarqaaMuhammed ShariffAl-Sheikh collecting the data and collaborate in experimental, Fazilat Fakhrzad, design an experiment, analysis data, Mohammed M. Mohammed, collaborate in writing manuscript and Heidar Meftahizade in perform experiments, analysis data Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Funding No specific financial credit was used in this experiment. Data availability Data is provided within the manuscript file. References Ali, B.H., Wabel, N.A. and Blunden, G., 2005. Phytochemical, pharmacological and toxicological aspects of Hibiscus sabdariffa L.: a review. Phytotherapy Research: An International Journal Devoted to Pharmacological and Toxicological Evaluation of Natural Product Derivatives , 19 (5), pp.369-375. https://doi.org/10.1002/ptr.1628 Zand-Silakhoor, A., Madani, H., Sharifabad, H.H., Mahmoudi, M. and Nourmohammadi, G., 2022. Influence of different irrigation regimes and planting times on the quality and quantity of calyx, seed oil content and water use efficiency of roselle ( Hibiscus sabdariffa L.). Grasas y Aceites , 73 (3), pp.e472-e472. https://doi.org/10.3989/gya.0564211 Fortin, F.A., De Rainville, F.M., Gardner, M.A.G., Parizeau, M. and Gagné, C., 2012. DEAP: Evolutionary algorithms made easy. The Journal of Machine Learning Research , 13 (1), pp.2171-2175. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V. and Vanderplas, J., 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research , 12 , pp.2825-2830. Mohamed, R., Fernandez, J., Pineda, M. and Aguilar, M., 2007. Roselle ( Hibiscus sabdariffa ) seed oil is a rich source of γ‐tocopherol. Journal of food science , 72 (3), pp.S207-S211. https://doi.org/10.1111/j.1750-3841.2007.00285.x Mulyaningsih, E.S., Hartati, N., Dyan Anggraheni, Y.G., Harmoko, R., Indrayani, S., Rahman, N., Fitriani, H., Nuro, F., Hapsari, Y. and Andika, N.R., 2025. Morpho Genetic Variability and Anthocyanine (Cyanidin-3-OGlucoside) Concent of Indonesia Roselle ( Hibiscus sabdariffa L.). International Journal on Advanced Science, Engineering & Information Technology , 15 (1). Mahunu, G.K., 2021. Breeding, genetic diversity, and safe production of Hibiscus sabdariffa under climate change. In Roselle (Hibiscus sabdariffa) (pp. 1-14). Academic Press. Richardson, M.L. and Arlotta, C.G., 2021. Differential yield and nutrients of Hibiscus sabdariffa L. genotypes when grown in urban production systems. Scientia Horticulturae , 288 , p.110349. Ullah, M. Z. (2024). Genetic variability of Roselle ( Hibiscus sabdariffa L.) genotypes. Innovare Journal of Agricultural Sciences . 12 (1), 1-6. http://dx.doi.org/10.22159/ijags.2024v12i1.49802. Tetteh, A.Y., Ankrah, N.A., Coffie, N. and Niagiah, A., 2019. Genetic diversity, variability and characterization of the agro-morphological traits of Northern Ghana Roselle ( Hibiscus sabdariffa var. altissima) accessions. African Journal of Plant Science , 13(6), pp.168-184. https://doi.org/10.5897/AJPS2019.1783. Ibrahim, M. M. and Hussein, R. M. 2006. Variability, heritability and genetic advance in some genotype of roselle ( Hibiscus sabdariffa ). World Journal of Agricultural Sciences , 2 (3): 340-345. Aung, C. and Uape, M., 2022. The effect of Different Planting Dates on the Growth and Yield of Roselle (Hibiscus sabdariffa L.) during the Rainy Season. University of Yangon Research Journal , 11 (1). https://meral.edu.mm/records/9019 El-Sagher, M., Mostafa, G.G., El-Ghadban, E.M.A., Soliman, W.S. and Gahory, A.A., 2024. Sowing date as a determining factor for Roselle, Hibiscus sabdariffa , production: I. Effect on vegetative and yield components. Aswan University Journal of Sciences and Technology , 4 (1), pp.28-37. https://dx.doi.org/10.21608/aujst.2024.337887 Yoosefzadeh-Najafabadi, M., Earl, H.J., Tulpan, D., Sulik, J. and Eskandari, M., 2021a. Application of machine learning algorithms in plant breeding: predicting yield from hyperspectral reflectance in soybean. Frontiers in plant science , 11 , p.624273. https://doi.org/10.3389/fpls.2020.624273 Yoosefzadeh-Najafabadi, M., Tulpan, D. and Eskandari, M., 2021b. Application of machine learning and genetic optimization algorithms for modeling and optimizing soybean yield using its component traits. Plos one , 16 (4), p.e0250665. https://doi.org/10.1371/journal.pone.0250665 Zarbakhsh, S. and Shahsavar, A.R., 2022. Artificial neural network-based model to predict the effect of γ-aminobutyric acid on salinity and drought responsive morphological traits in pomegranate. Scientific Reports , 12 (1), p.16662. https://doi.org/10.1038/s41598-022-21129-z Fakhrzad, F., Jowkar, A. and Hosseinzadeh, J., 2022. Mathematical modeling and optimizing the in vitro shoot proliferation of wallflower using multilayer perceptron non-dominated sorting genetic algorithm-II (MLP-NSGAII). PLoS One , 17 (9), p.e0273009. https://doi.org/10.1371/journal.pone.0273009 Prasad, N.R., Patel, N.R. and Danodia, A., 2021. Crop yield prediction in cotton for regional level using random forest approach. spatial information research , 29 , pp.195-206. https://doi.org/10.1007/s41324-020-00346-6. Shook, J., Gangopadhyay, T., Wu, L., Ganapathysubramanian, B., Sarkar, S. and Singh, A.K., 2021. Crop yield prediction integrating genotype and weather variables using deep learning. Plos one , 16 (6), p.e0252402. https://doi.org/10.48550/arXiv.2006.13847. Leukel, J., Zimpel, T. and Stumpe, C., 2023. Machine learning technology for early prediction of grain yield at the field scale: A systematic review. Computers and Electronics in Agriculture , 207 , p.107721. https://doi.org/10.1016/j.compag.2023.107721. Khaki, S. and Wang, L., 2019. Crop yield prediction using deep neural networks. Frontiers in plant science , 10 , p.621. https://doi.org/10.3389/fpls.2019.00621. Gupta, I., Ayalasomayajula, S., Shashidhara, Y., Kataria, A., Shashidhara, S., Kataria, K. and Undurti, A., 2023. Innovations in Agricultural Forecasting: A Multivariate Regression Study on Global Crop Yield Prediction. arXiv preprint arXiv:2312.02254 . https://doi.org/10.48550/arXiv.2312.02254 Asamoah, E., Heuvelink, G.B., Chairi, I., Bindraban, P.S. and Logah, V., 2024. Random forest machine learning for maize yield and agronomic efficiency prediction in Ghana. Heliyon , 10 (17). https://doi.org/10.1016/j.heliyon.2024.e37065 Parsa Motlagh, B., Rezvani Moghaddam, P. and Azami Sardooei, Z., 2018. Responses of Calyx Phytochemical Characteristic, Yield and Yield Components of Roselle ( Hibiscus sabdariffa L.) to Different Sowing Dates and Densities. International Journal of Horticultural Science and Technology , 5 (2), pp.241-251. Atta, S., Seyni, H.H., Bakasso, Y., Sarr, B., Lona, I. and Saadou, M., 2011. Yield character variability in Roselle (Hibiscus sabdariffa L.). African Journal of Agricultural Research , 6 (6), pp.1371-1377. https://doi.org/10.5897/AJAR10.334 Ibrahim, E.B., Abdalla, A.W.H., Ibrahim, E.A. and El Naim, A.M., 2013. Variability in some roselle ( Hibiscus sabdariffa L.) genotypes for yield and its attributes. International Journal of Agriculture and Forestry , 3 (7), pp.261-266. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 14 Jul, 2025 Reviews received at journal 14 Jul, 2025 Reviews received at journal 11 Jul, 2025 Reviewers agreed at journal 09 Jul, 2025 Reviews received at journal 06 Jul, 2025 Reviewers agreed at journal 03 Jul, 2025 Reviewers agreed at journal 02 Jul, 2025 Reviewers invited by journal 01 Jul, 2025 Editor assigned by journal 01 Jul, 2025 Editor invited by journal 23 Jun, 2025 Submission checks completed at journal 21 Jun, 2025 First submitted to journal 17 Jun, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6917293","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":479885936,"identity":"94374d3e-ffc1-4e5f-858f-b019c884900f","order_by":0,"name":"WarqaaMuhammed ShariffAl-Sheikh","email":"","orcid":"","institution":"University of Al-Qadisiyah","correspondingAuthor":false,"prefix":"","firstName":"WarqaaMuhammed","middleName":"","lastName":"ShariffAl-Sheikh","suffix":""},{"id":479885938,"identity":"d3b87f6a-b3f2-4f78-b09b-accc8b82005d","order_by":1,"name":"Fazilat Fakhrzad","email":"","orcid":"","institution":"Shiraz University","correspondingAuthor":false,"prefix":"","firstName":"Fazilat","middleName":"","lastName":"Fakhrzad","suffix":""},{"id":479885939,"identity":"b6d6650a-ac2c-4066-adb8-738b1b28b7be","order_by":2,"name":"Mohammed M. Mohammed","email":"","orcid":"","institution":"University of Baghdad","correspondingAuthor":false,"prefix":"","firstName":"Mohammed","middleName":"M.","lastName":"Mohammed","suffix":""},{"id":479885940,"identity":"4a01ef80-3a41-4033-af29-a5aeb9bc4e63","order_by":3,"name":"Heidar Meftahizadeh","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABGElEQVRIie3RMUvEMBTA8RcKcQnnGjnPfoVKoXWQu6+SUqhLwaEgDh0CQl1qb+2kX8EuzjkCnVpuc6nIidDJoS5FEIqc3ODQVtwc8psehD/vQQAU5R8yBEADIMh23k6Ik90THk5QKnYJSv+SfM8a+ZkMHvYoV1dN+HSoX8vs9TR8ni1jWW8gnMNkKnqTReU5XOQ1QbEXmH4emGkZ2QbkLuAJ699S+fZGYEk08K2pj5nD12BRwALwwIVGdf7ORScJ3n+zP0865tyt91oK3VjiI76KJCHUtzQUMee+jC2KorHEO+ZlIgmldXBwkzAzK4oL6iQuGU7cF37ZyoW+dLPmo2Wz2+LsgTbt/EiP+5MBDOC371EURVFGfAEcsmTNTqnGSAAAAABJRU5ErkJggg==","orcid":"","institution":"Ardakan University","correspondingAuthor":true,"prefix":"","firstName":"Heidar","middleName":"","lastName":"Meftahizadeh","suffix":""}],"badges":[],"createdAt":"2025-06-17 20:38:07","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6917293/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6917293/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":85912302,"identity":"fa733c31-b761-452d-9eb4-77ab1dd5dacd","added_by":"auto","created_at":"2025-07-03 05:54:06","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":344351,"visible":true,"origin":"","legend":"\u003cp\u003eSchematic diagram of the procedure used in this study, modeling morphological traits based on two input variables, including roselle genotypes and planting dates, using Random Forest (RF) and multilayer perceptron (MLP), and the step-by-step optimization process of morphological traits via non-dominated sorting genetic algorithm-II (NSGA-II).\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-6917293/v1/6cb8f46dce17c34f29a6a92f.png"},{"id":85913386,"identity":"3e3b3438-6ab5-45a5-afa6-16181399b8e1","added_by":"auto","created_at":"2025-07-03 06:10:06","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":205637,"visible":true,"origin":"","legend":"\u003cp\u003ea) The correlation heatmap between plant traits (targets) and two input variables: genotype and planting date. b) The heatmap of ANOVA p-values showing the significance of genotype, planting date, and their interaction on each trait. Traits with p-values \u0026lt; 0.05 are significant.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-6917293/v1/b29a1426471e9a654f1fa03b.png"},{"id":85912306,"identity":"bdb18d20-889b-4d23-8946-fe899dc9da27","added_by":"auto","created_at":"2025-07-03 05:54:06","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":418098,"visible":true,"origin":"","legend":"\u003cp\u003eThe scatter plot of observed values vs. predicted values of (a,b) branch number (Training) and (Testing), (c,d) boll number (Training) and (Testing), (e,f) seed per plant (Training) and (Testing), (g,h) growth period (Training) and (Testing) obtained by the Random Forest (RF) model.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-6917293/v1/ff1af3d45ab5b4605ffb677b.png"},{"id":85912677,"identity":"139aba4d-98fd-4110-828f-09057927e118","added_by":"auto","created_at":"2025-07-03 06:02:06","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":114145,"visible":true,"origin":"","legend":"\u003cp\u003ePlots of feature importance in the Random Forest model showing the influence of planting dates and different genotypes in predicting Roselle yield traits: a) branch number; b) growth period; c) boll number and d) seed per plant\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-6917293/v1/f1fe1fd433dd4647a57b2069.png"},{"id":85914174,"identity":"9064fa4e-41df-43b4-a233-f742512bb11e","added_by":"auto","created_at":"2025-07-03 06:26:08","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1923756,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6917293/v1/b3606cb0-d1a3-47d0-99b1-028c18e941be.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Machine Learning-Driven Models to Predict the optimum Genotype and Planting Date on yield and phytochemical Traits in Roselle (Hibiscus sabdariffa L.)","fulltext":[{"header":"Introduction","content":"\u003cp\u003eRoselle (\u003cem\u003eHibiscus sabdariffa\u003c/em\u003e L.) is an annual herbaceous plant belonging to the Malvaceae family that is widely cultivated in tropical and subtropical regions due to its high adaptability and drought tolerance. Various parts of the plant, including the flowers, stem fiber, leaves, seeds, fruits, and roots, are multi-purpose in the food and medicinal industries. Among various parts of the plant, the bright red fleshy calyx holds the most significant economic value, which contains vitamin C, iron, beta-carotene, anthocyanins, and phenolic compounds (Ali et al.2005). Additionally, roselle seeds contain high levels of protein, vitamin E, and unsaturated fatty acids such as oleic and linoleic acids (Zand-Silakhoor et al., 2022). In recent years, global demand for roselle and its processed products has increased steadily, particularly in the health food and natural product markets (Mahunu, 2021).\u003c/p\u003e \u003cp\u003ePlant growth and development are influenced by a variety of environmental and genetic factors. Among these, the selection of suitable genotypes and the determination of optimal planting dates are critical agronomic decisions that directly impact plant performance (Mahunu, 2021). Genotype selection allows breeders to identify plant varieties with favorable characteristics such as early maturity, high yield, and resistance to environmental stress (Mulyaningsih et al., 2025). The results of research conducted on different genotypes of roselle indicate that these traits have high heritability and are largely influenced by genetic factors (Ibrahim and Hussein, 2006). In this regard, Richardson and Arlotta (2021) reported that there are significant differences among genotypes in terms of flower and leaf production and nutrient density. Ullah, (2025) also showed that genetic differences among genotypes can significantly affect traits such as plant height, number of branches, fruit weight, and number of fruits per plant. Simultaneously, the planting date must align with regional climatic conditions to ensure proper germination, growth, and flowering, thereby enhancing productivity. Multiple studies have shown that delayed sowing can significantly reduce crop yield, including plant height, number of bolls, biomass, calyx, and seed yield. Also, early planting leads to a longer growth period and more desirable growth traits (El-Sagher et al., 2022). A study conducted by Aung and Uape (2022) demonstrated that planting date significantly affects the growth and yield of Roselle (Aung and Uape, 2022). They reported that planting in July produced the best results in terms of early flowering, increased flower bud count, and longer calyces. El-Sagher et al (2022) reported that planting on May 15 led to the best performance in terms of vegetative growth (plant height, stem diameter, number of branches and fruits) and yield components (fresh and dry weights of shoots, calyces, and seeds) (El-Sagher et al., 2022).\u003c/p\u003e \u003cp\u003eHowever, the interaction between genotype and planting date is complex and non-linear, requiring advanced modeling approaches to accurately predict outcomes and make informed recommendations. Traditional statistical methods, while useful, often fall short in capturing the non-linear interactions and high-dimensional relationships among multiple plant traits and environmental variables. In contrast, machine learning (ML) techniques offer powerful alternatives by learning patterns from historical data without relying on predefined models (Fakhrzad et al., 2022). Among these, Random Forest (RF) and Multi-layer Perceptron (MLP) are popular ML algorithms used in agricultural modeling due to their flexibility and robustness. Among these, MLP, as a type of feed-forward artificial neural network (ANN), is particularly known for its ability to model highly nonlinear functions due to its deep architecture composed of interconnected neurons and hidden layers (Fakhrzad et al., 2022). Random Forest (RF) is a widely used machine learning algorithm known for its simplicity and robust performance and reliable predictive capabilities. Its flexibility in handling both classification and regression problems has made it a valuable tool in numerous fields, including biological and agricultural research. RF models are particularly effective in processing complex datasets characterized by noise, imbalance, and high dimensionality, and they are also recognized for their ability to mitigate overfitting (Yoosefzadeh Najafabadi et al., 2021a).\u003c/p\u003e \u003cp\u003eIn biosystems, prediction alone is insufficient, and decision-making often involves balancing multiple conflicting objectives. Among the various optimization approaches, evolutionary algorithms, particularly the Non-dominated Sorting Genetic Algorithm II (NSGA-II), address this challenge by identifying a set of optimal solutions known as the Pareto front. Unlike single-objective genetic algorithms, NSGA-II can handle multiple objectives simultaneously and identify a set of Pareto-optimal solutions, representing the best trade-offs between conflicting goals. These solutions offer decision-makers a spectrum of options, enabling tailored strategies based on specific priorities and constraints. The integration of ML models with NSGA-II provides a synergistic framework that leverages the strengths of both approaches (Fakhrzad et al., 2022; Zarbakhsh and Shahsavar, 2022). Recently, the efficiency of the ANN-NSGA II model was used to predict and optimize pomegranate morphological traits under salinity and drought stress influenced by γ-aminobutyric acid. This approach helped identify optimal treatment conditions and provided insights into improving stress tolerance (Zarbakhsh and Shahsavar, 2022). Moreover, in a study by Yoosefzadeh-Najafabadi et al. (2021b), the combination of machine learning algorithms and genetic optimization was employed to model and optimize soybean yield based on its component traits. This integrated approach provided a better understanding of the relationships between yield and its morphological components, and can be effectively used in selecting parental lines and designing crosses aimed at improving the genetic yield potential of soybean cultivars (Yoosefzadeh-Najafabadi et al. 2021b).\u003c/p\u003e \u003cp\u003eThe present study aims to develop an integrated framework for predicting and optimizing the main agronomic traits of Roselle, focusing on four key traits, including the number of branches, growth period, number of bolls, and seeds per plant. Using a comprehensive dataset collected from a field experiment, we trained and compared the performance of multiple ML models, followed by the application of NSGA-II for multi-objective optimization. While ML models ensure the accurate prediction of morphological traits based on genotype and planting date, NSGA-II enables the optimization of these traits under a multi-objective setting, assisting in the selection of the best genotype\u0026ndash;planning date combinations that meet multiple morphological targets simultaneously. The novelty of this study lies in its data-driven approach to simultaneously address traits prediction and optimal genotype-date selection. Specifically, the objectives are to:\u003c/p\u003e \u003cp\u003e(1) examine how different planting dates and genotype combinations influence roselle's morphological performance,\u003c/p\u003e \u003cp\u003e(2) identify stable and high-performing genotypes across diverse environmental conditions,\u003c/p\u003e \u003cp\u003e(3) develop predictive machine learning models that can accurately forecast key traits based on genotype and planting date, and\u003c/p\u003e \u003cp\u003e(4) employ cutting-edge multi-objective optimization algorithms to recommend the most effective genotype\u0026ndash;planting date combinations that maximize crop yield. The insights derived from this model can inform breeding programs and cultivation planning, contributing to sustainable and high-yielding Roselle production systems.\u003c/p\u003e"},{"header":"Materials and methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1. Plant materials and experimental design\u003c/h2\u003e \u003cp\u003eThe field experiment was conducted in Dalgan (27\u0026deg; 28\u0026prime; N, 59\u0026deg; 27\u0026prime; E; 389 m a.s.l.), located in Sistan and Baluchestan province, southeast Iran. The region has a hot-arid climate, with minimal rainfall and high summer temperatures. The soil texture was loam with a pH of 7.6 and EC of 2.12 dS m⁻\u0026sup1;. The study employed a factorial experimental design based on a randomized complete block design (RCBD) with three replications. The experimental treatments consisted of ten roselle genotypes including eight native accessions \u003cem\u003eJiroft\u003c/em\u003e, \u003cem\u003eDalgan\u003c/em\u003e, \u003cem\u003eBampoor\u003c/em\u003e, \u003cem\u003eIranshahr\u003c/em\u003e, \u003cem\u003eNikshahr\u003c/em\u003e, \u003cem\u003eRoodbar\u003c/em\u003e, \u003cem\u003eSaravan\u003c/em\u003e, and \u003cem\u003eQaleganj\u003c/em\u003e were collected from different agro-ecological zones of Iran, along with two exotic landraces: HA (originating from Ghana) and HS-24 (from Bangladesh), both supplied by the Jiroft Agricultural Research Station. The ten genotypes were sown under five different planting dates: March 6, April 6, May 5, June 5, and July 1.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003e\u003c/h3\u003e\n\u003cdiv class=\"Heading\"\u003e\u003cb\u003e2.2. Morphological trait measurement\u003c/b\u003e\u003c/div\u003e \u003cp\u003eMorphological trait measurements were performed at the physiological maturity stage, defined as the point when approximately 70% of the seeds within each flower capsule had browned and the sepals had reached their maximum development. The following traits were recorded for each genotype and planting date treatment: plant height, number of branches per plant, fruit length, number of bolls per plant, fresh sepal weight per plant, number of seeds per capsule, seed weight per plant, 1000-seed weight, biomass yield, harvest index, calyx yield, and growth period (life cycle duration). Sepals were harvested at maturity and dried at ambient room temperature (~\u0026thinsp;30\u0026deg;C) for further chemical analysis. Pearson correlation analysis was conducted using Python (version 3.11.12) to evaluate the relationships between input variables (genotype and planting date) and output traits, as well as their interactions.\u003c/p\u003e\n\u003ch3\u003eData Pre-processing, and statistical analyses\u003c/h3\u003e\n\u003cp\u003eThe dataset included ten genotypes and five planting dates, with four primary output traits: number of branches, growth period, number of bolls, and seed yield per plant. Input features were encoded using one-hot encoding, and the output variables were normalized using z-score standardization. Outlier removal was performed using IQR and Z-score techniques to enhance data quality. The dataset was split into 80% for training and 20% for testing. Prior to model training, a two-step statistical analysis was conducted to evaluate the relevance of each phenotypic trait with respect to the input variables (genotype and planting date). First, Pearson correlation coefficients were computed to assess the linear relationship between inputs and each target. Subsequently, a two-way ANOVA was performed for each target to test the statistical significance of three main effects including genotype, planting date, and their interaction.\u003c/p\u003e\n\u003ch3\u003eHyperparameter Optimization in ML Models\u003c/h3\u003e\n\u003cp\u003eIn machine learning, prior optimization and tuning of model hyperparameters are critical steps that significantly affect predictive performance and generalization ability. In this study, a structured grid search algorithm, Grid Search Cross Validation (GridSearchCV) was used for hyperparameter tuning of two machine learning models, MLP and RF, combined with 10-fold cross-validation. During the K-fold cross-validation process (K\u0026thinsp;=\u0026thinsp;10), the dataset was partitioned into 10 subsets. Each subset served once as a validation set while the remaining subsets were used to train the model. This approach ensured that all data points contributed to both training and validation, thereby reducing the risk of overfitting or underfitting. The hyperparameter combinations yielding the highest cross-validated R\u0026sup2; scores and lowest generalization errors on the test set were selected as the optimal configurations.\u003c/p\u003e\n\u003ch3\u003e2.3. Description of ML models and optimization algorithm\u003c/h3\u003e\n\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eModel Development\u003c/h2\u003e \u003cp\u003eIn this study, a supervised machine learning framework was developed to predict agronomic traits of roselle as influenced by genotype and planting date. Two models, Multilayer Perceptron (MLP) and Random Forest (RF), were implemented within a unified scikit-learn (Pedregosa et al., 2011) pipeline structure. These models were selected due to their proven effectiveness in handling nonlinear relationships and complex feature interactions in plant trait prediction tasks. Each model was trained and evaluated based on multi-output regression performance (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eRF Model\u003c/h3\u003e\n\u003cp\u003eRF model was implemented as a robust ensemble-based learning method for multi-output regression to predict agronomic traits of roselle plants. RF operates by constructing an ensemble of decision trees, each trained on a bootstrapped subset of the original dataset. For a single decision tree with \u003cem\u003eL\u003c/em\u003e terminal leaves, the input feature space is partitioned into \u003cem\u003eR\u003c/em\u003e non-overlapping regions \u003cem\u003eRₘ\u003c/em\u003e such that the prediction function is defined as:\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:f\\left(x\\right)=\\sum\\:_{m=1}^{R}{C}_{m}.\\varPi\\:(x,{R}_{m}\\:)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere cₘ represents the predicted constant value in region \u003cem\u003eRₘ\u003c/em\u003e, and \u003cem\u003eΠ (x, Rₘ)\u003c/em\u003e is an indicator function defined by:\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e\n$$\\:\\varPi\\:\\left(x,{R}_{m}\\:\\right)=\\left\\{\\begin{array}{c}x\\:\\in\\:\\:Rₘ\\\\\\:0,\\:otherwise\\end{array}\\right.$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eThe final RF prediction for an input sample \u003cem\u003ex\u003c/em\u003e is obtained by averaging the outputs of all \u003cem\u003eT\u003c/em\u003e individual trees. The final prediction \u003cem\u003eŷ\u003c/em\u003e for an input \u003cem\u003ex\u003c/em\u003e is the average of predictions from all trees:\u003cdiv id=\"Equc\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equc\" name=\"EquationSource\"\u003e\n$$\\:ŷ=\\left(\\frac{1}{T}\\right)\\sum\\:_{t=1}^{T}{f}_{t}\\left(x\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003efₜ(x)\u003c/em\u003e is the output of the \u003cem\u003et\u003c/em\u003eth tree. This ensemble averaging mechanism enhances the predictive performance by reducing variance, improving robustness across different training subsets, and minimizing the risk of overfitting.\u003c/p\u003e \u003cp\u003eIn this study, the RF model was configured with 50 estimators (n_estimators\u0026thinsp;=\u0026thinsp;50) and a maximum depth of 5 per tree (max_depth\u0026thinsp;=\u0026thinsp;5). At each node, a random subset of predictors was selected to determine the best split, which encourages diversity among the trees and improves generalization. The model\u0026rsquo;s hyperparameters were optimized using GridSearchCV with 10-fold cross-validation.\u003c/p\u003e\n\u003ch3\u003eMLP Model\u003c/h3\u003e\n\u003cp\u003eIn this study, MLP model was implemented as a supervised learning algorithm within a feedforward neural network architecture to predict agronomic traits of \u003cem\u003eH. sabdariffa\u003c/em\u003e. The final MLP model consisted of three layers: an input layer, two hidden layers with five neurons each (5, 5), and a single output layer. The input vector \u003cem\u003ex\u003c/em\u003e = [\u003cem\u003ex₁, x₂, ..., xₙ\u003c/em\u003e], representing genotype and planting date encoded via one-hot encoding, was passed through the hidden layers, each employing the hyperbolic tangent activation function (tanh):\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{h}_{j}=tanh(\\sum\\:wⱼᵢ\\:\u0026middot;\\:xᵢ\\:+\\:bⱼ)\$\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:\\text{w}\\text{h}\\text{e}\\text{r}\\text{e}\\:\\text{w}\\text{ⱼ}\\text{ᵢ}\\:\\text{d}\\text{e}\\text{n}\\text{o}\\text{t}\\text{e}\\text{s}\\:\\text{t}\\text{h}\\text{e}\\:\\text{w}\\text{e}\\text{i}\\text{g}\\text{h}\\text{t}\\:\\text{b}\\text{e}\\text{t}\\text{w}\\text{e}\\text{e}\\text{n}\\:\\text{t}\\text{h}\\text{e}\\:\$\u003c/span\u003e\u003c/span\u003e\u003cem\u003ei\u003c/em\u003eth input and the \u003cem\u003ej\u003c/em\u003eth hidden neuron, bⱼ is the bias for the \u003cem\u003ej\u003c/em\u003eth neuron, and tanh() is the hyperbolic tangent activation function. The final output \u003cem\u003eŷ\u003c/em\u003e, representing predicted plant traits (branch number, growth period, boll number, and seed yield), was computed through a linear activation function:\u003c/p\u003e \u003cp\u003e \u003cspan class=\"InlineEquation\"\u003e \u003cspan class=\"mathinline\"\u003e\$\\:ŷ=\\sum\\:wⱼ\\:\u0026middot;\\:hⱼ\\:+\\:b₀\$\u003c/span\u003e \u003c/span\u003e \u003c/p\u003e \u003cp\u003eWhere \u003cem\u003ewⱼ\u003c/em\u003e is the weight connecting the \u003cem\u003ej\u003c/em\u003eth hidden neuron to the output, and \u003cem\u003eb₀\u003c/em\u003e is the output bias. The model was trained using the Adam optimizer with a maximum of 1000 iterations. To improve generalization and prevent overfitting, L2 regularization (weight decay) was incorporated via the alpha parameter, set to 1. Accordingly, the loss function minimized during training was defined as the regularized mean squared error (MSE):\u003cdiv id=\"Equd\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equd\" name=\"EquationSource\"\u003e\n$$\\:MSE=\\left(\\frac{1}{K}\\right)\\sum\\:{\\left(yₖ\\:-\\:ŷₖ\\right)}^{2}+a\\sum\\:wⱼ\u0026sup2;$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003eK\u003c/em\u003e is the number of training samples, \u003cem\u003eŷₖ\u003c/em\u003e is the predicted output for the \u003cem\u003ek\u003c/em\u003eth sample, and \u003cem\u003eα\u003c/em\u003e\u0026thinsp;=\u0026thinsp;1 is the regularization parameter controlling the penalty applied to large weights. This formulation penalizes large weights and enhances the model\u0026rsquo;s robustness against noise and overfitting. The architecture and number of hidden neurons were selected based on a combination of empirical trial-and-error and structured grid search optimization, ensuring the best trade-off between complexity and predictive accuracy.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eModel Performance Evaluation\u003c/h2\u003e \u003cp\u003eTo evaluate the performance of the developed machine learning models, several statistical indicators were employed, including the coefficient of determination (R\u0026sup2;), root mean square error (RMSE), mean absolute percentage error (MAPE), and mean bias error (MBE). These metrics provide a comprehensive assessment of the model\u0026rsquo;s accuracy, precision, and bias. These quantitative indicators can be found in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. They were computed for both training and testing subsets to ensure the generalization capability of the models across unseen data. The prediction results for each target trait were subsequently compared and visualized through scatter plots of observed vs. predicted values (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea-h). The two implemented models were then compared based on these performance metrics in the investigated targets.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eDescription of regression evaluation metrics\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMetric\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFormula\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDescription\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eR\u0026sup2; (Coefficient of Determination)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eR\u0026sup2; = 1 - (Σ (y\u003csub\u003ei\u003c/sub\u003e - ŷ\u003csub\u003ei\u003c/sub\u003e)\u0026sup2;) / (Σ (y\u003csub\u003ei\u003c/sub\u003e - ȳ)\u0026sup2;)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eR\u0026sup2; indicates the proportion of the variance in the observed values explained by the model. A higher R\u0026sup2; value (closer to 1) reflects better model performance and fit.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRMSE (Root Mean Squared Error)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRMSE = \u0026radic;[(1/n) Σ (y\u003csub\u003ei\u003c/sub\u003e - ŷ\u003csub\u003ei\u003c/sub\u003e)\u0026sup2;]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRMSE measures the standard deviation of prediction errors. It quantifies the model's ability to predict observed values close to actual ones. Lower values indicate higher accuracy.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMAPE (Mean Absolute Percentage Error)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMAPE = (100/n) Σ |(y\u003csub\u003ei\u003c/sub\u003e - ŷ\u003csub\u003ei\u003c/sub\u003e) / y\u003csub\u003ei\u003c/sub\u003e|\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMAPE expresses the average absolute error as a percentage of actual values. It allows for easy interpretation and comparison across different scales or units.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMBE (Mean Bias Error)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMBE = (1/n) Σ (ŷ\u003csub\u003ei\u003c/sub\u003e - y\u003csub\u003ei\u003c/sub\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMBE quantifies the average bias in predictions, indicating whether a model tends to overestimate (positive MBE) or underestimate (negative MBE) values.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"3\"\u003eWhere y\u003csub\u003ei\u003c/sub\u003e: observed value, ŷ\u003csub\u003ei\u003c/sub\u003e: predicted value, ȳ: mean of observed values, n: number of samples\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003ePermutation-based feature importance\u003c/h2\u003e \u003cp\u003eTo assess the relative contribution of each input feature to the model\u0026rsquo;s predictive performance, permutation-based feature importance analysis was performed for the RF and MLP models. For each model, a MultiOutputRegressor was trained and evaluated, and permutation importance was computed separately for each target trait: number of branches, growth period, number of bolls, and seed yield per plant. The permutation importance method involves randomly shuffling the values of each feature and observing the resulting change in prediction error. This process was repeated 50 times to obtain a stable estimate of feature relevance. The procedure was applied to the test set after transforming it using the model\u0026rsquo;s preprocessing pipeline (including standardization and one-hot encoding). The importance scores were averaged across all repetitions, and the results were visualized using bar plots for each target trait and also aggregated to assess global feature influence. The implementation was carried out using the Scikit-learn library (version 1.3.2) (Pedregosa et al., 2011) via the permutation_importance function. The results were visualized using bar plots to highlight the most influential genotype and planting date combinations.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eOptimization of ML Model via Non-Dominated Sorting Genetic Algorithm II (NSGA-II)\u003c/h2\u003e \u003cp\u003eTo identify optimal genotype and planting date combinations for maximizing agronomic performance, the best-performing machine learning model, RF, was employed as a surrogate fitness function and integrated with the NSGA-II for multi-objective optimization (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). In this optimization framework, the genotype and planting date were treated as decision variables, and four agronomic traits were simultaneously optimized: branch number, boll count, and seed yield per plant were maximized, while the growth period was minimized. The initial population of candidate solutions was randomly generated, and the tournament selection method (selTournamentDCD) was used to select the elite individuals for reproduction. A binary crossover operator was applied to create offspring, while real-valued polynomial mutation (real_pm) introduced diversity to prevent premature convergence. The optimization process adhered to the standard NSGA-II framework. In each generation, individuals in the population were first sorted into non-dominated fronts based on the principle of Pareto dominance. Each front was then assigned a rank, with lower ranks indicating better (non-dominated) solutions. To maintain both convergence toward the Pareto-optimal front and diversity among solutions, the algorithm utilized the concept of crowding distance. This metric estimates the density of surrounding solutions in the objective space by calculating the average distance between a solution and its immediate neighbors for each objective. During selection, individuals were prioritized based on lower rank and, within the same rank, higher crowding distance. This ensures that solutions in sparser regions of the objective space are favored, thereby preserving diversity and preventing premature convergence. To achieve an improved fitness function during the optimization process, the optimal values for crucial operators such as the crossover rate, maximum generation, initial population, and mutation rate were regulated through trial and error. In the current study, to achieve an improved fitness function during the optimization process, the optimal values for crucial operators such as the crossover rate, maximum generation, initial population, and mutation rate were regulated through trial and error. In the current study, population size\u0026thinsp;=\u0026thinsp;100, number of generations\u0026thinsp;=\u0026thinsp;200, crossover probability\u0026thinsp;=\u0026thinsp;0.9, and mutation probability\u0026thinsp;=\u0026thinsp;0.1. The distribution index for crossover and mutation operators was set to 15 and 20, respectively. All mathematical codes for implementing and evaluating RF and MLP models were performed using the Scikit-learn library (version 1.3.2) in Python (Pedregosa et al., 2011). All modeling and optimization procedures were implemented using Python (version 3.11.12). The Random Forest model was developed using the Scikit-learn library (version 1.3.2) (Pedregosa et al., 2011), and the NSGA-II optimization was carried out with the DEAP library (version 1.4.2) (Fortin et al., 2012). Hyperparameter tuning was performed using GridSearchCV.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eCorrelation coefficient, and statistical analyses\u003c/h2\u003e \u003cp\u003eComprehensive correlation and two-way ANOVA analyses were conducted to evaluate the influence of genotype and planting date on a broad set of morphological and yield traits in Roselle. The correlation results (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea) revealed that most traits were more strongly associated with planting date than with genotype, highlighting the dominant role of temporal factors in trait variation. Notably, branch number exhibited the highest correlation with planting date (r\u0026thinsp;=\u0026thinsp;0.63), followed by sepal weight (r\u0026thinsp;=\u0026thinsp;0.49), calyx size (r\u0026thinsp;=\u0026thinsp;0.47), harvest index (r\u0026thinsp;=\u0026thinsp;0.34), and boll number (r\u0026thinsp;=\u0026thinsp;0.31). In contrast, correlations with genotype were consistently weak, with the highest observed for seed per plant (r\u0026thinsp;=\u0026thinsp;0.082) and negligible values for other traits, including a slight negative correlation for plant height (r = \u0026minus;\u0026thinsp;0.12). Based on these findings, traits with minimal correlation to either input, such as plant height, calyx, harvest index, and sepal weight, were excluded from further modeling to enhance predictive clarity and reduce noise. Complementing the correlation analysis, two-way ANOVA was performed to assess the statistical significance of genotype, planting date, and their interaction (G\u0026times;E) on each trait (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb). Our findings highlight that planting date consistently emerged as the most influential factor across most traits, underscoring the critical role of environmental conditions and sowing time in shaping Roselle performance. Traits such as branch number, boll number, and seed yield per plant demonstrated highly significant responses (p\u0026thinsp;\u0026lt;\u0026thinsp;0.01) to all three factors, genotype, planting date, and their interaction, suggesting that these traits are highly responsive to both genotype and planting date. In contrast, traits such as harvest index, growth period, sepal weight, and calyx size were mainly influenced by planting date and G\u0026times;E interaction, but not by genotype (p\u0026thinsp;\u0026gt;\u0026thinsp;0.05), highlighting their strong environmental dependence. Overall, among all the morphological traits measured in this study, branch number, boll number, seed per plant, and growth period were selected as the final targets for predictive modeling and optimization.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eModel performance evaluation\u003c/h2\u003e \u003cp\u003eIn this study, RF and MLP models were used to predict morphological parameters based on genotype and planting date. The performances of both models are presented in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. The accuracy of each model was evaluated by R\u0026sup2;, RMSE, MBE, and MAPE. The results show that two models provided acceptable levels of accuracy. however, among the models tested, the RF algorithm (R\u0026sup2; = 0.84) outperformed the MLP (R\u0026sup2; = 0.80) in most of the target traits. Specifically, the RF model yielded the highest R\u0026sup2; and lowest error values in the majority of traits, indicating robust generalization ability and a well-fitted predictive structure. The regression lines demonstrated a good fit correlation between the observed and predicted data for all growth parameters during the training and testing processes of the RF model (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea-h).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparison statistics of multilayer perceptron (MLP) and random forest (RF) models for various morphological traits of \u003cem\u003eH. sabdariffa\u003c/em\u003e under training and testing conditions. Traits include branch number (branch), growth period (growth_period), number of bolls per plant (boll), and seeds per plant (seedperplant). R\u0026sup2; coefficient of determination, RMSE root mean square error, MBE mean bias error, MAPE mean absolute percentage error.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSubset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCriterion\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003ebranch\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003egrowth_period\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eboll\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eseedperplant\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"7\" rowspan=\"8\"\u003e \u003cp\u003e\u003cb\u003eRandomForest\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003eTraining\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eR2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.857\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.969\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.856\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.702\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRMSE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.90\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e5.59\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e8.33\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e190.15\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMBE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.007\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.020\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.141\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e-3.856\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMAPE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.097\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.025\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.083\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.181\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003eTesting\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eR2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.845\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.946\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.834\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.770\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRMSE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.88\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e6.72\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e8.49\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e199.83\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMBE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.21\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.33\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e-1.17\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e17.45\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMAPE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.093\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.031\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.089\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.198\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"7\" rowspan=\"8\"\u003e \u003cp\u003e\u003cb\u003eMLP\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003eTraining\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eR2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.871\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.967\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.860\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.755\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRMSE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.81\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e5.81\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e8.23\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e172.66\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMBE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.005\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.002\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.004\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.758\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMAPE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.092\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.026\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.077\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.160\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"3\" rowspan=\"4\"\u003e \u003cp\u003eTesting\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eR2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.860\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.953\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.821\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.739\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRMSE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1.99\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e6.83\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e8.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e210.12\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMBE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e-0.356\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e-0.259\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e-1.285\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e28.62\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMAPE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.105\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.032\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.096\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.220\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eFeature Importance Evaluation\u003c/h2\u003e \u003cp\u003eThe permutation importance analysis revealed clear and consistent patterns regarding the relative influence of each feature, including planting date and genotype, on the prediction of Roselle morphological traits (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea-d). Among the evaluated features, including five planting dates and ten genotypes, planting date emerged as the dominant factor across most traits. Traits such as branch number, growth period, and boll number demonstrated high sensitivity to planting date, with May, April, and March contributing most significantly to model performance. These temporal variables consistently ranked above genotypic features, indicating that environmental timing plays a more decisive role than genetic variation in shaping phenotypic outcomes. In terms of genotypic influence, although generally less impactful than planting date, a few genotypes showed relatively higher importance in specific traits. For branch number and boll count, \u0026lsquo;\u003cem\u003eIranshahr\u0026rsquo;\u003c/em\u003e and \u0026lsquo;\u003cem\u003eQaleganj\u0026rsquo;\u003c/em\u003e emerged as the most influential genotypes, contributing noticeably to model predictions. In the case of the growth period, \u0026lsquo;HA\u003cem\u003e\u0026rsquo;\u003c/em\u003e was the most prominent genotype, though its influence remained modest compared to planting date. For seed per plant, both planting date and genotype exhibited very low importance, indicating limited predictability and a likely dependence on unmeasured physiological or environmental factors. Overall, the analysis confirmed that planting date was consistently more critical than genotype in determining Roselle trait variability. Nonetheless, certain genotypes, such as \u003cem\u003eIranshahr\u003c/em\u003e, \u003cem\u003eQaleganj\u003c/em\u003e, and HA, exhibited trait-specific influence, highlighting their potential role in breeding programs when integrated with optimized planting schedules.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eModel optimization using NSGA-II\u003c/h2\u003e \u003cp\u003eThe NSGA-II algorithm was integrated with the RF model, as the most accurate predictive algorithm, in this study. The combined RF-NSGA-II model effectively determined the optimal values of the input variables (genotype and planting date) to simultaneously maximize four agronomic traits: branch number, growth period, boll number, and seeds per plant. The results of the multi-objective optimization process using the NSGA-II algorithm are summarized in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. The theoretically optimal performance can be achieved with the genotype \u003cem\u003eQaleganj\u003c/em\u003e and the planting date of May, yielding the predicted trait values of branch number, growth duration, boll number, and seed production per plant, 26.009, 175.872, 116.078, and 1517.165, respectively.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eOptimization of genotype and planting date of Roselle according to the RF-NSGA-II algorithm to obtain the best morphological traits, including number of branches, growth period, number of bolls, and number of seeds per plant.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e \u003cp\u003eInput Items\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"4\" nameend=\"c6\" namest=\"c3\"\u003e \u003cp\u003eOutput Items\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGenotype\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePlanting date\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePredicted Branch number\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003ePredicted growth period\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePredicted boll number\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003ePredicted seed per plant\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eQaleganj\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMay\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e26.009\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e175.872\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e116.078\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1517.165\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eYield prediction has an important role in crop farming aimed at efficient and sustainable production. Accurate and timely predictions are important for farmers\u0026rsquo; decision-making regarding genotypes, planting date, irrigation, fertilization, harvesting, and trading (Khaki \u0026amp; Wang, 2019). Yield prediction in crop science is inherently challenging because of the multifaceted interactions among genetic factors (G), environmental conditions (E), and management practices (Leukel et al.2023). Traditional linear models, including multiple linear regression and correlation-based approaches, often fall short in capturing these nonlinear, dynamic relationships, particularly under variable environmental contexts (Bejo et al., 2014). In contrast, ML models such as RF, support vector regression (SVR), ANNs, and deep learning (DL) architectures have emerged as robust alternatives, offering improved accuracy and adaptability in diverse agricultural scenarios (Leukel et al., 2023).\u003c/p\u003e \u003cp\u003eIn this study, we compared the performance accuracy of MLP and RF on the yield traits of Roselle. Although the difference in R\u0026sup2; values between RF (0.84) and MLP (0.80) was relatively small, RF consistently outperformed MLP, which aligns with previous research highlighting RF's superior capability to handle high-dimensional and complex agricultural datasets (Gupta et al., 2023; Asamoah et al). The ensemble structure of RF, leveraging multiple independent decision trees, minimizes overfitting and enhances model generalizability advantages that are especially important when working with noisy or limited datasets typical in agricultural studies (Gupta et al., 2023). In contrast, MLP typically requires meticulous hyperparameter tuning and careful network architecture design to accurately capture complex nonlinear relationships, which can be computationally demanding and prone to overfitting. Another notable advantage of RF is its relative simplicity and computational efficiency. The training process for RF is more straightforward and faster than MLP, making RF not only more robust in capturing nonlinear interactions but also more practical for agricultural applications where resources and data quality may be constrained. Overall, these findings highlight RF as a more accurate, interpretable, and computationally efficient model for Roselle yield prediction, particularly in the context of complex genotype-environment interactions that are typical in crop yield studies (Gupta et al., 2023; Asamoah et al).\u003c/p\u003e \u003cp\u003eAfter diagnosing the RF model as the best model based on the highest accuracy, the NSGA-II algorithm was linked to the RF. The results of the RF-NSGA-II algorithm indicated that the optimal input variables for maximizing morphological traits, identified as the \u003cem\u003eQaleganj\u003c/em\u003e genotype and a planting date of May 5, correspond to the Pareto frontier solutions, highlighting this combination as the best compromise or optimal point among the studied scenarios. Permutation importance analysis identified planting date as the most influential factor across most traits, overshadowing the contributions of genotype. This aligns with findings of previous studies which have consistently highlighted early planting as a critical determinant of biomass accumulation and overall productivity in Roselle and other crops (Aung and Uape, 2022; Parsa Motlagh et al., 2018). Early planting dates (March to May) likely provide more favorable thermal and photoperiodic conditions, enhancing growth rates and extending the vegetative phase, and eventually improving the yield (El-Sagher et al., 2024).\u003c/p\u003e \u003cp\u003eAlthough genotype had a comparatively lesser effect, its role was not negligible. Specific genotypes such as \u003cem\u003eIranshahr\u003c/em\u003e and \u003cem\u003eQaleganj\u003c/em\u003e demonstrated higher predictive importance for branch number and boll count, while the HA genotype showed prominence in extending the growth period. These findings corroborate earlier observations by Ibrahim et al. (2013) and Tetteh et al. (2019), which identified substantial genetic variability in Roselle for key yield-contributing traits, including plant height, branching, calyx yield, fibre production, and nutritional composition. Collectively, these studies highlight the considerable potential for selective breeding and genetic improvement in Roselle, emphasizing the importance of exploiting this variability to develop high-yielding and resilient cultivars. High heritability observed in traits such as plant height and number of branches suggests that these are primarily governed by genetic factors and can be effectively improved through selection. The interactions between genotype and planting date observed in this study suggest that while environmental timing is the primary driver of trait variability, genotype selection can further refine yield performance when optimized within suitable environmental windows. Moreover, our study revealed that seed yield per plant consistently exhibited low predictive importance for both planting date and genotype, suggesting a reliance on factors not directly captured in this dataset. This finding underscores the complexity and multifaceted nature of seed production in Roselle. Specifically, it highlights that while other morphological traits (such as branch number and boll count) are primarily shaped by genotype and planting date, seed yield is more strongly influenced by unmeasured physiological and environmental interactions. Our findings are consistent with earlier studies that have underscored the pivotal role of micro-environmental and physiological factors in determining seed production in Roselle (Atta et al., 2011; Ibrahim et al., 2013). These factors include variations in plant morphology, soil moisture levels, and physiological responses to environmental stressors, all of which contribute significantly to yield differences across genotypes. To the best of our knowledge, this is the first study to apply ML-NSGA-II for modeling and optimizing morphological traits of Roselle in response to various genotypes and planting dates. In conclusion, our study highlights the applicability and reliability of ML models, particularly RF, for analyzing complex genotype- environment interactions in Roselle cultivation.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study aimed to predict and understand how genotype and planting date affect the morphological traits of Roselle by leveraging advanced machine learning models, specifically RF and MLP, for the first time. The comparative analysis demonstrated that despite the relatively small difference in predictive performance, the RF model consistently outperformed MLP, confirming its robustness in capturing the complex genotype-by-environment interactions typical in agricultural systems. By integrating RF with the NSGA-II algorithm, we successfully identified the optimal combination of the \u003cem\u003eQaleganj\u003c/em\u003e genotype and the planting date of May 5 to maximize morphological traits. Furthermore, the RF-NSGA-II hybrid algorithm proved highly effective for multi-objective optimization, offering a powerful tool to identify ideal input combinations under varying conditions. These findings emphasize the potential of advanced ML-based approaches as robust alternatives to traditional statistical methods, offering new avenues for optimizing and improving morphological traits in Roselle and other crops in future studies. Future research can expand on these findings by incorporating additional environmental variables, exploring genomic selection approaches, and validating the models across multiple growing seasons and agroecological zones.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eCRediT authorship contribution statement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWarqaaMuhammed ShariffAl-Sheikh collecting the data and collaborate in experimental, Fazilat Fakhrzad, design an experiment, analysis data, Mohammed M. Mohammed, collaborate in writing manuscript and Heidar Meftahizade in perform experiments, analysis data\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;Declaration of competing interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNo specific financial credit was used in this experiment.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eData is provided within the manuscript \u0026nbsp;file.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAli, B.H., Wabel, N.A. and Blunden, G., 2005. Phytochemical, pharmacological and toxicological aspects of \u003cem\u003eHibiscus sabdariffa\u003c/em\u003e L.: a review. \u003cem\u003ePhytotherapy Research: An International Journal Devoted to Pharmacological and Toxicological Evaluation of Natural Product Derivatives\u003c/em\u003e, \u003cem\u003e19\u003c/em\u003e(5), pp.369-375. https://doi.org/10.1002/ptr.1628\u003c/li\u003e\n\u003cli\u003eZand-Silakhoor, A., Madani, H., Sharifabad, H.H., Mahmoudi, M. and Nourmohammadi, G., 2022. Influence of different irrigation regimes and planting times on the quality and quantity of calyx, seed oil content and water use efficiency of roselle (\u003cem\u003eHibiscus sabdariffa\u003c/em\u003e L.). \u003cem\u003eGrasas y Aceites\u003c/em\u003e, \u003cem\u003e73\u003c/em\u003e(3), pp.e472-e472. https://doi.org/10.3989/gya.0564211\u003c/li\u003e\n\u003cli\u003eFortin, F.A., De Rainville, F.M., Gardner, M.A.G., Parizeau, M. and Gagn\u0026eacute;, C., 2012. DEAP: Evolutionary algorithms made easy. \u003cem\u003eThe Journal of Machine Learning Research\u003c/em\u003e, \u003cem\u003e13\u003c/em\u003e(1), pp.2171-2175.\u003c/li\u003e\n\u003cli\u003ePedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V. and Vanderplas, J., 2011. Scikit-learn: Machine learning in Python. \u003cem\u003ethe Journal of machine Learning research\u003c/em\u003e, \u003cem\u003e12\u003c/em\u003e, pp.2825-2830.\u003c/li\u003e\n\u003cli\u003eMohamed, R., Fernandez, J., Pineda, M. and Aguilar, M., 2007. Roselle (\u003cem\u003eHibiscus sabdariffa\u003c/em\u003e) seed oil is a rich source of \u0026gamma;‐tocopherol. \u003cem\u003eJournal of food science\u003c/em\u003e, \u003cem\u003e72\u003c/em\u003e(3), pp.S207-S211. https://doi.org/10.1111/j.1750-3841.2007.00285.x\u003c/li\u003e\n\u003cli\u003eMulyaningsih, E.S., Hartati, N., Dyan Anggraheni, Y.G., Harmoko, R., Indrayani, S., Rahman, N., Fitriani, H., Nuro, F., Hapsari, Y. and Andika, N.R., 2025. Morpho Genetic Variability and Anthocyanine (Cyanidin-3-OGlucoside) Concent of Indonesia Roselle (\u003cem\u003eHibiscus sabdariffa\u003c/em\u003e L.). \u003cem\u003eInternational Journal on Advanced Science, Engineering \u0026amp; Information Technology\u003c/em\u003e, \u003cem\u003e15\u003c/em\u003e(1).\u003c/li\u003e\n\u003cli\u003eMahunu, G.K., 2021. Breeding, genetic diversity, and safe production of \u003cem\u003eHibiscus sabdariffa\u003c/em\u003e under climate change. In \u003cem\u003eRoselle (Hibiscus sabdariffa)\u003c/em\u003e (pp. 1-14). Academic Press.\u003c/li\u003e\n\u003cli\u003eRichardson, M.L. and Arlotta, C.G., 2021. Differential yield and nutrients of \u003cem\u003eHibiscus sabdariffa\u003c/em\u003e L. genotypes when grown in urban production systems. \u003cem\u003eScientia Horticulturae\u003c/em\u003e, \u003cem\u003e288\u003c/em\u003e, p.110349.\u003c/li\u003e\n\u003cli\u003eUllah, M. Z. (2024). Genetic variability of Roselle (\u003cem\u003eHibiscus sabdariffa\u003c/em\u003e L.) genotypes. \u003cem\u003eInnovare Journal of Agricultural Sciences\u003c/em\u003e. 12 (1), 1-6. http://dx.doi.org/10.22159/ijags.2024v12i1.49802. \u003c/li\u003e\n\u003cli\u003eTetteh, A.Y., Ankrah, N.A., Coffie, N. and Niagiah, A., 2019. Genetic diversity, variability and characterization of the agro-morphological traits of Northern Ghana Roselle (\u003cem\u003eHibiscus sabdariffa\u003c/em\u003e var. altissima) accessions. \u003cem\u003eAfrican Journal of Plant Science\u003c/em\u003e, 13(6), pp.168-184. https://doi.org/10.5897/AJPS2019.1783.\u003c/li\u003e\n\u003cli\u003eIbrahim, M. M. and Hussein, R. M. 2006. Variability, heritability and genetic advance in some genotype of roselle (\u003cem\u003eHibiscus sabdariffa\u003c/em\u003e). \u003cem\u003eWorld Journal of Agricultural Sciences\u003c/em\u003e, 2 (3): 340-345. \u003c/li\u003e\n\u003cli\u003eAung, C. and Uape, M., 2022. The effect of Different Planting Dates on the Growth and Yield of Roselle (Hibiscus sabdariffa L.) during the Rainy Season. \u003cem\u003eUniversity of Yangon Research Journal\u003c/em\u003e, 11 (1). https://meral.edu.mm/records/9019\u003c/li\u003e\n\u003cli\u003eEl-Sagher, M., Mostafa, G.G., El-Ghadban, E.M.A., Soliman, W.S. and Gahory, A.A., 2024. Sowing date as a determining factor for Roselle, \u003cem\u003eHibiscus sabdariffa\u003c/em\u003e, production: I. Effect on vegetative and yield components. \u003cem\u003eAswan University Journal of Sciences and Technology\u003c/em\u003e, \u003cem\u003e4\u003c/em\u003e(1), pp.28-37. https://dx.doi.org/10.21608/aujst.2024.337887\u003c/li\u003e\n\u003cli\u003eYoosefzadeh-Najafabadi, M., Earl, H.J., Tulpan, D., Sulik, J. and Eskandari, M., 2021a. Application of machine learning algorithms in plant breeding: predicting yield from hyperspectral reflectance in soybean. \u003cem\u003eFrontiers in plant science\u003c/em\u003e, \u003cem\u003e11\u003c/em\u003e, p.624273. https://doi.org/10.3389/fpls.2020.624273\u003c/li\u003e\n\u003cli\u003eYoosefzadeh-Najafabadi, M., Tulpan, D. and Eskandari, M., 2021b. Application of machine learning and genetic optimization algorithms for modeling and optimizing soybean yield using its component traits. \u003cem\u003ePlos one\u003c/em\u003e, \u003cem\u003e16\u003c/em\u003e(4), p.e0250665. https://doi.org/10.1371/journal.pone.0250665\u003c/li\u003e\n\u003cli\u003eZarbakhsh, S. and Shahsavar, A.R., 2022. Artificial neural network-based model to predict the effect of \u0026gamma;-aminobutyric acid on salinity and drought responsive morphological traits in pomegranate. \u003cem\u003eScientific Reports\u003c/em\u003e, \u003cem\u003e12\u003c/em\u003e(1), p.16662. https://doi.org/10.1038/s41598-022-21129-z\u003c/li\u003e\n\u003cli\u003eFakhrzad, F., Jowkar, A. and Hosseinzadeh, J., 2022. Mathematical modeling and optimizing the in vitro shoot proliferation of wallflower using multilayer perceptron non-dominated sorting genetic algorithm-II (MLP-NSGAII). \u003cem\u003ePLoS One\u003c/em\u003e, \u003cem\u003e17\u003c/em\u003e(9), p.e0273009. https://doi.org/10.1371/journal.pone.0273009\u003c/li\u003e\n\u003cli\u003ePrasad, N.R., Patel, N.R. and Danodia, A., 2021. Crop yield prediction in cotton for regional level using random forest approach. \u003cem\u003espatial information research\u003c/em\u003e, \u003cem\u003e29\u003c/em\u003e, pp.195-206. https://doi.org/10.1007/s41324-020-00346-6. \u003c/li\u003e\n\u003cli\u003eShook, J., Gangopadhyay, T., Wu, L., Ganapathysubramanian, B., Sarkar, S. and Singh, A.K., 2021. Crop yield prediction integrating genotype and weather variables using deep learning. \u003cem\u003ePlos one\u003c/em\u003e, \u003cem\u003e16\u003c/em\u003e(6), p.e0252402. https://doi.org/10.48550/arXiv.2006.13847.\u003c/li\u003e\n\u003cli\u003eLeukel, J., Zimpel, T. and Stumpe, C., 2023. Machine learning technology for early prediction of grain yield at the field scale: A systematic review. \u003cem\u003eComputers and Electronics in Agriculture\u003c/em\u003e, \u003cem\u003e207\u003c/em\u003e, p.107721. https://doi.org/10.1016/j.compag.2023.107721. \u003c/li\u003e\n\u003cli\u003eKhaki, S. and Wang, L., 2019. Crop yield prediction using deep neural networks. \u003cem\u003eFrontiers in plant science\u003c/em\u003e, \u003cem\u003e10\u003c/em\u003e, p.621. https://doi.org/10.3389/fpls.2019.00621.\u003c/li\u003e\n\u003cli\u003eGupta, I., Ayalasomayajula, S., Shashidhara, Y., Kataria, A., Shashidhara, S., Kataria, K. and Undurti, A., 2023. Innovations in Agricultural Forecasting: A Multivariate Regression Study on Global Crop Yield Prediction. \u003cem\u003earXiv preprint arXiv:2312.02254\u003c/em\u003e. https://doi.org/10.48550/arXiv.2312.02254\u003c/li\u003e\n\u003cli\u003eAsamoah, E., Heuvelink, G.B., Chairi, I., Bindraban, P.S. and Logah, V., 2024. Random forest machine learning for maize yield and agronomic efficiency prediction in Ghana. \u003cem\u003eHeliyon\u003c/em\u003e, \u003cem\u003e10\u003c/em\u003e(17). https://doi.org/10.1016/j.heliyon.2024.e37065\u003c/li\u003e\n\u003cli\u003eParsa Motlagh, B., Rezvani Moghaddam, P. and Azami Sardooei, Z., 2018. Responses of Calyx Phytochemical Characteristic, Yield and Yield Components of Roselle (\u003cem\u003eHibiscus sabdariffa\u003c/em\u003e L.) to Different Sowing Dates and Densities. \u003cem\u003eInternational Journal of Horticultural Science and Technology\u003c/em\u003e, \u003cem\u003e5\u003c/em\u003e(2), pp.241-251.\u003c/li\u003e\n\u003cli\u003eAtta, S., Seyni, H.H., Bakasso, Y., Sarr, B., Lona, I. and Saadou, M., 2011. Yield character variability in Roselle \u003cem\u003e(Hibiscus sabdariffa\u003c/em\u003e L.). \u003cem\u003eAfrican Journal of Agricultural Research\u003c/em\u003e, \u003cem\u003e6\u003c/em\u003e(6), pp.1371-1377. https://doi.org/10.5897/AJAR10.334\u003c/li\u003e\n\u003cli\u003eIbrahim, E.B., Abdalla, A.W.H., Ibrahim, E.A. and El Naim, A.M., 2013. Variability in some roselle (\u003cem\u003eHibiscus sabdariffa\u003c/em\u003e L.) genotypes for yield and its attributes. \u003cem\u003eInternational Journal of Agriculture and Forestry\u003c/em\u003e, \u003cem\u003e3\u003c/em\u003e(7), pp.261-266.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Machine Learning techniques, prediction, optimization algorithm, Multi-layer Perceptron, Random Forest","lastPublishedDoi":"10.21203/rs.3.rs-6917293/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6917293/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eAccurate prediction and optimization of morphological traits in Roselle are essential for enhancing crop productivity and adaptability to diverse environments. In the present study, a machine learning framework was developed using Random Forest and Multi-layer Perceptron algorithms to model and predict key morphological traits, branch number, growth period, boll number, and seed number per plant, based on genotype and planting date. The dataset was generated from a field experiment involving ten Roselle genotypes and five planting dates. Both RF and MLP exhibited robust predictive capabilities; however, RF (R\u0026sup2; = 0.84) demonstrated superior performance compared to MLP (R\u0026sup2; = 0.80), underscoring its efficacy in capturing the nonlinear genotype-by-environment interactions. Permutation-based feature importance analysis further revealed that planting date had a more significant impact on trait variation than genotype. To identify optimal combinations of genotype and planting date for maximizing morphological traits, the RF model was integrated with the Non-dominated Sorting Genetic Algorithm II (NSGA-II). According to the RF\u0026ndash;NSGA-II optimization results, the optimal values for branch number (26), growth period (176 days), boll number (116), and seed number per plant (1517) were achieved with the \u003cem\u003eQaleganj\u003c/em\u003e population planted on May 5. Collectively, these findings highlight the potential of integrating machine learning and evolutionary optimization algorithms as powerful computational tools for crop improvement and agronomic decision-making.\u003c/p\u003e","manuscriptTitle":"Machine Learning-Driven Models to Predict the optimum Genotype and Planting Date on yield and phytochemical Traits in Roselle (Hibiscus sabdariffa L.)","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-03 05:54:02","doi":"10.21203/rs.3.rs-6917293/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-07-14T08:59:51+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-07-14T04:57:52+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-07-11T16:05:03+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"298955227036865067031752305774144773848","date":"2025-07-09T05:17:44+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-07-06T20:11:36+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"316452330439952395308151132257921621129","date":"2025-07-04T01:05:16+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"74101014128690003297059209674173254314","date":"2025-07-03T01:46:21+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-07-01T23:27:02+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-07-01T14:37:11+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-06-23T17:59:57+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-06-21T15:20:40+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-06-17T20:23:31+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"70760b8e-7f25-4a83-9afa-c8584a8e0cc4","owner":[],"postedDate":"July 3rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":50958614,"name":"Biological sciences/Ecology"},{"id":50958615,"name":"Biological sciences/Plant sciences"}],"tags":[],"updatedAt":"2025-08-07T05:38:32+00:00","versionOfRecord":[],"versionCreatedAt":"2025-07-03 05:54:02","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6917293","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6917293","identity":"rs-6917293","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-24T02:00:01.246996+00:00

License: CC-BY-4.0