Machine Learning Driven Approach for Modelling Milk Production in India

preprint OA: closed
Full text JSON View at publisher
Full text 155,565 characters · extracted from preprint-html · click to expand
Machine Learning Driven Approach for Modelling Milk Production in India | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Machine Learning Driven Approach for Modelling Milk Production in India Sanjivani Srivastava, Jyotshana Singh, Shweta Chaudhary, Thota Sivasankar, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8702287/v1 This work is licensed under a CC BY 4.0 License Status: Under Revision Version 1 posted 11 You are reading this latest preprint version Abstract Milk production in India is integral to its food and nutritional security and is instrumental in achieving SDGs 1, 2, and 5. Accurate early prediction of milk production is crucial for data-driven policymaking, with implications for resource allocation and supply chain optimization. However, modelling milk production remains complex due to its dependence on multiple factors and the significant regional heterogeneity within the country. While the application of Artificial Intelligence (AI) driven approaches has expanded modelling capabilities in the realm of livestock production, their application in modelling milk production remains underexplored within the Indian context. Therefore, the present study aimed to develop state-level predictive models for milk production, employing a suite of Machine Learning (ML) algorithms. Time-series data encompassing fifteen variables from 2000-01 to 2022-23 across twenty-seven Indian states were systematically collected and analyzed. Seven machine learning techniques, including Decision Tree (DT), Random Forest (RF), eXtreme Gradient Boosting (XGB), Light Gradient Boosting Machine (LightGBM), CatBoost (CB), and Ensemble techniques, were tested and compared using an extensive array of performance metrics. Notably, the ensemble technique combining DT, RF, and XGB demonstrated superior predictive accuracy, surpassing individual models. Conversely, DT exhibited the weakest performance relatively. The findings clearly point to genetic improvement, remunerative pricing for producers, and assured irrigation as strategic priorities to boost India’s milk production. Machine learning Predictive modelling Milk production India Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 1. INTRODUCTION India is the world’s largest producer of milk, accounting for 22% of global milk production (FAO). With a significant segment of its population following a lactovegetarian diet, India is also the world’s leading consumer of milk (USDA Report, 2021), which makes milk and milk products indispensable to the country’s policies on food and nutritional security. The dairy sector directly supports the livelihoods of more than eighty million people, with the workforce predominantly comprising women and small and marginal farmers (GoI, 2024). Within India’s socioeconomic setting, livestock rearing and milk production are found intertwined with poverty alleviation, income diversification, improved welfare of rural households, higher resilience to economic vulnerability, and women empowerment (Birthal and Negi, 2012 ; FAO, 2018; Tricarico et al. ,2020). The milk cooperatives in the country have been recognized for delivering higher economic incentives and inclusiveness in resource distribution (Gaillard and Dervillé, 2022 ). Thus, dairying is a vital component of the country’s agricultural sector and is instrumental to achieving Sustainable Development Goals (SDGs) 1, 2, and 5, i.e., ‘No poverty’, ‘Zero hunger’, and ‘Gender equality’. In recent decades, fuelled by factors such as a rising population, increased disposable income, and growing urbanization, India is experiencing a ‘Livestock Revolution’ (Delgado et al. ,2001). The Indian dairy industry is expanding at a substantial rate of 6.2% annually (GoI, 2022). Milk stands out as the only livestock product in the country with a surplus in per-capita availability (Srivastava et al. , 2024). However, the income elasticity of milk and milk products is found to be higher than that of egg, meat, and fish in the country, implying that an increase in consumer income is likely to translate into a much higher demand for the former (Vasant and Zhang-Yue, 2010 ). The rise in milk prices has already been identified as a major contributor to food price inflation within the country (Sekhar et al., 2017 ). Despite this, a significant portion of the marketable surplus of raw milk continues to be channelled through unorganized supply chains that lack essential storage and transportation infrastructure (Petare, 2013), resulting in its adulteration and spoilage. As per the NITI Aayog ( 2021 ), only 40% of the milk sold in the country gets processed, compared to 90% in developed countries. Thus, the government aims to double the milk handling to provide dairy farmers with higher incomes (National Action Plan for Dairy Development, 2018). Looking ahead, global milk production is forecasted to grow at 1.8% per annum over the next decade, while the dairy herd population is projected to grow at an annual rate of 1.1%, particularly in low-yield regions, such as Sub-Saharan Africa and in major milk-producing countries like India and Pakistan (OECD-FAO, 2022). These trends spark ‘food versus feed’ debate and raise concerns about the sector’s long-term environmental sustainability, owing to mounting land-use pressure and a significant contribution of livestock production to greenhouse gas emissions (FAO, 2006 ). Against this backdrop, accurate early prediction of milk production in the country would equip policymakers with insights to make informed decisions, ensuring sustained productivity and long-term food security (Lyngkhoi et al. ,2022). In addition, such an attempt would align with the government’s Viksit Bharat Initiative, which emphasizes data-driven policymaking. However, modelling milk production is complex due to its dependency on multiple factors (Rao, 2017 ; Sarkar et al., 2022 ). Particularly in India, a wide heterogeneity exists among states in terms of socioeconomic attributes, dietary preferences, sources of feed and fodder, infrastructural and technological progress, and institutional working, leading to regional disparities in milk production (Rajeshwaran et al., 2014 ; Kale et al. , 2019). Therefore, investigating variables at the state level becomes crucial for designing robust, contextually appropriate models. This entails selecting an appropriate modelling technique capable of enhancing data scalability and efficiency. Owing to advances in the disciplines of applied mathematics, statistics, computing and information sciences, Artificial Intelligence (AI) is finding an increasing application (Franco & Santurro, 2020). Machine Learning (ML) algorithms, which are a subset of AI, have been documented for their ability to handle high-dimensional datasets, universal approximation, and adaptive learning. Hence, they are being increasingly employed for a variety of applications in livestock production, such as detecting diseases in cattle (Bhardwaj et al., 2023 ), predicting the yield of milch species (Kumar et al., 2019 ), forecasting the production of livestock products (Suseendran & Duraisamy, 2021 r, 2024; Kim et al., 2024 ), and estimating methane emissions from livestock farming (Perumal et al., 2024 ). In the Indian context, researchers have explored various methodologies, ranging from traditional modelling approaches such as ARIMA, VAR, and Holt’s model (Paul et al., 2014 ; Deshmukh & Paramasivam, 2016 ; Mishra et al. , 2020; Jirli et al., 2025 ) to mathematical models like deterministic growth models (Srivastava, 2024), AI algorithms like neural networks (Shankar et al., 2023 ), and hybrid models (Subbanna et al., 2021 ) to forecast milk production. However, studies employing advanced ML methods for large-scale modelling of milk production within the country remain limited. Therefore, the present study aimed to develop robust predictive models 1 , employing a range of machine learning techniques, to estimate state-level milk production across India. 2. MATERIALS AND METHODS Despite their accuracy in forecasting, ML models have not been fully utilized in predicting milk production, likely due to the lack of multi-parameter datasets (Ozella et al., 2023 ). To address this issue, the study meticulously retrieved and compiled time-series information on fifteen variables for twenty-seven Indian states 2 , spanning a period of over two decades, from 2000–01 to 2022–23. Open data repositories of government agencies, namely the Department of Animal Husbandry & Dairying, the National Dairy Development Board, the Department of Agriculture & Farmers Welfare, and the Reserve Bank of India, were used as sources. Time-series variables corresponding to different base years were spliced to the latest base year of 2011-12. Due to regional heterogeneity, data compilation was performed at the state level to improve the models' applicability and robustness. Subsequently, the variables were pre-processed and trained using various ML models, wherein proper data splitting and validation strategies ensured the quality of training. The performance of these models was finally evaluated using the test dataset, followed by a comparison to identify the best-performing model. The modelling framework adopted in the study is summarized in Fig. 1 , and each step is detailed as follows. 2.1 Selection of model parameters In order to build a model with good accuracy and applicability, comprehensive indicators or relevant proxy indicators that are in concordance with existing theory need to be selected, with a clear rationale behind the inclusion of each variable and the underlying presumption about its relationship with the dependent variable. Based on the availability of secondary data, sixteen input variables were selected a priori for developing the milk production models, as detailed in Table 1 . Table 1 Input variables selected for developing the milk production models S. No. Determinants Notations Rationale for inclusion Unit Minimum Value Maximum Value Apriori Relationship 1. Number of in-milk crossbred and exotic cows N cb Cows and buffaloes are the major milking species in the country and contribute to about 97% of milk production. Moreover, a substantial increase in milk production from buffaloes and cows is projected for the country by 2025–26 (Devi et al. , 2022). Thousands 36.1 2776.38 + 2. Number of in-milk indigenous and non-descript cows N i Thousands 7.34 5547.87 + 3. Number of in-milk buffalo N b Thousands 6.55 11759.06 + 4. Average yield per in-milk crossbred and exotic cows Y cb Given the variability in production performance of different species, their average yields were taken (Kale et al. , 2018) Kg/day 3.1 13.43 + 5. Average yield per in-milk indigenous and non-descript cows Y i Kg/day 1 6.91 + 6. Average yield per in-milk buffalo Y b Kg/day 1.67 9.11 + 7. Area under pastures and permanent grazing lands A p Pastures & permanent grazing lands, fodder crops, and crop residue are the prime sources of feed/fodder in India (Khan & Tomar, 2022 ; Singh et al. , 2022). Feed availability is directly linked to animal nutrition and productivity (Prajapati et al., 2019 ; Kumar et al., 2023 ) Thousand hectares 0 1709 + 8. Area under fodder crops A fc Thousand hectares 0 5370 + 9. Area under cereals and millets A cm Thousand hectares 172 18842 + 10. Ratio of gross irrigated area to total cropped area R Irrigation is central to India's crop-milk mixed farming system (Goswami et al., 2017 ). Improved access to groundwater irrigation facilitates a transition from drought to dairying in the bovine economy (Kishore et al., 2016 ) Unit less 3.6 98.8 + 11. Artificial inseminations conducted AI Artificial insemination is crucial in enhancing bovine productivity by advancing the genetic potential (NAAS, 2020; Saha and Bhattacharya, 2021) Thousands 45 13394 + 12. Number of veterinary institutions V The count of veterinary institutions was selected as the indicator of animal health care (Kumar et al., 2013 ). Poor veterinary infrastructure is linked to the imbalanced progress of dairy development in the country (Kale et al., 2016 ) Units 801 7897 + 13. Producer prices PP Remunerative producer prices are incentives for adopting improved production technologies (Cwalina et al., 2020 ) Lakh Rupees 95725 127292 + 14. Per capita net state domestic product (at constant prices) NSDPpc Taken as a proxy for per-capita income, which is a measure of people’s purchasing power. Since one of the main drivers for growth in milk demand is an increase in disposable incomes (Gerosa and Skoet, 2012) Rupees 7091 423716 + 15. Production year Yr To capture the local or global effects arising due to socio-politico-economic factors, which are difficult to capture otherwise, but affect the production and supply of milk. 0 22 Not known 16. State St Post-White Revolution, the Indian dairy sector is identified with inter- and intra-state inequalities in production (Gupta & Purohit, 2010 ; Jaiswal, 2022 ), addressing which is imperative to meet future demand. 0 26 Not known Table 2 Summary of ML techniques used for the development of predictive models Model Name Acronym Description Reference Decision Tree DT Decision trees are popular tools for predictive modeling owing to their simplicity. They are highly user-friendly, typically requiring minimal or no tuning, and can be trained rapidly. Blockeel et al., 2023 Random Forest RF The tree-based ensemble method is widely regarded for its rapid training, direct application to high-dimensional datasets, and statistical advantages such as providing measures of variable importance and unsupervised learning. Cutler et al. , 2012 eXtreme Gradient Boosting XGB Tree boosting is a widely used, powerful machine-learning method. XGBoost is a scalable end-to-end implementation of this approach, recognized for its predictive accuracy and interpretability. Ali et al. , 2023 Light Gradient Boosting Machine Light GBM The LightGBM regressor is a tree-based ensemble learning algorithm, developed to improve upon the efficiency and scalability constraints of XGBoost when applied to high-dimensional feature spaces and large-scale datasets. Soomro et al., 2024 CatBoost CB CatBoost excels in handling categorical data, thereby simplifying the process of model development and reducing the risk of information loss or overfitting Kulkarni, 2022 Ensemble Technique 1 ET-1 Ensemble learning is a machine learning concept in which multiple models are combined to tackle the same task, with the collective performance of the models surpassing that of any single constituent model. Fawagreh et al., 2014 Ensemble Technique 2 ET-2 2.2 Machine Learning Models selected ML algorithms have gained prominence across numerous fields due to their speed and accuracy in handling voluminous data. In this study, seven machine-learning techniques were employed, namely Decision Tree (DT), Random Forest (RF), eXtreme Gradient Boosting (XGB), Light Gradient Boosting Machine (Light GBM), CatBoost (CB), and two Ensemble models. Ensemble Technique 1 (ET-1) combined Random Forest and XGB, while Ensemble Technique 2 (ET-2) combined DT, RF, and XGB. Each of these modelling techniques is described briefly in Table 2 . 2.3 Data pre-processing As machine learning algorithms operate on numerical inputs, it is necessary to encode categorical variables into numerical values using suitable encoding techniques (Potdar et al., 2017 ). In the study, label encoding, a type of determined encoding technique, was selected for its lower computational complexity (Hancock and Khoshgoftaar, 2020 ). The technique was applied to the 'state' and 'year' columns to replace the categorical values with distinct, scalar numerical values. 2.4 Model Development, Validation and Evaluation The processed dataset, comprising 558 observations, was divided into training and testing subsets using an 80:20 ratio. The training process allows for automatic adjustment of weights and biases in a model during each iteration, while the validation process, which is performed on a reserved, unseen portion of the training data, allows for computing the loss at each iteration, thereby ensuring the achievement of optimal model performance. Finally, the testing dataset is used to assess the performance of a developed model. Given that the study uses a medium-sized dataset, partitioning the data decreases the number of samples available for training, which makes the results susceptible to the randomisation of train, test, and validation sets. Therefore, implementing cross-validation becomes crucial to ensure the robustness and generalizability of the developed models (Kaliappan et al., 2023 ). Cross-validation systematically divides data into multiple training and testing subsets to reduce performance variance arising from bias in data selection and to maximize the use of available data for both training and evaluation. Among various procedures, K-fold cross-validation, where the dataset is divided into k equal-sized folds, with each fold serving once as a validation set and the remaining folds as a training set, is widely used for model selection (Lyashenko and Jha, 2022 ). Subsequently, for each model, the following configuration was applied to perform K-fold cross-validation: kf = KFold(n_splits = 5, shuffle=True, random_state = 42). It implies that the entire dataset was divided into five equal folds. The models were trained five times, each time using four of the folds for training and one fold for testing. Prior to the partition, the data was randomized to ensure the folds were representative of the overall data distribution. Lastly, to ensure a consistent basis for comparison across models, a ‘random_state’ parameter was specified, so the dataset underwent an identical random split each time. Hyperparameter tuning is another crucial process to find the optimal set of hyperparameters, such as learning rate, batch size, number of layers in neural networks, regularization parameters, number of trees, depth of trees, etc., that yield the best performance for an ML model (Ilemobayo et al., 2024 ). Hence, a detailed account of hyperparameter tuning for each model is presented, with selection performed using random search. Firstly, the decision tree (DT) regressor was configured as follows: ‘max_depth=None’, ‘max_features=None’, ‘min_samples_leaf = 2’, and ‘min_samples_split = 8’. Here, min_samples_split defines the minimum number of samples required to split an internal node, whereas min_samples_leaf indicates the minimum number of samples that a terminal node must contain post-split. For the Random Forest regressor, the hyperparameter space was defined as follows: `max_depth = 20`, `min_samples_leaf = 2`, `min_samples_split = 8`, and `n_estimators = 400`. The n_estimators parameter specifies the number of individual decision trees within the forest. Its value is selected to strike a balance between capturing data complexity and preventing overfitting due to excessive model complexity. The `max_depth` parameter limits the maximum number of splits per tree to constrain complexity. For the XGB regressor, optimal hyperparameters were identified as follows: ‘n_estimators = 200’, ‘learning_rate = 0.1’, ‘max_depth = 4’, ‘subsample = 0.8’, and ‘colsample_bytree = 0.8’. Learning rate is the step size shrinkage used in the updation of weights and is one of the key parameters in controlling overfitting. Subsample and colsample_bytree denote the proportions of samples and features incorporated during the construction of each tree (Dean, 2020 ), where the latter helps in preventing dominant features from skewing the model output. For the LGBM Regressor, the parameters were set to ‘num_leaves = 31’, ‘learning_rate = 0.01’, and ‘n_estimators = 500’, and for the CB Regressor to ‘iterations = 1000’, ‘learning_rate = 0.03’, and ‘depth = 4’. Lastly, for the ensemble models, optimised base learners, namely Random Forest, XGBoost, and Decision Trees, were combined to leverage their complementary strengths, a design shown to improve predictive accuracy (Sagi and Rokach, 2018 ). Finally, the ML models were evaluated based on the following metrics: Error percentage = \(\:\frac{{\sum\:}_{\text{i}=1}^{\text{N}}\left|\text{O}\text{b}\text{s}\text{e}\text{r}\text{v}\text{e}\text{d}\:\text{o}\text{u}\text{t}\text{p}\text{u}\text{t}-\text{P}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{e}\text{d}\:\text{o}\text{u}\text{t}\text{p}\text{u}\text{t}\right|}{\text{N}}*100\) Root Mean Square Error (RMSE) – $$\:\text{R}\text{M}\text{S}\text{E}=\sqrt{\sum\:_{\text{i}=1}^{\text{N}}\frac{\left(\text{P}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{e}\text{d}\:\text{o}\text{u}\text{t}\text{p}\text{u}\text{t}-\text{O}\text{b}\text{s}\text{e}\text{r}\text{v}\text{e}\text{d}\:\text{o}\text{u}\text{t}\text{p}\text{u}\text{t}\right)^2}{\text{N}}}$$ MAE (Mean Absolute Error) - $$\:\text{M}\text{A}\text{E}\:=\frac{{\sum\:}_{\text{i}=1}^{\text{n}}\left|\text{O}\text{b}\text{s}\text{e}\text{r}\text{v}\text{e}\text{d}\:\text{o}\text{u}\text{t}\text{p}\text{u}\text{t}-\text{P}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{e}\text{d}\:\text{o}\text{u}\text{t}\text{p}\text{u}\text{t}\right|}{\text{N}}$$ R 2 (Coefficient of multiple determination) – $$\:\text{R}2\:=1-\sum\:_{\text{i}=1}^{\text{N}}\frac{\left(\text{P}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{e}\text{d}\:\text{o}\text{u}\text{t}\text{p}\text{u}\text{t}-\text{O}\text{b}\text{s}\text{e}\text{r}\text{v}\text{e}\text{d}\:\text{o}\text{u}\text{t}\text{p}\text{u}\text{t}\right)^2}{\left(\text{P}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{e}\text{d}\:\text{o}\text{u}\text{t}\text{p}\text{u}\text{t}-\text{M}\text{e}\text{a}\text{n}\right)^2}$$ In addition, several diagnostic visualisations, such as scatter plots comparing actual and predicted values, error distribution plots, residual box plots, and a radar chart, were used to qualitatively assess the models’ fitting, consistency, and robustness. A feature importance plot was also generated to facilitate the interpretation of the model's behaviour and to identify the top predictive variables. 3. RESULTS AND DISCUSSION The study employed seven Machine Learning (ML) models, wherein each model was trained and evaluated using an 80:20 data split. Hyperparameters such as maximum depth, number of estimators, learning rate, and feature subsampling ratios were fine-tuned iteratively. Additionally, a five-fold cross-validation was conducted, and L2 regularisation was applied during training to ensure models' robustness and to improve their generalization ability. Subsequently, their performance was assessed based on the metrics generated from the test dataset. As the study aimed to conduct a comparative analysis to identify the best-performing model, a random_state value of 42 was assigned to each model to ensure similar data splits. Table 3 Error metrics of the models MODEL MAE R 2 score RMSE Error Percentage DT 2062.54 0.89 3288.03 19.65% RF 1829.21 0.90 3100.69 16.61% XGB 1819.39 0.89 3191.47 14.57% CB 1717.74 0.90 3111.83 16.17% LightGBM 1989.6 0.88 3365.92 17.87% ET-1 1723.04 0.91 3009.55 13.93% ET-2 1683.67 0.91 2929.79 13.63% The error metrics of ML models are presented in Table 3 . It was found that Ensemble Technique 2 demonstrated the best performance in terms of achieving the lowest Mean Absolute Error (MAE) of 1683.67, the lowest Root Mean Square Error (RMSE) of 2929.79, the lowest error percentage of 13.63%, and the highest coefficient of determination (R²) value of 0.91. Ensemble Technique 1 followed close behind with MAE of 1723.04, RMSE of 3009.55, error percentage of 13.93%, and R 2 of 0.91. The ensemble models, owing to their augmented capability to manage the bias-variance tradeoff (Ranglani, 2024 ), yield improved performance. Next in line was the CatBoost model, which was observed to achieve an MAE of 1717.74, an RMSE of 3111.83, and an R 2 of 0.90. However, the XGBoost model marginally outperformed it in terms of error percentage. Since the error percentage metric is quite sensitive to the presence of large outliers, CB was considered to perform better than XGB. The Decision Tree (DT) model yielded the highest Mean Absolute Error (MAE) of 2062.54, the highest Root Mean Square Error (RMSE) of 3288.03, and the highest error percentage of 19.65%. This could be attributed to the inherent simplicity of DT, wherein it operates with nearly no assumptions regarding the target function, guiding the hypothesis solely based on the training data. However, this characteristic causes DT to capture superfluous patterns, rendering it highly susceptible to variance in data, resulting in a relatively weak performance. The results underscore the capability of ensemble techniques to attain higher prediction accuracy and minimal variance compared to the individual classifiers they integrate, reinforcing their foundational learning principle that a combination of diverse models often outperforms an individual model in isolation (Sharma et al. ,2018). Notably, this performance was achieved across a wide range of target values, spanning from a minimum of 11.0 to a maximum of 33,873.61, highlighting the models’ ability to capture a wider range of data patterns. Next, error distribution plots were visualized using histograms overlaid with Kernel Density Estimation (KDE) curves, in Fig. 2 , to provide a statistical overview of models’ residuals. An ideal model is expected to yield a symmetric error distribution, conforming to a bell-shaped curve centred at zero, as this characteristic signifies minimal bias in predictions and uniform performance across the dataset. Figure 2 : Error distribution plots of the models From Fig. 2 , it was observed that amongst all models, DT exhibited a relatively symmetric distribution around zero, but with notable irregularities. The error range extended from approximately − 4,000 to + 8,000, indicating fluctuations between under-prediction and over-prediction contingent on data characteristics. The widespread errors observed in both directions indicate that the model is capturing extraneous noise rather than identifying fundamental patterns within certain regions of the feature space. This can be ascribed to the susceptibility of DT to overfitting when they grow excessively deep, i.e., the model constructs highly specific decision pathways that may attain a good fit for the training data yet fail to generalize effectively to previously unseen data. In contrast, the remaining models demonstrated right-skewed distributions with peaks close to zero, suggesting a common tendency to overestimate higher target values. An analysis of the kernel density estimate, represented by the pink curve, revealed that the ensemble techniques, particularly ET-2, showed a steady decline, implying that the model produced large prediction errors less frequently and thereby provided a more robust performance. To further aid the visual error analysis, box plots were generated and are presented in Fig. 3 . These are instrumental in assessing model performance as each plot encapsulates key measures of central tendency, such as the interquartile range (IQR), median, and the spread, including outliers. First, the IQR, depicted by the height of a box, shows the spread of the middle 50% of the residuals. Second, the median, represented by the horizontal line within each box, when situated on the zero line, indicates that the model does not systematically overestimate or underestimate predictions. Then, the whiskers, extending from the box edges to the minimum and maximum values within 1.5 times the IQR, represent the spread of the majority of residuals, excluding the extreme values. Lastly, the outliers, marked as red dots beyond the whiskers, represent prediction errors that deviate significantly from the typical range. Figure 3 : Residual box plots of the models From Fig. 3 , it was observed that the CB model exhibited the smallest interquartile range, followed by ET-1, ET-2, and XGB, all of which demonstrated comparable box heights. This signifies that these models yielded more consistent predictions, with minimal variation in error. Additionally, these models had the shortest whiskers, reinforcing their strong predictive stability and accuracy. Interestingly, the DT model exhibited no apparent outliers. However, the absence of plotted red dots does not necessarily imply superior performance. Outliers are only visualized when data points exceed the range of 1.5 times above the third quartile or below the first quartile, respectively. Thus, in this case, the absence of outliers underscores the model’s inherently high variance, as evidenced by the widest interquartile range and the longest whiskers. Overall, the presence of outliers across all models indicated that, despite improvements in central tendency, the occurrence of extreme prediction errors was common to all ML algorithms for the given dataset. Regarding the median line, it was found positioned closest to the zero line for the DT model. This suggests that its residuals are, on average, centred around zero, i.e., the model has low bias. The above observations align with the theoretical understanding that Decision Trees are low-bias but high-variance learners (Ibrahim, 2022 ). The remaining models were found to have their medians slightly above zero, with a more pronounced deviation observed in the case of LightGBM. Figure 4 : Radar Chart Thereafter, a radar chart was illustrated, in Fig. 4 . Each axis represents one of the three metrics of R², MAE, and RMSE, with models projected along it according to their scaled performance. A greater distance from the centre on the R² axis indicates higher predictive accuracy. To obtain a comparable visual analysis, an inversion of the error metrics during the normalization process is performed. Hence, movement toward the outer edge on the MAE and RMSE axes reflects lower error. Subsequently, for each model, its normalized values on the three axes were connected to form a polygon. A larger and more symmetric polygon denotes that the model performs well across all metrics. From Fig. 4 , ET-2 was noted to have the most expansive and symmetric polygon on the radar plot, reflecting strong and consistent performance across R², MAE, and RMSE. ET-1 closely followed, demonstrating a high degree of generalization. CB, XGB, and RF formed moderately large polygons, indicative of stable predictive accuracy, though marginally less than the ensembles. In contrast, LGBM and DT yielded notably contracted polygons, denoting lower R² and higher error metrics, and thus inferior overall performance. A holistic assessment of polygonal areas and symmetry, which provides an insight into the model’s balance and efficacy, revealed Ensemble Technique-2 as the most reliable model across all evaluated criteria. However, comprehending the rationale behind a model's predictions is often as critical as achieving high accuracy. Despite their robust performance, complex models such as Ensemble techniques pose the significant challenge of opacity, highlighting a dilemma between accuracy and interpretability (Lundberg and Lee, 2017 ). To address this issue, a feature importance plot, as presented in Fig. 5 , was generated using the XGB model. The F-score for a feature, particularly for tree-based models, measures how often the feature is used to split the data across all trees in the ensemble and the improvement in the model's performance due to the splits. Hence, this metric facilitates quantifying the relative contribution of individual features to model predictions and serves as a tool for model interpretability. Nevertheless, the interpretations should be made considering the limitation that the plots primarily reflect overall influence rather than feature interactions or individual prediction effects, i.e., they capture a global perspective rather than a local one. Additionally, the high importance of a feature indicates an association within the model and not necessarily causation. Figure 5 : Feature Importance Plot From Fig. 5 , it was noted that the Production year, denoted as year_encoded, had the highest F-score of 220, followed closely by the Artificial inseminations conducted (AI), having a score of 200. It suggests these factors exert a more significant impact on the model's predictions. Production year emerged as the top feature due to an almost consistent pattern of year-on-year growth in milk production across all states. The high importance of AI as a predictive feature aligns with the available literature on its significant impact on livestock productivity, health, and offspring quality. Next in line were ‘Ratio of gross irrigated area to total cropped area (R)’, ‘Area under cereals and millets (A cm )’, ‘Average yield per in-milk crossbred and exotic cows (Y cb ), ‘Producer prices (PP)’, and ‘Number of in-milk crossbred and exotic cows (N cb )’, with F-scores above 150. Among the variables denoting fodder availability, pre-eminence of irrigation over land area variables emphasizes the importance of the former in enhancing both the yield and stability of fodder supply. This highlights that while expanding the area under crops and grazing lands is required to meet the fodder demand of the growing livestock population, the synergies of crop-livestock systems are most effective with assured irrigation, a finding similar to Mynavathi et al. ( 2019 ). Further, the importance of both the count and yield of crossbred cows, i.e., N cb and Y cb , strengthens the species growing status in the country’s milking herd (NAAS, 2016). Remunerative prices to producers emerged as another crucial variable, underscoring its importance in augmenting dairy activity (Kumar et al., 2011 ). 4. CONCLUSION India is the world’s largest producer and consumer of milk, with the dairy sector directly employing over eighty million people, mainly comprising women and small and marginal farmers. Within the country’s socioeconomic setting, livestock rearing and milk production are found to be intertwined with poverty alleviation, income diversification, higher welfare of rural households, improved resilience to economic vulnerability, and women empowerment. Thus, dairying is a vital component of India’s agricultural sector and is crucial to achieving SDGs 1, 2, and 5. Accurate early prediction of milk production in the country would leverage data-driven policymaking to ensure food security and supply chain optimization. Owing to the multi-factorial dependency and regional disparity in production, the study aimed to develop robust models, utilizing Machine Learning (ML) techniques, to predict milk production at the state level. To this end, time-series data on fifteen variables, for a period from 2000-01 to 2022-23, across twenty-seven states were compiled. Seven Machine Learning techniques, namely Decision Tree (DT), Random Forest (RF), eXtreme Gradient Boosting (XGB), Light Gradient Boosting Machine (Light GBM), CatBoost (CB), and two Ensemble models, were employed. Ensemble Technique 1 (ET-1) combined RF and XGB, while Ensemble Technique 2 (ET-2) combined DT, RF, and XGB. Results revealed that Ensemble methods produced the best results, particularly Ensemble Technique 2 (ET-2). It achieved the lowest MAE, RMSE, and error percentage along with the highest R² of 0.91. The ET-1 model followed close behind. A qualitative error assessment using error distribution plots, residual box plots, and a radar chart reaffirmed the ET-2 model yielding the most reliable and robust performance. Contrarily, DT was found to perform the weakest. A feature importance plot, generated to comprehend the relative contribution of individual features to model predictions, showed Production year, Artificial inseminations conducted, Ratio of gross irrigated area to total cropped area, Area under cereals and millets, Producer prices, Average yields and Numbers of in-milk crossbred and exotic cows, as the top features contributing to milk production. Thus, the findings demonstrate the robustness of ML techniques in context-specific modelling and point to genetic improvement, provision of remunerative producer prices, and assured irrigation as strategic priorities to boost milk production in the country. 5. LIMITATIONS AND FUTURE DIRECTIONS The authors endeavoured to incorporate as comprehensive a dataset as possible, however, the study is constrained by its reliance on secondary data sources. Consequently, a few crucial variables, such as outbreaks of livestock diseases, data on dairy cooperative societies, and socio-economic characteristics of dairy households, could not be included despite their established significance in influencing milk production within the Indian context. Future studies may consider including these variables to further enrich the policy implications of their work. Furthermore, a comparative evaluation of machine learning models against traditional statistical forecasting models or hybrid approaches documented in the literature, using a similarly extensive and diverse dataset for large-scale milk prediction, can be attempted in future studies to enhance the methodological rigour in the field. Declarations 6. ETHICAL APPROVAL AND ACCORDANCE Not applicable, as the study is based solely on secondary data available free of cost in the public domain. The authors assure that the research work presented is original, and the manuscript is not under consideration for publication by any other journal. No part of the work presented in the manuscript has been published earlier. The final manuscript was read and approved by all authors, who have collectively consented to submit it to this journal. Third Party Material: All of the material is owned by the authors and/or no permissions are required. Consent to Participate : Not applicable 7. CONSENT TO PUBLISH Not applicable. 8. OTHER DECLARATIONS Acknowledgement: The work presented in this manuscript is an extension of research that received a Research Award from the National Bank for Agriculture and Rural Development (NABARD), Government of India. The authors are deeply grateful for the recognition. Funding statement: The work received no funding. Competing Interests: The authors declare no potential conflicts of interest. The authors have no financial or proprietary interests in any material discussed in the manuscript. Data availability: The data compiled and analyzed for the current study are available in open data repositories hosted on websites of government agencies, Archive Land Use Statistics – At a Glance | Official website of Directorate of Economics and Statistics, Department of Agriculture and Farmers Welfare, Ministry of Agriculture and Farmers Welfare, Government of India; https://dahd.gov.in/schemes/programmes/animal-husbandry-statistics; Reserve Bank of India - Handbook of Statistics on Indian Economy ; agriculture value-output-milk Statistics and Growth Figures Year-wise of india– Indiastat ; https://www.indiastat.com/data/agriculture/veterinary-institutions-animal-health-services/data-year/all References Ilemobayo A, Durodola J, Alade O, Awotunde OJ, Olanrewaju OT, Falana A, Edu O…E, O. Hyperparameter tuning in machine learning: A comprehensive review. J Eng Res Rep. 2024;26(6):388–95. https://doi.org/10.9734/jerr/2024/v26i61188 . Bhardwaj S, Tarafdar A, Baghel M, Dutt T, Gaur GK. Determining point of economic cattle milk production through machine learning and evolutionary algorithm for enhancing food security. J Food Qual. 2023;2023(1):7568139. Birthal PS, Negi DS. Livestock for higher, sustainable and inclusive agricultural growth. Econo Political Wkly. 2012;47:89–99. Blockeel H, Devos L, Frénay B, Nanfack G, Nijssen S. Decision trees: from efficient prediction to responsible AI. Front Artif Intell. 2023;6:1124553. Bogatinovski J, Todorovski L, Džeroski S, Kocev D. Comprehensive comparative study of multi-label classification methods. Expert Syst Appl. 2022;203:117215. Chand R, Raju SS. Livestock sector composition and factors affecting its growth. Indian J Agric Econ. 2008;63(2):198–210. Cwalina K, Borusiewicz A, Ferrari M, Herrmann IT, Priekulis J. Factors influencing the development of milk production in agricultural holdings. Agricultural Eng. 2020;24(4):23–34. https://doi.org/10.1515/agriceng-2020-0033 . Dean N. (2020). Subsampling Big Data for Tree-Based Learning Algorithms (Doctoral dissertation, North Carolina Agricultural and Technical State University). Delgado C, Rosegrant M, Steinfeld H, Ehul S, Courbis C. Livestock to 2020: the next food revolution. Outlook Agric. 2001;30(1):27–9. Deshmukh SS, Paramasivam R. Forecasting of milk production in India with ARIMA and VAR time series models. Asian J Dairy Food Res. 2016;35(1):17–22. 10.18805/ajdfr.v35i1.9246 . Devi Monika RH, Umme, Weerasinghe WPMCN, Pradeep M, Shiwani T, Karakaya Kadir. Future Milk Production Prospects in India for Various Animal Species using Time Series Models. Indian J Anim Res. 2022;56(9):1170–5. 10.18805/IJAR.B-4409 . Di Franco G, Santurro M. Machine learning, artificial neural networks and social research. Qual Quantity. 2021;55(3):1007–25. FAO. Livestock's Long Shadow. Rome: Food and Agriculture Organisation; 2006. Fawagreh K, Gaber MM, Elyan E. Random forests: from early developments to recent advancements. Syst Sci Control Engineering: Open Access J. 2014;2(1):602–9. 10.1080/21642583.2014.956265 . Food and Agriculture Organization of the United Nations. (2018). Dairy Development's Impact on Poverty Reduction . FAO. 4p. https://openknowledge.fao.org/handle/20.500.14283/ca2185en Food and Agriculture Organization of the United Nations. (2021). World Food and Agriculture-Statistical Yearbook 2021 . FAO. 368 p. https://doi.org/10.4060/cb4477en Gaillard C, Dervillé M. Dairy farming, cooperatives and livelihoods: lessons learned from six indian villages. J Asian Econ. 2022;78:101422. Gateway to dairy production. and products. FAO. https://www.fao.org/dairy-production-products/production/en Gerosa & Skoet. (2012). Milk availability: trends in production and demand and medium-term outlook. ESA Working paper 12 – 01. FAO. https://www.fao.org/agrifood-economics/publications/detail/en/c/163733/ Goswami A, Rajan A, Verma S, Shah T. Irrigation and India's crop-milk agrarian economy: a simple recursive model and some early results. IWMI-Tata Water Policy Res Highlight. 2017;2:1–12. Gupta SK, Purohit A. Growth of productivity in dairy industry: an inter-state analysis. Int J Educ Adm. 2010;2(4):685–9. Gür YE. Innovation in the dairy industry: forecasting cow cheese production with machine learning and deep learning models. Int J Agric Environ Food Sci. 2024;8(2):327–46. https://doi.org/10.31015/jaefs.2024.2.9 . Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7(1):1–41. 10.1186/s40537-020-00305-w . https://apeda.gov.in/ Three years export summary statement (2022). https:// / Attaché Reports (GAIN) – Dairy and Products Annual (2020) . https:// - Dairy Industry In India: Growth, FDI, Companies, Exports (2022). Ibrahim M. Evolution of Random Forest from Decision Tree and Bagging: A Bias-Variance Perspective. Dhaka Univ J Appl Sci Eng. 2022;7(1):66–71. Jaiswal S. Interstate Analysis of Productive Performance in India's Dairy Manufacturing Sector Using Stochastic Frontier Models. Q Literature Magazine- Naagfani. 2022;12(42):370–7. Jirli B, Kumar S, Pachava V, Golla SK. (2025). Forecasting milk production in India: Strategic insights for policymakers and farmers. Indian Journal of Extension Education , 61 (2), 14–18.Kale, R. B., Ponnusamy, K., Sendhil, R., Maiti, S., Chandel, B. S., Jha, S. K., … Lal, S. P. (2019). Determinants of inequality in dairy development of India. National Academy Science Letters , 42 , 195–198. Kale RB, Ponnusamy K, Chakravarty AK, Sendhil R, Mohammad A. Assessing resource and infrastructure disparities to strengthen Indian dairy sector. Indian J Anim Sci. 2016;86(6):720–5. Kaliappan J, Bagepalli AR, Almal S, Mishra R, Hu YC, Srinivasan K. Impact of cross-validation on machine learning models for early detection of intrauterine fetal demise. Diagnostics. 2023;13(10):1692. Khan F, Tomar A. (2022). Livestock Feed and Fodder Resources of India and Strategies for their Judicious Utilization: A Review. In Biodiversity in the Service of Mankind . Walnut Publication. 10.6084/m9.figshare.20003078.v1 Kim M, Kang SR, Na MH. Prediction of Milk Production in Dairy Cows Using Statistical Regression Model and Machine Learning Methods. Korean Data Anal Soc. 2024. https://doi.org/10.37727/jkdas.2024.26.6.1855 . Kishore A, Birthal PS, Joshi PK, Shah T, Saini A. Patterns and drivers of dairy development in India: insights from analysis of household and district-level data. Agricultural Economic Res Rev. 2016;29(1):1–14. https://doi.org/10.5958/0974- 0279.2016.00014.8 . Kulkarni CS. Advancing Gradient Boosting: A Comprehensive Evaluation of the CatBoost Algorithm for Predictive Modelling. J Artif Intell Mach Learn Data Sci. 2022;1(5):54–7. Kumar A, Parappurathu S, Joshi PK. Structural Transformation in Dairy Sector of India. Agricultural Econ Res Rev. 2013;26(2):209–20. Kumar A, Staal SJ, Singh DK. Smallholder dairy farmers’ access to modern milk marketing chains in India. Agricultural Econ Res Rev. 2011;24(2):243–54. Kumar S, Singh P, Devi U, Yathish KR, Saujanya PL, Kumar R, Mahanta SK. An overview of the current fodder scenario and the potential for improving fodder productivity through genetic interventions in India. Anim Nutr Feed Technol. 2023;23(3):631–44. Kumar V, Chakravarty AK, Magotra A, Patil CS, Shivahre PR. Comparative study of ANN and conventional methods in forecasting first lactation milk yield in Murrah buffalo. Indian J Anim Sci. 2019;89(11):1262–8. Lundberg SM, Lee SI. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems , 30 . https://doi.org/10.48550/arXiv.1705.07874 Lyashenko V, Jha A. (2022). Cross-validation in machine learning: how to do it right. s interneta , https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it- right, 24 . Lyngkhoi DR, Singh SB, Singh R, Tyngkan H. Trend analysis of milk production in India. Asian J Dairy Food Res. 2022;41(2):183–7. 10.18805/ajdfr.DR-1789 . Ministry of Agriculture and Farmers’ Welfare. (2018). National Action Plan for Dairy Development VISION-2022 . Government of India. 68 p. Mishra P. (2020). Time Series Investigation of Milk Production in Major States of India Using ARIMA Modelling. J Anim Res, 10 . Mynavathi VS, Murugeswari R, Kumar VRS, Gopi H, Valli C, Babu M. Impact of Irrigation Methods on Soil, Water, and Nutrient USE Efficiency of Integrated Cropping: Livestock Production Systems. Management Strategies for Water Use Efficiency and Micro Irrigated Crops. Apple Academic; 2019. pp. 35–41. National Academy of Agricultural Sciences. 2016. Breeding Policy for Cattle and Buffalo in India. https://naas.org.in/Policy%20Papers/policy%2082.pdf National Academy of Agricultural Sciences. 2020. Livestock Improvement through Artificial Insemination. https://naas.org.in/Policy%20Papers/policy%2096.pdf NITI Aayog. Agriculture, Animal Husbandry and Fisheries Sector Report. 274p ed. Government of India; 2021. Ozella L, Rebuli B, Forte K, C., Giacobini M. (2023). A literature review of modeling approaches applied to data collected in automatic milking systems. Animals , 13 (12), 1916. Paul RK, Alam W, Paul AK. Prospects of livestock and dairy production in India under time series framework. Indian J Anim Sci. 2014;84(4):462–6. Perumal A, Mazumdar C, Srinatha TN, Rath S, Likhitha S. Mitigating Climate Impact: A Machine Learning Approach to Forecast Methane Emissions from Indian Livestock. Indian J Agric Econ. 2024;79(3):610–20. Petare PA. (2013, August 11). Issues and challenges of supply chain management with perspective to Indian dairy industry [Paper Presentation]. International Conference on Issues and Challenges in Current Global Economy: It’s Impact on Commerce, Engineering and Technology, Pune, India. Potdar K, Pardawala TS, Pai CD. A comparative study of categorical variable encoding techniques for neural network classifiers. Int J Comput Appl. 2017;175(4):7–9. Prajapati B, Prajapati J, Kumar K, Srivastava A. Determination of the relationships between quality parameters and yields of fodder obtained from intercropping systems by correlation analysis. Forage Res. 2019;45(3):219–24. Rajeshwaran S, Naik G, Dhas RAC. (2014). Rising milk price – A cause for concern on food security. Working Paper 472. Bangalore: Indian Institute of Management. Ranglani H. (2024). Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models. Mach Learn Applications: Int J (MLAIJ) Vol . 11 . Rao MK. Factors Affecting Milk Production: A Case Study in Andhra Pradesh. J Rural Dev. 2017;36(1):21–32. Sagi O, Rokach L. (2018). Ensemble learning: A survey. Wiley interdisciplinary reviews: data Min Knowl discovery, 8(4), e1249. Saha A, Bhattacharyya S. Artificial insemination for milk production in India: a statistical insight. Indian J Anim Sci. 2020;90(8):1186–90. Sarkar A, Gupta H, Dutta A. (2022). In search of the determinants of dairy production in an emerging market: a panel data approach. Sekhar CSC, Roy D, Bhatt Y. 2017. Food inflation and food price volatility in India: Trends and determinants. IFPRI Discussion Paper 1640. Washington, DC: International Food Policy Research Institute (IFPRI). https://hdl.handle.net/10568/147644 Shankar SV, Ajaykumar R, Ananthakrishnan S, Aravinthkumar A, Harishankar K, Sakthiselvi T, Navinkumar C. Modeling and forecasting of milk production in the western zone of Tamil Nadu. Asian J Dairy Food Res. 2023;42(3):427–32. Sharma DK, Garg A, Kumar A. Ensemble learning in machine learning: Integrating multiple models for improved predictions. Int J Appl Res. 2018;4(7):61–5. 10.22271/allresearch.2018.v4.i7a.11443 . Soomro AA, Mokhtar AA, Hussin HB, Lashari N, Oladosu TL, Jameel SM, Inayat M. Analysis of machine learning models and data sources to forecast burst pressure of petroleum corroded pipelines: A comprehensive review. Eng Fail Anal. 2024;155:107747. https://doi.org/10.1016/j.engfailanal.2023.107747 . Subbanna YB, Kumar S, Puttaraju SKM. Forecasting buffalo milk production in India: Time series approach. Buffalo Bull. 2021;40(2):335–43. https://kuojs.lib.ku.ac.th/index.php/BufBu/article/view/3993 . Suseendran G, Duraisamy B. (2021). Prediction of Dairy Milk Production Using Machine Learning Techniques. Intelligent Computing and Innovation on Data Science . https://doi.org/10.1007/978-981-16-3153-5_60 Tricarico JM, Kebreab E, Wattiaux MA. (2020). MILK Symposium review: Sustainability of dairy production and consumption in low-income countries with emphasis on productivity and environmental impact. Journal of Dairy Science, 103(11), 9791–9802. https://doi.org/10.3168/jds.2020-18269 Vasant PG, Zhang-Yue Z. Rising demand for livestock products in India: nature, patterns and implications. Australasian Agribusiness Rev. 2010;18:103–35. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Revision Version 1 posted Editorial decision: Revision requested 25 Feb, 2026 Reviews received at journal 24 Feb, 2026 Reviewers agreed at journal 17 Feb, 2026 Reviewers agreed at journal 13 Feb, 2026 Reviews received at journal 11 Feb, 2026 Reviewers agreed at journal 11 Feb, 2026 Reviewers invited by journal 11 Feb, 2026 Editor invited by journal 06 Feb, 2026 Editor assigned by journal 04 Feb, 2026 Submission checks completed at journal 04 Feb, 2026 First submitted to journal 04 Feb, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8702287","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":590768457,"identity":"39ea2011-64f3-4ef1-a3c7-6b81ffc6c950","order_by":0,"name":"Sanjivani Srivastava","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABA0lEQVRIiWNgGAWjYJCCA8gcObDIAyK0SMA4xmCRBCJsgmtJbACR+LTI9599eJinpq6Of0byA+aKirr0+WGHHwJtsZPTbcCuxeBGusFhnmOHJSRupBkwnjlzOHfj7TQDoJZkY7MDOLRIsDEcnMF2QILhzAEDxsa2A7kbZyeAtBxI3IZDi3z/MaCWf3US8meOf2Bs/FeXbjg7/QNeLQwH0hgOfGxjljA43gO0pYE5QV46B78tBjdAWvoOS2483lNwsOHYYcMN0jkFBxIMcPsF6DDmDwnf6vjlDrNvfNhQUycvPzt984cPFXZyuLSgOhJsL4QkQjnC3gZSVI+CUTAKRsFIAABMrmWTNX4fVwAAAABJRU5ErkJggg==","orcid":"","institution":"Gokhale Institute of Politics and Economics","correspondingAuthor":true,"prefix":"","firstName":"Sanjivani","middleName":"","lastName":"Srivastava","suffix":""},{"id":590768458,"identity":"585006a0-a6ee-4f55-8cc9-01d4f03f1e3a","order_by":1,"name":"Jyotshana Singh","email":"","orcid":"","institution":"Banasthali University: Banasthali Vidyapith","correspondingAuthor":false,"prefix":"","firstName":"Jyotshana","middleName":"","lastName":"Singh","suffix":""},{"id":590768459,"identity":"cc02ca4c-24c2-43ca-8872-57d6477a718d","order_by":2,"name":"Shweta Chaudhary","email":"","orcid":"","institution":"G.B. Pant University of Agriculture \u0026 Technology","correspondingAuthor":false,"prefix":"","firstName":"Shweta","middleName":"","lastName":"Chaudhary","suffix":""},{"id":590768468,"identity":"5ff37cd6-3358-4724-a59d-c4cf8476b413","order_by":3,"name":"Thota Sivasankar","email":"","orcid":"","institution":"University of Petroleum and Energy Studies","correspondingAuthor":false,"prefix":"","firstName":"Thota","middleName":"","lastName":"Sivasankar","suffix":""},{"id":590768470,"identity":"3a65dff4-0656-4b04-8265-0fbb99771c83","order_by":4,"name":"Vinay Sehgal","email":"","orcid":"","institution":"Indian Agricultural Research Institute","correspondingAuthor":false,"prefix":"","firstName":"Vinay","middleName":"","lastName":"Sehgal","suffix":""}],"badges":[],"createdAt":"2026-01-26 16:23:12","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8702287/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8702287/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":102753493,"identity":"360a4118-fb83-4e8c-8131-d00039f6139c","added_by":"auto","created_at":"2026-02-16 09:35:11","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":39170,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eMethodological framework adopted in the study\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8702287/v1/4b7fb5f4dbe572a57af7230e.png"},{"id":102753385,"identity":"479d838e-25b8-4e9a-9ac8-34de3f8af230","added_by":"auto","created_at":"2026-02-16 09:34:45","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":153165,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eError distribution plots of the models\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8702287/v1/560dc50088f0037894df9dfa.png"},{"id":102753341,"identity":"84135326-fcff-4fbe-b746-d75ab039f89b","added_by":"auto","created_at":"2026-02-16 09:34:32","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":95587,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eResidual box plots of the models\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-8702287/v1/b50f21fd6238f38463cd85f8.png"},{"id":102753381,"identity":"1cebd27e-0837-49af-9e87-ef292624bd47","added_by":"auto","created_at":"2026-02-16 09:34:42","extension":"jpg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":49343,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eRadar Chart\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"4.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8702287/v1/e298677ba529ce04b68efe5f.jpg"},{"id":102753450,"identity":"a2772fd1-1cef-490c-9a49-d494cac9a34d","added_by":"auto","created_at":"2026-02-16 09:35:01","extension":"jpg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":79507,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFeature Importance Plot\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"5.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8702287/v1/693059e0e4e77ab357e99ac7.jpg"},{"id":102754751,"identity":"cc67f94e-143e-457f-ae31-a8b8d01cee31","added_by":"auto","created_at":"2026-02-16 09:39:24","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1412496,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8702287/v1/1efdfe01-12d2-4fcb-9cd6-8d28cace0378.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"\u003cp\u003eMachine Learning Driven Approach for Modelling Milk Production in India\u003c/p\u003e","fulltext":[{"header":"1. INTRODUCTION","content":"\u003cp\u003eIndia is the world\u0026rsquo;s largest producer of milk, accounting for 22% of global milk production (FAO). With a significant segment of its population following a lactovegetarian diet, India is also the world\u0026rsquo;s leading consumer of milk (USDA Report, 2021), which makes milk and milk products indispensable to the country\u0026rsquo;s policies on food and nutritional security.\u003c/p\u003e \u003cp\u003eThe dairy sector directly supports the livelihoods of more than eighty million people, with the workforce predominantly comprising women and small and marginal farmers (GoI, 2024). Within India\u0026rsquo;s socioeconomic setting, livestock rearing and milk production are found intertwined with poverty alleviation, income diversification, improved welfare of rural households, higher resilience to economic vulnerability, and women empowerment (Birthal and Negi, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2012\u003c/span\u003e; FAO, 2018; Tricarico \u003cem\u003eet al.\u003c/em\u003e,2020). The milk cooperatives in the country have been recognized for delivering higher economic incentives and inclusiveness in resource distribution (Gaillard and Dervill\u0026eacute;, \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Thus, dairying is a vital component of the country\u0026rsquo;s agricultural sector and is instrumental to achieving Sustainable Development Goals (SDGs) 1, 2, and 5, i.e., \u0026lsquo;No poverty\u0026rsquo;, \u0026lsquo;Zero hunger\u0026rsquo;, and \u0026lsquo;Gender equality\u0026rsquo;.\u003c/p\u003e \u003cp\u003eIn recent decades, fuelled by factors such as a rising population, increased disposable income, and growing urbanization, India is experiencing a \u0026lsquo;Livestock Revolution\u0026rsquo; (Delgado \u003cem\u003eet al.\u003c/em\u003e,2001). The Indian dairy industry is expanding at a substantial rate of 6.2% annually (GoI, 2022). Milk stands out as the only livestock product in the country with a surplus in per-capita availability (Srivastava \u003cem\u003eet al.\u003c/em\u003e, 2024). However, the income elasticity of milk and milk products is found to be higher than that of egg, meat, and fish in the country, implying that an increase in consumer income is likely to translate into a much higher demand for the former (Vasant and Zhang-Yue, \u003cspan citationid=\"CR68\" class=\"CitationRef\"\u003e2010\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe rise in milk prices has already been identified as a major contributor to food price inflation within the country (Sekhar et al., \u003cspan citationid=\"CR61\" class=\"CitationRef\"\u003e2017\u003c/span\u003e). Despite this, a significant portion of the marketable surplus of raw milk continues to be channelled through unorganized supply chains that lack essential storage and transportation infrastructure (Petare, 2013), resulting in its adulteration and spoilage. As per the NITI Aayog (\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2021\u003c/span\u003e), only 40% of the milk sold in the country gets processed, compared to 90% in developed countries. Thus, the government aims to double the milk handling to provide dairy farmers with higher incomes (National Action Plan for Dairy Development, 2018).\u003c/p\u003e \u003cp\u003eLooking ahead, global milk production is forecasted to grow at 1.8% per annum over the next decade, while the dairy herd population is projected to grow at an annual rate of 1.1%, particularly in low-yield regions, such as Sub-Saharan Africa and in major milk-producing countries like India and Pakistan (OECD-FAO, 2022). These trends spark \u0026lsquo;food versus feed\u0026rsquo; debate and raise concerns about the sector\u0026rsquo;s long-term environmental sustainability, owing to mounting land-use pressure and a significant contribution of livestock production to greenhouse gas emissions (FAO, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2006\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eAgainst this backdrop, accurate early prediction of milk production in the country would equip policymakers with insights to make informed decisions, ensuring sustained productivity and long-term food security (Lyngkhoi \u003cem\u003eet al.\u003c/em\u003e,2022). In addition, such an attempt would align with the government\u0026rsquo;s Viksit Bharat Initiative, which emphasizes data-driven policymaking.\u003c/p\u003e \u003cp\u003eHowever, modelling milk production is complex due to its dependency on multiple factors (Rao, \u003cspan citationid=\"CR57\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Sarkar et al., \u003cspan citationid=\"CR60\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Particularly in India, a wide heterogeneity exists among states in terms of socioeconomic attributes, dietary preferences, sources of feed and fodder, infrastructural and technological progress, and institutional working, leading to regional disparities in milk production (Rajeshwaran et al., \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2014\u003c/span\u003e; Kale \u003cem\u003eet al.\u003c/em\u003e, 2019). Therefore, investigating variables at the state level becomes crucial for designing robust, contextually appropriate models. This entails selecting an appropriate modelling technique capable of enhancing data scalability and efficiency. Owing to advances in the disciplines of applied mathematics, statistics, computing and information sciences, Artificial Intelligence (AI) is finding an increasing application (Franco \u0026amp; Santurro, 2020). Machine Learning (ML) algorithms, which are a subset of AI, have been documented for their ability to handle high-dimensional datasets, universal approximation, and adaptive learning. Hence, they are being increasingly employed for a variety of applications in livestock production, such as detecting diseases in cattle (Bhardwaj et al., \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), predicting the yield of milch species (Kumar et al., \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2019\u003c/span\u003e), forecasting the production of livestock products (Suseendran \u0026amp; Duraisamy, \u003cspan citationid=\"CR66\" class=\"CitationRef\"\u003e2021\u003c/span\u003er, 2024; Kim et al., \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), and estimating methane emissions from livestock farming (Perumal et al., \u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIn the Indian context, researchers have explored various methodologies, ranging from traditional modelling approaches such as ARIMA, VAR, and Holt\u0026rsquo;s model (Paul et al., \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2014\u003c/span\u003e; Deshmukh \u0026amp; Paramasivam, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2016\u003c/span\u003e; Mishra \u003cem\u003eet al.\u003c/em\u003e, 2020; Jirli et al., \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) to mathematical models like deterministic growth models (Srivastava, 2024), AI algorithms like neural networks (Shankar et al., \u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), and hybrid models (Subbanna et al., \u003cspan citationid=\"CR65\" class=\"CitationRef\"\u003e2021\u003c/span\u003e) to forecast milk production. However, studies employing advanced ML methods for large-scale modelling of milk production within the country remain limited. Therefore, the present study aimed to develop robust predictive models\u003csup\u003e1\u003c/sup\u003e, employing a range of machine learning techniques, to estimate state-level milk production across India.\u003c/p\u003e"},{"header":"2. MATERIALS AND METHODS","content":"\u003cp\u003eDespite their accuracy in forecasting, ML models have not been fully utilized in predicting milk production, likely due to the lack of multi-parameter datasets (Ozella et al., \u003cspan class=\"CitationRef\"\u003e2023\u003c/span\u003e). To address this issue, the study meticulously retrieved and compiled time-series information on fifteen variables for twenty-seven Indian states\u003csup\u003e2\u003c/sup\u003e, spanning a period of over two decades, from 2000\u0026ndash;01 to 2022\u0026ndash;23. Open data repositories of government agencies, namely the Department of Animal Husbandry \u0026amp; Dairying, the National Dairy Development Board, the Department of Agriculture \u0026amp; Farmers Welfare, and the Reserve Bank of India, were used as sources. Time-series variables corresponding to different base years were spliced to the latest base year of 2011-12. Due to regional heterogeneity, data compilation was performed at the state level to improve the models' applicability and robustness.\u003c/p\u003e\n\u003cp\u003eSubsequently, the variables were pre-processed and trained using various ML models, wherein proper data splitting and validation strategies ensured the quality of training. The performance of these models was finally evaluated using the test dataset, followed by a comparison to identify the best-performing model. The modelling framework adopted in the study is summarized in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e, and each step is detailed as follows.\u003c/p\u003e\n\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\n\u003ch2\u003e2.1 Selection of model parameters\u003c/h2\u003e\n\u003cp\u003eIn order to build a model with good accuracy and applicability, comprehensive indicators or relevant proxy indicators that are in concordance with existing theory need to be selected, with a clear rationale behind the inclusion of each variable and the underlying presumption about its relationship with the dependent variable. Based on the availability of secondary data, sixteen input variables were selected a priori for developing the milk production models, as detailed in Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\n\u003cdiv class=\"gridtable\"\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003ctable id=\"Tab1\" border=\"1\"\u003e\u003ccaption\u003e\n\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n\u003cdiv class=\"CaptionContent\"\u003e\n\u003cp\u003eInput variables selected for developing the milk production models\u003c/p\u003e\n\u003c/div\u003e\n\u003c/caption\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eS.\u003c/p\u003e\n\u003cp\u003eNo.\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eDeterminants\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eNotations\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eRationale for inclusion\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eUnit\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eMinimum Value\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eMaximum Value\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eApriori Relationship\u003c/p\u003e\n\u003c/th\u003e\n\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e1.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eNumber of in-milk crossbred and exotic cows\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eN\u003csub\u003ecb\u003c/sub\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd rowspan=\"3\" align=\"left\"\u003e\n\u003cp\u003eCows and buffaloes are the major milking species in the country and contribute to about 97% of milk production. Moreover, a substantial increase in milk production from buffaloes and cows is projected for the country by 2025\u0026ndash;26 (Devi \u003cem\u003eet al.\u003c/em\u003e, 2022).\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eThousands\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e36.1\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e2776.38\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e\u003cstrong\u003e+\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e2.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eNumber of in-milk indigenous and non-descript cows\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eN\u003csub\u003ei\u003c/sub\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eThousands\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e7.34\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e5547.87\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e\u003cstrong\u003e+\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e3.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eNumber of in-milk buffalo\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eN\u003csub\u003eb\u003c/sub\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eThousands\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e6.55\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e11759.06\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e\u003cstrong\u003e+\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e4.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eAverage yield per in-milk crossbred and exotic cows\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eY\u003csub\u003ecb\u003c/sub\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd rowspan=\"3\" align=\"left\"\u003e\n\u003cp\u003eGiven the variability in production performance of different species, their average yields were taken (Kale \u003cem\u003eet al.\u003c/em\u003e, 2018)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eKg/day\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e3.1\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e13.43\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e\u003cstrong\u003e+\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e5.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eAverage yield per in-milk indigenous and non-descript cows\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eY\u003csub\u003ei\u003c/sub\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eKg/day\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e1\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e6.91\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e\u003cstrong\u003e+\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e6.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eAverage yield per in-milk buffalo\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eY\u003csub\u003eb\u003c/sub\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eKg/day\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e1.67\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e9.11\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e\u003cstrong\u003e+\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e7.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eArea under pastures and permanent grazing lands\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eA\u003csub\u003ep\u003c/sub\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd rowspan=\"3\" align=\"left\"\u003e\n\u003cp\u003ePastures \u0026amp; permanent grazing lands, fodder crops, and crop residue are the prime sources of feed/fodder in India (Khan \u0026amp; Tomar, \u003cspan class=\"CitationRef\"\u003e2022\u003c/span\u003e; Singh \u003cem\u003eet al.\u003c/em\u003e, 2022). Feed availability is directly linked to animal nutrition and productivity (Prajapati et al., \u003cspan class=\"CitationRef\"\u003e2019\u003c/span\u003e; Kumar et al., \u003cspan class=\"CitationRef\"\u003e2023\u003c/span\u003e)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eThousand hectares\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e0\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e1709\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e\u003cstrong\u003e+\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e8.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eArea under fodder crops\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eA\u003csub\u003efc\u003c/sub\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eThousand hectares\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e0\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e5370\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e\u003cstrong\u003e+\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e9.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eArea under cereals and millets\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eA\u003csub\u003ecm\u003c/sub\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eThousand hectares\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e172\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e18842\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e\u003cstrong\u003e+\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e10.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eRatio of gross irrigated area to total cropped area\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eR\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eIrrigation is central to India's crop-milk mixed farming system (Goswami et al., \u003cspan class=\"CitationRef\"\u003e2017\u003c/span\u003e). Improved access to groundwater irrigation facilitates a transition from drought to dairying in the bovine economy (Kishore et al., \u003cspan class=\"CitationRef\"\u003e2016\u003c/span\u003e)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eUnit less\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e3.6\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e98.8\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e\u003cstrong\u003e+\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e11.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eArtificial inseminations conducted\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eAI\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eArtificial insemination is crucial in enhancing bovine productivity by advancing the genetic potential (NAAS, 2020; Saha and Bhattacharya, 2021)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eThousands\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e45\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e13394\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e\u003cstrong\u003e+\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e12.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eNumber of veterinary institutions\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eV\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eThe count of veterinary institutions was selected as the indicator of animal health care (Kumar et al., \u003cspan class=\"CitationRef\"\u003e2013\u003c/span\u003e). Poor veterinary infrastructure is linked to the imbalanced progress of dairy development in the country (Kale et al., \u003cspan class=\"CitationRef\"\u003e2016\u003c/span\u003e)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eUnits\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e801\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e7897\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e\u003cstrong\u003e+\u003c/strong\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e13.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eProducer prices\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003ePP\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eRemunerative producer prices are incentives for adopting improved production technologies (Cwalina et al., \u003cspan class=\"CitationRef\"\u003e2020\u003c/span\u003e)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eLakh Rupees\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e95725\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e127292\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e+\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e14.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003ePer capita net state domestic product (at constant prices)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eNSDPpc\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eTaken as a proxy for per-capita income, which is a measure of people\u0026rsquo;s purchasing power. Since one of the main drivers for growth in milk demand is an increase in disposable incomes (Gerosa and Skoet, 2012)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eRupees\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e7091\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e423716\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e+\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e15.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eProduction year\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eYr\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eTo capture the local or global effects arising due to socio-politico-economic factors, which are difficult to capture otherwise, but affect the production and supply of milk.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e0\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e22\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eNot known\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e16.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eState\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eSt\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003ePost-White Revolution, the Indian dairy sector is identified with inter- and intra-state inequalities in production (Gupta \u0026amp; Purohit, \u003cspan class=\"CitationRef\"\u003e2010\u003c/span\u003e; Jaiswal, \u003cspan class=\"CitationRef\"\u003e2022\u003c/span\u003e), addressing which is imperative to meet future demand.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\u0026nbsp;\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e0\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e26\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eNot known\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\u003cdiv class=\"gridtable\"\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003ctable id=\"Tab2\" border=\"1\"\u003e\u003ccaption\u003e\n\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n\u003cdiv class=\"CaptionContent\"\u003e\n\u003cp\u003eSummary of ML techniques used for the development of predictive models\u003c/p\u003e\n\u003c/div\u003e\n\u003c/caption\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eModel Name\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eAcronym\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eDescription\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eReference\u003c/p\u003e\n\u003c/th\u003e\n\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eDecision Tree\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eDT\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eDecision trees are popular tools for predictive modeling owing to their simplicity. They are highly user-friendly, typically requiring minimal or no tuning, and can be trained rapidly.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eBlockeel et al., \u003cspan class=\"CitationRef\"\u003e2023\u003c/span\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eRandom Forest\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eRF\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eThe tree-based ensemble method is widely regarded for its rapid training, direct application to high-dimensional datasets, and statistical advantages such as providing measures of variable importance and unsupervised learning.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eCutler \u003cem\u003eet al.\u003c/em\u003e, 2012\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eeXtreme Gradient Boosting\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eXGB\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eTree boosting is a widely used, powerful machine-learning method. XGBoost is a scalable end-to-end implementation of this approach, recognized for its predictive accuracy and interpretability.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eAli \u003cem\u003eet al.\u003c/em\u003e, 2023\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eLight Gradient Boosting Machine\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eLight GBM\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eThe LightGBM regressor is a tree-based ensemble learning algorithm, developed to improve upon the efficiency and scalability constraints of XGBoost when applied to high-dimensional feature spaces and large-scale datasets.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eSoomro et al., \u003cspan class=\"CitationRef\"\u003e2024\u003c/span\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eCatBoost\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eCB\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eCatBoost excels in handling categorical data, thereby simplifying the process of model development and reducing the risk of information loss or overfitting\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eKulkarni, \u003cspan class=\"CitationRef\"\u003e2022\u003c/span\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eEnsemble Technique 1\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eET-1\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd rowspan=\"2\" align=\"left\"\u003e\n\u003cp\u003eEnsemble learning is a machine learning concept in which multiple models are combined to tackle the same task, with the collective performance of the models surpassing that of any single constituent model.\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd rowspan=\"2\" align=\"left\"\u003e\n\u003cp\u003eFawagreh et al., \u003cspan class=\"CitationRef\"\u003e2014\u003c/span\u003e\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eEnsemble Technique 2\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eET-2\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\n\u003ch2\u003e2.2 Machine Learning Models selected\u003c/h2\u003e\n\u003cp\u003eML algorithms have gained prominence across numerous fields due to their speed and accuracy in handling voluminous data. In this study, seven machine-learning techniques were employed, namely Decision Tree (DT), Random Forest (RF), eXtreme Gradient Boosting (XGB), Light Gradient Boosting Machine (Light GBM), CatBoost (CB), and two Ensemble models. Ensemble Technique 1 (ET-1) combined Random Forest and XGB, while Ensemble Technique 2 (ET-2) combined DT, RF, and XGB. Each of these modelling techniques is described briefly in Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\n\u003ch2\u003e2.3 Data pre-processing\u003c/h2\u003e\n\u003cp\u003eAs machine learning algorithms operate on numerical inputs, it is necessary to encode categorical variables into numerical values using suitable encoding techniques (Potdar et al., \u003cspan class=\"CitationRef\"\u003e2017\u003c/span\u003e). In the study, label encoding, a type of determined encoding technique, was selected for its lower computational complexity (Hancock and Khoshgoftaar, \u003cspan class=\"CitationRef\"\u003e2020\u003c/span\u003e). The technique was applied to the 'state' and 'year' columns to replace the categorical values with distinct, scalar numerical values.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\n\u003ch2\u003e2.4 Model Development, Validation and Evaluation\u003c/h2\u003e\n\u003cp\u003eThe processed dataset, comprising 558 observations, was divided into training and testing subsets using an 80:20 ratio. The training process allows for automatic adjustment of weights and biases in a model during each iteration, while the validation process, which is performed on a reserved, unseen portion of the training data, allows for computing the loss at each iteration, thereby ensuring the achievement of optimal model performance. Finally, the testing dataset is used to assess the performance of a developed model.\u003c/p\u003e\n\u003cp\u003eGiven that the study uses a medium-sized dataset, partitioning the data decreases the number of samples available for training, which makes the results susceptible to the randomisation of train, test, and validation sets. Therefore, implementing cross-validation becomes crucial to ensure the robustness and generalizability of the developed models (Kaliappan et al., \u003cspan class=\"CitationRef\"\u003e2023\u003c/span\u003e). Cross-validation systematically divides data into multiple training and testing subsets to reduce performance variance arising from bias in data selection and to maximize the use of available data for both training and evaluation. Among various procedures, K-fold cross-validation, where the dataset is divided into k equal-sized folds, with each fold serving once as a validation set and the remaining folds as a training set, is widely used for model selection (Lyashenko and Jha, \u003cspan class=\"CitationRef\"\u003e2022\u003c/span\u003e).\u003c/p\u003e\n\u003cp\u003eSubsequently, for each model, the following configuration was applied to perform K-fold cross-validation: kf\u0026thinsp;=\u0026thinsp;KFold(n_splits\u0026thinsp;=\u0026thinsp;5, shuffle=True, random_state\u0026thinsp;=\u0026thinsp;42). It implies that the entire dataset was divided into five equal folds. The models were trained five times, each time using four of the folds for training and one fold for testing. Prior to the partition, the data was randomized to ensure the folds were representative of the overall data distribution. Lastly, to ensure a consistent basis for comparison across models, a \u0026lsquo;random_state\u0026rsquo; parameter was specified, so the dataset underwent an identical random split each time.\u003c/p\u003e\n\u003cp\u003eHyperparameter tuning is another crucial process to find the optimal set of hyperparameters, such as learning rate, batch size, number of layers in neural networks, regularization parameters, number of trees, depth of trees, etc., that yield the best performance for an ML model (Ilemobayo et al., \u003cspan class=\"CitationRef\"\u003e2024\u003c/span\u003e). Hence, a detailed account of hyperparameter tuning for each model is presented, with selection performed using random search. Firstly, the decision tree (DT) regressor was configured as follows: \u0026lsquo;max_depth=None\u0026rsquo;, \u0026lsquo;max_features=None\u0026rsquo;, \u0026lsquo;min_samples_leaf\u0026thinsp;=\u0026thinsp;2\u0026rsquo;, and \u0026lsquo;min_samples_split\u0026thinsp;=\u0026thinsp;8\u0026rsquo;. Here, min_samples_split defines the minimum number of samples required to split an internal node, whereas min_samples_leaf indicates the minimum number of samples that a terminal node must contain post-split. For the Random Forest regressor, the hyperparameter space was defined as follows: `max_depth\u0026thinsp;=\u0026thinsp;20`, `min_samples_leaf\u0026thinsp;=\u0026thinsp;2`, `min_samples_split\u0026thinsp;=\u0026thinsp;8`, and `n_estimators\u0026thinsp;=\u0026thinsp;400`. The n_estimators parameter specifies the number of individual decision trees within the forest. Its value is selected to strike a balance between capturing data complexity and preventing overfitting due to excessive model complexity. The `max_depth` parameter limits the maximum number of splits per tree to constrain complexity.\u003c/p\u003e\n\u003cp\u003eFor the XGB regressor, optimal hyperparameters were identified as follows: \u0026lsquo;n_estimators\u0026thinsp;=\u0026thinsp;200\u0026rsquo;, \u0026lsquo;learning_rate\u0026thinsp;=\u0026thinsp;0.1\u0026rsquo;, \u0026lsquo;max_depth\u0026thinsp;=\u0026thinsp;4\u0026rsquo;, \u0026lsquo;subsample\u0026thinsp;=\u0026thinsp;0.8\u0026rsquo;, and \u0026lsquo;colsample_bytree\u0026thinsp;=\u0026thinsp;0.8\u0026rsquo;. Learning rate is the step size shrinkage used in the updation of weights and is one of the key parameters in controlling overfitting. Subsample and colsample_bytree denote the proportions of samples and features incorporated during the construction of each tree (Dean, \u003cspan class=\"CitationRef\"\u003e2020\u003c/span\u003e), where the latter helps in preventing dominant features from skewing the model output. For the LGBM Regressor, the parameters were set to \u0026lsquo;num_leaves\u0026thinsp;=\u0026thinsp;31\u0026rsquo;, \u0026lsquo;learning_rate\u0026thinsp;=\u0026thinsp;0.01\u0026rsquo;, and \u0026lsquo;n_estimators\u0026thinsp;=\u0026thinsp;500\u0026rsquo;, and for the CB Regressor to \u0026lsquo;iterations\u0026thinsp;=\u0026thinsp;1000\u0026rsquo;, \u0026lsquo;learning_rate\u0026thinsp;=\u0026thinsp;0.03\u0026rsquo;, and \u0026lsquo;depth\u0026thinsp;=\u0026thinsp;4\u0026rsquo;. Lastly, for the ensemble models, optimised base learners, namely Random Forest, XGBoost, and Decision Trees, were combined to leverage their complementary strengths, a design shown to improve predictive accuracy (Sagi and Rokach, \u003cspan class=\"CitationRef\"\u003e2018\u003c/span\u003e).\u003c/p\u003e\n\u003cp\u003eFinally, the ML models were evaluated based on the following metrics:\u003c/p\u003e\n\u003col style=\"list-style-type: lower-alpha;\"\u003e\n\u003cli\u003e\n\u003cp\u003eError percentage = \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\frac{{\\sum\\:}_{\\text{i}=1}^{\\text{N}}\\left|\\text{O}\\text{b}\\text{s}\\text{e}\\text{r}\\text{v}\\text{e}\\text{d}\\:\\text{o}\\text{u}\\text{t}\\text{p}\\text{u}\\text{t}-\\text{P}\\text{r}\\text{e}\\text{d}\\text{i}\\text{c}\\text{t}\\text{e}\\text{d}\\:\\text{o}\\text{u}\\text{t}\\text{p}\\text{u}\\text{t}\\right|}{\\text{N}}*100\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eRoot Mean Square Error (RMSE) \u0026ndash;\u003cbr /\u003e$$\\:\\text{R}\\text{M}\\text{S}\\text{E}=\\sqrt{\\sum\\:_{\\text{i}=1}^{\\text{N}}\\frac{\\left(\\text{P}\\text{r}\\text{e}\\text{d}\\text{i}\\text{c}\\text{t}\\text{e}\\text{d}\\:\\text{o}\\text{u}\\text{t}\\text{p}\\text{u}\\text{t}-\\text{O}\\text{b}\\text{s}\\text{e}\\text{r}\\text{v}\\text{e}\\text{d}\\:\\text{o}\\text{u}\\text{t}\\text{p}\\text{u}\\text{t}\\right)^2}{\\text{N}}}$$\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eMAE (Mean Absolute Error) -\u003cbr /\u003e$$\\:\\text{M}\\text{A}\\text{E}\\:=\\frac{{\\sum\\:}_{\\text{i}=1}^{\\text{n}}\\left|\\text{O}\\text{b}\\text{s}\\text{e}\\text{r}\\text{v}\\text{e}\\text{d}\\:\\text{o}\\text{u}\\text{t}\\text{p}\\text{u}\\text{t}-\\text{P}\\text{r}\\text{e}\\text{d}\\text{i}\\text{c}\\text{t}\\text{e}\\text{d}\\:\\text{o}\\text{u}\\text{t}\\text{p}\\text{u}\\text{t}\\right|}{\\text{N}}$$\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eR\u003csup\u003e2\u003c/sup\u003e (Coefficient of multiple determination) \u0026ndash;\u003cbr /\u003e$$\\:\\text{R}2\\:=1-\\sum\\:_{\\text{i}=1}^{\\text{N}}\\frac{\\left(\\text{P}\\text{r}\\text{e}\\text{d}\\text{i}\\text{c}\\text{t}\\text{e}\\text{d}\\:\\text{o}\\text{u}\\text{t}\\text{p}\\text{u}\\text{t}-\\text{O}\\text{b}\\text{s}\\text{e}\\text{r}\\text{v}\\text{e}\\text{d}\\:\\text{o}\\text{u}\\text{t}\\text{p}\\text{u}\\text{t}\\right)^2}{\\left(\\text{P}\\text{r}\\text{e}\\text{d}\\text{i}\\text{c}\\text{t}\\text{e}\\text{d}\\:\\text{o}\\text{u}\\text{t}\\text{p}\\text{u}\\text{t}-\\text{M}\\text{e}\\text{a}\\text{n}\\right)^2}$$\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eIn addition, several diagnostic visualisations, such as scatter plots comparing actual and predicted values, error distribution plots, residual box plots, and a radar chart, were used to qualitatively assess the models\u0026rsquo; fitting, consistency, and robustness. A feature importance plot was also generated to facilitate the interpretation of the model's behaviour and to identify the top predictive variables.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"3. RESULTS AND DISCUSSION","content":"\u003cp\u003eThe study employed seven Machine Learning (ML) models, wherein each model was trained and evaluated using an 80:20 data split. Hyperparameters such as maximum depth, number of estimators, learning rate, and feature subsampling ratios were fine-tuned iteratively. Additionally, a five-fold cross-validation was conducted, and L2 regularisation was applied during training to ensure models' robustness and to improve their generalization ability. Subsequently, their performance was assessed based on the metrics generated from the test dataset. As the study aimed to conduct a comparative analysis to identify the best-performing model, a random_state value of 42 was assigned to each model to ensure similar data splits.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eError metrics of the models\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMODEL\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMAE\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eR\u003csup\u003e2\u003c/sup\u003e score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eRMSE\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eError Percentage\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e2062.54\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.89\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3288.03\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e19.65%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRF\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1829.21\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.90\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3100.69\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e16.61%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eXGB\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1819.39\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.89\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3191.47\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e14.57%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCB\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1717.74\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.90\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3111.83\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e16.17%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLightGBM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1989.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.88\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3365.92\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e17.87%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eET-1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1723.04\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.91\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3009.55\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e13.93%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eET-2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1683.67\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.91\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2929.79\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e13.63%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe error metrics of ML models are presented in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. It was found that Ensemble Technique 2 demonstrated the best performance in terms of achieving the lowest Mean Absolute Error (MAE) of 1683.67, the lowest Root Mean Square Error (RMSE) of 2929.79, the lowest error percentage of 13.63%, and the highest coefficient of determination (R\u0026sup2;) value of 0.91. Ensemble Technique 1 followed close behind with MAE of 1723.04, RMSE of 3009.55, error percentage of 13.93%, and R\u003csup\u003e2\u003c/sup\u003e of 0.91. The ensemble models, owing to their augmented capability to manage the bias-variance tradeoff (Ranglani, \u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), yield improved performance.\u003c/p\u003e \u003cp\u003eNext in line was the CatBoost model, which was observed to achieve an MAE of 1717.74, an RMSE of 3111.83, and an R\u003csup\u003e2\u003c/sup\u003e of 0.90. However, the XGBoost model marginally outperformed it in terms of error percentage. Since the error percentage metric is quite sensitive to the presence of large outliers, CB was considered to perform better than XGB. The Decision Tree (DT) model yielded the highest Mean Absolute Error (MAE) of 2062.54, the highest Root Mean Square Error (RMSE) of 3288.03, and the highest error percentage of 19.65%. This could be attributed to the inherent simplicity of DT, wherein it operates with nearly no assumptions regarding the target function, guiding the hypothesis solely based on the training data. However, this characteristic causes DT to capture superfluous patterns, rendering it highly susceptible to variance in data, resulting in a relatively weak performance.\u003c/p\u003e \u003cp\u003eThe results underscore the capability of ensemble techniques to attain higher prediction accuracy and minimal variance compared to the individual classifiers they integrate, reinforcing their foundational learning principle that a combination of diverse models often outperforms an individual model in isolation (Sharma \u003cem\u003eet al.\u003c/em\u003e,2018). Notably, this performance was achieved across a wide range of target values, spanning from a minimum of 11.0 to a maximum of 33,873.61, highlighting the models\u0026rsquo; ability to capture a wider range of data patterns.\u003c/p\u003e \u003cp\u003eNext, error distribution plots were visualized using histograms overlaid with Kernel Density Estimation (KDE) curves, in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, to provide a statistical overview of models\u0026rsquo; residuals. An ideal model is expected to yield a symmetric error distribution, conforming to a bell-shaped curve centred at zero, as this characteristic signifies minimal bias in predictions and uniform performance across the dataset.\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e: \u003cb\u003eError distribution plots of the models\u003c/b\u003e\u003c/p\u003e \u003cp\u003eFrom Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, it was observed that amongst all models, DT exhibited a relatively symmetric distribution around zero, but with notable irregularities. The error range extended from approximately\u0026thinsp;\u0026minus;\u0026thinsp;4,000 to +\u0026thinsp;8,000, indicating fluctuations between under-prediction and over-prediction contingent on data characteristics. The widespread errors observed in both directions indicate that the model is capturing extraneous noise rather than identifying fundamental patterns within certain regions of the feature space. This can be ascribed to the susceptibility of DT to overfitting when they grow excessively deep, i.e., the model constructs highly specific decision pathways that may attain a good fit for the training data yet fail to generalize effectively to previously unseen data. In contrast, the remaining models demonstrated right-skewed distributions with peaks close to zero, suggesting a common tendency to overestimate higher target values. An analysis of the kernel density estimate, represented by the pink curve, revealed that the ensemble techniques, particularly ET-2, showed a steady decline, implying that the model produced large prediction errors less frequently and thereby provided a more robust performance.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTo further aid the visual error analysis, box plots were generated and are presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. These are instrumental in assessing model performance as each plot encapsulates key measures of central tendency, such as the interquartile range (IQR), median, and the spread, including outliers. First, the IQR, depicted by the height of a box, shows the spread of the middle 50% of the residuals. Second, the median, represented by the horizontal line within each box, when situated on the zero line, indicates that the model does not systematically overestimate or underestimate predictions. Then, the whiskers, extending from the box edges to the minimum and maximum values within 1.5 times the IQR, represent the spread of the majority of residuals, excluding the extreme values. Lastly, the outliers, marked as red dots beyond the whiskers, represent prediction errors that deviate significantly from the typical range.\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e: \u003cb\u003eResidual box plots of the models\u003c/b\u003e\u003c/p\u003e \u003cp\u003eFrom Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, it was observed that the CB model exhibited the smallest interquartile range, followed by ET-1, ET-2, and XGB, all of which demonstrated comparable box heights. This signifies that these models yielded more consistent predictions, with minimal variation in error. Additionally, these models had the shortest whiskers, reinforcing their strong predictive stability and accuracy. Interestingly, the DT model exhibited no apparent outliers. However, the absence of plotted red dots does not necessarily imply superior performance. Outliers are only visualized when data points exceed the range of 1.5 times above the third quartile or below the first quartile, respectively. Thus, in this case, the absence of outliers underscores the model\u0026rsquo;s inherently high variance, as evidenced by the widest interquartile range and the longest whiskers. Overall, the presence of outliers across all models indicated that, despite improvements in central tendency, the occurrence of extreme prediction errors was common to all ML algorithms for the given dataset. Regarding the median line, it was found positioned closest to the zero line for the DT model. This suggests that its residuals are, on average, centred around zero, i.e., the model has low bias. The above observations align with the theoretical understanding that Decision Trees are low-bias but high-variance learners (Ibrahim, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). The remaining models were found to have their medians slightly above zero, with a more pronounced deviation observed in the case of LightGBM.\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e: \u003cb\u003eRadar Chart\u003c/b\u003e\u003c/p\u003e \u003cp\u003eThereafter, a radar chart was illustrated, in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e. Each axis represents one of the three metrics of R\u0026sup2;, MAE, and RMSE, with models projected along it according to their scaled performance. A greater distance from the centre on the R\u0026sup2; axis indicates higher predictive accuracy. To obtain a comparable visual analysis, an inversion of the error metrics during the normalization process is performed. Hence, movement toward the outer edge on the MAE and RMSE axes reflects lower error. Subsequently, for each model, its normalized values on the three axes were connected to form a polygon. A larger and more symmetric polygon denotes that the model performs well across all metrics.\u003c/p\u003e \u003cp\u003eFrom Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, ET-2 was noted to have the most expansive and symmetric polygon on the radar plot, reflecting strong and consistent performance across R\u0026sup2;, MAE, and RMSE. ET-1 closely followed, demonstrating a high degree of generalization. CB, XGB, and RF formed moderately large polygons, indicative of stable predictive\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eaccuracy, though marginally less than the ensembles. In contrast, LGBM and DT yielded notably contracted polygons, denoting lower R\u0026sup2; and higher error metrics, and thus inferior overall performance. A holistic assessment of polygonal areas and symmetry, which provides an insight into the model\u0026rsquo;s balance and efficacy, revealed Ensemble Technique-2 as the most reliable model across all evaluated criteria.\u003c/p\u003e \u003cp\u003eHowever, comprehending the rationale behind a model's predictions is often as critical as achieving high accuracy. Despite their robust performance, complex models such as Ensemble techniques pose the significant challenge of opacity, highlighting a dilemma between accuracy and interpretability (Lundberg and Lee, \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2017\u003c/span\u003e). To address this issue, a feature importance plot, as presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, was generated using the XGB model. The F-score for a feature, particularly for tree-based models, measures how often the feature is used to split the data across all trees in the ensemble and the improvement in the model's performance due to the splits. Hence, this metric facilitates quantifying the relative contribution of individual features to model predictions and serves as a tool for model interpretability. Nevertheless, the interpretations should be made considering the limitation that the plots primarily reflect overall influence rather than feature interactions or individual prediction effects, i.e., they capture a global perspective rather than a local one. Additionally, the high importance of a feature indicates an association within the model and not necessarily causation.\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e: \u003cb\u003eFeature Importance Plot\u003c/b\u003e\u003c/p\u003e \u003cp\u003eFrom Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, it was noted that the Production year, denoted as year_encoded, had the highest F-score of 220, followed closely by the Artificial inseminations conducted (AI), having a score of 200. It suggests these factors exert a more significant impact on the\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003emodel's predictions. Production year emerged as the top feature due to an almost consistent pattern of year-on-year growth in milk production across all states. The high importance of AI as a predictive feature aligns with the available literature on its significant impact on livestock productivity, health, and offspring quality. Next in line were \u0026lsquo;Ratio of gross irrigated area to total cropped area (R)\u0026rsquo;, \u0026lsquo;Area under cereals and millets (A\u003csub\u003ecm\u003c/sub\u003e)\u0026rsquo;, \u0026lsquo;Average yield per in-milk crossbred and exotic cows (Y\u003csub\u003ecb\u003c/sub\u003e), \u0026lsquo;Producer prices (PP)\u0026rsquo;, and \u0026lsquo;Number of in-milk crossbred and exotic cows (N\u003csub\u003ecb\u003c/sub\u003e)\u0026rsquo;, with F-scores above 150. Among the variables denoting fodder availability, pre-eminence of irrigation over land area variables emphasizes the importance of the former in enhancing both the yield and stability of fodder supply. This highlights that while expanding the area under crops and grazing lands is required to meet the fodder demand of the growing livestock population, the synergies of crop-livestock systems are most effective with assured irrigation, a finding similar to Mynavathi et al. (\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). Further, the importance of both the count and yield of crossbred cows, i.e., N\u003csub\u003ecb\u003c/sub\u003e and Y\u003csub\u003ecb\u003c/sub\u003e, strengthens the species growing status in the country\u0026rsquo;s milking herd (NAAS, 2016). Remunerative prices to producers emerged as another crucial variable, underscoring its importance in augmenting dairy activity (Kumar et al., \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e2011\u003c/span\u003e).\u003c/p\u003e"},{"header":"4. CONCLUSION","content":"\u003cp\u003eIndia is the world\u0026rsquo;s largest producer and consumer of milk, with the dairy sector directly employing over eighty million people, mainly comprising women and small and marginal farmers. Within the country\u0026rsquo;s socioeconomic setting, livestock rearing and milk production are found to be intertwined with poverty alleviation, income diversification, higher welfare of rural households, improved resilience to economic vulnerability, and women empowerment. Thus, dairying is a vital component of India\u0026rsquo;s agricultural sector and is crucial to achieving SDGs 1, 2, and 5.\u003c/p\u003e \u003cp\u003eAccurate early prediction of milk production in the country would leverage data-driven policymaking to ensure food security and supply chain optimization. Owing to the multi-factorial dependency and regional disparity in production, the study aimed to develop robust models, utilizing Machine Learning (ML) techniques, to predict milk production at the state level. To this end, time-series data on fifteen variables, for a period from 2000-01 to 2022-23, across twenty-seven states were compiled. Seven Machine Learning techniques, namely Decision Tree (DT), Random Forest (RF), eXtreme Gradient Boosting (XGB), Light Gradient Boosting Machine (Light GBM), CatBoost (CB), and two Ensemble models, were employed. Ensemble Technique 1 (ET-1) combined RF and XGB, while Ensemble Technique 2 (ET-2) combined DT, RF, and XGB.\u003c/p\u003e \u003cp\u003eResults revealed that Ensemble methods produced the best results, particularly Ensemble Technique 2 (ET-2). It achieved the lowest MAE, RMSE, and error percentage along with the highest R\u0026sup2; of 0.91. The ET-1 model followed close behind. A qualitative error assessment using error distribution plots, residual box plots, and a radar chart reaffirmed the ET-2 model yielding the most reliable and robust performance. Contrarily, DT was found to perform the weakest. A feature importance plot, generated to comprehend the relative contribution of individual features to model predictions, showed Production year, Artificial inseminations conducted, Ratio of gross irrigated area to total cropped area, Area under cereals and millets, Producer prices, Average yields and Numbers of in-milk crossbred and exotic cows, as the top features contributing to milk production. Thus, the findings demonstrate the robustness of ML techniques in context-specific modelling and point to genetic improvement, provision of remunerative producer prices, and assured irrigation as strategic priorities to boost milk production in the country.\u003c/p\u003e"},{"header":"5. LIMITATIONS AND FUTURE DIRECTIONS","content":"\u003cp\u003eThe authors endeavoured to incorporate as comprehensive a dataset as possible, however, the study is constrained by its reliance on secondary data sources. Consequently, a few crucial variables, such as outbreaks of livestock diseases, data on dairy cooperative societies, and socio-economic characteristics of dairy households, could not be included despite their established significance in influencing milk production within the Indian context. Future studies may consider including these variables to further enrich the policy implications of their work. Furthermore, a comparative evaluation of machine learning models against traditional statistical forecasting models or hybrid approaches documented in the literature, using a similarly extensive and diverse dataset for large-scale milk prediction, can be attempted in future studies to enhance the methodological rigour in the field.\u003c/p\u003e "},{"header":"Declarations","content":"\u003ch2\u003e\u003cstrong\u003e6. \u003c/strong\u003e\u003cstrong\u003eETHICAL APPROVAL AND ACCORDANCE \u003c/strong\u003e\u003c/h2\u003e\n\u003cp\u003eNot applicable, as the study is based solely on secondary data available free of cost in the public domain. The authors assure that the research work presented is original, and the manuscript is not under consideration for publication by any other journal. No part of the work presented in the manuscript has been published earlier. The final manuscript was read and approved by all authors, who have collectively consented to submit it to this journal.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eThird Party Material:\u003c/strong\u003e All of the material is owned by the authors and/or no permissions are required.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent to Participate\u003c/strong\u003e: Not applicable\u003c/p\u003e\n\u003ch2\u003e\u003cstrong\u003e7. \u003c/strong\u003e\u003cstrong\u003eCONSENT TO PUBLISH \u003c/strong\u003e\u003c/h2\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e8. OTHER DECLARATIONS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgement:\u003c/strong\u003e The work presented in this manuscript is an extension of research that received a Research Award from the National Bank for Agriculture and Rural Development (NABARD), Government of India. The authors are deeply grateful for the recognition.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding statement: \u003c/strong\u003eThe work received no funding.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests:\u003c/strong\u003e The authors declare no potential conflicts of interest. The authors have no financial or proprietary interests in any material discussed in the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability:\u003c/strong\u003e The data compiled and analyzed for the current study are available in open data repositories hosted on websites of government agencies, Archive Land Use Statistics \u0026ndash; At a Glance | Official website of Directorate of Economics and Statistics, Department of Agriculture and Farmers Welfare, Ministry of Agriculture and Farmers Welfare, Government of India; https://dahd.gov.in/schemes/programmes/animal-husbandry-statistics; \u003cu\u003eReserve Bank of India - Handbook of Statistics on Indian Economy\u003c/u\u003e; \u003cu\u003eagriculture value-output-milk Statistics and Growth Figures Year-wise of india\u0026ndash; Indiastat\u003c/u\u003e; \u003cu\u003ehttps://www.indiastat.com/data/agriculture/veterinary-institutions-animal-health-services/data-year/all\u003c/u\u003e\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eIlemobayo A, Durodola J, Alade O, Awotunde OJ, Olanrewaju OT, Falana A, Edu O\u0026hellip;E, O. Hyperparameter tuning in machine learning: A comprehensive review. J Eng Res Rep. 2024;26(6):388\u0026ndash;95. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.9734/jerr/2024/v26i61188\u003c/span\u003e\u003cspan address=\"10.9734/jerr/2024/v26i61188\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBhardwaj S, Tarafdar A, Baghel M, Dutt T, Gaur GK. Determining point of economic cattle milk production through machine learning and evolutionary algorithm for enhancing food security. J Food Qual. 2023;2023(1):7568139.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBirthal PS, Negi DS. Livestock for higher, sustainable and inclusive agricultural growth. Econo Political Wkly. 2012;47:89\u0026ndash;99.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBlockeel H, Devos L, Fr\u0026eacute;nay B, Nanfack G, Nijssen S. Decision trees: from efficient prediction to responsible AI. Front Artif Intell. 2023;6:1124553.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBogatinovski J, Todorovski L, Džeroski S, Kocev D. Comprehensive comparative study of multi-label classification methods. Expert Syst Appl. 2022;203:117215.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChand R, Raju SS. Livestock sector composition and factors affecting its growth. Indian J Agric Econ. 2008;63(2):198\u0026ndash;210.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCwalina K, Borusiewicz A, Ferrari M, Herrmann IT, Priekulis J. Factors influencing the development of milk production in agricultural holdings. Agricultural Eng. 2020;24(4):23\u0026ndash;34. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1515/agriceng-2020-0033\u003c/span\u003e\u003cspan address=\"10.1515/agriceng-2020-0033\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDean N. (2020). \u003cem\u003eSubsampling Big Data for Tree-Based Learning Algorithms\u003c/em\u003e (Doctoral dissertation, North Carolina Agricultural and Technical State University).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDelgado C, Rosegrant M, Steinfeld H, Ehul S, Courbis C. Livestock to 2020: the next food revolution. Outlook Agric. 2001;30(1):27\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDeshmukh SS, Paramasivam R. Forecasting of milk production in India with ARIMA and VAR time series models. Asian J Dairy Food Res. 2016;35(1):17\u0026ndash;22. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.18805/ajdfr.v35i1.9246\u003c/span\u003e\u003cspan address=\"10.18805/ajdfr.v35i1.9246\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDevi Monika RH, Umme, Weerasinghe WPMCN, Pradeep M, Shiwani T, Karakaya Kadir. Future Milk Production Prospects in India for Various Animal Species using Time Series Models. Indian J Anim Res. 2022;56(9):1170\u0026ndash;5. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.18805/IJAR.B-4409\u003c/span\u003e\u003cspan address=\"10.18805/IJAR.B-4409\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDi Franco G, Santurro M. Machine learning, artificial neural networks and social research. Qual Quantity. 2021;55(3):1007\u0026ndash;25.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFAO. Livestock's Long Shadow. Rome: Food and Agriculture Organisation; 2006.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFawagreh K, Gaber MM, Elyan E. Random forests: from early developments to recent advancements. Syst Sci Control Engineering: Open Access J. 2014;2(1):602\u0026ndash;9. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1080/21642583.2014.956265\u003c/span\u003e\u003cspan address=\"10.1080/21642583.2014.956265\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFood and Agriculture Organization of the United Nations. (2018). \u003cem\u003eDairy Development's Impact on Poverty Reduction\u003c/em\u003e. FAO. 4p. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://openknowledge.fao.org/handle/20.500.14283/ca2185en\u003c/span\u003e\u003cspan address=\"https://openknowledge.fao.org/handle/20.500.14283/ca2185en\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFood and Agriculture Organization of the United Nations. (2021). \u003cem\u003eWorld Food and Agriculture-Statistical Yearbook 2021\u003c/em\u003e. FAO. 368 p. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.4060/cb4477en\u003c/span\u003e\u003cspan address=\"10.4060/cb4477en\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGaillard C, Dervill\u0026eacute; M. Dairy farming, cooperatives and livelihoods: lessons learned from six indian villages. J Asian Econ. 2022;78:101422.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGateway to dairy production. and products. FAO. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.fao.org/dairy-production-products/production/en\u003c/span\u003e\u003cspan address=\"https://www.fao.org/dairy-production-products/production/en\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGerosa \u0026amp; Skoet. (2012). Milk availability: trends in production and demand and medium-term outlook. ESA Working paper 12\u0026thinsp;\u0026ndash;\u0026thinsp;01. FAO. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.fao.org/agrifood-economics/publications/detail/en/c/163733/\u003c/span\u003e\u003cspan address=\"https://www.fao.org/agrifood-economics/publications/detail/en/c/163733/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGoswami A, Rajan A, Verma S, Shah T. Irrigation and India's crop-milk agrarian economy: a simple recursive model and some early results. IWMI-Tata Water Policy Res Highlight. 2017;2:1\u0026ndash;12.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGupta SK, Purohit A. Growth of productivity in dairy industry: an inter-state analysis. Int J Educ Adm. 2010;2(4):685\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eG\u0026uuml;r YE. Innovation in the dairy industry: forecasting cow cheese production with machine learning and deep learning models. Int J Agric Environ Food Sci. 2024;8(2):327\u0026ndash;46. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.31015/jaefs.2024.2.9\u003c/span\u003e\u003cspan address=\"10.31015/jaefs.2024.2.9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7(1):1\u0026ndash;41. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s40537-020-00305-w\u003c/span\u003e\u003cspan address=\"10.1186/s40537-020-00305-w\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ehttps://apeda.gov.in/ Three years export summary statement (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ehttps://\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e\u003c/span\u003e\u003cspan address=\"http://www.fas.usda.gov\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e/ Attach\u0026eacute; Reports (GAIN) \u0026ndash; \u003cem\u003eDairy and Products Annual (2020)\u003c/em\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ehttps://\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e\u003c/span\u003e\u003cspan address=\"http://www.investindia.gov.in/sector/food-processing\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e - \u003cem\u003eDairy Industry In India: Growth, FDI, Companies, Exports\u003c/em\u003e (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eIbrahim M. Evolution of Random Forest from Decision Tree and Bagging: A Bias-Variance Perspective. Dhaka Univ J Appl Sci Eng. 2022;7(1):66\u0026ndash;71.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJaiswal S. Interstate Analysis of Productive Performance in India's Dairy Manufacturing Sector Using Stochastic Frontier Models. Q Literature Magazine- Naagfani. 2022;12(42):370\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJirli B, Kumar S, Pachava V, Golla SK. (2025). Forecasting milk production in India: Strategic insights for policymakers and farmers. \u003cem\u003eIndian Journal of Extension Education\u003c/em\u003e, \u003cem\u003e61\u003c/em\u003e(2), 14\u0026ndash;18.Kale, R. B., Ponnusamy, K., Sendhil, R., Maiti, S., Chandel, B. S., Jha, S. K., \u0026hellip; Lal, S. P. (2019). Determinants of inequality in dairy development of India. \u003cem\u003eNational Academy Science Letters\u003c/em\u003e, \u003cem\u003e42\u003c/em\u003e, 195\u0026ndash;198.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKale RB, Ponnusamy K, Chakravarty AK, Sendhil R, Mohammad A. Assessing resource and infrastructure disparities to strengthen Indian dairy sector. Indian J Anim Sci. 2016;86(6):720\u0026ndash;5.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKaliappan J, Bagepalli AR, Almal S, Mishra R, Hu YC, Srinivasan K. Impact of cross-validation on machine learning models for early detection of intrauterine fetal demise. Diagnostics. 2023;13(10):1692.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKhan F, Tomar A. (2022). Livestock Feed and Fodder Resources of India and Strategies for their Judicious Utilization: A Review. \u003cem\u003eIn Biodiversity in the Service of Mankind\u003c/em\u003e. Walnut Publication. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.6084/m9.figshare.20003078.v1\u003c/span\u003e\u003cspan address=\"10.6084/m9.figshare.20003078.v1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim M, Kang SR, Na MH. Prediction of Milk Production in Dairy Cows Using Statistical Regression Model and Machine Learning Methods. Korean Data Anal Soc. 2024. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.37727/jkdas.2024.26.6.1855\u003c/span\u003e\u003cspan address=\"10.37727/jkdas.2024.26.6.1855\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKishore A, Birthal PS, Joshi PK, Shah T, Saini A. Patterns and drivers of dairy development in India: insights from analysis of household and district-level data. Agricultural Economic Res Rev. 2016;29(1):1\u0026ndash;14. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.5958/0974- 0279.2016.00014.8\u003c/span\u003e\u003cspan address=\"10.5958/0974- 0279.2016.00014.8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKulkarni CS. Advancing Gradient Boosting: A Comprehensive Evaluation of the CatBoost Algorithm for Predictive Modelling. J Artif Intell Mach Learn Data Sci. 2022;1(5):54\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKumar A, Parappurathu S, Joshi PK. Structural Transformation in Dairy Sector of India. Agricultural Econ Res Rev. 2013;26(2):209\u0026ndash;20.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKumar A, Staal SJ, Singh DK. Smallholder dairy farmers\u0026rsquo; access to modern milk marketing chains in India. Agricultural Econ Res Rev. 2011;24(2):243\u0026ndash;54.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKumar S, Singh P, Devi U, Yathish KR, Saujanya PL, Kumar R, Mahanta SK. An overview of the current fodder scenario and the potential for improving fodder productivity through genetic interventions in India. Anim Nutr Feed Technol. 2023;23(3):631\u0026ndash;44.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKumar V, Chakravarty AK, Magotra A, Patil CS, Shivahre PR. Comparative study of ANN and conventional methods in forecasting first lactation milk yield in Murrah buffalo. Indian J Anim Sci. 2019;89(11):1262\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLundberg SM, Lee SI. (2017). A unified approach to interpreting model predictions. \u003cem\u003eAdvances in neural information processing systems\u003c/em\u003e, \u003cem\u003e30\u003c/em\u003e. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.1705.07874\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.1705.07874\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLyashenko V, Jha A. (2022). Cross-validation in machine learning: how to do it right. \u003cem\u003es interneta\u003c/em\u003e, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-\u003c/span\u003e\u003cspan address=\"https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cem\u003eright, 24\u003c/em\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLyngkhoi DR, Singh SB, Singh R, Tyngkan H. Trend analysis of milk production in India. Asian J Dairy Food Res. 2022;41(2):183\u0026ndash;7. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.18805/ajdfr.DR-1789\u003c/span\u003e\u003cspan address=\"10.18805/ajdfr.DR-1789\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMinistry of Agriculture and Farmers\u0026rsquo; Welfare. (2018). \u003cem\u003eNational Action Plan for Dairy Development VISION-2022\u003c/em\u003e. Government of India. 68 p.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMishra P. (2020). Time Series Investigation of Milk Production in Major States of India Using ARIMA Modelling. J Anim Res, \u003cem\u003e10\u003c/em\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMynavathi VS, Murugeswari R, Kumar VRS, Gopi H, Valli C, Babu M. Impact of Irrigation Methods on Soil, Water, and Nutrient USE Efficiency of Integrated Cropping: Livestock Production Systems. Management Strategies for Water Use Efficiency and Micro Irrigated Crops. Apple Academic; 2019. pp. 35\u0026ndash;41.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNational Academy of Agricultural Sciences. 2016. \u003cem\u003eBreeding Policy for Cattle and Buffalo in India.\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://naas.org.in/Policy%20Papers/policy%2082.pdf\u003c/span\u003e\u003cspan address=\"https://naas.org.in/Policy%20Papers/policy%2082.pdf\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNational Academy of Agricultural Sciences. 2020. \u003cem\u003eLivestock Improvement through Artificial Insemination.\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://naas.org.in/Policy%20Papers/policy%2096.pdf\u003c/span\u003e\u003cspan address=\"https://naas.org.in/Policy%20Papers/policy%2096.pdf\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNITI Aayog. Agriculture, Animal Husbandry and Fisheries Sector Report. 274p ed. Government of India; 2021.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOzella L, Rebuli B, Forte K, C., Giacobini M. (2023). A literature review of modeling approaches applied to data collected in automatic milking systems. \u003cem\u003eAnimals\u003c/em\u003e, \u003cem\u003e13\u003c/em\u003e(12), 1916.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePaul RK, Alam W, Paul AK. Prospects of livestock and dairy production in India under time series framework. Indian J Anim Sci. 2014;84(4):462\u0026ndash;6.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePerumal A, Mazumdar C, Srinatha TN, Rath S, Likhitha S. Mitigating Climate Impact: A Machine Learning Approach to Forecast Methane Emissions from Indian Livestock. Indian J Agric Econ. 2024;79(3):610\u0026ndash;20.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePetare PA. (2013, August 11). \u003cem\u003eIssues and challenges of supply chain management with perspective to Indian dairy industry\u003c/em\u003e [Paper Presentation]. International Conference on Issues and Challenges in Current Global Economy: It\u0026rsquo;s Impact on Commerce, Engineering and Technology, Pune, India.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePotdar K, Pardawala TS, Pai CD. A comparative study of categorical variable encoding techniques for neural network classifiers. Int J Comput Appl. 2017;175(4):7\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePrajapati B, Prajapati J, Kumar K, Srivastava A. Determination of the relationships between quality parameters and yields of fodder obtained from intercropping systems by correlation analysis. Forage Res. 2019;45(3):219\u0026ndash;24.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRajeshwaran S, Naik G, Dhas RAC. (2014). Rising milk price \u0026ndash; A cause for concern on food security. Working Paper 472. Bangalore: Indian Institute of Management.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRanglani H. (2024). Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models. Mach Learn Applications: Int J (MLAIJ) \u003cem\u003eVol\u003c/em\u003e. \u003cem\u003e11\u003c/em\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRao MK. Factors Affecting Milk Production: A Case Study in Andhra Pradesh. J Rural Dev. 2017;36(1):21\u0026ndash;32.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSagi O, Rokach L. (2018). Ensemble learning: A survey. Wiley interdisciplinary reviews: data Min Knowl discovery, 8(4), e1249.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSaha A, Bhattacharyya S. Artificial insemination for milk production in India: a statistical insight. Indian J Anim Sci. 2020;90(8):1186\u0026ndash;90.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSarkar A, Gupta H, Dutta A. (2022). In search of the determinants of dairy production in an emerging market: a panel data approach.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSekhar CSC, Roy D, Bhatt Y. 2017. Food inflation and food price volatility in India: Trends and determinants. IFPRI Discussion Paper 1640. Washington, DC: International Food Policy Research Institute (IFPRI). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://hdl.handle.net/10568/147644\u003c/span\u003e\u003cspan address=\"https://hdl.handle.net/10568/147644\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShankar SV, Ajaykumar R, Ananthakrishnan S, Aravinthkumar A, Harishankar K, Sakthiselvi T, Navinkumar C. Modeling and forecasting of milk production in the western zone of Tamil Nadu. Asian J Dairy Food Res. 2023;42(3):427\u0026ndash;32.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSharma DK, Garg A, Kumar A. Ensemble learning in machine learning: Integrating multiple models for improved predictions. Int J Appl Res. 2018;4(7):61\u0026ndash;5. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.22271/allresearch.2018.v4.i7a.11443\u003c/span\u003e\u003cspan address=\"10.22271/allresearch.2018.v4.i7a.11443\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSoomro AA, Mokhtar AA, Hussin HB, Lashari N, Oladosu TL, Jameel SM, Inayat M. Analysis of machine learning models and data sources to forecast burst pressure of petroleum corroded pipelines: A comprehensive review. Eng Fail Anal. 2024;155:107747. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.engfailanal.2023.107747\u003c/span\u003e\u003cspan address=\"10.1016/j.engfailanal.2023.107747\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSubbanna YB, Kumar S, Puttaraju SKM. Forecasting buffalo milk production in India: Time series approach. Buffalo Bull. 2021;40(2):335\u0026ndash;43. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://kuojs.lib.ku.ac.th/index.php/BufBu/article/view/3993\u003c/span\u003e\u003cspan address=\"https://kuojs.lib.ku.ac.th/index.php/BufBu/article/view/3993\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSuseendran G, Duraisamy B. (2021). Prediction of Dairy Milk Production Using Machine Learning Techniques. \u003cem\u003eIntelligent Computing and Innovation on Data Science\u003c/em\u003e. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/978-981-16-3153-5_60\u003c/span\u003e\u003cspan address=\"10.1007/978-981-16-3153-5_60\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTricarico JM, Kebreab E, Wattiaux MA. (2020). MILK Symposium review: Sustainability of dairy production and consumption in low-income countries with emphasis on productivity and environmental impact. Journal of Dairy Science, 103(11), 9791\u0026ndash;9802. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3168/jds.2020-18269\u003c/span\u003e\u003cspan address=\"10.3168/jds.2020-18269\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVasant PG, Zhang-Yue Z. Rising demand for livestock products in India: nature, patterns and implications. Australasian Agribusiness Rev. 2010;18:103\u0026ndash;35.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"discover-artificial-intelligence","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"diai","sideBox":"Learn more about [Discover Artificial Intelligence](https://www.springer.com/44163)","snPcode":"","submissionUrl":"","title":"Discover Artificial Intelligence","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Discover Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Machine learning, Predictive modelling, Milk production, India","lastPublishedDoi":"10.21203/rs.3.rs-8702287/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8702287/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eMilk production in India is integral to its food and nutritional security and is instrumental in achieving SDGs 1, 2, and 5. Accurate early prediction of milk production is crucial for data-driven policymaking, with implications for resource allocation and supply chain optimization. However, modelling milk production remains complex due to its dependence on multiple factors and the significant regional heterogeneity within the country. While the application of Artificial Intelligence (AI) driven approaches has expanded modelling capabilities in the realm of livestock production, their application in modelling milk production remains underexplored within the Indian context. Therefore, the present study aimed to develop state-level predictive models for milk production, employing a suite of Machine Learning (ML) algorithms.\u003c/p\u003e \u003cp\u003eTime-series data encompassing fifteen variables from 2000-01 to 2022-23 across twenty-seven Indian states were systematically collected and analyzed. Seven machine learning techniques, including Decision Tree (DT), Random Forest (RF), eXtreme Gradient Boosting (XGB), Light Gradient Boosting Machine (LightGBM), CatBoost (CB), and Ensemble techniques, were tested and compared using an extensive array of performance metrics. Notably, the ensemble technique combining DT, RF, and XGB demonstrated superior predictive accuracy, surpassing individual models. Conversely, DT exhibited the weakest performance relatively. The findings clearly point to genetic improvement, remunerative pricing for producers, and assured irrigation as strategic priorities to boost India\u0026rsquo;s milk production.\u003c/p\u003e","manuscriptTitle":"Machine Learning Driven Approach for Modelling Milk Production in India","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-16 09:15:18","doi":"10.21203/rs.3.rs-8702287/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-02-25T11:58:10+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-24T08:11:32+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"334632543257530834748151559572787880369","date":"2026-02-17T05:47:14+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"188870050206385832783972865360006127538","date":"2026-02-13T09:10:46+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-11T10:20:38+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"196116780135626405213359747277763398607","date":"2026-02-11T09:58:46+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-02-11T07:40:29+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-02-06T06:51:43+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-02-05T02:38:17+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-02-04T14:32:53+00:00","index":"","fulltext":""},{"type":"submitted","content":"Discover Artificial Intelligence","date":"2026-02-04T14:22:43+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"discover-artificial-intelligence","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"diai","sideBox":"Learn more about [Discover Artificial Intelligence](https://www.springer.com/44163)","snPcode":"","submissionUrl":"","title":"Discover Artificial Intelligence","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Discover Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"75ffac43-5892-432e-b9c4-e22a7b6b5466","owner":[],"postedDate":"February 16th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"in-revision","subjectAreas":[],"tags":[],"updatedAt":"2026-05-18T11:24:51+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-16 09:15:18","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8702287","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8702287","identity":"rs-8702287","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00