Research on Applying Machine Learning Models to Predict and Assess Return on Assets (Roa) | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Research on Applying Machine Learning Models to Predict and Assess Return on Assets (Roa) Vu Hong Son Pham, Tung Duong Le This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4129810/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 7 You are reading this latest preprint version Abstract Return on Assets (ROA), a profitability measure, is crucial in corporate finance for assessing how efficiently a company uses assets to generate profit. Currently, the prediction of the ROA index at present is a tedious, manual process. It usually involves making educated guesses or waiting for the accurate data, which becomes available only after financial reports have been compiled. This paper introduces a machine learning model for predicting the ROA index. The model draws data from 78 companies listed on the Vietnam Stock Exchanges (HOSE and HNX) over the span of 2012 to 2022.The Random Forest (RF) model was put to the test using datasets from selected Vietnamese businesses in 2023. The results demonstrated a high level of precision, with an error rate of less than 1%, an R2 value of 0.9762, and a Root Mean Square Error (RMSE) of 0.5826. These findings indicate potential real-world uses in predicting and boosting business performance. In conclusion, the integration of machine learning in financial analysis and prediction represents substantial progress. It enhances both accuracy and efficiency and holds promise for future advancements in financial management practices. This study aims to encourage more research and development in this area, leading to more advanced and efficient financial management tools. profit working capital debt ratio growth rate Vietnamese construction enterprises machine learning models optimization Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Introduction In recent years, the application of information technology in the construction industry has piqued the interest of scientists. Numerous studies, both domestic and international, have explored the use of artificial intelligence (AI) in various types of projects. These include civil projects (L. N. Son 2023) (L. N. Son 2024), (N. D. Son 2024), (Nguyen Dang Nghiep Trinh 2024), (N. T. Son 2024) , transport (N. V. Son 2023) and electricity (T. H. Son 2023). In the evolving field of construction management, the application of machine learning (ML) techniques has marked a significant stride toward enhancing the predictability and understanding of profitability within the industry. Recent studies highlight the burgeoning role of ML in forecasting financial outcomes, underscoring a technological shift that promises to refine strategic planning and decision-making processes. (E. Adinyira 2021)pioneered the use of a Support Vector Regression Algorithm (SVRA) to predict construction project profit margins in Ghana, showcasing a commendable predictive accuracy of 73.66%. This study not only demonstrates the applicability of ML in construction profitability forecasts but also sets a benchmark for future research in similar emerging markets. Following a similar trajectory, (Hong Zhang 2015) employed principal component analysis (PCA) and a support vector machine (SVM) to navigate the complex financial landscapes of Chinese construction firms, achieving an impressive accuracy rate exceeding 80%. These findings illuminate the potential of ML algorithms to dissect and interpret multifaceted financial data effectively. Adding to the body of knowledge, (Mahfouz 2012) introduced a decision support system designed to estimate productivity rates in construction projects through the integration of SVM and Naive Bayes models. This approach not only underscores the versatility of ML in handling diverse construction management challenges but also reinforces the importance of predictive accuracy in enhancing project outcomes. In a study focusing on the Vietnamese context, (Thi Nhu Hoa Le 2020) meticulously identified a set of critical factors influencing the profitability of local construction companies. Variables such as company age, debt ratio, growth rate, asset utilization performance, company size, and the proportion of fixed assets were highlighted as determinants of financial success. This research enriches the discourse on profitability drivers in construction, offering valuable insights into the Vietnamese market's unique characteristics. Further more, (Wassie 2020) investigates the impact of capital structure on the profitability of construction firms in Ethiopia, revealing that both debt-to-equity and long-term debt-to-total assets ratios exhibit a significant positive correlation with ROE and ROA . This finding aligns with the broader discourse on the pivotal role of capital structure decisions in influencing company worth and operational costs, thereby echoing the essentiality of strategic financial planning in the construction domain. Moreover, a conceptual review by (Ngo 2023) on working capital management practices in the construction industry highlights the intricate relationship between working capital components and firm profitability, further substantiating the critical role of financial management in the sector's sustainability. This aligns with (Soa La Nguyen 2023) exploration into the associations between ROE, ROA, liquidity, and debt, emphasizing the nuanced impact of short-term loans on firm profitability and underscoring the potential of ML to refine these predictive analyses. Collectively, these studies underscore the transformative potential of ML in the construction industry, presenting a promising avenue for future research and application. By leveraging advanced analytical techniques, the construction sector can achieve greater accuracy in profitability predictions, enabling more informed strategic decisions that drive sustainable growth and financial stability. This study introduces a suite of machine learning (ML) models, chosen based on criteria such as data characteristics, computational efficiency, and the specific objectives of the analysis. The investigation encompasses a variety of algorithms, including supervised regression techniques like Lasso, Ridge Regression (RR), k-Nearest Neighbors Regressor (KNR), and Support Vector Regression (SVR), as well as ensemble methods such as Random Forest (RF), Extra Trees Regressor (ETR), Gradient Boosting Regressor (GBR), and Extreme Gradient Boosting (XGBoost). Additionally, the study evaluates an artificial neural network approach through the Multilayer Perceptron (MLP) model. The effectiveness of these models will be quantitatively measured using R Square (R 2 ) and Root Mean Square Error (RMSE) metrics. The forthcoming section will detail the ML methodologies under review, the research approach employed, and a discussion on the findings derived from the analysis. Research methodology A range of studies have demonstrated the effectiveness of the random forest algorithm in various financial applications. (Yongsong Cai 2020) and (WU 2012) both found that the algorithm outperformed other methods in predicting enterprise return on net assets and diagnosing assets impairment, respectively. (Zhu 2019) applied the algorithm to forecast fund return rate direction and select stocks, with both studies reporting positive results. (Sevil 2020) and (McGroarty 2014) further extended the algorithm's application to predicting IPO initial returns and developing an automated trading system, respectively, with both studies finding the algorithm to be superior to other methods. (Scornet 2016) and (Breiman 1999) provided comprehensive overviews of the algorithm, highlighting its versatility and robustness. 2.1 Return on assets (ROA) forecasting model: Predicting Return on Assets (ROA) enables businesses to develop risk management strategies and adjust their financial posture more effectively by forecasting future outcomes. This aids businesses in: Making critical strategic decisions for the company's future, guiding the business towards sustainable and financially secure growth. Planning and adjusting financial ratios (such as leverage ratio, working capital, etc.) to optimize profits. The aim of this study is to contribute to financial management practices within the construction industry, while also providing a trained dataset for predicting and assessing the ROA for the construction sector in Vietnam. 2.1.1 Linear regression analysis model The Single Linear Regression Analysis model, a statistical method, is employed to examine the correlation between one or more independent variables, also known as feature variables, and a dependent variable, or target variable. The goal is to anticipate the value of the dependent variable based on the independent one by identifying the optimal straight line to reduce the discrepancy between the projected and actual values of the dependent variable. (Khoi 2019) identified internal variables such as firm size, return on equity, and earnings per share as significant predictors of ROA in the Vietnamese stock market by using a basic regression model. Moverover, the predictability of real estate asset returns using a vector regression model which incorporates financial spreads was found to be reduced over longer forecasting horizons (Tsolacos 2010) . The linear regression models used in our study encompass Lasso Regression, Ridge Regression, K Neighbors Regression, and Support Vector Regression (SVR). 2.1.2 Ensemble model The ensemble model, first introduced in the early 1990s (Hansen 1990), saw significant development in the early 21st century (Friedman 2001) . This model is a departure from the independent linear regression analysis model. Instead, it is a machine learning approach that combines the predictions of multiple individual models. The aim of the ensemble model is to enhance generalization and accuracy by combining the predictions of base estimators built with a certain learning algorithm The two main methods, averaging and augmentation, are typically differentiated as follows: Averaging involves independently constructing multiple estimators and then averaging their predictions. Generally, the combined estimator in averaging tends to outperform any single base estimator as it reduces variance. Models like the RF and GBR operate on this principle. On the other hand, with augmentation, base estimators are built sequentially with an aim to decrease the bias of the combined estimator. The objective here is to amalgamate several weak models to produce a potent ensemble. Models such as GBR and XGBoost fall under this category. 2.1.3 Artificial neural network model Artificial Neural Networks (ANNs) are computational models inspired by biological neural networks. Their origins date back to the 1940s and 1950s when Warren McCulloch and Walter Pitts first conceptualized the neuron model (McCulloch 1943) .This work laid the ground work for initial artificial neural models. The field progressed significantly in the 1990s with the introduction of advanced models like the Convolutional Neural Network (CNN) and the Counter propagation Neural Net (CPN) (A. &. Kaveh 1998). These models have been applied to various tasks, including speech recognition, natural language processing, and image and video analysis. Today, ANNs are prevalent in both industry and research, with the field evolving rapidly CITATION Kav23 \l 1066 (A. &. Kaveh 2023) and (Mehmet Kayakus 2023) found that artificial neural networks (ANNs) and support vector regression (SVR) were successful in predicting ROA in the iron and steel industry. The Multilayer Perceptron (MLP), a type of ANN with multiple hidden layers, is particularly useful for regression problems. This model evolved from the Perceptron model of the 1950s and 1960s. The 1980s marked the emergence of multilayer neural network architectures used for tackling complex problems. Today, MLP is still a commonly used machine learning model . In this paper, we will use the MLP model to examine predictability using our existing dataset. 2.2 Model evaluation indicators: RMSE, or Root Mean Square Error, is used to measure the average magnitude of the error, also known as the residual, between the predicted value and the actual value. The smaller the RMSE value, the smaller the error, indicating a high level of estimation. This suggests that the model is reliable. Formula refer to: (Hodson 2022) R 2 , or the coefficient of determination, measures the proportion of the target data's variance that our regression model can explain. An R 2 value close to 1 is typically considered to indicate good predictive ability. Formula refer to: (Davide Chicco 2021) In these formulas, y i is the actual value of observation i, ŷ i is the predicted value of observation i, y̅ is the average value of y, and n is the number of sample data. Evaluation of machine learning models 3.1 Data set overview 3.1.1 Data collection In accordance with the analysis requirements and the model under construction, we'll select the necessary data. Identifying the right data source, its format, and the collection method forms a crucial part of the process. The types of data we collect will be dictated by the purpose of our research and the results we're aiming for. Here are the specifics: - We'll gather data on the company's business results from financial reports and balance sheets (revenue, cost of goods sold, financial costs, sales costs, profit, interest,...) available at https://finance.vietstock.vn/. - We'll compute financial variables using specific formulas. 3.1.2 Data processing a. Missing values To manage missing values while preserving the data structure, two methods are proposed: i) Replace the missing value with the column's mean or median. This technique is generally applied when the number of missing values is insignificant. ii) Exclude the row containing the missing value from the data series. In the realm of power or meteorological data series, this discarded row would embody all the data for a specific time step associated with the missing value. b. Data analysis: This study applies the Cross-Validation technique for assessing the machine learning model. We've used 5 folds for optimizing resources. When compared with the conventional technique (which splits data into training and testing parts), here's what we find: - Data Use Efficiency: Cross-Validation technique: Each dataset sample becomes a test set once and a training set (k-1) times - maximizing data usage, which is crucial for small datasets. Traditional Split: The data is divided in a fixed manner, typically 70% for training and 30% for testing. Some data is only used either for training or testing, which doesn't make the best use of all data. - Bias and Variance Reduction: Cross-Validation technique: Since each sample is used for both training and testing, this method helps curb bias and variance for a more accurate model performance estimate. Traditional Split: There could be high bias or variance if the dataset split doesn't accurately represent the dataset's full structure or distribution, particularly if the dataset is small or lacks diversity. - Flexibility and General applicability: Cross-Validation technique: Offers a more flexible and general model evaluation method, good for various data types and different problems. It also allows fold number adjustment to balance computational efficiency and model performance estimate accuracy. Traditional Split: Suitable when there's plenty of data and a quick model evaluation is needed. However, this method may not accurately reflect the model's applicability to new data. - Computation and Time: Cross-Validation technique: Needs more computational resources and time as the model is trained k times. But, it's considered a fair trade-off for a more accurate model performance estimate. Traditional Split: Requires fewer resources and less time, making it a good option when computational resources are limited or quick results are needed. But, this may compromise the accuracy of the model performance estimate. 3.2 Data description 3.2.1 Information about the financial database: The forecast model is built on data collected from the financial reports of 76 construction firms listed on the Vietnam Stock Exchange (HOSE, HNX) spanning from 2012 to 2022. You'll find the data, which delves into comprehensive business outcomes and balance sheets. 3.2.2 Description of financial database: The financial results of companies play a significant role in determining financial indicators. Summarized financial report information can be accessed from the website: https://finance.vietstock.vn/. This website, established on 02/08/2002, aims to provide accurate data about corporate finance, stocks, bonds, and macro information. The data from this website is used in this article for calculations. 3.3 Variables in the predictive model The variables considered to be included in the model include: 14 dependence variables (features) and 01 independence variable (target) ( Table 1 ). Table 1: Statistics of variables included in the model (Source: Synthesized by the authors) (Ngo 2023) No. Name/Variable Name Unit Measurement 1 Return on Asset/ROA % Net income to Average total asset 2 Days Sales Outstanding/DSO Days Average account receivables balance to net sales ) x365 3 Days Inventory Outstanding/DIO Days Average account inventory balance costs of goods sold) x365 4 Days Payable Outstanding/DPO Days Average account payables balance costs of goods sold) x365 5 Cash Conversion Cycle/CCC Days DSO+DIO−DPO 6 Net Working Capital/NWC Billions of VND Current Assets - Current Liabilities 7 Size/SIZ Billions of VND Natural Logarithm of net sales 8 Financial Leverage/LEV % Total debt/Total Asset 9 Current Ratio/CR % Current Assets/Current Debt 10 Growth Rate/GRO % Percentage change in net sales 11 Return on Equity/ROE % Net Income/Shareholder's Equity 12 Earnings Before Interest Taxes/EBIT Billions of VND Revenue - Operating Expenses (excluding interest and taxes) 13 Gross Margin/GROS % (Total Revenue - Cost of Goods Sold)/Total Revenue 14 Quick Ratio/QR % (Current Assets - Inventories)/Current Liabilities 15 Return/RE Billions of VND Net Profit Margin The characteristics of the dataset are illustrated by a graph (Figure 2) which shows the frequency of values. Results and discussion In the process of preparing data for machine learning algorithms, an essential step involves exploring, visualizing, and preprocessing the available features. This step is crucial as it gives a clear overview of the data that will be fed into the machine learning model, helping to identify any areas that may need further refinement before proceeding. The dataset utilised in this process consists of a substantial 740 rows and 15 columns. Such a volume of data provides a rich resource for the machine learning algorithms to draw from, thus increasing the potential accuracy of the forecasts produced. In order to examine the details of the available columns in the dataset, the pandas data analysis library is employed. This powerful tool provides an efficient and effective means of handling and exploring the data in detail, allowing for a thorough examination of the dataset in its entirety. This article goes a step further and delves into the relationships between the data by studying the correlation between different variables. This is done in order to select the most suitable data for the forecasting model. In this instance, the Pearson correlation is considered a useful tool for identifying quantities that have a high correlation with the ROA value. The results of this analysis can be seen in Figure 3. This step is crucial as it ensures that only the most relevant and impactful data is included in the forecasting model, thus increasing the probability of accurate and meaningful results. The correlation matrix reveals that: - Variables DSO, DIO, LEV, GRO, ROE, EBIT, GROS, and RE have a strong correlation with ROA. - Variables DPO, CCC, NWC, SIZ, CR, and QR have a weak correlation with ROA. Even with these correlations, all variables are included in the machine learning model to guarantee accurate predictions, including non-linear ones. The results of models predicting return on assets (ROA) are given in Table 2. Among these, the RF model yields the best results when cross-validated with five iterations (n_splits). The results of the models applied to predict the return on assets are presented in ( Table 2 ). Table 2: Comparison results of models No. Model R 2 RMSE 1 Lasso 0.5908 2.001 2 RR 0.7194 2.0012 3 KNR 0.452 2.79 4 SVR 0.3968 2.9339 5 RF 0.9762 0.5826 6 GBR 0.97 0.6528 7 XGBoost 0.9822 0.504 8 MLP 0.838 1.517 The results of ranking models R 2 and RMSE indexes are shown in ( Figure 4 ). The model will be evaluated and selected based on three criteria: - R Score: An optimal score is closer to 1. - RMSE Score: A lower score is better. - Learning Curves Chart: This chart displays the number of training and test points. The closer these two lines are, the better the model generalizes. In deciding between the RF and XGBoost models, both of which have commendable R2 and RMSE scores, we observe a larger gap between the training and testing lines in XGBoost (1 unit) compared to RF (approximately 0.7 units). As such, this document will use the RF model for computation. Calibration of RF model The GridSearchCV tool gives you the power to discover the optimal parameters for your machine learning model. It achieves this by testing a variety of parameters and determining the model's performance for each set, using cross-validation techniques to evaluate the model across diverse training and validation datasets. In simple terms, GridSearchCV breaks down the parameters into separate values and creates various parameter sets by combining these values. The model is then trained using each parameter set, with its performance evaluated through methods like cross-validation. The best parameter set is then selected based on the model's performance on the validation dataset. - max_depth = 10. - min_samples_leaf = 2. - min_samples_split = 5. - n_estimators = 200. Looking at the ranking of important features, it's evident that the ROE index has the most significant impact on the ROA. Additionally, the variables LEV, EBIT, and CR also have a substantial influence on the ROA index (Figure 5). From the learning curves (refer to Figure 6), we can make the following observations: - The training score and cross-validation score remain relatively stable. As more samples are used for training, both scores tend to converge or nearly converge. This suggests that the model has strong generalization capabilities. 3.5 Prediction: The predictive model outlined in this article can assist business owners in the following ways: - Enhancing business decisions: The model provides a more accurate prediction of ROA when specific profit indicators have not been calculated, enabling businesses to predict future scenarios and analyze "What-If" situations. This supports planning and risk management. - Streamlining and automating the prediction process: The application of machine learning models standardizes the prediction process, making it automated, quick, and less reliant on manual intervention. - Integrating data, knowledge, and context: While this model only considers 15 variables, integrating additional variables and contexts into the analysis process can better reflect the actual situation and the influences from the external business environment. When applying these machine learning models to forecast the ROA for construction businesses in 2023, we obtained the following results: (Figure 7) Despite the impressive results (error < 1%), this study has not yet been validated with a larger dataset. This would involve a larger number of feature variables and require further enhancement of the computational model to improve the accuracy of the prediction tool. Conclusions, application and Proposal This article presents a financial perspective in the field of construction, applying machine learning models such as simple linear regression, ensemble models, and neural network models. The results indicate that the RF model delivered more optimal results than the second-ranked GBR model (0.62%), as well as the XGBoost model, when considering the distance between the two training and testing lines with the Learning Curves chart. However, the SVR and KNN models were found to be unsuitable for this dataset, given their poor R square results (less than 0.5). This study sets the stage for various research avenues that could be pursued in the near future, including: Comparative Analysis Across Industries, Temporal Stability of ML Predictions, Integration with Other Financial Health Indicators, Impact of External Variables, Advanced ML Techniques and Algorithms, Cross-country Comparisons, ML Interpretability and Decision-making, Sustainability and Environmental Considerations. These proposed research avenues can expand the understanding built by my study. Moreover, they could offer valuable insights for professionals in the industry, policymakers, and scholars intrigued by the blend of finance, construction, and machine learning. Declarations Funding The authors declare that no funds, grants, or other support were received during the preparation of this manuscript. Competing Interests The authors have no relevant financial or non-financial interests to disclose. Author Contributions Both the authors wrote, prepared and reviewed the manuscript. Acknowledgments For this work, we gratefully recognize the time and facilities provided by Ho Chi Minh City University of Technology (HCMUT), VNUHCM. References Breiman, L. 1999. "Random Forests--random Features." Computer Science, Mathematics. https://www.stat.berkeley.edu/~breiman/random-forests.pdf. Davide Chicco, Matthijs J. Warrens, Giuseppe Jurman. 2021. "The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation." PeerJ Computer Science. https://doi.org/10.7717/peerj-cs.623. E. Adinyira, E. Adjei, K. Agyekum, F. Fugar. 2021. "Application of machine learning in predicting construction project profit in Ghana using Support Vector Regression Algorithm (SVRA)." Engineering, Construction and Architectural Management. https://doi.org/10.1108/ECAM-08-2020-0618. Friedman, J. H. 2001. "Greedy function approximation: A gradient boosting machine." The Annals of Statistics. https://doi.org/10.1214/aos/1013203451. Hansen, L. K., & Salamon, P. 1990. "Neural network ensembles." IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/34.58871. Hodson, Timothy O. 2022. "Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not." Geoscientific Model Development. https://doi.org/10.5194/gmd-15-5481-2022. Hong Zhang, Fei Yang, Yang Li, Heng Li. 2015. "Predicting profitability of listed construction companies based on principal component analysis and support vector machine—Evidence from China." Automation in Construction Vol 53, Pages 22-28. https://doi.org/10.1016/J.AUTCON.2015.03.001. Kaveh, A., & Iranmanesh, A. 1998. "Comparative study of backpropagation and improved counterpropagation neural nets in structural analysis and optimization." International Journal of Space Structures 13(4), 177–185. https://doi.org/10.1177/026635119801300. Kaveh, A., & Khavaninzadeh, N. 2023. "Efficient training of two ANNs using four meta-heuristic algorithms for predicting the FRP strength." Structures 52, 256–272. https://doi.org/10.1016/j.istruc.2023.03.178. Khoi, Lucille V. Pointer & Phan. 2019. "Predictors of Return on Assets and Return on Equity for Banking and Insurance Companies on Vietnam Stock Exchange." Entrepreneurial Business and Economics Review. DOI: 10.15678/EBER.2019.070411. Lucille V. Pointer, Phan Dinh Khoi. 2019. "Predictors of Return on Assets and Return on Equity for Banking and Insurance Companies on Vietnam Stock Exchange." Entrepreneurial business and economics review. https://doi.org/10.15678/eber.2019.070411. Mahfouz, Tarek. 2012. "A Productivity Decision Support System for Construction Projects Through Machine Learning (ML)." Engineering, Computer Science. McCulloch, W. S., & Pitts, W. 1943. "A logical calculus of the ideas immanent in nervous activity." The Bulletin of Mathematical Biophysics. https://doi.org/10.1007/BF02478259. McGroarty, Ash Booth & Enrico Gerding & Frank. 2014. "Automated trading with performance weighted random forests and seasonality." Expert Systems with Applications. https://doi.org/10.1016/j.eswa.2013.12.009. Mehmet Kayakus, Burçin Tutcu, Mustafa Terzioglu, Hasan Tala¸s and Güler Ferhan Ünal Uyar. 2023. "ROA and ROE Forecasting in Iron and Steel Industry Using Machine Learning Techniques for Sustainable Profitability." Sustainability. https://doi.org/10.3390/su15097389. Ngo, Thi Quy Vo and Ngoc Cuong. 2023. "Does working capital management matter? A comparative case between consumer goods firms and construction firms in Vietnam." Cogent Business & Management. https://doi.org/10.1080/23311975.2023.2271543. Nguyen Dang Nghiep Trinh, Nguyen Van Nam, Pham Vu Hong Son. 2024. "Achieving improved performance in construction projects: advanced time and cost optimization framework." Evolutionary Intelligence. https://doi.org/10.1007/s12065-024-00918-7. Rincy, T. N., & Gupta, R. 2020. "Ensemble Learning Techniques and its Efficiency in Machine Learning: A Survey." 2nd International Conference on Data, Engineering and Applications (IDEA). https://doi.org/10.1109/IDEA49133.2020.9170675. Scornet, Gérard Biau & Erwan. 2016. "A random forest guided tour." TEST. https://doi.org/10.1007/S11749-016-0481-7. Sevil, Boubekeur Baba & Güven. 2020. "Predicting IPO initial returns using random forest." Borsa Istanbul Review. https://doi.org/10.1016/J.BIR.2019.08.001. Soa La Nguyen, C. Pham, Tu Van Truong, Trong Van Phi, Linh Le, T. Vu. 2023. "Relationship between Capital Structure and Firm Profitability: Evidence from Vietnamese Listed Companies." International Journal of Financial Studies. https://doi.org/10.3390/ijfs11010045. Son, Luu Ngoc Quynh Khoi & Pham Vu Hong. 2024. "Artificial intelligent support model for multiple criteria decision in construction management." OPSEARCH. https://doi.org/10.1007/s12597-024-00749-1. Son, Luu Ngoc Quynh Khoi & Pham Vu Hong. 2023. "Optimization in Construction Management Using Adaptive Opposition Slime Mould Algorithm." Advances in Civil Engineering. https://doi.org/10.1155/2023/7228896. Son, Nguyen Dang Nghiep Trinh & Nguyen Van Nam & Pham Vu Hong. 2024. "Advanced vehicle routing in cement distribution: a discrete Salp Swarm Algorithm approach." International Journal of Management Science and Engineering Management. https://doi.org/10.1080/17509653.2024.2324172. Son, Nguyen Trieu Vi & Pham Vu Hong. 2024. "Applying ant colony optimization algorithm to optimize construction time and costs for mass concrete projects." Asian Journal of Civil Engineering. https://doi.org/10.1007/s42107-024-00990-5. Son, Nguyen Van Nam & Pham Vu Hong. 2023. "Cement Transport Vehicle Routing with a Hybrid Sine Cosine Optimization Algorithm." Advances in Civil Engineering. https://doi.org/10.1155/2023/2728039. Son, Tran Hoang Duy & Pham Vu Hong. 2023. "Research on applying machine learning models to predict the electricity generation capacity of rooftop solar energy systems on buildings." Asian Journal of Civil Engineering. https://doi.org/10.1007/s42107-023-00722-1. Tsolacos, Chris Brooks & Sotiris. 2010. "Forecasting real estate returns using financial spreads." Journal of Property Research . https://doi.org/10.1080/09599910110060037. Thi Nhu Hoa Le, Van Anh Mai, Cong Van Nguyen. 2020. "Determinants of profitability: evidence from construction companies listed on Vietnam Securities Market." Management Science Letters. https://doi.org/10.5267/j.msl.2019.9.028. Wassie, Fekadu Agmas. 2020. "Impacts of capital structure: profitability of construction companies in Ethiopia." Journal of Financial Management of Property and Construction. 10.1108/JFMPC-08-2019-0072. WU, CHING-LUNG CHEN & CHEI-WEI. 2012. "DIAGNOSING ASSETS IMPAIRMENT BY USING RANDOM FORESTS MODEL." International Journal of Information Technology & Decision Making. https://doi.org/10.1142/S0219622012500046. Yongsong Cai, Qi Yin, Qian Su, Xinyu Huang, Yin Zhang and Ting Liu. 2020. "Prediction Method of Enterprise Return on Net Assets Based on Improved Random Forest Algorithm." Conference Series. https://doi.org/10.1088/1742-6596/1682/1/012083. Zhu, Zheng Tan & Ziqin Yan & Guangwei. 2019. "Stock selection with random forest: An exploitation of excess return in the Chinese stock market." Heliyon . https://doi.org/10.1016/j.heliyon.2019.e02310. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 25 Mar, 2024 Reviews received at journal 24 Mar, 2024 Reviewers agreed at journal 23 Mar, 2024 Reviewers invited by journal 23 Mar, 2024 Editor assigned by journal 20 Mar, 2024 Submission checks completed at journal 19 Mar, 2024 First submitted to journal 19 Mar, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4129810","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":281791771,"identity":"5787a54e-cd49-423e-92de-881ca3ec5734","order_by":0,"name":"Vu Hong Son Pham","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAyklEQVRIiWNgGAWjYFACHhBhkwDlMROtJQ2khbGBFC2HSdBiLpF7+MPHHefz+KedMX/AUGGd2CCRewCvFssZeWmSM8/cLpa4nWPYwHAmHaglLwGvFoMzZ8yYedtuJ26QBmphbDuc2MBzxoCQFuPPvG3noFr+EaPleI+BNG/bAaiWBqAW9h78Wizbe8wkZ7YlJ864nVY4I+FYunEbIS3mzDzGHz622SX2z07e8OFDjbVsPzMPAYeh8BKAmA2vegwto2AUjIJRMAqwAQAHLEbT0+qmMQAAAABJRU5ErkJggg==","orcid":"","institution":"Ho Chi Minh City University of Technology (HCMUT)","correspondingAuthor":true,"prefix":"","firstName":"Vu","middleName":"Hong Son","lastName":"Pham","suffix":""},{"id":281791772,"identity":"d8a37ed9-2be1-410a-b678-acfd4de92c23","order_by":1,"name":"Tung Duong Le","email":"","orcid":"","institution":"Ho Chi Minh City University of Technology (HCMUT)","correspondingAuthor":false,"prefix":"","firstName":"Tung","middleName":"Duong","lastName":"Le","suffix":""}],"badges":[],"createdAt":"2024-03-19 11:16:12","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4129810/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4129810/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":53269268,"identity":"74c37799-b2bc-41ad-b103-28ca939d9d79","added_by":"auto","created_at":"2024-03-22 16:09:36","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":34931,"visible":true,"origin":"","legend":"\u003cp\u003eBasic structure of combined model. Source: (Rincy 2020)\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-4129810/v1/759022eb1de320a052df6dfb.png"},{"id":53269267,"identity":"ca5298ea-297a-4700-8104-9a6c41a8bc64","added_by":"auto","created_at":"2024-03-22 16:09:36","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":2092810,"visible":true,"origin":"","legend":"\u003cp\u003eDescriptive chart of data distribution\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-4129810/v1/d2d01016a012a9b56247a6ba.png"},{"id":53269920,"identity":"85233ad1-600c-477f-a170-ab2929785615","added_by":"auto","created_at":"2024-03-22 16:17:36","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":47616,"visible":true,"origin":"","legend":"\u003cp\u003ePearson Correlation Matrix\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-4129810/v1/09020c76ebd3946003b91e38.png"},{"id":53269269,"identity":"bd4096e9-c305-401b-a197-5706d15829a4","added_by":"auto","created_at":"2024-03-22 16:09:36","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":14259,"visible":true,"origin":"","legend":"\u003cp\u003eRanking of machine learning models\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-4129810/v1/f945a65b56318cf166d79f5f.png"},{"id":53269272,"identity":"b44bca85-938d-43d9-a622-2c43ef1c9117","added_by":"auto","created_at":"2024-03-22 16:09:36","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":18413,"visible":true,"origin":"","legend":"\u003cp\u003eRanking of the importance of independent variables\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-4129810/v1/389a9cde451825d61959e183.png"},{"id":53269270,"identity":"a0f0b860-9c93-4f74-883e-1d746ed071b1","added_by":"auto","created_at":"2024-03-22 16:09:36","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":54392,"visible":true,"origin":"","legend":"\u003cp\u003eLearning curve of GBR model\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-4129810/v1/f194687d146b314cebbd7202.png"},{"id":53269273,"identity":"a795303e-4ba1-424d-9f17-8442fbfd12f2","added_by":"auto","created_at":"2024-03-22 16:09:36","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":24482,"visible":true,"origin":"","legend":"\u003cp\u003ePredicting the Return on Assets (ROA) for Vietnamese Construction Enterprises for the year 2023\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-4129810/v1/83f0c2746cdcf857aa4a3fd0.png"},{"id":53270373,"identity":"a3d8cb1d-2c46-411d-9dce-4b1d85dc277c","added_by":"auto","created_at":"2024-03-22 16:25:37","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":463156,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4129810/v1/72509435-504d-485a-af5a-b177c5965e6c.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"\u003cp\u003eResearch on Applying Machine Learning Models to Predict and Assess Return on Assets (Roa)\u003c/p\u003e","fulltext":[{"header":"Introduction","content":"\u003cp\u003eIn recent years, the application of information technology in the construction industry has piqued the interest of scientists. Numerous studies, both domestic and international, have explored the use of artificial intelligence (AI) in various types of projects. These include civil projects (L. N. Son 2023) (L. N. Son 2024), (N. D. Son 2024), (Nguyen Dang Nghiep Trinh 2024), (N. T. Son 2024) , transport (N. V. Son 2023) and electricity (T. H. Son 2023).\u003c/p\u003e\n\u003cp\u003eIn the evolving field of construction management, the application of machine learning (ML) techniques has marked a significant stride toward enhancing the predictability and understanding of profitability within the industry. Recent studies highlight the burgeoning role of ML in forecasting financial outcomes, underscoring a technological shift that promises to refine strategic planning and decision-making processes.\u003c/p\u003e\n\u003cp\u003e(E. Adinyira 2021)pioneered the use of a Support Vector Regression Algorithm (SVRA) to predict construction project profit margins in Ghana, showcasing a commendable predictive accuracy of 73.66%. This study not only demonstrates the applicability of ML in construction profitability forecasts but also sets a benchmark for future research in similar emerging markets. Following a similar trajectory, (Hong Zhang 2015) employed principal component analysis (PCA) and a support vector machine (SVM) to navigate the complex financial landscapes of Chinese construction firms, achieving an impressive accuracy rate exceeding 80%. These findings illuminate the potential of ML algorithms to dissect and interpret multifaceted financial data effectively. Adding to the body of knowledge, (Mahfouz 2012) introduced a decision support system designed to estimate productivity rates in construction projects through the integration of SVM and Naive Bayes models. This approach not only underscores the versatility of ML in handling diverse construction management challenges but also reinforces the importance of predictive accuracy in enhancing project outcomes.\u003c/p\u003e\n\u003cp\u003eIn a study focusing on the Vietnamese context, (Thi Nhu Hoa Le 2020) meticulously identified a set of critical factors influencing the profitability of local construction companies. Variables such as company age, debt ratio, growth rate, asset utilization performance, company size, and the proportion of fixed assets were highlighted as determinants of financial success. This research enriches the discourse on profitability drivers in construction, offering valuable insights into the Vietnamese market\u0026apos;s unique characteristics.\u003c/p\u003e\n\u003cp\u003eFurther more, \u0026nbsp;(Wassie 2020) investigates the impact of capital structure on the profitability of construction firms in Ethiopia, revealing that both debt-to-equity and long-term debt-to-total assets ratios exhibit a significant positive correlation with ROE and ROA . This finding aligns with the broader discourse on the pivotal role of capital structure decisions in influencing company worth and operational costs, thereby echoing the essentiality of strategic financial planning in the construction domain. Moreover, a conceptual review by (Ngo 2023) on working capital management practices in the construction industry highlights the intricate relationship between working capital components and firm profitability, further substantiating the critical role of financial management in the sector\u0026apos;s sustainability. This aligns with (Soa La Nguyen 2023) exploration into the associations between ROE, ROA, liquidity, and debt, emphasizing the nuanced impact of short-term loans on firm profitability and underscoring the potential of ML to refine these predictive analyses.\u003c/p\u003e\n\u003cp\u003eCollectively, these studies underscore the transformative potential of ML in the construction industry, presenting a promising avenue for future research and application. By leveraging advanced analytical techniques, the construction sector can achieve greater accuracy in profitability predictions, enabling more informed strategic decisions that drive sustainable growth and financial stability.\u003c/p\u003e\n\u003cp\u003eThis study introduces a suite of machine learning (ML) models, chosen based on criteria such as data characteristics, computational efficiency, and the specific objectives of the analysis. The investigation encompasses a variety of algorithms, including supervised regression techniques like Lasso, Ridge Regression (RR), k-Nearest Neighbors Regressor (KNR), and Support Vector Regression (SVR), as well as ensemble methods such as Random Forest (RF), Extra Trees Regressor (ETR), Gradient Boosting Regressor (GBR), and Extreme Gradient Boosting (XGBoost). Additionally, the study evaluates an artificial neural network approach through the Multilayer Perceptron (MLP) model. The effectiveness of these models will be quantitatively measured using R Square (R\u003csup\u003e2\u003c/sup\u003e) and Root Mean Square Error (RMSE) metrics. The forthcoming section will detail the ML methodologies under review, the research approach employed, and a discussion on the findings derived from the analysis.\u003c/p\u003e"},{"header":"Research methodology","content":"\u003cp\u003eA range of studies have demonstrated the effectiveness of the random forest algorithm in various financial applications. (Yongsong Cai 2020) and (WU 2012) both found that the algorithm outperformed other methods in predicting enterprise return on net assets and diagnosing assets impairment, respectively. (Zhu 2019) applied the algorithm to forecast fund return rate direction and select stocks, with both studies reporting positive results. (Sevil 2020) and (McGroarty 2014) further extended the algorithm\u0026apos;s application to predicting IPO initial returns and developing an automated trading system, respectively, with both studies finding the algorithm to be superior to other methods. (Scornet 2016) and (Breiman 1999) provided comprehensive overviews of the algorithm, highlighting its versatility and robustness.\u003c/p\u003e\n\u003cp\u003e2.1 Return\u0026nbsp;on assets (ROA) forecasting model:\u003c/p\u003e\n\u003cp\u003ePredicting Return on Assets (ROA) enables businesses to develop risk management strategies and adjust their financial posture more effectively by forecasting future outcomes. This aids businesses in:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eMaking critical strategic decisions for the company\u0026apos;s future, guiding the business towards sustainable and financially secure growth.\u003c/li\u003e\n \u003cli\u003ePlanning and adjusting financial ratios (such as leverage ratio, working capital, etc.) to optimize profits.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe aim of this study is to contribute to financial management practices within the construction industry, while also providing a trained dataset for predicting and assessing the ROA for the construction sector in Vietnam.\u003c/p\u003e\n\u003cp\u003e2.1.1 Linear regression analysis model\u003c/p\u003e\n\u003cp\u003eThe Single Linear Regression Analysis model, a statistical method, is employed to examine the correlation between one or more independent variables, also known as feature variables, and a dependent variable, or target variable. The goal is to anticipate the value of the dependent variable based on the independent one by identifying the optimal straight line to reduce the discrepancy between the projected and actual values of the dependent variable.\u0026nbsp;(Khoi 2019) identified internal variables such as firm size, return on equity, and earnings per share as significant predictors of ROA in the Vietnamese stock market by using a basic regression model. Moverover, the predictability of real estate asset returns using a vector regression model which incorporates financial spreads was found to be reduced over longer forecasting horizons (Tsolacos 2010) . The linear regression models used in our study encompass Lasso Regression, Ridge Regression, K Neighbors Regression, and Support Vector Regression (SVR).\u003c/p\u003e\n\u003cp\u003e2.1.2 Ensemble model\u003c/p\u003e\n\u003cp\u003eThe ensemble model, first introduced in the early 1990s (Hansen 1990), saw significant development in the early 21st century \u003cspan lang=\"VI\"\u003e(Friedman 2001)\u003c/span\u003e. This model is a departure from the independent linear regression analysis model. Instead, it is a machine learning approach that combines the predictions of multiple individual models. The aim of the ensemble model is to enhance generalization and accuracy by combining the predictions of base estimators built with a certain learning algorithm\u003c/p\u003e\n\u003cp\u003eThe two main methods, averaging and augmentation, are typically differentiated as follows:\u003c/p\u003e\n\u003cp\u003eAveraging involves independently constructing multiple estimators and then averaging their predictions. Generally, the combined estimator in averaging tends to outperform any single base estimator as it reduces variance. Models like the RF and GBR operate on this principle.\u003c/p\u003e\n\u003cp\u003eOn the other hand, with augmentation, base estimators are built sequentially with an aim to decrease the bias of the combined estimator. The objective here is to amalgamate several weak models to produce a potent ensemble. Models such as GBR and XGBoost fall under this category.\u003c/p\u003e\n\u003cp\u003e2.1.3 Artificial neural network model\u003c/p\u003e\n\u003cp\u003eArtificial Neural Networks (ANNs) are computational models inspired by biological neural networks. Their origins date back to the 1940s and 1950s when Warren McCulloch and Walter Pitts first conceptualized the neuron model (McCulloch 1943) .This work laid the ground work for initial artificial neural models. The field progressed significantly in the 1990s with the introduction of advanced models like the Convolutional Neural Network (CNN) and the Counter propagation Neural Net (CPN) (A. \u0026amp;. Kaveh 1998). These models have been applied to various tasks, including speech recognition, natural language processing, and image and video analysis. Today, ANNs are prevalent in both industry and research, with the field evolving rapidly\n \u003c!--[if supportFields]\u003e\u003cspan style='mso-element:field-begin'\u003e\u003c/span\u003e CITATION Kav23 \\l 1066 \u003cspan style='mso-element:field-separator'\u003e\u003c/span\u003e\u003c![endif]--\u003e (A. \u0026amp;. Kaveh 2023)\n \u003c!--[if supportFields]\u003e\u003cspan style='mso-element:field-end'\u003e\u003c/span\u003e\u003c![endif]--\u003e and \u0026nbsp;(Mehmet Kayakus 2023) found that artificial neural networks (ANNs) and support vector regression (SVR) were successful in predicting ROA in the iron and steel industry.\u003c/p\u003e\n\u003cp\u003eThe Multilayer Perceptron (MLP), a type of ANN with multiple hidden layers, is particularly useful for regression problems. This model evolved from the Perceptron model of the 1950s and 1960s. The 1980s marked the emergence of multilayer neural network architectures used for tackling complex problems. Today, MLP is still a commonly used machine learning model . In this paper, we will use the MLP model to examine predictability using our existing dataset.\u003c/p\u003e\n\u003cp\u003e2.2 Model evaluation indicators:\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;RMSE, or Root Mean Square Error, is used to measure the average magnitude of the error, also known as the residual, between the predicted value and the actual value. The smaller the RMSE value, the smaller the error, indicating a high level of estimation. This suggests that the model is reliable. Formula refer to: (Hodson 2022)\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"https://myfiles.space/user_files/122228_c8a1650c59388082/122228_custom_files/img1711094658.png\"\u003e\u003c/p\u003e\n\u003cp\u003eR\u003csup\u003e2\u003c/sup\u003e, or the coefficient of determination, measures the proportion of the target data\u0026apos;s variance that our regression model can explain. An R\u003csup\u003e2\u003c/sup\u003e value close to 1 is typically considered to indicate good predictive ability. Formula refer to: (Davide Chicco 2021)\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"https://myfiles.space/user_files/122228_c8a1650c59388082/122228_custom_files/img1711094705.png\"\u003e\u003c/p\u003e\n\u003cp\u003eIn these formulas, y\u003csub\u003ei\u003c/sub\u003e is the actual value of observation i, ŷ\u003csub\u003ei\u003c/sub\u003e is the predicted value of observation i, y̅ is the average value of y, and n is the number of sample data.\u003c/p\u003e\n"},{"header":"Evaluation of machine learning models","content":"\u003cp\u003e3.1 Data set overview\u003c/p\u003e\n\u003cp\u003e3.1.1 Data collection\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;In accordance with the analysis requirements and the model under construction, we\u0026apos;ll select the necessary data. Identifying the right data source, its format, and the collection method forms a crucial part of the process. The types of data we collect will be dictated by the purpose of our research and the results we\u0026apos;re aiming for. Here are the specifics:\u003c/p\u003e\n\u003cp\u003e- We\u0026apos;ll gather data on the company\u0026apos;s business results from financial reports and balance sheets (revenue, cost of goods sold, financial costs, sales costs, profit, interest,...) available at https://finance.vietstock.vn/.\u003c/p\u003e\n\u003cp\u003e- We\u0026apos;ll compute financial variables using specific formulas.\u003c/p\u003e\n\u003cp\u003e3.1.2 Data processing\u003c/p\u003e\n\u003cp\u003ea. Missing values\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; To manage missing values while preserving the data structure, two methods are proposed:\u003c/p\u003e\n\u003cp\u003ei) Replace the missing value with the column\u0026apos;s mean or median. This technique is generally applied when the number of missing values is insignificant.\u003c/p\u003e\n\u003cp\u003eii) Exclude the row containing the missing value from the data series. In the realm of power or meteorological data series, this discarded row would embody all the data for a specific time step associated with the missing value.\u003c/p\u003e\n\u003cp\u003eb. Data analysis:\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;This study applies the Cross-Validation technique for assessing the machine learning model. We\u0026apos;ve used 5 folds for optimizing resources. When compared with the conventional technique (which splits data into training and testing parts), here\u0026apos;s what we find:\u003c/p\u003e\n\u003cp\u003e- Data Use Efficiency:\u003c/p\u003e\n\u003cp\u003eCross-Validation technique: Each dataset sample becomes a test set once and a training set (k-1) times - maximizing data usage, which is crucial for small datasets.\u003c/p\u003e\n\u003cp\u003eTraditional Split: The data is divided in a fixed manner, typically 70% for training and 30% for testing. Some data is only used either for training or testing, which doesn\u0026apos;t make the best use of all data.\u003c/p\u003e\n\u003cp\u003e- Bias and Variance Reduction:\u003c/p\u003e\n\u003cp\u003eCross-Validation technique: Since each sample is used for both training and testing, this method helps curb bias and variance for a more accurate model performance estimate.\u003c/p\u003e\n\u003cp\u003eTraditional Split: There could be high bias or variance if the dataset split doesn\u0026apos;t accurately represent the dataset\u0026apos;s full structure or distribution, particularly if the dataset is small or lacks diversity.\u003c/p\u003e\n\u003cp\u003e- Flexibility and General applicability:\u003c/p\u003e\n\u003cp\u003eCross-Validation technique: Offers a more flexible and general model evaluation method, good for various data types and different problems. It also allows fold number adjustment to balance computational efficiency and model performance estimate accuracy.\u003c/p\u003e\n\u003cp\u003eTraditional Split: Suitable when there\u0026apos;s plenty of data and a quick model evaluation is needed. However, this method may not accurately reflect the model\u0026apos;s applicability to new data.\u003c/p\u003e\n\u003cp\u003e- Computation and Time:\u003c/p\u003e\n\u003cp\u003eCross-Validation technique: Needs more computational resources and time as the model is trained k times. But, it\u0026apos;s considered a fair trade-off for a more accurate model performance estimate.\u003c/p\u003e\n\u003cp\u003eTraditional Split: Requires fewer resources and less time, making it a good option when computational resources are limited or quick results are needed. But, this may compromise the accuracy of the model performance estimate.\u003c/p\u003e\n\u003cp\u003e3.2 Data description\u003c/p\u003e\n\u003cp\u003e3.2.1 Information about the financial\u0026nbsp;database:\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; The forecast model is built on data collected from the financial reports of 76 construction firms listed on the Vietnam Stock Exchange (HOSE, HNX) spanning from 2012 to 2022. You\u0026apos;ll find the data, which delves into comprehensive business outcomes and balance sheets.\u003c/p\u003e\n\u003cp\u003e3.2.2 Description of financial\u0026nbsp;database:\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;The financial results of companies play a significant role in determining financial indicators. Summarized financial report information can be accessed from the website: https://finance.vietstock.vn/. This website, established on 02/08/2002, aims to provide accurate data about corporate finance, stocks, bonds, and macro information. The data from this website is used in this article for calculations.\u003c/p\u003e\n\u003cp\u003e3.3 Variables in the predictive model\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;The variables considered to be included in the model include: 14 dependence\u0026nbsp;variables (features) and 01 independence\u0026nbsp;variable (target) (\u003cstrong\u003eTable 1\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003eTable 1: Statistics of variables included in the model (Source: Synthesized by the authors) (Ngo 2023)\u003c/p\u003e\n\u003cdiv\u003e\n \u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"718\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003eNo.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eName/Variable Name\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003eUnit\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003eMeasurement\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eReturn on Asset/ROA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003e%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003eNet income to Average total asset\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eDays Sales Outstanding/DSO\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003eDays\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003eAverage account receivables balance to net sales ) x365\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eDays Inventory Outstanding/DIO\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003eDays\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003eAverage account inventory balance costs of goods sold) x365\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eDays Payable\u0026nbsp;Outstanding/DPO\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003eDays\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003eAverage account payables balance costs of goods sold) x365\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eCash Conversion\u0026nbsp;Cycle/CCC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003eDays\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003eDSO+DIO\u0026minus;DPO\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eNet Working\u0026nbsp;Capital/NWC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003eBillions of VND\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003eCurrent Assets - Current Liabilities\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eSize/SIZ\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003eBillions of VND\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003eNatural Logarithm of net sales\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eFinancial\u0026nbsp;Leverage/LEV\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003e%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003eTotal debt/Total Asset\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eCurrent Ratio/CR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003e%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003eCurrent Assets/Current Debt\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eGrowth Rate/GRO\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003e%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003ePercentage change in net sales\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eReturn on Equity/ROE\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003e%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003eNet Income/Shareholder\u0026apos;s Equity\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eEarnings Before Interest\u0026nbsp;Taxes/EBIT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003eBillions of VND\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003eRevenue - Operating Expenses (excluding interest and taxes)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eGross\u0026nbsp;Margin/GROS\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003e%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003e(Total Revenue - Cost of Goods Sold)/Total Revenue\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eQuick\u0026nbsp;Ratio/QR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003e%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003e(Current Assets - Inventories)/Current Liabilities\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"7.520891364902507%\"\u003e\n \u003cp\u003e15\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"38.579387186629525%\"\u003e\n \u003cp\u003eReturn/RE\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"18.384401114206128%\"\u003e\n \u003cp\u003eBillions of VND\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"35.51532033426184%\"\u003e\n \u003cp\u003eNet Profit Margin\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003eThe characteristics of the dataset are illustrated by a graph (Figure 2) which shows the frequency of values.\u003c/p\u003e"},{"header":"Results and discussion","content":"\u003cp\u003eIn the process of preparing data for machine learning algorithms, an essential step involves exploring, visualizing, and preprocessing the available features. This step is crucial as it gives a clear overview of the data that will be fed into the machine learning model, helping to identify any areas that may need further refinement before proceeding.\u003c/p\u003e\n\u003cp\u003eThe dataset utilised in this process consists of a substantial 740 rows and 15 columns. Such a volume of data provides a rich resource for the machine learning algorithms to draw from, thus increasing the potential accuracy of the forecasts produced.\u003c/p\u003e\n\u003cp\u003eIn order to examine the details of the available columns in the dataset, the pandas data analysis library is employed. This powerful tool provides an efficient and effective means of handling and exploring the data in detail, allowing for a thorough examination of the dataset in its entirety.\u003c/p\u003e\n\u003cp\u003eThis article goes a step further and delves into the relationships between the data by studying the correlation between different variables. This is done in order to select the most suitable data for the forecasting model. In this instance, the Pearson correlation is considered a useful tool for identifying quantities that have a high correlation with the ROA value. The results of this analysis can be seen in Figure 3. This step is crucial as it ensures that only the most relevant and impactful data is included in the forecasting model, thus increasing the probability of accurate and meaningful results.\u003c/p\u003e\n\u003cp\u003eThe correlation matrix reveals that:\u003c/p\u003e\n\u003cp\u003e- Variables DSO, DIO, LEV, GRO, ROE, EBIT, GROS, and RE have a strong correlation with ROA.\u003c/p\u003e\n\u003cp\u003e- Variables DPO, CCC, NWC, SIZ, CR, and QR have a weak correlation with ROA.\u003c/p\u003e\n\u003cp\u003eEven with these correlations, all variables are included in the machine learning model to guarantee accurate predictions, including non-linear ones.\u003c/p\u003e\n\u003cp\u003eThe results of models predicting return on assets (ROA) are given in Table 2. Among these, the RF model yields the best results when cross-validated with five iterations (n_splits).\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;The results of the models applied to predict the return on assets are presented in (\u003cstrong\u003eTable 2\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003eTable 2: Comparison results of models\u003c/p\u003e\n\u003cdiv\u003e\n \u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.930348258706468%\"\u003e\n \u003cp\u003eNo.\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"22.139303482587064%\"\u003e\n \u003cp\u003eModel\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"32.83582089552239%\"\u003e\n \u003cp\u003eR\u003csup\u003e2\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"31.09452736318408%\"\u003e\n \u003cp\u003eRMSE\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.930348258706468%\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"22.139303482587064%\"\u003e\n \u003cp\u003eLasso\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"32.83582089552239%\"\u003e\n \u003cp\u003e0.5908\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"31.09452736318408%\"\u003e\n \u003cp\u003e2.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.930348258706468%\"\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"22.139303482587064%\"\u003e\n \u003cp\u003eRR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"32.83582089552239%\"\u003e\n \u003cp\u003e0.7194\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"31.09452736318408%\"\u003e\n \u003cp\u003e2.0012\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.930348258706468%\"\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"22.139303482587064%\"\u003e\n \u003cp\u003eKNR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"32.83582089552239%\"\u003e\n \u003cp\u003e0.452\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"31.09452736318408%\"\u003e\n \u003cp\u003e2.79\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.930348258706468%\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"22.139303482587064%\"\u003e\n \u003cp\u003eSVR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"32.83582089552239%\"\u003e\n \u003cp\u003e0.3968\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"31.09452736318408%\"\u003e\n \u003cp\u003e2.9339\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.930348258706468%\"\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"22.139303482587064%\"\u003e\n \u003cp\u003e\u003cstrong\u003eRF\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"32.83582089552239%\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.9762\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"31.09452736318408%\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.5826\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.930348258706468%\"\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"22.139303482587064%\"\u003e\n \u003cp\u003eGBR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"32.83582089552239%\"\u003e\n \u003cp\u003e0.97\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"31.09452736318408%\"\u003e\n \u003cp\u003e0.6528\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.930348258706468%\"\u003e\n \u003cp\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"22.139303482587064%\"\u003e\n \u003cp\u003eXGBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"32.83582089552239%\"\u003e\n \u003cp\u003e0.9822\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"31.09452736318408%\"\u003e\n \u003cp\u003e0.504\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd width=\"13.930348258706468%\"\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"22.139303482587064%\"\u003e\n \u003cp\u003eMLP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"32.83582089552239%\"\u003e\n \u003cp\u003e0.838\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd width=\"31.09452736318408%\"\u003e\n \u003cp\u003e1.517\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003eThe results of ranking models R\u003csup\u003e2\u003c/sup\u003e and RMSE indexes are shown in (\u003cstrong\u003eFigure\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;4\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003eThe model will be evaluated and selected based on three criteria:\u003c/p\u003e\n\u003cp\u003e- R Score: An optimal score is closer to 1.\u003c/p\u003e\n\u003cp\u003e- RMSE Score: A lower score is better.\u003c/p\u003e\n\u003cp\u003e- Learning Curves Chart: This chart displays the number of training and test points. The closer these two lines are, the better the model generalizes.\u003c/p\u003e\n\u003cp\u003eIn deciding between the RF and XGBoost models, both of which have commendable R2 and RMSE scores, we observe a larger gap between the training and testing lines in XGBoost (1 unit) compared to RF (approximately 0.7 units). As such, this document will use the RF model for computation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cem\u003eCalibration of RF\u003c/em\u003e\u003c/strong\u003e\u003cstrong\u003e\u003cem\u003e\u0026nbsp;\u003c/em\u003e\u003c/strong\u003e\u003cstrong\u003e\u003cem\u003emodel\u003c/em\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;The GridSearchCV tool gives you the power to discover the optimal parameters for your machine learning model. It achieves this by testing a variety of parameters and determining the model\u0026apos;s performance for each set, using cross-validation techniques to evaluate the model across diverse training and validation datasets.\u003c/p\u003e\n\u003cp\u003eIn simple terms, GridSearchCV breaks down the parameters into separate values and creates various parameter sets by combining these values. The model is then trained using each parameter set, with its performance evaluated through methods like cross-validation. The best parameter set is then selected based on the model\u0026apos;s performance on the validation dataset.\u003c/p\u003e\n\u003cp\u003e- max_depth = 10.\u003c/p\u003e\n\u003cp\u003e- min_samples_leaf = 2.\u003c/p\u003e\n\u003cp\u003e- min_samples_split = 5.\u003c/p\u003e\n\u003cp\u003e- n_estimators = 200.\u003c/p\u003e\n\u003cp\u003eLooking at the ranking of important features, it\u0026apos;s evident that the ROE index has the most significant impact on the ROA. Additionally, the variables LEV, EBIT, and CR also have a substantial influence on the ROA index (Figure 5).\u003c/p\u003e\n\u003cp\u003eFrom the learning curves (refer to Figure 6), we can make the following observations:\u003c/p\u003e\n\u003cp\u003e- The training score and cross-validation score remain relatively stable. As more samples are used for training, both scores tend to converge or nearly converge. This suggests that the model has strong generalization capabilities.\u003c/p\u003e\n\u003cp\u003e3.5 Prediction:\u003c/p\u003e\n\u003cp\u003eThe predictive model outlined in this article can assist business owners in the following ways:\u003c/p\u003e\n\u003cp\u003e- Enhancing business decisions: The model provides a more accurate prediction of ROA when specific profit indicators have not been calculated, enabling businesses to predict future scenarios and analyze \u0026quot;What-If\u0026quot; situations. This supports planning and risk management.\u003c/p\u003e\n\u003cp\u003e- Streamlining and automating the prediction process: The application of machine learning models standardizes the prediction process, making it automated, quick, and less reliant on manual intervention.\u003c/p\u003e\n\u003cp\u003e- Integrating data, knowledge, and context: While this model only considers 15 variables, integrating additional variables and contexts into the analysis process can better reflect the actual situation and the influences from the external business environment.\u003c/p\u003e\n\u003cp\u003eWhen applying these machine learning models to forecast the ROA for construction businesses in 2023, we obtained the following results: (Figure 7)\u003c/p\u003e\n\u003cp\u003eDespite the impressive results (error \u0026lt; 1%), this study has not yet been validated with a larger dataset. This would involve a larger number of feature variables and require further enhancement of the computational model to improve the accuracy of the prediction tool.\u003c/p\u003e"},{"header":"Conclusions, application and Proposal","content":"\u003cp\u003eThis article presents a financial perspective in the field of construction, applying machine learning models such as simple linear regression, ensemble models, and neural network models. The results indicate that the RF model delivered more optimal results than the second-ranked GBR model (0.62%), as well as the XGBoost model, when considering the distance between the two training and testing lines with the Learning Curves chart. However, the SVR and KNN models were found to be unsuitable for this dataset, given their poor R square results (less than 0.5).\u003c/p\u003e\n\u003cp\u003eThis study sets the stage for various research avenues that could be pursued in the near future, including: Comparative Analysis Across Industries, Temporal Stability of ML Predictions, Integration with Other Financial Health Indicators, Impact of External Variables, Advanced ML Techniques and Algorithms, Cross-country Comparisons, ML Interpretability and Decision-making, Sustainability and Environmental Considerations.\u003c/p\u003e\n\u003cp\u003eThese proposed research avenues can expand the understanding built by my study. Moreover, they could offer valuable insights for professionals in the industry, policymakers, and scholars intrigued by the blend of finance, construction, and machine learning.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that no funds, grants, or other support were received during the preparation of this manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors have no relevant financial or non-financial interests to disclose.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eBoth the authors wrote, prepared and reviewed the manuscript.\u003c/p\u003e\n\u003cp\u003eAcknowledgments\u003c/p\u003e\n\u003cp\u003eFor this work, we gratefully recognize the time and facilities provided by Ho Chi Minh City University of Technology (HCMUT), VNUHCM.\u003cstrong\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eBreiman, L. 1999. \u0026quot;Random Forests--random Features.\u0026quot; \u003cem\u003eComputer Science, Mathematics.\u003c/em\u003e https://www.stat.berkeley.edu/~breiman/random-forests.pdf.\u003c/li\u003e\n\u003cli\u003eDavide Chicco, Matthijs J. Warrens, Giuseppe Jurman. 2021. \u0026quot;The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation.\u0026quot; \u003cem\u003ePeerJ Computer Science.\u003c/em\u003e https://doi.org/10.7717/peerj-cs.623.\u003c/li\u003e\n\u003cli\u003eE. Adinyira, E. Adjei, K. Agyekum, F. Fugar. 2021. \u0026quot;Application of machine learning in predicting construction project profit in Ghana using Support Vector Regression Algorithm (SVRA).\u0026quot; \u003cem\u003eEngineering, Construction and Architectural Management.\u003c/em\u003e https://doi.org/10.1108/ECAM-08-2020-0618.\u003c/li\u003e\n\u003cli\u003eFriedman, J. H. 2001. \u0026quot;Greedy function approximation: A gradient boosting machine.\u0026quot; \u003cem\u003eThe Annals of Statistics.\u003c/em\u003e https://doi.org/10.1214/aos/1013203451.\u003c/li\u003e\n\u003cli\u003eHansen, L. K., \u0026amp; Salamon, P. 1990. \u0026quot;Neural network ensembles.\u0026quot; \u003cem\u003eIEEE Transactions on Pattern Analysis and Machine Intelligence.\u003c/em\u003e https://doi.org/10.1109/34.58871.\u003c/li\u003e\n\u003cli\u003eHodson, Timothy O. 2022. \u0026quot;Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not.\u0026quot; \u003cem\u003eGeoscientific Model Development.\u003c/em\u003e https://doi.org/10.5194/gmd-15-5481-2022.\u003c/li\u003e\n\u003cli\u003eHong Zhang, Fei Yang, Yang Li, Heng Li. 2015. \u0026quot;Predicting profitability of listed construction companies based on principal component analysis and support vector machine\u0026mdash;Evidence from China.\u0026quot; \u003cem\u003eAutomation in Construction\u003c/em\u003e Vol 53, Pages 22-28. https://doi.org/10.1016/J.AUTCON.2015.03.001.\u003c/li\u003e\n\u003cli\u003eKaveh, A., \u0026amp; Iranmanesh, A. 1998. \u0026quot;Comparative study of backpropagation and improved counterpropagation neural nets in structural analysis and optimization.\u0026quot; \u003cem\u003eInternational Journal of Space Structures\u003c/em\u003e 13(4), 177\u0026ndash;185. https://doi.org/10.1177/026635119801300.\u003c/li\u003e\n\u003cli\u003eKaveh, A., \u0026amp; Khavaninzadeh, N. 2023. \u0026quot;Efficient training of two ANNs using four meta-heuristic algorithms for predicting the FRP strength.\u0026quot; \u003cem\u003eStructures\u003c/em\u003e 52, 256\u0026ndash;272. https://doi.org/10.1016/j.istruc.2023.03.178.\u003c/li\u003e\n\u003cli\u003eKhoi, Lucille V. Pointer \u0026amp; Phan. 2019. \u0026quot;Predictors of Return on Assets and Return on Equity for Banking and Insurance Companies on Vietnam Stock Exchange.\u0026quot; \u003cem\u003eEntrepreneurial Business and Economics Review.\u003c/em\u003e DOI: 10.15678/EBER.2019.070411.\u003c/li\u003e\n\u003cli\u003eLucille V. Pointer, Phan Dinh Khoi. 2019. \u0026quot;Predictors of Return on Assets and Return on Equity for Banking and Insurance Companies on Vietnam Stock Exchange.\u0026quot; \u003cem\u003eEntrepreneurial business and economics review.\u003c/em\u003e https://doi.org/10.15678/eber.2019.070411.\u003c/li\u003e\n\u003cli\u003eMahfouz, Tarek. 2012. \u0026quot;A Productivity Decision Support System for Construction Projects Through Machine Learning (ML).\u0026quot; \u003cem\u003eEngineering, Computer Science.\u003c/em\u003e \u003c/li\u003e\n\u003cli\u003eMcCulloch, W. S., \u0026amp; Pitts, W. 1943. \u0026quot;A logical calculus of the ideas immanent in nervous activity.\u0026quot; \u003cem\u003eThe Bulletin of Mathematical Biophysics.\u003c/em\u003e https://doi.org/10.1007/BF02478259.\u003c/li\u003e\n\u003cli\u003eMcGroarty, Ash Booth \u0026amp; Enrico Gerding \u0026amp; Frank. 2014. \u0026quot;Automated trading with performance weighted random forests and seasonality.\u0026quot; \u003cem\u003eExpert Systems with Applications.\u003c/em\u003e https://doi.org/10.1016/j.eswa.2013.12.009.\u003c/li\u003e\n\u003cli\u003eMehmet Kayakus, Bur\u0026ccedil;in Tutcu, Mustafa Terzioglu, Hasan Tala\u0026cedil;s and G\u0026uuml;ler Ferhan \u0026Uuml;nal Uyar. 2023. \u0026quot;ROA and ROE Forecasting in Iron and Steel Industry Using Machine Learning Techniques for Sustainable Profitability.\u0026quot; \u003cem\u003eSustainability.\u003c/em\u003e https://doi.org/10.3390/su15097389.\u003c/li\u003e\n\u003cli\u003eNgo, Thi Quy Vo and Ngoc Cuong. 2023. \u0026quot;Does working capital management matter? A comparative case between consumer goods firms and construction firms in Vietnam.\u0026quot; \u003cem\u003eCogent Business \u0026amp; Management.\u003c/em\u003e https://doi.org/10.1080/23311975.2023.2271543.\u003c/li\u003e\n\u003cli\u003eNguyen Dang Nghiep Trinh, Nguyen Van Nam, Pham Vu Hong Son. 2024. \u0026quot;Achieving improved performance in construction projects: advanced time and cost optimization framework.\u0026quot; \u003cem\u003eEvolutionary Intelligence.\u003c/em\u003e https://doi.org/10.1007/s12065-024-00918-7.\u003c/li\u003e\n\u003cli\u003eRincy, T. N., \u0026amp; Gupta, R. 2020. \u0026quot;Ensemble Learning Techniques and its Efficiency in Machine Learning: A Survey.\u0026quot; \u003cem\u003e2nd International Conference on Data, Engineering and Applications (IDEA).\u003c/em\u003e https://doi.org/10.1109/IDEA49133.2020.9170675.\u003c/li\u003e\n\u003cli\u003eScornet, G\u0026eacute;rard Biau \u0026amp; Erwan. 2016. \u0026quot;A random forest guided tour.\u0026quot; \u003cem\u003eTEST.\u003c/em\u003e https://doi.org/10.1007/S11749-016-0481-7.\u003c/li\u003e\n\u003cli\u003eSevil, Boubekeur Baba \u0026amp; G\u0026uuml;ven. 2020. \u0026quot;Predicting IPO initial returns using random forest.\u0026quot; \u003cem\u003eBorsa Istanbul Review.\u003c/em\u003e https://doi.org/10.1016/J.BIR.2019.08.001.\u003c/li\u003e\n\u003cli\u003eSoa La Nguyen, C. Pham, Tu Van Truong, Trong Van Phi, Linh Le, T. Vu. 2023. \u0026quot;Relationship between Capital Structure and Firm Profitability: Evidence from Vietnamese Listed Companies.\u0026quot; \u003cem\u003eInternational Journal of Financial Studies.\u003c/em\u003e https://doi.org/10.3390/ijfs11010045.\u003c/li\u003e\n\u003cli\u003eSon, Luu Ngoc Quynh Khoi \u0026amp; Pham Vu Hong. 2024. \u0026quot;Artificial intelligent support model for multiple criteria decision in construction management.\u0026quot; \u003cem\u003eOPSEARCH.\u003c/em\u003e https://doi.org/10.1007/s12597-024-00749-1.\u003c/li\u003e\n\u003cli\u003eSon, Luu Ngoc Quynh Khoi \u0026amp; Pham Vu Hong. 2023. \u0026quot;Optimization in Construction Management Using Adaptive Opposition Slime Mould Algorithm.\u0026quot; \u003cem\u003eAdvances in Civil Engineering.\u003c/em\u003e https://doi.org/10.1155/2023/7228896.\u003c/li\u003e\n\u003cli\u003eSon, Nguyen Dang Nghiep Trinh \u0026amp; Nguyen Van Nam \u0026amp; Pham Vu Hong. 2024. \u0026quot;Advanced vehicle routing in cement distribution: a discrete Salp Swarm Algorithm approach.\u0026quot; \u003cem\u003eInternational Journal of Management Science and Engineering Management.\u003c/em\u003e https://doi.org/10.1080/17509653.2024.2324172.\u003c/li\u003e\n\u003cli\u003eSon, Nguyen Trieu Vi \u0026amp; Pham Vu Hong. 2024. \u0026quot;Applying ant colony optimization algorithm to optimize construction time and costs for mass concrete projects.\u0026quot; \u003cem\u003eAsian Journal of Civil Engineering.\u003c/em\u003e https://doi.org/10.1007/s42107-024-00990-5.\u003c/li\u003e\n\u003cli\u003eSon, Nguyen Van Nam \u0026amp; Pham Vu Hong. 2023. \u0026quot;Cement Transport Vehicle Routing with a Hybrid Sine Cosine Optimization Algorithm.\u0026quot; \u003cem\u003eAdvances in Civil Engineering.\u003c/em\u003e https://doi.org/10.1155/2023/2728039.\u003c/li\u003e\n\u003cli\u003eSon, Tran Hoang Duy \u0026amp; Pham Vu Hong. 2023. \u0026quot;Research on applying machine learning models to predict the electricity generation capacity of rooftop solar energy systems on buildings.\u0026quot; \u003cem\u003eAsian Journal of Civil Engineering.\u003c/em\u003e https://doi.org/10.1007/s42107-023-00722-1.\u003c/li\u003e\n\u003cli\u003eTsolacos, Chris Brooks \u0026amp; Sotiris. 2010. \u0026quot;Forecasting real estate returns using financial spreads.\u0026quot; \u003cem\u003eJournal of Property Research .\u003c/em\u003e https://doi.org/10.1080/09599910110060037.\u003c/li\u003e\n\u003cli\u003eThi Nhu Hoa Le, Van Anh Mai, Cong Van Nguyen. 2020. \u0026quot;Determinants of profitability: evidence from construction companies listed on Vietnam Securities Market.\u0026quot; \u003cem\u003eManagement Science Letters.\u003c/em\u003e https://doi.org/10.5267/j.msl.2019.9.028.\u003c/li\u003e\n\u003cli\u003eWassie, Fekadu Agmas. 2020. \u0026quot;Impacts of capital structure: profitability of construction companies in Ethiopia.\u0026quot; \u003cem\u003eJournal of Financial Management of Property and Construction.\u003c/em\u003e 10.1108/JFMPC-08-2019-0072.\u003c/li\u003e\n\u003cli\u003eWU, CHING-LUNG CHEN \u0026amp; CHEI-WEI. 2012. \u0026quot;DIAGNOSING ASSETS IMPAIRMENT BY USING RANDOM FORESTS MODEL.\u0026quot; \u003cem\u003eInternational Journal of Information Technology \u0026amp; Decision Making.\u003c/em\u003e https://doi.org/10.1142/S0219622012500046.\u003c/li\u003e\n\u003cli\u003eYongsong Cai, Qi Yin, Qian Su, Xinyu Huang, Yin Zhang and Ting Liu. 2020. \u0026quot;Prediction Method of Enterprise Return on Net Assets Based on Improved Random Forest Algorithm.\u0026quot; \u003cem\u003eConference Series.\u003c/em\u003e https://doi.org/10.1088/1742-6596/1682/1/012083.\u003c/li\u003e\n\u003cli\u003eZhu, Zheng Tan \u0026amp; Ziqin Yan \u0026amp; Guangwei. 2019. \u0026quot;Stock selection with random forest: An exploitation of excess return in the Chinese stock market.\u0026quot; \u003cem\u003eHeliyon .\u003c/em\u003e https://doi.org/10.1016/j.heliyon.2019.e02310.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"asian-journal-of-civil-engineering","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [Asian Journal of Civil Engineering](https://www.springer.com/journal/42107)","snPcode":"42107","submissionUrl":"https://submission.nature.com/new-submission/42107/3","title":"Asian Journal of Civil Engineering","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"profit, working capital, debt ratio, growth rate, Vietnamese construction enterprises, machine learning models, optimization","lastPublishedDoi":"10.21203/rs.3.rs-4129810/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4129810/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Return on Assets (ROA), a profitability measure, is crucial in corporate finance for assessing how efficiently a company uses assets to generate profit. Currently, the prediction of the ROA index at present is a tedious, manual process. It usually involves making educated guesses or waiting for the accurate data, which becomes available only after financial reports have been compiled. This paper introduces a machine learning model for predicting the ROA index. The model draws data from 78 companies listed on the Vietnam Stock Exchanges (HOSE and HNX) over the span of 2012 to 2022.The Random Forest (RF) model was put to the test using datasets from selected Vietnamese businesses in 2023. The results demonstrated a high level of precision, with an error rate of less than 1%, an R2 value of 0.9762, and a Root Mean Square Error (RMSE) of 0.5826. These findings indicate potential real-world uses in predicting and boosting business performance. In conclusion, the integration of machine learning in financial analysis and prediction represents substantial progress. It enhances both accuracy and efficiency and holds promise for future advancements in financial management practices. This study aims to encourage more research and development in this area, leading to more advanced and efficient financial management tools.","manuscriptTitle":"Research on Applying Machine Learning Models to Predict and Assess Return on Assets (Roa)","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-03-22 16:09:31","doi":"10.21203/rs.3.rs-4129810/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-03-25T08:46:23+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-03-24T08:41:38+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"2348505c-995b-4ae6-8200-9957cd0ab78a","date":"2024-03-23T16:24:01+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-03-23T16:20:18+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-03-20T11:31:56+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-03-20T00:15:07+00:00","index":"","fulltext":""},{"type":"submitted","content":"Asian Journal of Civil Engineering","date":"2024-03-19T11:15:05+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"asian-journal-of-civil-engineering","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [Asian Journal of Civil Engineering](https://www.springer.com/journal/42107)","snPcode":"42107","submissionUrl":"https://submission.nature.com/new-submission/42107/3","title":"Asian Journal of Civil Engineering","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"44e8a548-4d35-4cad-b908-09fb8439a4af","owner":[],"postedDate":"March 22nd, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2024-03-27T08:29:26+00:00","versionOfRecord":[],"versionCreatedAt":"2024-03-22 16:09:31","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4129810","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4129810","identity":"rs-4129810","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.