A hybrid regression framework for estimating urban residential community drainage infrastructure: Model optimization and recursive prediction

doi:10.21203/rs.3.rs-6754917/v1

A hybrid regression framework for estimating urban residential community drainage infrastructure: Model optimization and recursive prediction

2025 · doi:10.21203/rs.3.rs-6754917/v1

preprint OA: closed CC-BY-4.0

🔓 Open OA copy Full text JSON View at publisher

Full text 142,877 characters · extracted from preprint-html · click to expand

A hybrid regression framework for estimating urban residential community drainage infrastructure: Model optimization and recursive prediction | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article A hybrid regression framework for estimating urban residential community drainage infrastructure: Model optimization and recursive prediction Kaiwei Hu, Yixiao He, Shuping Li This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6754917/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 5 You are reading this latest preprint version Abstract Urban drainage infrastructure plays a vital role in maintaining water quality and mitigating urban flood risks. However, estimating the quantity of residential drainage facilities in megacities remains challenging due to infrastructure aging, fragmented management, and data deficiencies. This study proposes a hybrid regression framework combining multiple linear regression (MLR), support vector regression (SVR), and random forest regression (RFR) to improve prediction accuracy of drainage facility quantities at the residential community scale. Based on data from 120 residential communities in City S, the study analyzes correlations between drainage facilities (e.g., pipeline length, number of manholes) and community attributes (e.g., building area, number of buildings, households), and incorporates a recursive forecasting mechanism to enhance multi-step estimation accuracy. The MLR model was optimized using weighted least squares to address heteroscedasticity, while SVR and RFR parameters were tuned via grid search and ten-fold cross-validation. The results indicated that the weighted linear regression (WLR) model performs best in predicting pipeline lengths, while the SVR model achieved higher accuracy in predicting the number of manholes. The proposed modeling framework offers reliable data support and a practical methodology for planning and fine-tuning the management of drainage systems in highly heterogeneous megacities. Urban drainage infrastructure Residential community Hybrid regression framework Recursive prediction Heteroscedasticity correction Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 1. Introduction With the accelerating pace of urbanization, drainage infrastructure has become increasingly critical in safeguarding urban water quality, mitigating flood risks, and supporting ecological resilience (Rodriguez et al., 2021 ; Dong et al., 2017 ). As a vital component of this system, residential drainage facilities constitute a crucial element of urban drainage management (Liu et al., 2023 ). However, aging infrastructure, inadequate maintenance, and fragmented management have resulted in a series of critical issues (Fathy et al., 2020 ; Kozak et al., 2020 ), including pipeline blockages, sewage overflows, and accelerated facility deterioration (Wang et al., 2021 ). These problems are particularly severe in megacities, where rapid urbanization exerts greater pressure on the robustness of drainage management systems (Maglia and Raimondi, 2025 ; Qiao et al., 2018 ). In recent years, cities around the world have adopted a variety of strategies to address similar challenges (Fu et al., 2019 ; Cao et al., 2024 ; Yazdi, 2018 ). Nevertheless, significant gaps remain in the existing research: international practices largely focus on macro-level infrastructure planning, with a lack of statistical modeling at the micro scale, such as for residential communities. Moreover, basic information on residential drainage facilities is often scattered and lacks a centralized database (Sun et al., 2021 ), while the labor required for facility enumeration is immense and difficult to complete in the short term. Investigating the underlying mathematical relationships between residential drainage facilities and community characteristics, and developing predictive models using statistical methods to estimate facility quantities (Kattan and Gerds, 2020 ), are essential steps for designing effective management mechanisms (Arya and Kumar, 2023 ). Traditional statistical approaches often struggle to capture the complex relationships between facilities and community attributes, and existing studies frequently rely on single regression models for prediction. However, these models are constrained by data heterogeneity and nonlinear associations (Zhao et al., 2024 ), resulting in poor predictive accuracy and limited generalizability. This issue is particularly pronounced in multi-step recursive forecasting, where error accumulation becomes significant (Ailobhio and Ikughur, 2024 ; Brooks et al., 2016 ). Hybrid modeling approaches have emerged as effective solutions for addressing prediction problems in complex systems (Xu et al., 2019 ; Dai et al., 2022 ). For example, MLR serves as a fundamental and effective statistical technique for analyzing and interpreting linear relationships between variables (Etemadi and Khashei, 2021 ; Liao et al., 2021 ). Its simplicity and interpretability have facilitated its widespread application across various fields (Ciulla and D'Amico, 2019 ). However, MLR may perform poorly when dealing with nonlinear data, requiring the use of more advanced models (Beigzadeh et al., 2020 ). SVR is a powerful machine learning technique capable of effectively handling nonlinear relationships (Tsirikoglou et al., 2017 ; Alizadeh Gharaei et al., 2024 ). SVR introduces kernel functions to transform nonlinear problems from low-dimensional to high-dimensional spaces, thereby enabling linear separability (Hsia and Lin, 2020; Klopfenstein and Vaiter, 2021 ). RFR is an ensemble learning method that constructs multiple decision trees to model complex nonlinear relationships (Ma et al., 2023 ; Zhong et al., 2021 ). RFR improves accuracy and robustness by aggregating predictions from multiple decision trees (Neo et al., 2024 ; Desai and Ouarda, 2021 ). However, integrating the strengths of different models to build a predictive framework capable of accommodating the heterogeneity of urban drainage facilities continues to pose a critical challenge. Moreover, existing studies often focus on estimating a single type of facility, without systematically modeling interdependencies among multiple variables, thereby amplifying error propagation in recursive predictions (Wen et al., 2023 ). To address the above issues, this study proposes a hybrid regression framework that integrates MLR, SVR, and RFR to improve the accuracy of predicting the quantity of drainage facilities at the residential community scale. Using empirical data from 120 residential communities in City S, the study systematically analyzes the correlations between drainage facility quantities and community characteristics to uncover their underlying mathematical patterns. To address the heteroscedasticity in traditional MLR models, weighted least squares is applied as a corrective measure, while grid search and cross-validation techniques are employed to optimize the hyperparameters of the SVR and RFR models, enhancing the models’ capacity to capture nonlinear relationships and improve predictive accuracy. A recursive prediction mechanism is developed, in which the predicted results of intermediate variables, such as sanitary sewer pipeline length, are iteratively fed into subsequent models, enabling dynamic multi-step estimation of facility quantities, thereby providing data-driven technical support for the long-term management framework of residential drainage systems. 2. Research methods 2.1 Modeling software and evaluation indicators 2.1.1 Software tools This study employs the Statistical Package for the Social Sciences (SPSS) for statistical analysis and data management (Gholami and Khaleghi, 2019 ). Its core functionalities include data management, descriptive statistics, exploratory data analysis, inferential statistics, and predictive modeling. In addition, the Python programming language is widely used in regression analysis, as it offers a rich ecosystem of third-party libraries, including statistical analysis packages with built-in regression tools (Hill et al., 2024 ), and machine learning libraries that include regression algorithms such as support vector machines and random forests. These libraries can handle large datasets and train complex models, while also facilitating efficient data cleaning, transformation, and visualization, thereby enhancing the understanding of data distributions and relationships (Mao et al., 2024 ). 2.1.2 Evaluation indicators When evaluating model fit and accuracy, two commonly used metrics are Root Mean Square Error (RMSE) and the coefficient of determination (R 2 ) (Winter et al., 2025 ), as detailed in Text S1. In predictive applications, model accuracy is of primary concern. RMSE is an important metric, but it does not fully capture prediction accuracy, as it reflects only the average magnitude of errors while ignoring their distribution. In contrast, a higher R 2 value indicates better explanatory power of the model and generally reflects stronger statistical performance. Therefore, models with the highest R 2 values were prioritized during model selection. 2.2 Data preparation 2.2.1 Sample selection Given the large number of residential communities in City S, the broad range of construction periods, and significant variation in community characteristics, sampling was necessary to obtain representative data. To ensure prediction accuracy, the sampling process needed to cover communities with diverse building sizes across the city. A multistage sampling approach combining stratified and random sampling was adopted (Chen et al., 2021 ). The sample size was determined based on the rule that linear regression requires at least 20 times the number of predictors. This was considered in conjunction with the drainage facility–community characteristic analysis (Text S2), which identified building area, number of buildings, and number of households as key predictors. Assuming a regression model with three predictors and the need to divide data into training and validation sets, 120 community samples were selected. During the multistage sampling process, a list of 3,685 communities and their building area data was obtained from City S’s drainage renovation platform. First, administrative districts were used as strata, and 50% of communities were randomly selected from each stratum, yielding 1,843 samples. These were then divided into 12 layers based on ascending building area, from which 10 communities were randomly selected per layer, resulting in 120 final samples. The dataset used for analysis included independent variables (administrative district, building area, number of buildings, number of households) and dependent variables (length of sanitary sewer pipelines, length of stormwater pipelines, number of sewage manholes, number of stormwater manholes). After encoding administrative districts and removing missing and outlier values, the data were imported into SPSS for further analysis. 2.2.2 Data verification Prior to model construction, it was necessary to test the four key assumptions of linear regression—linearity, independence, normality, and homoscedasticity—to ensure model accuracy and reliability. The Pearson correlation coefficient is a commonly used metric for measuring linear associations between variables (Haddad et al., 2022 ). It ranges from − 1 to + 1, with values closer to + 1 indicating a stronger positive correlation. In SPSS, all variables were selected and analyzed using Pearson correlation coefficients. The results were presented in Table 1 . Table 1 Linearity test of sample data. Pearson correlation coefficient Sanitary sewer pipeline length Stormwater pipeline length Sewage manhole Stormwater manhole Administrative district -0.033 -0.028 -0.051 -0.054 Building area 0.869** 0.856** 0.806** 0.831** Number of buildings 0.811** 0.804** 0.771** 0.763** Number of households 0.877** 0.862** 0.878** 0.857** Sanitary sewer pipeline length 1 0.972** 0.940** 0.947** Stormwater pipeline length 0.972** 1 0.945** 0.974** Sewage manhole 0.940** 0.945** 1 0.953** Stormwater manhole 0.947** 0.974** 0.953** 1 Note: * p < 0.05, ** p < 0.01 The correlation coefficients between the four dependent variables—length of sanitary sewer pipelines, length of stormwater pipelines, number of sewage manholes, and number of stormwater manholes—and the predictors (building area, number of buildings, number of households) mostly exceeded 0.8, indicating strong linear relationships (Da Cunha De Sá-Caputo et al., 2021 ). In addition, significant correlations were observed among the four dependent variables themselves. Therefore, recursive prediction was incorporated during model fitting, where predicted variables served as independent inputs to estimate subsequent target variables. The assumption of independence requires that data samples be mutually independent and free from autocorrelation to ensure that model residuals are not influenced by one another. If autocorrelation exists, it may reduce the model’s accuracy and cause the calculated error to underestimate the true error. This assumption was tested using the Durbin-Watson (DW) statistic, which ranges from 0 to 4 (Text S3). A value closer to 2 indicates a higher likelihood of sample independence (Kabaila et al., 2021 ). As shown in Table S1 , the DW values for all four models were close to 2, indicating that the samples were independent and free from autocorrelation, thus satisfying the conditions for linear regression analysis. Normality ensures the validity of parameter estimation in regression models. If the error terms are not normally distributed, confidence intervals may become unstable. Linear regression assumes that residuals follow a normal distribution, which can be tested by plotting histograms of the residuals (Schmidt and Finan, 2018 ). As shown in Figure S1 , the histograms for the sanitary sewer pipeline length, stormwater pipeline length, and stormwater manhole models aligned closely with the normal curve, indicating that their residuals were approximately normally distributed. However, the histogram for the sewage manhole model exhibited a sharp peak, suggesting that its residuals did not follow a normal distribution. According to the central limit theorem, in the case of large sample sizes, the sampling distribution of the mean approximated a normal distribution regardless of the population distribution. Therefore, the normality assumption was less critical for linear regression when large datasets were used (Bilon, 2023 ). Homoscedasticity can be assessed using a residual plot with standardized predicted values on the x-axis and standardized residuals on the y-axis. If the residuals are randomly scattered around the horizontal line at zero without any discernible pattern or outliers, the assumption of homoscedasticity is considered to be met (Moravec et al., 2024 ). Conversely, if the variance of residuals increases with the predicted values of the dependent variable—known as heteroscedasticity—it can lead to inaccurate predictions, as observed in the sanitary sewer pipeline length model (Figure S2 ). Therefore, in subsequent model optimization, it was necessary to address heteroscedasticity to ensure accurate predictions. 2.3 Model optimization and validation methods To optimize the model and improve statistical accuracy and fitting performance, weighting was applied to independent variables and model parameters were adjusted. The model was trained on a training set and validated using a validation set to assess its performance. For example, in a weighted linear regression model, a weight function was first defined to construct a weighted least squares model (Meermeyer, 2015 ). The dataset was divided into ten folds; in each iteration, nine folds were used for training and one for validation. The model’s performance was evaluated using R 2 and RMSE, and the optimal model was identified by selecting the one with the highest R 2 . The final model was then output, and a residual plot was generated. K-fold cross-validation is a widely employed technique for model evaluation (Wong and Yang, 2017 ), especially effective for small datasets. It involves dividing the dataset into K subsets; in each iteration, one subset is used for validation while the others are used for training. This process was repeated K times, making efficient use of the data, reducing overfitting, and providing a reliable estimate of model stability. Empirical studies have suggested that K = 10 often yields the best overall performance (Rodriguez et al., 2010 ). In 10-fold cross-validation, the dataset was split into ten subsets. The process was repeated ten times, with each subset used once as the validation set and the remaining nine used for training. R 2 and RMSE were calculated in each iteration to evaluate model performance. The model with the highest R 2 was selected as the best-performing model, and relevant data and visualizations were output. 3. Results and discussion 3.1 Model fitting, optimization and validation 3.1.1 MLR model fitting This study examined four dependent variables: the length of sanitary sewer pipelines, the length of stormwater pipelines, the number of sewage manholes, and the number of stormwater manholes. The independent variables included building area, number of buildings, and number of households. Since the length of sanitary sewer pipelines had a correlation coefficient above 0.9 with the other three dependent variables, it was considered significantly correlated and was used as a new predictor in recursive forecasting. Specifically, sanitary sewer pipeline length was first predicted and then used as an input variable in the other three models. The detailed variable names were listed in Table S2 . SPSS software was utilized to perform backward elimination for variable selection and conduct preliminary linear regression analysis for the four models (Table S3). When evaluating a linear regression model, four key aspects were considered: (1) model significance (Magara and Boury-Jamot, 2024 ), where a p-value less than 0.05 indicates statistical relevance; (2) goodness-of-fit, measured by R 2 , with values closer to 1 indicating better model performance; (3) multicollinearity, assessed using the variance inflation factor (VIF) (Salmerón et al., 2018 ), where values below 10 suggest no multicollinearity; and (4) residual analysis (Kozak and Piepho, 2018 ), which help examine unexplained variance and assess the model’s assumptions. Together, these factors offer a comprehensive assessment of the overall model quality. Table 2 indicated that all four models met the criteria of significance level < 0.05 and VIF < 10, confirming that the models were valid and free from multicollinearity. Among them, the sanitary sewer pipeline length model achieved an R 2 value of 0.877, indicating a good fit but still leaving room for improvement. Table 2 Multiple linear regression model fitting results. Model name Model significance R 2 Adjusted R 2 VIF Model 1: Sanitary sewer pipeline length < 0.001 0.877 0.873 Building area 3.366 Number of buildings 2.186 Number of households 3.582 Model 2: Stormwater pipeline length < 0.001 0.946 0.945 Building area 4.099 Number of buildings 2.926 Sanitary sewer pipeline length 6.244 Model 3: Sewage manhole < 0.001 0.899 0.897 Building area 4.406 Number of households 4.649 Sanitary sewer pipeline length 6.056 Model 4: Stormwater manhole < 0.001 0.901 0.899 Number of households 4.317 Sanitary sewer pipeline length 4.317 Figure S3 showed that the residuals of all four models increased with the predicted values, indicating the presence of heteroscedasticity. This issue compromised predictive accuracy (Amado et al., 2025 ), as samples with large dependent variable values tended to be underestimated, while those with small values might be overestimated. 3.1.2 MLR model optimization and verification To address this issue, weighted least squares (WLS) regression was applied. The process involved first using SPSS’s weight estimation function to calculate variable weights, followed by performing weighted linear regression based on the optimal weight coefficients. The model with the highest R 2 was selected as the optimal one. As shown in Figure S4, the residuals of the four models were evenly distributed above and below zero, indicating that the assumption of homoscedasticity was satisfied. Among the four models, only the stormwater pipeline length model achieved an R 2 value greater than 0.9, indicating a strong fit. The remaining three models showed comparatively weaker performance and required further validation of the linear regression approach. Results from ten-fold cross-validation of the weighted linear regression models were presented in Table 3 . Table 3 Model weighted linear regression and cross validation of optimized results. Model name Coefficient Model significance R 2 Optimized R 2 RMSE Model 1: Sanitary sewer pipeline length Constant 36.837 < 0.001 0.873 0.966 490.290 Building area 0.018 Number of buildings 12.568 Number of households 0.948 Model 2: Stormwater pipeline length Constant 9.536 < 0.001 0.963 0.991 118.092 Building area 0.004 Sanitary sewer pipeline length 0.885 Model 3: Sewage manhole Constant 0.044 < 0.001 0.894 0.959 127.814 Number of households 0.112 Sanitary sewer pipeline length 0.112 Model 4: Stormwater manhole Constant 0.355 < 0.001 0.884 0.943 78.356 Number of households 0.082 Sanitary sewer pipeline length 0.074 All models showed statistical significance (p < 0.001), with increased R 2 values and reduced RMSE, indicating improved model performance. Figure 1 illustrated the fit between observed and predicted values for each model. The stormwater pipeline length model demonstrated the best performance (R 2 = 0.991). All four models had R 2 values exceeding 0.9, indicating suitability for predicting new data. Figure 2 showed the residual plots of the models. The residuals of the sewage and stormwater pipeline length models were evenly distributed above and below zero, suggesting homoscedasticity. In contrast, heteroscedasticity was observed in the sewage and stormwater manhole models. Although model optimization improved the R 2 values and fitting accuracy, residual plots still exhibited signs of heteroscedasticity, indicating that the models did not fully capture the variability of the error terms. 3.1.3 SVR model fitting In nonlinear regression analysis, although linear regression may exclude some variables with weak linear correlations, these variables may still exhibit nonlinear relationships with the dependent variable. Therefore, building area, number of buildings, and number of households were retained as independent variables. In this study, the dataset was divided into a 7:3 split for training and testing. The training set was used to build the SVR model, with the radial basis function (RBF) selected as the kernel. The model’s accuracy was evaluated using RMSE and R 2 . The key Python code for the SVR model was provided in Text S4. The results from the code execution (Table S4) showed that all models had R 2 values below 0.9 and relatively large RMSE, indicating poor fit and prediction accuracy. Therefore, further parameter adjustments and optimization were necessary. 3.1.4 SVR model optimization and verification The performance of the SVR model was influenced not only by the data structure but also by its sensitivity to hyperparameter settings, primarily the regularization parameter (C) and the epsilon-insensitive loss parameter (Ɛ) (Hsia and Lin, 2020b ). The parameter C controls model complexity: smaller values encourage a simpler model, while larger values enhance the fit to training data. Ɛ defines the prediction tolerance and sensitivity to noise: smaller values increase sensitivity to errors, while larger values result in smoother predictions. Hyperparameter tuning was typically performed using grid search combined with cross-validation (Xie et al., 2023 ), with C set to 0.1, 1, 10, and 100, and Ɛ set to 0.01, 0.1, 1, and 10. These combinations were implemented programmatically to build and optimize the SVR model (Code in Text S5). The results in Table 4 showed that the R 2 of the sewage manhole and stormwater manhole models increased, while RMSE values decreased, indicating improved accuracy of the SVR models. Table 4 Optimized support vector regression model results. Model name Independent variable C Ɛ R 2 RMSE Sanitary sewer pipeline length Building area 10 0.01 0.895 310.844 Number of buildings Number of households Stormwater pipeline length Building area 10 0.01 0.989 142.897 Number of buildings Number of households Sanitary sewer pipeline length Sewage manhole Building area 10 0.1 0.957 69.505 Number of buildings Number of households Sanitary sewer pipeline length Stormwater manhole Number of households Sanitary sewer pipeline length 10 0.1 0.971 32.278 Figure 3 showed the fit between the real and predicted values of the four dependent variables in the SVR model. The lines for the sewage manhole and stormwater manhole models nearly coincided, indicating excellent fit and suitability for predicting new data. Figure 4 was the residual plot of the model. The residuals for the sanitary sewer pipeline length and stormwater pipeline length models radiated along the zero axis, indicating the presence of heteroscedasticity. In contrast, the residuals for the sewage manhole and stormwater manhole models were evenly distributed above and below zero, showing good homoscedasticity and improvement in heteroscedasticity. This indicated that the SVR model could capture the variability of error terms at different levels. 3.1.5 RFR model fitting The RFR model was similar to that of the SVR model. The dataset was randomly split into training and validation sets in a 7:3 split. The RFR model was implemented using functions from the scikit-learn (sklearn) library, and its performance was evaluated based on two key metrics: RMSE and R 2 . The implementation details were provided in Text S6. As shown in Table S5, the RFR model achieved a slight improvement in R 2 compared to the SVR model. However, the R 2 values remained below 0.9, indicating a relatively poor model fit. Moreover, the large RMSE values suggested a significant average deviation between the predicted and actual values, reflecting low predictive accuracy. These results highlighted the need for further model tuning and optimization. 3.1.6 RFR model optimization and verification The performance of the RFR model was primarily determined by two key hyperparameters: the number of estimators (n_estimators) and the maximum depth of the trees (max_depth) (Dai et al., 2018 ). The n_estimators parameter defined the number of decision trees in the forest. A larger number generally improved model performance but increases computational cost. Common values include 300, 500, and 1000. The max_depth parameter controlled the maximum depth of each tree. While deeper trees could improve model fitting, they were more prone to overfitting. Typical tuning began with small values such as 3, 5, or 7. Hyperparameter tuning was performed using grid search (Zong et al., 2025 ) in combination with ten-fold cross-validation. Parameter combinations were implemented programmatically to build and optimize the model (Text S7). According to Table 5 , the R 2 values obtained from the RFR model were the lowest, and the RMSE values were the highest among the three regression approaches, indicating that the model had the least accurate predictive performance. Table 5 Optimized random forest regression model results. Model name Independent variables n_estimators max_depth R 2 RMSE Sanitary sewer pipeline length Building area 1000 3 0.867 871.871 Number of buildings Number of households Stormwater pipeline length Building area 500 3 0.977 166.868 Sanitary sewer pipeline length Sewage manhole Building area 300 3 0.898 107.203 Number of buildings Number of households Sanitary sewer pipeline length Stormwater manhole Number of households 1000 5 0.964 52.194 Sanitary sewer pipeline length As illustrated in Fig. 5 , the predicted values showed low consistency with the actual values for all four dependent variables, further confirming that the RFR model was not well-suited for this forecasting task. 3.2 Model comparison Figure 6 compared the models developed using three different regression methods. Overall, the WLR and SVR models demonstrated better fitting performance. Specifically, in the sewage and stormwater pipeline length models, the R 2 values obtained from linear regression were significantly higher than those of the other two methods, demonstrating its superior accuracy and suitability for estimating these variables. In contrast, for the sewage and stormwater manhole models, the SVR method achieved higher R 2 values and the lowest RMSE, indicating better accuracy and smaller prediction errors, and thus was more appropriate for estimating manhole numbers. The fitting and residual plots provided a visual comparison of the three regression methods across different dependent variables. As shown in Fig. 7 , linear regression performed best in fitting sewage and stormwater pipeline lengths, whereas SVR showed superior performance in modeling sewage and stormwater manholes. Furthermore, the residual plots in Fig. 8 demonstrated the symmetry of residual distribution around the zero line, supporting the observed trends in model performance. 3.3 Model results 3.3.1 Model 1: Sanitary sewer pipeline length The dependent variable in the first model was sanitary sewer pipeline length, with building area (A), number of buildings (B), and number of households (H) in the residential community as the independent variables. A WLR model was applied, and the resulting equation was expressed as follows: 3.3.2 Model 2: Stormwater pipeline length For the stormwater pipeline length model, the dependent variable was stormwater pipeline length, and the independent variables included building area (A) and the predicted sanitary sewer pipeline length (S). The same WLR approach was employed, with recursive prediction based on the first model’s output. The model equation is as follows: 3.3.3 Model 3: Sewage manhole The dependent variable in this model was the number of sewage manholes, while the independent variables included building area, number of buildings, number of households, and sanitary sewer pipeline length. An SVR model was employed. Unlike linear regression models, which offer explicit mathematical expressions, SVR utilizes nonlinear kernel functions for data transformation, leading to greater model complexity. Consequently, the model cannot be expressed with a simple formula and must be implemented using Python programming to estimate new data. 3.3.4 Model 4: Stormwater manhole The dependent variable in this model was the number of stormwater manholes, with the number of households and sanitary sewer pipeline length serving as the independent variables. An SVR model was applied, which required implementation through Python programming to perform predictions on new data. 3.4 Error analysis 3.4.1 Data measurement error Measurement errors may arise during data collection (Stoklosa et al., 2016 ), such as inaccuracies from manual measurements in design drawings or the inclusion of secondary pipelines in automated CAD statistics. These errors can introduce bias or noise into the dataset, thereby compromising the statistical accuracy of the model. 3.4.2 Small number of training samples Typically, building effective statistical models using random forest or support vector machine algorithms requires 1,000 to 10,000 samples. A smaller sample size may limit the model’s ability to learn underlying patterns and trends, potentially causing underfitting or overfitting and reducing predictive accuracy (Lee and Yang, 2022 ). 3.4.3 Error propagation In multi-step predictions using recursive forecasting, errors may accumulate progressively and propagate through subsequent steps. Even if the initial statistical errors are minor, they may compound over time, resulting in substantial deviations between the predicted and actual values. 4. Conclusions This study developed a hybrid regression model integrating MLR, SVR, and RFR to estimate the number of drainage facilities in urban residential communities. The results indicated that drainage facility estimation was affected not only by structural variables such as building area, number of buildings, and number of households, but also by complex nonlinear relationships. Therefore, the MLR model was optimized using weighted least squares, while the SVR and RFR models were tuned via grid search and ten-fold cross-validation, resulting in improved estimation accuracy. The findings showed that the WLR model effectively predicted pipeline length, achieving R 2 values above 0.96. For estimating the number of sewage and stormwater manholes, the SVR model outperformed others due to its strong nonlinear fitting capabilities. Moreover, incorporating a recursive modeling mechanism enhanced prediction accuracy by capturing interdependencies among variables. However, despite model optimization, measurement errors and limited sample sizes remained significant sources of uncertainty, and recursive forecasting introduced the risk of cumulative error propagation. Overall, this study provides a scientific foundation and practical framework for estimating urban drainage infrastructure, contributing to the advancement of data-driven and intelligent urban drainage management systems. Declarations Funding: The authors declare that no funds, grants, or other support were received during the preparation of this manuscript. Competing Interests: The authors have no relevant financial or non-financial interests to disclose. Author Contributions: All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Kaiwei Hu, Yixiao He and Shuping Li. The first draft of the manuscript was written by Kaiwei Hu and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript. References Ailobhio DT, Ikughur JA (2024) A Review of Some Goodness-of-Fit Tests for Logistic Regression Model. Asian J Probab Stat 26(7):75–85 Alizadeh Gharaei MS, Ramezani Y, Nazeri Tahroudi M (2024) Toward coupling of nonlinear support vector regression and crowd intelligence optimization algorithms in estimation of suspended sediment load. Appl Water Sci 14(9) Amado C, Bianco AM, Boente G, Rodrigues IM (2025) Robust estimation of heteroscedastic regression models: a brief overview and new proposals. Stat Pap 66(3) Arya S, Kumar A (2023) Evaluation of stormwater management approaches and challenges in urban flood control. Urban Clim 51:101643 Beigzadeh B, Bahrami M, Amiri MJ, Mahmoudi MR (2020) A new approach in adsorption modeling using random forest regression, Bayesian multiple linear regression, and multiple linear regression: 2,4-D adsorption by a green adsorbent. Water Sci Technol 82(8):1586–1602 Bilon XJ (2023) Normality and significance testing in simple linear regression model for large sample sizes: a simulation study. Communications in statistics. Simul Comput 52(6):2781–2797 Brooks C, Burke SP, Stanescu S (2016) Finite sample weighting of recursive forecast errors. Int J Forecast 32(2):458–474 Cao T, Truong V, Nguyen N (2024) An efficient optimization framework for Urban drainage system design. E3S Web of Conferences 533, 5001 Chen Z, Sarkar A, Li X, Xia X (2021) Effects of joint adoption for multiple green production technologies on welfare-a survey of 650 kiwi growers in Shaanxi and Sichuan. Int J Clim Chang Strateg Manag 13(3):229–249 Ciulla G, D'Amico A (2019) Building energy performance forecasting: A multiple linear regression approach. Appl Energy 253:113500 Sá-Caputo DCD, Sonza D, Coelho-Oliveira A, Pessanha-Freitas AC, Reis J, Francisca-Santos AS, Dos Anjos A, Paineiras-Domingos EM, de Rezende Bessa Guerra LL, Da Silva Franco T, Xavier A, Barbosa VL, Silva E, Moura-Fernandes CJ, Mendonça MC, Lacerda VAR, Mulder AC, Seixas A, Sartorio A, Taiar A, Bernardo-Filho R (2021) M., Evaluation of the Relationships between Simple Anthropometric Measures and Bioelectrical Impedance Assessment Variables with Multivariate Linear Regression Models to Estimate Body Composition and Fat Distribution in Adults: Preliminary Results. Biology 10(11), 1209 Dai B, Gu C, Zhao E, Qin X (2018) Statistical model optimized random forest regression model for concrete dam deformation monitoring. Struct Control Health Monit 25(6), e2170 Dai W, Mohammadi S, Cremaschi S (2022) A hybrid modeling framework using dimensional analysis for erosion predictions. Comput Chem Eng 156:107577 Desai S, Ouarda TBMJ (2021) Regional hydrological frequency analysis at ungauged sites with random forest regression. J Hydrol 594:125861 Dong X, Guo H, Zeng S (2017) Enhancing future resilience in urban drainage system: Green versus grey infrastructure. Water Res 124:280–289 Etemadi S, Khashei M (2021) Etemadi multiple linear regression. Measurement 186:110080 Fathy I, Abdel-Aal GM, Fahmy MR, Fathy A, Zeleňáková M (2020) The Negative Impact of Blockage on Storm Water Drainage Network. Water 12(7), 1974 Fu X, Goddard H, Wang X, Hopton ME (2019) Development of a scenario-based stormwater management planning support system for reducing combined sewer overflows (CSOs). J Environ Manage 236:571–580 Gholami V, Khaleghi MR (2019) A comparative study of the performance of artificial neural network and multivariate regression in simulating springs discharge in the Caspian Southern Watersheds, Iran. Appl Water Sci 9(1) Haddad S, Boukhayma A, Caizzone A (2022) Continuous PPG-Based Blood Pressure Monitoring Using Multi-Linear Regression. Ieee J Biomed Health Inf 26(5):2096–2105 Hill C, Du L, Johnson M, McCullough BD (2024) Comparing programming languages for data analytics: Accuracy of estimation inPython andR. WIREs Data Min Knowl Discov 14(3) Hsia J, Lin C (2020a) Parameter Selection for Linear Support Vector Regression. Ieee Trans Neural Netw Learn Syst 31(12):5639–5644 Hsia J, Lin C (2020b) Parameter Selection for Linear Support Vector Regression. Ieee Trans Neural Netw Learn Syst 31(12):5639–5644 Kabaila P, Farchione D, Alhelli S, Bragg N (2021) The effect of a Durbin–Watson pretest on confidence intervals in regression. Stat Neerl 75(1):4–23 Kattan MW, Gerds TA (2020) A Framework for the Evaluation of Statistical Prediction Models. Chest 158(1):S29–S38 Klopfenstein Q, Vaiter S (2021) Linear support vector regression with linear constraints. Mach Learn 110(7):1939–1974 Kozak M, Piepho HP (2018) What's normal anyway? Residual plots are more telling than significance tests when checkingANOVA assumptions. J Agron Crop Sci 204(1):86–98 Kozak S, Petterson S, McAlister T, Jennison I, Bagraith S, Roiko A (2020) Utility of QMRA to compare health risks associated with alternative urban sewer overflow management strategies. J Environ Manage 262:110309 Lee J, Yang C (2022) Deep neural network and meta-learning-based reactive sputtering with small data sample counts. J Manuf Syst 62:703–717 Liao K, Park ES, Zhang J, Cheng L, Ji D, Ying Q, Yu JZ (2021) A multiple linear regression model with multiplicative log-normal error term for atmospheric concentration data. Sci Total Environ 767:144282 Liu Y, Zhao W, Wei Y, Sebastian FSM, Wang M (2023) Urban waterlogging control: A novel method to urban drainage pipes reconstruction, systematic and automated. J Clean Prod 418:137950 Ma X, Zhang J, Wang P, Zhou L, Sun Y (2023) Estimating the nonlinear response of landscape patterns to ecological resilience using a random forest algorithm: Evidence from the Yangtze River Delta. Ecol Indic 153:110409 Magara F, Boury-Jamot B (2024) About statistical significance, and the lack thereof. Lab Anim 58(5):448–452 Maglia N, Raimondi A (2025) A new approach on design and verification of integrated sustainable urban drainage systems for stormwater management in urban areas. J Environ Manage 373:123882 Mao K, Chen C, Zhang J, Li Y (2024) ORLEP: an efficient offline reinforcement learning evaluation platform. Multimed Tools Appl 83(12):37073–37087 Meermeyer M (2015) Weighted linear regression models with fixed weights and spherical disturbances. Comput Stat 30(4):929–955 Moravec M, Pinosova M, Badida M, Izarikova G, Badidova M (2024) Analysis of the Acoustic Parameters of Building Partition Structures of Varying Composition. Buildings-Basel 14(8):2440 Neo PK, Leong YW, Soon MF, Goh QS, Thumsorn S, Ito H (2024) Development of a Machine Learning Model to Predict the Color of Extruded Thermoplastic Resins. Polymers 16(4):481 Qiao X, Kristoffersson A, Randrup TB (2018) Challenges to implementing urban sustainable stormwater management from a governance perspective: A literature review. J Clean Prod 196:943–952 Rodriguez JD, Perez A, Lozano JA (2010) Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation. Ieee Trans Pattern Anal Mach Intell 32(3):569–575 Rodriguez M, Fu G, Butler D, Yuan Z, Sharma K (2021) Exploring the Spatial Impact of Green Infrastructure on Urban Drainage Resilience. Water 13(13):1789 Salmerón R, García CB, García J (2018) Variance Inflation Factor and Condition Number in multiple linear regression. J Stat Comput Simul 88(12):2365–2384 Schmidt AF, Finan C (2018) Linear regression and the normality assumption. J Clin Epidemiol 98:146–151 Stoklosa J, Huang Y, Furlan E, Hwang W (2016) On quadratic logistic regression models when predictor variables are subject to measurement error. Comput Stat Data Anal 95:109–121 Sun Y, Hu X, Li Y, Peng Y, Yu Y (2021) A framework for deriving dispatching rules of integrated urban drainage systems. J Environ Manage 298:113401 Tsirikoglou P, Abraham S, Contino F, Lacor C, Ghorbaniasl G (2017) A hyperparameters selection technique for support vector regression models. Appl Soft Comput 61:139–148 Wang J, Liu G, Wang J, Xu X, Shao Y, Zhang Q, Liu Y, Qi L, Wang H (2021) Current status, existent problems, and coping strategy of urban drainage pipeline network in China. Environ Sci Pollut Res 28(32):43035–43049 Wen K, Wu W, Wu X (2023) Electricity demand forecasting and risk management using Gaussian process model with error propagation. J Forecast 42(4):957–969 Winter AR, Zhu Y, Asimow NG, Patel MY, Cohen RC (2025) A Scalable Calibration Method for Enhanced Accuracy in Dense Air Quality Monitoring Networks. Environ Sci Technol 59(5):2599–2610 Wong T, Yang N (2017) Dependency Analysis of Accuracy Estimates in k-Fold Cross Validation. Ieee Trans Knowl Data Eng 29(11):2417–2427 Xie S, Lin H, Chen Y, Duan H, Liu H, Liu B (2023) Prediction of shear strength of rock fractures using support vector regression and grid search optimization. Mater Today Commun 36:106780 Xu W, Peng H, Zeng X, Zhou F, Tian X, Peng X (2019) A hybrid modelling method for time series forecasting based on a linear regression model and deep learning. Appl Intell 49(8):3002–3015 Yazdi J (2018) Water quality monitoring network design for urban drainage systems, an entropy method. Urban Water J 15(3):227–233 Zhao J, Lao F, Yan G, Zhang Y (2024) How data heterogeneity affects innovating knowledge and information in gene identification: A statistical learning perspective. J Innov Knowl 9(3):100514 Zhong Y, Yang H, Zhang Y, Li P (2021) Online Rebuilding Regression Random Forests. Knowledge-Based Syst 221:106960 Zong Y, Nian Y, Zhang C, Tang X, Wang L, Zhang L (2025) Hybrid Grid Search and Bayesian optimization-based random forest regression for predicting material compression pressure in manufacturing processes. Eng Appl Artif Intell 141:109580 Supplementary Files Highlights.docx Supplementarymaterial.docx Cite Share Download PDF Status: Under Review Version 1 posted Editor invited by journal 03 Dec, 2025 Reviewers agreed at journal 16 Jul, 2025 Reviewers invited by journal 16 Jul, 2025 Editor assigned by journal 27 May, 2025 First submitted to journal 27 May, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6754917","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":486192883,"identity":"6dd8d031-3718-44aa-a2b5-cc7875a1d955","order_by":0,"name":"Kaiwei Hu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA+klEQVRIiWNgGAWjYBACfoaDDQc+VNjIgXkSBkDiAAEtko2HGx/OOJNmzEO0FoPDx5uNOdsOJ/bAhQhpYTh2sE2a4Qxz+n723sMvLAoY5PhuJDB+LsCjg7EHqKWggi23h+dcmgXQYcaSNxKYpWfg0cIsAdQy4wxPbo9EjpkBUEvihhsJbMw8eLSwyT9sk+Ztk0jngWqpJ6iFh+FgszFvm0ECUIvxA6CWBANCWiQYDoICOcGw58wZM2AgSxjOPPOwWRqfFvsDxx8Ao/K/PHt7j/FniT828nzHkw9+xqcFxV/SEkBbgaHYQKQGYOh9/EC02lEwCkbBKBhJAACc6U9h08wfCwAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0009-0005-6617-7735","institution":"Tongji University","correspondingAuthor":true,"prefix":"","firstName":"Kaiwei","middleName":"","lastName":"Hu","suffix":""},{"id":486192884,"identity":"fd92b1c2-5695-494f-abcb-006bbd5b4cac","order_by":1,"name":"Yixiao He","email":"","orcid":"","institution":"Tongji University","correspondingAuthor":false,"prefix":"","firstName":"Yixiao","middleName":"","lastName":"He","suffix":""},{"id":486192885,"identity":"1d62b93e-c509-4047-be3f-d21991dc62da","order_by":2,"name":"Shuping Li","email":"","orcid":"","institution":"Tongji University","correspondingAuthor":false,"prefix":"","firstName":"Shuping","middleName":"","lastName":"Li","suffix":""}],"badges":[],"createdAt":"2025-05-27 03:24:09","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6754917/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6754917/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":87065922,"identity":"7e7b971c-4bab-4517-a756-aeb45985b150","added_by":"auto","created_at":"2025-07-18 18:23:16","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":1745621,"visible":true,"origin":"","legend":"\u003cp\u003eWLR model fitting diagram: (a) sanitary sewer pipeline length model, (b) stormwater pipeline length model, (c) sewage manhole model, and (d) stormwater manhole model.\u003c/p\u003e","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-6754917/v1/06c32bb4ec83d5af5ec8451f.png"},{"id":87065892,"identity":"684d52ed-bdc8-43e3-bda3-46a5b035ae73","added_by":"auto","created_at":"2025-07-18 18:23:15","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":682364,"visible":true,"origin":"","legend":"\u003cp\u003eResidual plots of WLR models: (a) sanitary sewer pipeline length model, (b) stormwater pipeline length model, (c) sewage manhole model, and (d) stormwater manhole model.\u003c/p\u003e","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-6754917/v1/23e5a3ac989134d81a77095e.png"},{"id":87065897,"identity":"3d414160-31b1-4e7f-a7cc-d4528b706248","added_by":"auto","created_at":"2025-07-18 18:23:15","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1761320,"visible":true,"origin":"","legend":"\u003cp\u003eSVR model fitting diagram: (a) sanitary sewer pipeline length model, (b) stormwater pipeline length model, (c) sewage manhole model, and (d) stormwater manhole model.\u003c/p\u003e","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-6754917/v1/97a2077182ccfea61ef59d6b.png"},{"id":87065933,"identity":"7dbcbbbd-791c-4313-b7f4-e2aa2ff48a64","added_by":"auto","created_at":"2025-07-18 18:23:17","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":572096,"visible":true,"origin":"","legend":"\u003cp\u003eResidual plots of SVR models: (a) sanitary sewer pipeline length model, (b) stormwater pipeline length model, (c) sewage manhole model, and (d) stormwater manhole model.\u003c/p\u003e","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-6754917/v1/08a7082195d179b204bef1e9.png"},{"id":87065908,"identity":"3ec01320-745e-423e-a401-f3994acf8848","added_by":"auto","created_at":"2025-07-18 18:23:15","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":1750362,"visible":true,"origin":"","legend":"\u003cp\u003eRFR model fitting diagram: (a) sanitary sewer pipeline length model, (b) stormwater pipeline length model, (c) sewage manhole model, and (d) stormwater manhole model.\u003c/p\u003e","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-6754917/v1/dcd0680e1f6cffe0f382b011.png"},{"id":87065921,"identity":"390a5348-fe00-478a-b0a7-e2653035670c","added_by":"auto","created_at":"2025-07-18 18:23:16","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":490639,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of the evaluation results for the WLR, SVR, and RFR models: (a) sanitary sewer pipeline length model, (b) stormwater pipeline length model, (c) sewage manhole model, and (d) stormwater manhole model.\u003c/p\u003e","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-6754917/v1/d583a841a24da8bd1e5f7489.png"},{"id":87065899,"identity":"934141c6-1f0d-4ba0-99ed-ddbc5cfb4491","added_by":"auto","created_at":"2025-07-18 18:23:15","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":2898325,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of the model fitting results for the WLR, SVR, and RFR models: (a) sanitary sewer pipeline length model, (b) stormwater pipeline length model, (c) sewage manhole model, and (d) stormwater manhole model.\u003c/p\u003e","description":"","filename":"Onlinefloatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-6754917/v1/8774ffe8e96b56b98d97b2f7.png"},{"id":87065919,"identity":"980c7383-dbb0-460b-a9b8-b2b6ec9d65bc","added_by":"auto","created_at":"2025-07-18 18:23:16","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":844486,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of the residuals for the WLR, SVR, and RFR models: (a) sanitary sewer pipeline length model, (b) stormwater pipeline length model, (c) sewage manhole model, and (d) stormwater manhole model.\u003c/p\u003e","description":"","filename":"Onlinefloatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-6754917/v1/e880fe1b33d40fc2a3448318.png"},{"id":87065891,"identity":"5ee8742b-b90e-4bf2-a43c-31b597445a21","added_by":"auto","created_at":"2025-07-18 18:23:15","extension":"docx","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":16923,"visible":true,"origin":"","legend":"","description":"","filename":"Highlights.docx","url":"https://assets-eu.researchsquare.com/files/rs-6754917/v1/96e552e1f5113b7c961a6f11.docx"},{"id":87065917,"identity":"d7a3a5fd-09ea-47c7-ad5d-091cdbaf99ad","added_by":"auto","created_at":"2025-07-18 18:23:16","extension":"docx","order_by":6,"title":"","display":"","copyAsset":false,"role":"supplement","size":638352,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarymaterial.docx","url":"https://assets-eu.researchsquare.com/files/rs-6754917/v1/eb0962fc347251b271b01db2.docx"}],"financialInterests":"","formattedTitle":"A hybrid regression framework for estimating urban residential community drainage infrastructure: Model optimization and recursive prediction","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eWith the accelerating pace of urbanization, drainage infrastructure has become increasingly critical in safeguarding urban water quality, mitigating flood risks, and supporting ecological resilience (Rodriguez et al., \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Dong et al., \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2017\u003c/span\u003e). As a vital component of this system, residential drainage facilities constitute a crucial element of urban drainage management (Liu et al., \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). However, aging infrastructure, inadequate maintenance, and fragmented management have resulted in a series of critical issues (Fathy et al., \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Kozak et al., \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e2020\u003c/span\u003e), including pipeline blockages, sewage overflows, and accelerated facility deterioration (Wang et al., \u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). These problems are particularly severe in megacities, where rapid urbanization exerts greater pressure on the robustness of drainage management systems (Maglia and Raimondi, \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2025\u003c/span\u003e; Qiao et al., \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2018\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eIn recent years, cities around the world have adopted a variety of strategies to address similar challenges (Fu et al., \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Cao et al., \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Yazdi, \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). Nevertheless, significant gaps remain in the existing research: international practices largely focus on macro-level infrastructure planning, with a lack of statistical modeling at the micro scale, such as for residential communities. Moreover, basic information on residential drainage facilities is often scattered and lacks a centralized database (Sun et al., \u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e2021\u003c/span\u003e), while the labor required for facility enumeration is immense and difficult to complete in the short term.\u003c/p\u003e\u003cp\u003eInvestigating the underlying mathematical relationships between residential drainage facilities and community characteristics, and developing predictive models using statistical methods to estimate facility quantities (Kattan and Gerds, \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2020\u003c/span\u003e), are essential steps for designing effective management mechanisms (Arya and Kumar, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Traditional statistical approaches often struggle to capture the complex relationships between facilities and community attributes, and existing studies frequently rely on single regression models for prediction. However, these models are constrained by data heterogeneity and nonlinear associations (Zhao et al., \u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), resulting in poor predictive accuracy and limited generalizability. This issue is particularly pronounced in multi-step recursive forecasting, where error accumulation becomes significant (Ailobhio and Ikughur, \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Brooks et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2016\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eHybrid modeling approaches have emerged as effective solutions for addressing prediction problems in complex systems (Xu et al., \u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Dai et al., \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). For example, MLR serves as a fundamental and effective statistical technique for analyzing and interpreting linear relationships between variables (Etemadi and Khashei, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Liao et al., \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). Its simplicity and interpretability have facilitated its widespread application across various fields (Ciulla and D'Amico, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). However, MLR may perform poorly when dealing with nonlinear data, requiring the use of more advanced models (Beigzadeh et al., \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). SVR is a powerful machine learning technique capable of effectively handling nonlinear relationships (Tsirikoglou et al., \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Alizadeh Gharaei et al., \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). SVR introduces kernel functions to transform nonlinear problems from low-dimensional to high-dimensional spaces, thereby enabling linear separability (Hsia and Lin, 2020; Klopfenstein and Vaiter, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). RFR is an ensemble learning method that constructs multiple decision trees to model complex nonlinear relationships (Ma et al., \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Zhong et al., \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). RFR improves accuracy and robustness by aggregating predictions from multiple decision trees (Neo et al., \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Desai and Ouarda, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). However, integrating the strengths of different models to build a predictive framework capable of accommodating the heterogeneity of urban drainage facilities continues to pose a critical challenge. Moreover, existing studies often focus on estimating a single type of facility, without systematically modeling interdependencies among multiple variables, thereby amplifying error propagation in recursive predictions (Wen et al., \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eTo address the above issues, this study proposes a hybrid regression framework that integrates MLR, SVR, and RFR to improve the accuracy of predicting the quantity of drainage facilities at the residential community scale. Using empirical data from 120 residential communities in City S, the study systematically analyzes the correlations between drainage facility quantities and community characteristics to uncover their underlying mathematical patterns. To address the heteroscedasticity in traditional MLR models, weighted least squares is applied as a corrective measure, while grid search and cross-validation techniques are employed to optimize the hyperparameters of the SVR and RFR models, enhancing the models\u0026rsquo; capacity to capture nonlinear relationships and improve predictive accuracy. A recursive prediction mechanism is developed, in which the predicted results of intermediate variables, such as sanitary sewer pipeline length, are iteratively fed into subsequent models, enabling dynamic multi-step estimation of facility quantities, thereby providing data-driven technical support for the long-term management framework of residential drainage systems.\u003c/p\u003e"},{"header":"2. Research methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1 Modeling software and evaluation indicators\u003c/h2\u003e\u003cdiv id=\"Sec4\" class=\"Section3\"\u003e\u003ch2\u003e2.1.1 Software tools\u003c/h2\u003e\u003cp\u003eThis study employs the Statistical Package for the Social Sciences (SPSS) for statistical analysis and data management (Gholami and Khaleghi, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). Its core functionalities include data management, descriptive statistics, exploratory data analysis, inferential statistics, and predictive modeling. In addition, the Python programming language is widely used in regression analysis, as it offers a rich ecosystem of third-party libraries, including statistical analysis packages with built-in regression tools (Hill et al., \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), and machine learning libraries that include regression algorithms such as support vector machines and random forests. These libraries can handle large datasets and train complex models, while also facilitating efficient data cleaning, transformation, and visualization, thereby enhancing the understanding of data distributions and relationships (Mao et al., \u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section3\"\u003e\u003ch2\u003e2.1.2 Evaluation indicators\u003c/h2\u003e\u003cp\u003eWhen evaluating model fit and accuracy, two commonly used metrics are Root Mean Square Error (RMSE) and the coefficient of determination (R\u003csup\u003e2\u003c/sup\u003e) (Winter et al., \u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e2025\u003c/span\u003e), as detailed in Text S1. In predictive applications, model accuracy is of primary concern. RMSE is an important metric, but it does not fully capture prediction accuracy, as it reflects only the average magnitude of errors while ignoring their distribution. In contrast, a higher R\u003csup\u003e2\u003c/sup\u003e value indicates better explanatory power of the model and generally reflects stronger statistical performance. Therefore, models with the highest R\u003csup\u003e2\u003c/sup\u003e values were prioritized during model selection.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e2.2 Data preparation\u003c/h2\u003e\u003cdiv id=\"Sec7\" class=\"Section3\"\u003e\u003ch2\u003e2.2.1 Sample selection\u003c/h2\u003e\u003cp\u003eGiven the large number of residential communities in City S, the broad range of construction periods, and significant variation in community characteristics, sampling was necessary to obtain representative data. To ensure prediction accuracy, the sampling process needed to cover communities with diverse building sizes across the city. A multistage sampling approach combining stratified and random sampling was adopted (Chen et al., \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). The sample size was determined based on the rule that linear regression requires at least 20 times the number of predictors. This was considered in conjunction with the drainage facility\u0026ndash;community characteristic analysis (Text S2), which identified building area, number of buildings, and number of households as key predictors. Assuming a regression model with three predictors and the need to divide data into training and validation sets, 120 community samples were selected.\u003c/p\u003e\u003cp\u003eDuring the multistage sampling process, a list of 3,685 communities and their building area data was obtained from City S\u0026rsquo;s drainage renovation platform. First, administrative districts were used as strata, and 50% of communities were randomly selected from each stratum, yielding 1,843 samples. These were then divided into 12 layers based on ascending building area, from which 10 communities were randomly selected per layer, resulting in 120 final samples. The dataset used for analysis included independent variables (administrative district, building area, number of buildings, number of households) and dependent variables (length of sanitary sewer pipelines, length of stormwater pipelines, number of sewage manholes, number of stormwater manholes). After encoding administrative districts and removing missing and outlier values, the data were imported into SPSS for further analysis.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec8\" class=\"Section3\"\u003e\u003ch2\u003e2.2.2 Data verification\u003c/h2\u003e\u003cp\u003ePrior to model construction, it was necessary to test the four key assumptions of linear regression\u0026mdash;linearity, independence, normality, and homoscedasticity\u0026mdash;to ensure model accuracy and reliability.\u003c/p\u003e\u003cp\u003eThe Pearson correlation coefficient is a commonly used metric for measuring linear associations between variables (Haddad et al., \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). It ranges from \u0026minus;\u0026thinsp;1 to +\u0026thinsp;1, with values closer to +\u0026thinsp;1 indicating a stronger positive correlation. In SPSS, all variables were selected and analyzed using Pearson correlation coefficients. The results were presented in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eLinearity test of sample data.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePearson correlation coefficient\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eStormwater pipeline length\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eSewage manhole\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eStormwater manhole\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAdministrative district\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e-0.033\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e-0.028\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e-0.051\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e-0.054\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBuilding area\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.869**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.856**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.806**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.831**\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNumber of buildings\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.811**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.804**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.771**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.763**\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.877**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.862**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.878**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.857**\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.972**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.940**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.947**\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eStormwater pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.972**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.945**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.974**\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSewage manhole\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.940**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.945**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.953**\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eStormwater manhole\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.947**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.974**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.953**\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colspan=\"5\" nameend=\"c5\" namest=\"c1\"\u003e\u003cp\u003eNote: * p\u0026thinsp;\u0026lt;\u0026thinsp;0.05, ** p\u0026thinsp;\u0026lt;\u0026thinsp;0.01\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThe correlation coefficients between the four dependent variables\u0026mdash;length of sanitary sewer pipelines, length of stormwater pipelines, number of sewage manholes, and number of stormwater manholes\u0026mdash;and the predictors (building area, number of buildings, number of households) mostly exceeded 0.8, indicating strong linear relationships (Da Cunha De S\u0026aacute;-Caputo et al., \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). In addition, significant correlations were observed among the four dependent variables themselves. Therefore, recursive prediction was incorporated during model fitting, where predicted variables served as independent inputs to estimate subsequent target variables.\u003c/p\u003e\u003cp\u003eThe assumption of independence requires that data samples be mutually independent and free from autocorrelation to ensure that model residuals are not influenced by one another. If autocorrelation exists, it may reduce the model\u0026rsquo;s accuracy and cause the calculated error to underestimate the true error. This assumption was tested using the Durbin-Watson (DW) statistic, which ranges from 0 to 4 (Text S3). A value closer to 2 indicates a higher likelihood of sample independence (Kabaila et al., \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). As shown in Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e, the DW values for all four models were close to 2, indicating that the samples were independent and free from autocorrelation, thus satisfying the conditions for linear regression analysis.\u003c/p\u003e\u003cp\u003eNormality ensures the validity of parameter estimation in regression models. If the error terms are not normally distributed, confidence intervals may become unstable. Linear regression assumes that residuals follow a normal distribution, which can be tested by plotting histograms of the residuals (Schmidt and Finan, \u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). As shown in Figure \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e, the histograms for the sanitary sewer pipeline length, stormwater pipeline length, and stormwater manhole models aligned closely with the normal curve, indicating that their residuals were approximately normally distributed. However, the histogram for the sewage manhole model exhibited a sharp peak, suggesting that its residuals did not follow a normal distribution. According to the central limit theorem, in the case of large sample sizes, the sampling distribution of the mean approximated a normal distribution regardless of the population distribution. Therefore, the normality assumption was less critical for linear regression when large datasets were used (Bilon, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eHomoscedasticity can be assessed using a residual plot with standardized predicted values on the x-axis and standardized residuals on the y-axis. If the residuals are randomly scattered around the horizontal line at zero without any discernible pattern or outliers, the assumption of homoscedasticity is considered to be met (Moravec et al., \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). Conversely, if the variance of residuals increases with the predicted values of the dependent variable\u0026mdash;known as heteroscedasticity\u0026mdash;it can lead to inaccurate predictions, as observed in the sanitary sewer pipeline length model (Figure \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e). Therefore, in subsequent model optimization, it was necessary to address heteroscedasticity to ensure accurate predictions.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e\u003cem\u003e2.3 Model optimization and validation methods\u003c/em\u003e\u003c/h2\u003e\u003cp\u003eTo optimize the model and improve statistical accuracy and fitting performance, weighting was applied to independent variables and model parameters were adjusted. The model was trained on a training set and validated using a validation set to assess its performance.\u003c/p\u003e\u003cp\u003eFor example, in a weighted linear regression model, a weight function was first defined to construct a weighted least squares model (Meermeyer, \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e2015\u003c/span\u003e). The dataset was divided into ten folds; in each iteration, nine folds were used for training and one for validation. The model\u0026rsquo;s performance was evaluated using R\u003csup\u003e2\u003c/sup\u003e and RMSE, and the optimal model was identified by selecting the one with the highest R\u003csup\u003e2\u003c/sup\u003e. The final model was then output, and a residual plot was generated.\u003c/p\u003e\u003cp\u003eK-fold cross-validation is a widely employed technique for model evaluation (Wong and Yang, \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2017\u003c/span\u003e), especially effective for small datasets. It involves dividing the dataset into K subsets; in each iteration, one subset is used for validation while the others are used for training. This process was repeated K times, making efficient use of the data, reducing overfitting, and providing a reliable estimate of model stability. Empirical studies have suggested that K\u0026thinsp;=\u0026thinsp;10 often yields the best overall performance (Rodriguez et al., \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2010\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eIn 10-fold cross-validation, the dataset was split into ten subsets. The process was repeated ten times, with each subset used once as the validation set and the remaining nine used for training. R\u003csup\u003e2\u003c/sup\u003e and RMSE were calculated in each iteration to evaluate model performance. The model with the highest R\u003csup\u003e2\u003c/sup\u003e was selected as the best-performing model, and relevant data and visualizations were output.\u003c/p\u003e\u003c/div\u003e"},{"header":"3. Results and discussion","content":"\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e3.1 Model fitting, optimization and validation\u003c/h2\u003e\u003cdiv id=\"Sec12\" class=\"Section3\"\u003e\u003ch2\u003e3.1.1 MLR model fitting\u003c/h2\u003e\u003cp\u003eThis study examined four dependent variables: the length of sanitary sewer pipelines, the length of stormwater pipelines, the number of sewage manholes, and the number of stormwater manholes. The independent variables included building area, number of buildings, and number of households. Since the length of sanitary sewer pipelines had a correlation coefficient above 0.9 with the other three dependent variables, it was considered significantly correlated and was used as a new predictor in recursive forecasting. Specifically, sanitary sewer pipeline length was first predicted and then used as an input variable in the other three models. The detailed variable names were listed in Table \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e. SPSS software was utilized to perform backward elimination for variable selection and conduct preliminary linear regression analysis for the four models (Table S3).\u003c/p\u003e\u003cp\u003eWhen evaluating a linear regression model, four key aspects were considered: (1) model significance (Magara and Boury-Jamot, \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e2024\u003c/span\u003e), where a p-value less than 0.05 indicates statistical relevance; (2) goodness-of-fit, measured by R\u003csup\u003e2\u003c/sup\u003e, with values closer to 1 indicating better model performance; (3) multicollinearity, assessed using the variance inflation factor (VIF) (Salmer\u0026oacute;n et al., \u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e2018\u003c/span\u003e), where values below 10 suggest no multicollinearity; and (4) residual analysis (Kozak and Piepho, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2018\u003c/span\u003e), which help examine unexplained variance and assess the model\u0026rsquo;s assumptions. Together, these factors offer a comprehensive assessment of the overall model quality.\u003c/p\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e indicated that all four models met the criteria of significance level\u0026thinsp;\u0026lt;\u0026thinsp;0.05 and VIF\u0026thinsp;\u0026lt;\u0026thinsp;10, confirming that the models were valid and free from multicollinearity. Among them, the sanitary sewer pipeline length model achieved an R\u003csup\u003e2\u003c/sup\u003e value of 0.877, indicating a good fit but still leaving room for improvement.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eMultiple linear regression model fitting results.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eModel significance\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eR\u003csup\u003e2\u003c/sup\u003e\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eAdjusted R\u003csup\u003e2\u003c/sup\u003e\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c6\" namest=\"c5\"\u003e\u003cp\u003eVIF\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003eModel 1: Sanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.877\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.873\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eBuilding area\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e3.366\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eNumber of buildings\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e2.186\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e3.582\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003eModel 2: Stormwater pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.946\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.945\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eBuilding area\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e4.099\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eNumber of buildings\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e2.926\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e6.244\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003eModel 3: Sewage manhole\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.899\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.897\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eBuilding area\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e4.406\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e4.649\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e6.056\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eModel 4: Stormwater manhole\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e0.901\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e0.899\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e4.317\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e4.317\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eFigure S3 showed that the residuals of all four models increased with the predicted values, indicating the presence of heteroscedasticity. This issue compromised predictive accuracy (Amado et al., \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2025\u003c/span\u003e), as samples with large dependent variable values tended to be underestimated, while those with small values might be overestimated.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section3\"\u003e\u003ch2\u003e3.1.2 MLR model optimization and verification\u003c/h2\u003e\u003cp\u003eTo address this issue, weighted least squares (WLS) regression was applied. The process involved first using SPSS\u0026rsquo;s weight estimation function to calculate variable weights, followed by performing weighted linear regression based on the optimal weight coefficients. The model with the highest R\u003csup\u003e2\u003c/sup\u003e was selected as the optimal one. As shown in Figure S4, the residuals of the four models were evenly distributed above and below zero, indicating that the assumption of homoscedasticity was satisfied.\u003c/p\u003e\u003cp\u003eAmong the four models, only the stormwater pipeline length model achieved an R\u003csup\u003e2\u003c/sup\u003e value greater than 0.9, indicating a strong fit. The remaining three models showed comparatively weaker performance and required further validation of the linear regression approach. Results from ten-fold cross-validation of the weighted linear regression models were presented in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eModel weighted linear regression and cross validation of optimized results.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"7\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c3\" namest=\"c2\"\u003e\u003cp\u003eCoefficient\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eModel significance\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eR\u003csup\u003e2\u003c/sup\u003e\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eOptimized R\u003csup\u003e2\u003c/sup\u003e\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003eRMSE\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003eModel 1: Sanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eConstant\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e36.837\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e0.873\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e0.966\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e490.290\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBuilding area\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.018\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of buildings\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e12.568\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.948\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003eModel 2: Stormwater pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eConstant\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e9.536\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.963\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.991\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e118.092\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBuilding area\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.004\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.885\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003eModel 3: Sewage manhole\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eConstant\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.044\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.894\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.959\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e127.814\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.112\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.112\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003eModel 4: Stormwater manhole\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eConstant\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.355\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.884\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.943\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e78.356\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.082\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.074\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eAll models showed statistical significance (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001), with increased R\u003csup\u003e2\u003c/sup\u003e values and reduced RMSE, indicating improved model performance. Figure\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e1\u003c/span\u003e illustrated the fit between observed and predicted values for each model. The stormwater pipeline length model demonstrated the best performance (R\u003csup\u003e2\u003c/sup\u003e\u0026thinsp;=\u0026thinsp;0.991). All four models had R\u003csup\u003e2\u003c/sup\u003e values exceeding 0.9, indicating suitability for predicting new data. Figure\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e2\u003c/span\u003e showed the residual plots of the models. The residuals of the sewage and stormwater pipeline length models were evenly distributed above and below zero, suggesting homoscedasticity. In contrast, heteroscedasticity was observed in the sewage and stormwater manhole models. Although model optimization improved the R\u003csup\u003e2\u003c/sup\u003e values and fitting accuracy, residual plots still exhibited signs of heteroscedasticity, indicating that the models did not fully capture the variability of the error terms.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec14\" class=\"Section3\"\u003e\u003ch2\u003e3.1.3 SVR model fitting\u003c/h2\u003e\u003cp\u003eIn nonlinear regression analysis, although linear regression may exclude some variables with weak linear correlations, these variables may still exhibit nonlinear relationships with the dependent variable. Therefore, building area, number of buildings, and number of households were retained as independent variables. In this study, the dataset was divided into a 7:3 split for training and testing. The training set was used to build the SVR model, with the radial basis function (RBF) selected as the kernel. The model\u0026rsquo;s accuracy was evaluated using RMSE and R\u003csup\u003e2\u003c/sup\u003e. The key Python code for the SVR model was provided in Text S4. The results from the code execution (Table S4) showed that all models had R\u003csup\u003e2\u003c/sup\u003e values below 0.9 and relatively large RMSE, indicating poor fit and prediction accuracy. Therefore, further parameter adjustments and optimization were necessary.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec15\" class=\"Section3\"\u003e\u003ch2\u003e3.1.4 SVR model optimization and verification\u003c/h2\u003e\u003cp\u003eThe performance of the SVR model was influenced not only by the data structure but also by its sensitivity to hyperparameter settings, primarily the regularization parameter (C) and the epsilon-insensitive loss parameter (Ɛ) (Hsia and Lin, \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2020b\u003c/span\u003e). The parameter C controls model complexity: smaller values encourage a simpler model, while larger values enhance the fit to training data. Ɛ defines the prediction tolerance and sensitivity to noise: smaller values increase sensitivity to errors, while larger values result in smoother predictions. Hyperparameter tuning was typically performed using grid search combined with cross-validation (Xie et al., \u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), with C set to 0.1, 1, 10, and 100, and Ɛ set to 0.01, 0.1, 1, and 10. These combinations were implemented programmatically to build and optimize the SVR model (Code in Text S5).\u003c/p\u003e\u003cp\u003eThe results in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e showed that the R\u003csup\u003e2\u003c/sup\u003e of the sewage manhole and stormwater manhole models increased, while RMSE values decreased, indicating improved accuracy of the SVR models.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eOptimized support vector regression model results.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eIndependent variable\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eC\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eƐ\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eR\u003csup\u003e2\u003c/sup\u003e\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eRMSE\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBuilding area\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e10\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.01\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.895\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e310.844\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of buildings\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003eStormwater pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBuilding area\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e10\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e0.01\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e0.989\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e142.897\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of buildings\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003eSewage manhole\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBuilding area\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e10\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e0.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e0.957\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e69.505\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of buildings\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eStormwater manhole\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e10\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.971\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e32.278\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e3\u003c/span\u003e showed the fit between the real and predicted values of the four dependent variables in the SVR model. The lines for the sewage manhole and stormwater manhole models nearly coincided, indicating excellent fit and suitability for predicting new data. Figure\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e4\u003c/span\u003e was the residual plot of the model. The residuals for the sanitary sewer pipeline length and stormwater pipeline length models radiated along the zero axis, indicating the presence of heteroscedasticity. In contrast, the residuals for the sewage manhole and stormwater manhole models were evenly distributed above and below zero, showing good homoscedasticity and improvement in heteroscedasticity. This indicated that the SVR model could capture the variability of error terms at different levels.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec16\" class=\"Section3\"\u003e\u003ch2\u003e3.1.5 RFR model fitting\u003c/h2\u003e\u003cp\u003eThe RFR model was similar to that of the SVR model. The dataset was randomly split into training and validation sets in a 7:3 split. The RFR model was implemented using functions from the scikit-learn (sklearn) library, and its performance was evaluated based on two key metrics: RMSE and R\u003csup\u003e2\u003c/sup\u003e. The implementation details were provided in Text S6.\u003c/p\u003e\u003cp\u003eAs shown in Table S5, the RFR model achieved a slight improvement in R\u003csup\u003e2\u003c/sup\u003e compared to the SVR model. However, the R\u003csup\u003e2\u003c/sup\u003e values remained below 0.9, indicating a relatively poor model fit. Moreover, the large RMSE values suggested a significant average deviation between the predicted and actual values, reflecting low predictive accuracy. These results highlighted the need for further model tuning and optimization.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec17\" class=\"Section3\"\u003e\u003ch2\u003e3.1.6 RFR model optimization and verification\u003c/h2\u003e\u003cp\u003eThe performance of the RFR model was primarily determined by two key hyperparameters: the number of estimators (n_estimators) and the maximum depth of the trees (max_depth) (Dai et al., \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). The n_estimators parameter defined the number of decision trees in the forest. A larger number generally improved model performance but increases computational cost. Common values include 300, 500, and 1000. The max_depth parameter controlled the maximum depth of each tree. While deeper trees could improve model fitting, they were more prone to overfitting. Typical tuning began with small values such as 3, 5, or 7. Hyperparameter tuning was performed using grid search (Zong et al., \u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) in combination with ten-fold cross-validation. Parameter combinations were implemented programmatically to build and optimize the model (Text S7).\u003c/p\u003e\u003cp\u003eAccording to Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, the R\u003csup\u003e2\u003c/sup\u003e values obtained from the RFR model were the lowest, and the RMSE values were the highest among the three regression approaches, indicating that the model had the least accurate predictive performance.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eOptimized random forest regression model results.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eIndependent variables\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003en_estimators\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003emax_depth\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eR\u003csup\u003e2\u003c/sup\u003e\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eRMSE\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBuilding area\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e1000\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e0.867\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\" morerows=\"2\" rowspan=\"3\"\u003e\u003cp\u003e871.871\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of buildings\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eStormwater pipeline length\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBuilding area\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e500\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e0.977\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e166.868\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003eSewage manhole\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBuilding area\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e300\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e0.898\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\" morerows=\"3\" rowspan=\"4\"\u003e\u003cp\u003e107.203\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of buildings\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eStormwater manhole\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNumber of households\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e1000\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e0.964\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003e52.194\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSanitary sewer pipeline length\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eAs illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e5\u003c/span\u003e, the predicted values showed low consistency with the actual values for all four dependent variables, further confirming that the RFR model was not well-suited for this forecasting task.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\u003ch2\u003e3.2 Model comparison\u003c/h2\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e6\u003c/span\u003e compared the models developed using three different regression methods. Overall, the WLR and SVR models demonstrated better fitting performance. Specifically, in the sewage and stormwater pipeline length models, the R\u003csup\u003e2\u003c/sup\u003e values obtained from linear regression were significantly higher than those of the other two methods, demonstrating its superior accuracy and suitability for estimating these variables. In contrast, for the sewage and stormwater manhole models, the SVR method achieved higher R\u003csup\u003e2\u003c/sup\u003e values and the lowest RMSE, indicating better accuracy and smaller prediction errors, and thus was more appropriate for estimating manhole numbers. The fitting and residual plots provided a visual comparison of the three regression methods across different dependent variables. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e7\u003c/span\u003e, linear regression performed best in fitting sewage and stormwater pipeline lengths, whereas SVR showed superior performance in modeling sewage and stormwater manholes. Furthermore, the residual plots in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e8\u003c/span\u003e demonstrated the symmetry of residual distribution around the zero line, supporting the observed trends in model performance.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e\u003ch2\u003e3.3 Model results\u003c/h2\u003e\u003cdiv id=\"Sec20\" class=\"Section3\"\u003e\u003ch2\u003e3.3.1 Model 1: Sanitary sewer pipeline length\u003c/h2\u003e\u003cp\u003eThe dependent variable in the first model was sanitary sewer pipeline length, with building area (A), number of buildings (B), and number of households (H) in the residential community as the independent variables. A WLR model was applied, and the resulting equation was expressed as follows:\u003c/p\u003e\u003cp\u003e\u003cimg src=\"https://myfiles.space/user_files/127393_c7e80a1c9bb65875/127393_custom_files/img1752862686.png\"\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec21\" class=\"Section3\"\u003e\u003ch2\u003e3.3.2 Model 2: Stormwater pipeline length\u003c/h2\u003e\u003cp\u003eFor the stormwater pipeline length model, the dependent variable was stormwater pipeline length, and the independent variables included building area (A) and the predicted sanitary sewer pipeline length (S). The same WLR approach was employed, with recursive prediction based on the first model\u0026rsquo;s output. The model equation is as follows:\u003c/p\u003e\u003cp\u003e\u003cimg src=\"https://myfiles.space/user_files/127393_c7e80a1c9bb65875/127393_custom_files/img1752862747.png\"\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec22\" class=\"Section3\"\u003e\u003ch2\u003e3.3.3 Model 3: Sewage manhole\u003c/h2\u003e\u003cp\u003eThe dependent variable in this model was the number of sewage manholes, while the independent variables included building area, number of buildings, number of households, and sanitary sewer pipeline length. An SVR model was employed. Unlike linear regression models, which offer explicit mathematical expressions, SVR utilizes nonlinear kernel functions for data transformation, leading to greater model complexity. Consequently, the model cannot be expressed with a simple formula and must be implemented using Python programming to estimate new data.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec23\" class=\"Section3\"\u003e\u003ch2\u003e3.3.4 Model 4: Stormwater manhole\u003c/h2\u003e\u003cp\u003eThe dependent variable in this model was the number of stormwater manholes, with the number of households and sanitary sewer pipeline length serving as the independent variables. An SVR model was applied, which required implementation through Python programming to perform predictions on new data.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec24\" class=\"Section2\"\u003e\u003ch2\u003e3.4 Error analysis\u003c/h2\u003e\u003cdiv id=\"Sec25\" class=\"Section3\"\u003e\u003ch2\u003e3.4.1 Data measurement error\u003c/h2\u003e\u003cp\u003eMeasurement errors may arise during data collection (Stoklosa et al., \u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e2016\u003c/span\u003e), such as inaccuracies from manual measurements in design drawings or the inclusion of secondary pipelines in automated CAD statistics. These errors can introduce bias or noise into the dataset, thereby compromising the statistical accuracy of the model.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec26\" class=\"Section3\"\u003e\u003ch2\u003e3.4.2 Small number of training samples\u003c/h2\u003e\u003cp\u003eTypically, building effective statistical models using random forest or support vector machine algorithms requires 1,000 to 10,000 samples. A smaller sample size may limit the model\u0026rsquo;s ability to learn underlying patterns and trends, potentially causing underfitting or overfitting and reducing predictive accuracy (Lee and Yang, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2022\u003c/span\u003e).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec27\" class=\"Section3\"\u003e\u003ch2\u003e3.4.3 Error propagation\u003c/h2\u003e\u003cp\u003eIn multi-step predictions using recursive forecasting, errors may accumulate progressively and propagate through subsequent steps. Even if the initial statistical errors are minor, they may compound over time, resulting in substantial deviations between the predicted and actual values.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e"},{"header":"4. Conclusions","content":"\u003cp\u003eThis study developed a hybrid regression model integrating MLR, SVR, and RFR to estimate the number of drainage facilities in urban residential communities. The results indicated that drainage facility estimation was affected not only by structural variables such as building area, number of buildings, and number of households, but also by complex nonlinear relationships. Therefore, the MLR model was optimized using weighted least squares, while the SVR and RFR models were tuned via grid search and ten-fold cross-validation, resulting in improved estimation accuracy. The findings showed that the WLR model effectively predicted pipeline length, achieving R\u003csup\u003e2\u003c/sup\u003e values above 0.96. For estimating the number of sewage and stormwater manholes, the SVR model outperformed others due to its strong nonlinear fitting capabilities. Moreover, incorporating a recursive modeling mechanism enhanced prediction accuracy by capturing interdependencies among variables. However, despite model optimization, measurement errors and limited sample sizes remained significant sources of uncertainty, and recursive forecasting introduced the risk of cumulative error propagation. Overall, this study provides a scientific foundation and practical framework for estimating urban drainage infrastructure, contributing to the advancement of data-driven and intelligent urban drainage management systems.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eFunding:\u0026nbsp;\u003c/strong\u003eThe authors declare that no funds, grants, or other support were received during the preparation of this manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests:\u0026nbsp;\u003c/strong\u003eThe authors have no relevant financial or non-financial interests to disclose.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contributions:\u003c/strong\u003e All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Kaiwei Hu, Yixiao He and Shuping Li. The first draft of the manuscript was written by Kaiwei Hu and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAilobhio DT, Ikughur JA (2024) A Review of Some Goodness-of-Fit Tests for Logistic Regression Model. Asian J Probab Stat 26(7):75\u0026ndash;85\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAlizadeh Gharaei MS, Ramezani Y, Nazeri Tahroudi M (2024) Toward coupling of nonlinear support vector regression and crowd intelligence optimization algorithms in estimation of suspended sediment load. Appl Water Sci 14(9)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAmado C, Bianco AM, Boente G, Rodrigues IM (2025) Robust estimation of heteroscedastic regression models: a brief overview and new proposals. Stat Pap 66(3)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eArya S, Kumar A (2023) Evaluation of stormwater management approaches and challenges in urban flood control. Urban Clim 51:101643\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBeigzadeh B, Bahrami M, Amiri MJ, Mahmoudi MR (2020) A new approach in adsorption modeling using random forest regression, Bayesian multiple linear regression, and multiple linear regression: 2,4-D adsorption by a green adsorbent. Water Sci Technol 82(8):1586\u0026ndash;1602\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBilon XJ (2023) Normality and significance testing in simple linear regression model for large sample sizes: a simulation study. Communications in statistics. Simul Comput 52(6):2781\u0026ndash;2797\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBrooks C, Burke SP, Stanescu S (2016) Finite sample weighting of recursive forecast errors. Int J Forecast 32(2):458\u0026ndash;474\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCao T, Truong V, Nguyen N (2024) An efficient optimization framework for Urban drainage system design. E3S Web of Conferences 533, 5001\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChen Z, Sarkar A, Li X, Xia X (2021) Effects of joint adoption for multiple green production technologies on welfare-a survey of 650 kiwi growers in Shaanxi and Sichuan. Int J Clim Chang Strateg Manag 13(3):229\u0026ndash;249\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCiulla G, D'Amico A (2019) Building energy performance forecasting: A multiple linear regression approach. Appl Energy 253:113500\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eS\u0026aacute;-Caputo DCD, Sonza D, Coelho-Oliveira A, Pessanha-Freitas AC, Reis J, Francisca-Santos AS, Dos Anjos A, Paineiras-Domingos EM, de Rezende Bessa Guerra LL, Da Silva Franco T, Xavier A, Barbosa VL, Silva E, Moura-Fernandes CJ, Mendon\u0026ccedil;a MC, Lacerda VAR, Mulder AC, Seixas A, Sartorio A, Taiar A, Bernardo-Filho R (2021) M., Evaluation of the Relationships between Simple Anthropometric Measures and Bioelectrical Impedance Assessment Variables with Multivariate Linear Regression Models to Estimate Body Composition and Fat Distribution in Adults: Preliminary Results. Biology 10(11), 1209\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDai B, Gu C, Zhao E, Qin X (2018) Statistical model optimized random forest regression model for concrete dam deformation monitoring. Struct Control Health Monit 25(6), e2170\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDai W, Mohammadi S, Cremaschi S (2022) A hybrid modeling framework using dimensional analysis for erosion predictions. Comput Chem Eng 156:107577\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDesai S, Ouarda TBMJ (2021) Regional hydrological frequency analysis at ungauged sites with random forest regression. J Hydrol 594:125861\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDong X, Guo H, Zeng S (2017) Enhancing future resilience in urban drainage system: Green versus grey infrastructure. Water Res 124:280\u0026ndash;289\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eEtemadi S, Khashei M (2021) Etemadi multiple linear regression. Measurement 186:110080\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFathy I, Abdel-Aal GM, Fahmy MR, Fathy A, Zeleň\u0026aacute;kov\u0026aacute; M (2020) The Negative Impact of Blockage on Storm Water Drainage Network. Water 12(7), 1974\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFu X, Goddard H, Wang X, Hopton ME (2019) Development of a scenario-based stormwater management planning support system for reducing combined sewer overflows (CSOs). J Environ Manage 236:571\u0026ndash;580\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGholami V, Khaleghi MR (2019) A comparative study of the performance of artificial neural network and multivariate regression in simulating springs discharge in the Caspian Southern Watersheds, Iran. Appl Water Sci 9(1)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHaddad S, Boukhayma A, Caizzone A (2022) Continuous PPG-Based Blood Pressure Monitoring Using Multi-Linear Regression. Ieee J Biomed Health Inf 26(5):2096\u0026ndash;2105\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHill C, Du L, Johnson M, McCullough BD (2024) Comparing programming languages for data analytics: Accuracy of estimation inPython andR. WIREs Data Min Knowl Discov 14(3)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHsia J, Lin C (2020a) Parameter Selection for Linear Support Vector Regression. Ieee Trans Neural Netw Learn Syst 31(12):5639\u0026ndash;5644\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHsia J, Lin C (2020b) Parameter Selection for Linear Support Vector Regression. Ieee Trans Neural Netw Learn Syst 31(12):5639\u0026ndash;5644\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKabaila P, Farchione D, Alhelli S, Bragg N (2021) The effect of a Durbin\u0026ndash;Watson pretest on confidence intervals in regression. Stat Neerl 75(1):4\u0026ndash;23\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKattan MW, Gerds TA (2020) A Framework for the Evaluation of Statistical Prediction Models. Chest 158(1):S29\u0026ndash;S38\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKlopfenstein Q, Vaiter S (2021) Linear support vector regression with linear constraints. Mach Learn 110(7):1939\u0026ndash;1974\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKozak M, Piepho HP (2018) What's normal anyway? Residual plots are more telling than significance tests when checkingANOVA assumptions. J Agron Crop Sci 204(1):86\u0026ndash;98\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKozak S, Petterson S, McAlister T, Jennison I, Bagraith S, Roiko A (2020) Utility of QMRA to compare health risks associated with alternative urban sewer overflow management strategies. J Environ Manage 262:110309\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLee J, Yang C (2022) Deep neural network and meta-learning-based reactive sputtering with small data sample counts. J Manuf Syst 62:703\u0026ndash;717\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiao K, Park ES, Zhang J, Cheng L, Ji D, Ying Q, Yu JZ (2021) A multiple linear regression model with multiplicative log-normal error term for atmospheric concentration data. Sci Total Environ 767:144282\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu Y, Zhao W, Wei Y, Sebastian FSM, Wang M (2023) Urban waterlogging control: A novel method to urban drainage pipes reconstruction, systematic and automated. J Clean Prod 418:137950\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMa X, Zhang J, Wang P, Zhou L, Sun Y (2023) Estimating the nonlinear response of landscape patterns to ecological resilience using a random forest algorithm: Evidence from the Yangtze River Delta. Ecol Indic 153:110409\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMagara F, Boury-Jamot B (2024) About statistical significance, and the lack thereof. Lab Anim 58(5):448\u0026ndash;452\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMaglia N, Raimondi A (2025) A new approach on design and verification of integrated sustainable urban drainage systems for stormwater management in urban areas. J Environ Manage 373:123882\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMao K, Chen C, Zhang J, Li Y (2024) ORLEP: an efficient offline reinforcement learning evaluation platform. Multimed Tools Appl 83(12):37073\u0026ndash;37087\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMeermeyer M (2015) Weighted linear regression models with fixed weights and spherical disturbances. Comput Stat 30(4):929\u0026ndash;955\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMoravec M, Pinosova M, Badida M, Izarikova G, Badidova M (2024) Analysis of the Acoustic Parameters of Building Partition Structures of Varying Composition. Buildings-Basel 14(8):2440\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNeo PK, Leong YW, Soon MF, Goh QS, Thumsorn S, Ito H (2024) Development of a Machine Learning Model to Predict the Color of Extruded Thermoplastic Resins. Polymers 16(4):481\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eQiao X, Kristoffersson A, Randrup TB (2018) Challenges to implementing urban sustainable stormwater management from a governance perspective: A literature review. J Clean Prod 196:943\u0026ndash;952\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRodriguez JD, Perez A, Lozano JA (2010) Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation. Ieee Trans Pattern Anal Mach Intell 32(3):569\u0026ndash;575\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRodriguez M, Fu G, Butler D, Yuan Z, Sharma K (2021) Exploring the Spatial Impact of Green Infrastructure on Urban Drainage Resilience. Water 13(13):1789\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSalmer\u0026oacute;n R, Garc\u0026iacute;a CB, Garc\u0026iacute;a J (2018) Variance Inflation Factor and Condition Number in multiple linear regression. J Stat Comput Simul 88(12):2365\u0026ndash;2384\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSchmidt AF, Finan C (2018) Linear regression and the normality assumption. J Clin Epidemiol 98:146\u0026ndash;151\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eStoklosa J, Huang Y, Furlan E, Hwang W (2016) On quadratic logistic regression models when predictor variables are subject to measurement error. Comput Stat Data Anal 95:109\u0026ndash;121\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSun Y, Hu X, Li Y, Peng Y, Yu Y (2021) A framework for deriving dispatching rules of integrated urban drainage systems. J Environ Manage 298:113401\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTsirikoglou P, Abraham S, Contino F, Lacor C, Ghorbaniasl G (2017) A hyperparameters selection technique for support vector regression models. Appl Soft Comput 61:139\u0026ndash;148\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang J, Liu G, Wang J, Xu X, Shao Y, Zhang Q, Liu Y, Qi L, Wang H (2021) Current status, existent problems, and coping strategy of urban drainage pipeline network in China. Environ Sci Pollut Res 28(32):43035\u0026ndash;43049\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWen K, Wu W, Wu X (2023) Electricity demand forecasting and risk management using Gaussian process model with error propagation. J Forecast 42(4):957\u0026ndash;969\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWinter AR, Zhu Y, Asimow NG, Patel MY, Cohen RC (2025) A Scalable Calibration Method for Enhanced Accuracy in Dense Air Quality Monitoring Networks. Environ Sci Technol 59(5):2599\u0026ndash;2610\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWong T, Yang N (2017) Dependency Analysis of Accuracy Estimates in k-Fold Cross Validation. Ieee Trans Knowl Data Eng 29(11):2417\u0026ndash;2427\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXie S, Lin H, Chen Y, Duan H, Liu H, Liu B (2023) Prediction of shear strength of rock fractures using support vector regression and grid search optimization. Mater Today Commun 36:106780\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXu W, Peng H, Zeng X, Zhou F, Tian X, Peng X (2019) A hybrid modelling method for time series forecasting based on a linear regression model and deep learning. Appl Intell 49(8):3002\u0026ndash;3015\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYazdi J (2018) Water quality monitoring network design for urban drainage systems, an entropy method. Urban Water J 15(3):227\u0026ndash;233\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhao J, Lao F, Yan G, Zhang Y (2024) How data heterogeneity affects innovating knowledge and information in gene identification: A statistical learning perspective. J Innov Knowl 9(3):100514\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhong Y, Yang H, Zhang Y, Li P (2021) Online Rebuilding Regression Random Forests. Knowledge-Based Syst 221:106960\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZong Y, Nian Y, Zhang C, Tang X, Wang L, Zhang L (2025) Hybrid Grid Search and Bayesian optimization-based random forest regression for predicting material compression pressure in manufacturing processes. Eng Appl Artif Intell 141:109580\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":false,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"water-resources-management","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"warm","sideBox":"Learn more about [Water Resources Management](https://www.springer.com/journal/11269)","snPcode":"11269","submissionUrl":"https://submission.nature.com/new-submission/11269/3","title":"Water Resources Management","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Urban drainage infrastructure, Residential community, Hybrid regression framework, Recursive prediction, Heteroscedasticity correction","lastPublishedDoi":"10.21203/rs.3.rs-6754917/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6754917/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eUrban drainage infrastructure plays a vital role in maintaining water quality and mitigating urban flood risks. However, estimating the quantity of residential drainage facilities in megacities remains challenging due to infrastructure aging, fragmented management, and data deficiencies. This study proposes a hybrid regression framework combining multiple linear regression (MLR), support vector regression (SVR), and random forest regression (RFR) to improve prediction accuracy of drainage facility quantities at the residential community scale. Based on data from 120 residential communities in City S, the study analyzes correlations between drainage facilities (e.g., pipeline length, number of manholes) and community attributes (e.g., building area, number of buildings, households), and incorporates a recursive forecasting mechanism to enhance multi-step estimation accuracy. The MLR model was optimized using weighted least squares to address heteroscedasticity, while SVR and RFR parameters were tuned via grid search and ten-fold cross-validation. The results indicated that the weighted linear regression (WLR) model performs best in predicting pipeline lengths, while the SVR model achieved higher accuracy in predicting the number of manholes. The proposed modeling framework offers reliable data support and a practical methodology for planning and fine-tuning the management of drainage systems in highly heterogeneous megacities.\u003c/p\u003e","manuscriptTitle":"A hybrid regression framework for estimating urban residential community drainage infrastructure: Model optimization and recursive prediction","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-18 18:23:07","doi":"10.21203/rs.3.rs-6754917/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"editorInvited","content":"Water Resources Management","date":"2025-12-03T08:00:42+00:00","index":"","fulltext":""},{"type":"reviewerAgreed","content":"","date":"2025-07-16T14:16:04+00:00","index":0,"fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-07-16T08:28:13+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-05-28T02:39:38+00:00","index":"","fulltext":""},{"type":"submitted","content":"Water Resources Management","date":"2025-05-27T11:28:54+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"water-resources-management","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"warm","sideBox":"Learn more about [Water Resources Management](https://www.springer.com/journal/11269)","snPcode":"11269","submissionUrl":"https://submission.nature.com/new-submission/11269/3","title":"Water Resources Management","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"e25eac7a-c263-46f0-a61b-26aac753fc83","owner":[],"postedDate":"July 18th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2025-07-18T18:23:07+00:00","versionOfRecord":[],"versionCreatedAt":"2025-07-18 18:23:07","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6754917","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6754917","identity":"rs-6754917","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0