Machine learning based shear strength prediction in reinforced concrete beams using Levy flight enhanced decision trees

This section presents a comparative analysis of six ML algorithms utilized to predict the shear strength of RC T-beams. The algorithms include AdaBoost, DT, RF, KNN, Ridge, and the novel algorithm, Levy-DT, which enhances the standard DT using the LF mechanism. The proposed Levy-DT aims to improve the accuracy of prediction, offering a more robust approach for complex structural design scenarios. The input parameters include several geometrical and material properties of the T-beams, such as shear span-depth ratio (a/d), web width (b), flange width (bf), flange depth (hf), reinforcement ratios (ρ_s1, ρ_s2), and concrete compressive strength (fc’). The output is the ultimate shear force (Vu), which is related to the shear strength of the beam and represents the internal force acting to resist sliding along the cross-section of the beam. Data preprocessing is performed in several stages to ensure reliable model training and testing. First, missing data is addressed using “SimpleImputer” with a mean strategy, which replaces gaps in the dataset with the mean value of each respective feature⁷⁶. This imputation approach is adopted due to the relatively low proportion of missing data and the approximately symmetric distribution of most features. It preserved the statistical characteristics of the dataset while preventing bias introduced by deletion or arbitrary imputation techniques. Next, feature scaling is implemented using “StandardScaler” to normalize all input variables to a common scale with zero mean and unit variance, which is particularly important for distance-based algorithms and improves convergence speed for most ML models. The experimental design incorporated both traditional train-test splitting and “k-fold cross-validation” to ensure model robustness. Initially, 80% of the dataset (156 samples) is designated for model training, while the remaining 20% (39 samples) is reserved for testing. Additionally, a “5-fold cross-validation” approach is implemented to evaluate model performance stability across different data subsets, with each fold maintaining the same preprocessing pipeline to prevent data leakage. All computations and model development, including implementation of the Levy-DT algorithm, are conducted using Python⁷⁷.

To ensure a fair and rigorous comparison between the proposed Levy-DT and baseline models, hyperparameter optimization is implemented for the standard DT algorithm. This optimization process aims to identify the most suitable regularization parameters that balance model complexity with generalization capability. The hyperparameter optimization is conducted using GridSearchCV with 5-fold cross-validation to systematically evaluate different parameter combinations. The optimization space includes critical regularization parameters: max_depth values ranging from 5 to 15 plus unrestricted depth, min_samples_split values from 2 to 15, min_samples_leaf values from 1 to 8, max_features options including square root, logarithmic, and all features, and min_impurity_decrease thresholds from 0.0 to 0.002. A balanced model selection approach is employed to address potential overfitting concerns. Rather than selecting the model with the highest cross-validation score alone, the selection criterion incorporated both test performance and the overfitting gap between training and test scores. To ensure a reasonable level of generalization alongside competitive predictive accuracy, the model selection strategy incorporates both training and validation performance. Initially, the five most promising models identified through cross-validation results are shortlisted for further evaluation. For each of these candidates, R² scores are computed on both the training and test sets to quantify potential overfitting. The performance gap, expressed as the difference between these two scores, serves as a measure of model robustness. A balance score is then calculated by adjusting the test R² value with a penalty proportional to the overfitting gap, particularly when the gap exceeds a threshold of 0.1. The model achieving the highest balance score is ultimately selected, reflecting an optimal compromise between accuracy and generalization. This systematic optimization process provides a robust baseline for comparison with the proposed Levy-enhanced approach, allowing for meaningful evaluation of the enhancement achieved through LF integration.

The sensitivity analysis conducted on the Levy-DT algorithm, as visualized in Fig. 3, provides valuable insights into the influence of key parameters on model performance and computational efficiency. This analysis addresses the relationship between parameter selection and model performance metrics, particularly the R² score and training time.

The analysis of the Levy lambda (λ) parameter reveals a complex relationship with model performance. As illustrated in the first graph, the R² score demonstrates notable sensitivity to λ variations, with peak performance occurring at λ = 1.3 (R² ≈ 0.979). Lower λ values (1.1–1.3) generally yielded superior predictive performance compared to higher values (1.4-2.0), where performance stabilized at a lower level (R² ≈ 0.979). This indicates that the stochastic search behavior governed by λ exhibits an optimal range for this specific dataset. Training time exhibits an irregular pattern across λ values; notably, at λ = 1.3 where the highest R² score is observed the training time remains at a moderate level, indicating a favorable balance between performance and computational cost.

The β parameter demonstrates a pronounced effect on model performance, with optimal R² scores observed at a β value of 0.02. However, the difference in R² between β = 0.02 and β = 0.04 is minimal (Δ ≈ 0.0001). Considering that β = 0.04 yields the shortest training time, it is ultimately selected to balance predictive performance and computational efficiency. This choice reflects a deliberate trade-off between near-optimal accuracy and significantly improved training speed. The performance trend shows an initial improvement as β increases from 0.005 to 0.02, followed by a slight decrease at higher values, suggesting that intermediate perturbation magnitudes enable effective exploration of promising solution regions without excessive deviation from original feature values.

Among all the parameters investigated, the maximum tree depth (D_max) demonstrated the most significant effect on predictive outcomes. R² values showed considerable fluctuation across different depth limits, with notable improvements observed at depths of 5 and 10, and declines at depth 7 and when no upper limit is applied. This non-monotonic relationship underscores the complexity of tree-based model capacity optimization and the importance of careful depth selection to prevent both underfitting and overfitting. Training time generally increases with greater D_max values, reflecting the expected computational cost of building deeper trees. While a general upward trend is visible, the pattern is not strictly linear. In fact, some configurations such as D_max = 10 achieve high R² values with only a moderate increase in training time, indicating that optimal depth values can maintain computational efficiency when well aligned with the LF-guided search.

Based on comprehensive sensitivity analysis, the optimal parameter configuration is determined to be λ = 1.3, β = 0.04, D_max = 10, with 15 iterations. This configuration achieved an R² score of 0.979, RMSE of 26.962, and MAE of 14.446, with a training time of 0.1193 s. When compared to standard DT methods, the Levy-DT algorithm introduces additional computational overhead due to the stochastic perturbation process and multiple iterations. However, the sensitivity analysis demonstrates that this overhead can be minimized through judicious parameter selection while still achieving superior predictive performance. The modest training time of 0.1193 s for the optimal configuration suggests that the LF enhancement mechanism introduces acceptable computational costs relative to the performance benefits gained. The parameter tuning process reveals the delicate balance between model performance and computational efficiency in the Levy-DT algorithm. The optimal configuration achieves superior predictive accuracy compared to standard DT while maintaining reasonable computational demands, validating the practical applicability of the LF enhancement mechanism in tree-based regression tasks.

The cross-validation analysis implemented in the study, the results of which are presented in Fig. 4, represents a robust methodology for evaluating model performance across different data subsets. The k-fold cross-validation procedure is systematically implemented with several key characteristics. The dataset is partitioned into five equal subsets (folds), with each fold serving as a validation set once while the remaining folds form the training set. Within each fold iteration, crucial preprocessing steps are applied independently to prevent data leakage. Missing values are imputed using the mean strategy via SimpleImputer, and features are standardized using StandardScaler to ensure zero mean and unit variance. For each fold, models are trained on the preprocessed training subset with consistent hyperparameters. Standard models (RF, AdaBoost, KNN, Ridge) are trained with their default parameters, while both DT and Levy-DT models utilize optimized hyperparameters determined through systematic parameter tuning, with DT employing regularization parameters identified via GridSearchCV and Levy-DT utilizing optimized parameters (λ, scale, maximum depth) determined from the sensitivity analysis. The R² score is calculated for each fold’s validation set, capturing the model’s ability to explain variance in unseen data. Figure 4 shows that the proposed Levy-DT algorithm achieved the highest mean R² score (0.939) across all folds, demonstrating its robust predictive capability for shear strength estimation in RC T-beams. The performance hierarchy is clearly established with Levy-DT (0.939) and Ridge (0.932) displaying superior performance, RF (0.912) and AdaBoost (0.907) showing good but slightly reduced predictive power, DT (0.886) providing moderate performance, and KNN (0.607) significantly underperforming compared to other algorithms. The error bars represent standard deviation across folds and indicate that Levy-DT and Ridge demonstrate high stability with small error bars, KNN shows considerable variability suggesting sensitivity to specific data partitions, while other algorithms maintain relatively consistent performance across different data subsets. The cross-validation results substantiate the effectiveness of the LF enhancement to the DT algorithm. The marginal improvement of 6.2% in R² score over the standard DT (from 0.877 to 0.939) indicates that the stochastic perturbation mechanism successfully mitigates local optima issues in the base algorithm. Furthermore, the consistency across folds confirms that the performance improvement is not coincidental but rather a systematic enhancement provided by the LF mechanism.

Upon examination of the test data, as presented in Table 3, it is evident that the Levy-DT algorithm exhibits superior performance among the evaluated models. It achieves the highest R² value of 0.982, indicating strong predictive accuracy. Furthermore, the model yields the lowest RMSE (27.941) and MSE (780.698) values, reflecting its excellent capability to minimize prediction errors. The MAE of 14.551 further confirms its precision in estimating the shear strength of RC T-beams. The optimized DT model, which underwent systematic hyperparameter tuning to ensure fair comparison, demonstrates improved generalization capability compared to an unconstrained baseline. However, even with optimization, the DT model achieves a moderate R² of 0.731 with corresponding RMSE (97.281) and MSE (9463.552) values that are substantially higher than those of Levy-DT. This performance gap highlights the effectiveness of the LF enhancement in further improving the DT algorithm’s predictive capability beyond conventional optimization approaches. The Ridge regression model demonstrates solid performance with an R² of 0.906, RMSE of 57.544, and MSE of 3311.308. While not as accurate as Levy-DT, it performs better than ensemble methods such as RF (R² = 0.847) and AdaBoost (R² = 0.827), which show higher error values. KNN, on the other hand, yields the weakest test results with an R² of 0.730, RMSE of 97.490, and MSE of 9504.266, indicating that it struggles to capture the nonlinearities of the dataset.

Table 3 Test performance evaluation of regression models.

As detailed in Table 4, performance on the training dataset reveals important insights about model behavior and generalization capability. The optimized DT model, despite systematic hyperparameter tuning aimed at preventing overfitting, still achieves very high training performance (R² = 0.986, RMSE = 9.988, MSE = 99.753, MAE = 5.250). However, when compared to its test performance (R² = 0.731, RMSE = 97.281 in Table 3), a substantial performance gap remains evident, indicating that even with regularization, complete elimination of overfitting remains challenging for the standard DT algorithm. In contrast, the Levy-DT model demonstrates consistently high performance across both training and test datasets with minimal performance degradation, suggesting that the incorporation of LF-based perturbations provides an additional layer of regularization that effectively enhances generalization beyond conventional hyperparameter optimization approaches. This consistent performance across training and testing phases validates the effectiveness of the stochastic perturbation mechanism in mitigating overfitting tendencies inherent in tree-based algorithms. The Ridge model performs exceptionally well during training (R² = 0.990, RMSE = 13.860), demonstrating its inherent stability and regularization effectiveness. Similarly, RF (R² = 0.982) and AdaBoost (R² = 0.969) exhibit strong training accuracy with acceptable error metrics, benefiting from their ensemble nature that naturally provides regularization through model averaging and boosting mechanisms respectively. KNN, by contrast, shows considerably higher RMSE (64.768) and MSE (4194.849) during training, along with a relatively lower R² of 0.784, indicating a weaker fit even on the training data. This poor performance can be attributed to several factors. First, the feature space characteristics of the dataset make distance-based algorithms such as KNN less effective, as the Euclidean distance metric fails to capture the complex relationships between features and target variables in this context. Moreover, the dimensionality of the dataset poses challenges for KNN, as the algorithm suffers from the “curse of dimensionality” where distance measurements become less meaningful in higher-dimensional spaces. The comparison with other algorithms, particularly tree-based methods such as DT and Levy-DT, demonstrates the limitations of purely distance-based approaches for this specific prediction task, where hierarchical decision boundaries prove more effective than neighborhood-based predictions.

Table 4 Train performance evaluation of regression models.

Figure 5 compares the predicted values with the true values for both the training and test datasets across the algorithms employed. Each subplot visualizes the alignment of the predicted values (y-axis) with the true values (x-axis), where the dotted diagonal line represents perfect predictions. The Levy-DT algorithm demonstrates a strong agreement between the predicted and true values for both the training (green) and test (blue) datasets, as most points lie close to the diagonal line. This reflects the superior performance of the algorithm in accurately capturing underlying data patterns with excellent generalization capability. The optimized DT algorithm exhibits reasonable correspondence between the predicted and true values in the training dataset; however, it presents a noticeable increase in deviation in the test dataset, indicating a performance gap between training and testing phases despite hyperparameter optimization. This performance difference highlights the inherent limitations of conventional regularization techniques in completely addressing generalization challenges in DT algorithms, thereby demonstrating the added value of the LF enhancement mechanism. The Ridge and RF algorithms exhibit a moderate degree of alignment, with discernible deviations from the diagonal line, especially within the test set, which indicates a comparatively lower level of predictive accuracy. AdaBoost also performs reasonably well, though its predictions for the test data deviate more from the true values compared to the tree-based algorithms, signaling its relatively lower accuracy for this specific dataset. The KNN algorithm has the most scattered predictions, particularly for larger values, highlighting its weaker performance compared to the other models. This is consistent with the lower R² and higher error metrics observed for KNN in the test data, as it struggles to generalize well to the complex relationships in the dataset. In general, the figure visually reinforces the quantitative results from the tables, emphasizing the robustness of Levy-DT in providing accurate predictions with superior generalization capability, particularly in comparison to both the optimized baseline DT and other comparative methods.

Figure 6 presents the performance of the Levy-DT model across 15 iterations using two key metrics: the R² score and the RMSE. Initially, the R² score plot reveals a perfect score (1.0) for the training data at iteration 0, corresponding to the standard DT model, while the validation score remains lower, around 0.95. This discrepancy indicates the standard DT’s tendency to overfit the training data. However, as the iterations progress, the “Best R² So Far” curve shows a gradual increase, approaching 0.98. This curve represents the best performance obtained up to each iteration and illustrates how the Levy-DT algorithm progressively improves its generalization capability. Notably, substantial fluctuations are observed between the training and validation R² scores at several iterations (e.g., the 3rd, 5th, and 9th). These variations are attributed to the stochastic nature of LF-based perturbations. Nonetheless, the applied improvement strategies enable the algorithm to recover from such instabilities, and the “Best R² So Far” curve maintains consistently high performance. A similar trend is evident in the RMSE plot. Initially, the training RMSE is very low (close to 0), whereas the validation RMSE is significantly higher. As the iterations proceed, the gap between the training and validation RMSE tends to narrow, indicating an enhancement in the model’s generalization performance. Similar to the R² score plot, the RMSE values also exhibit fluctuations at certain iterations, consistent with the inherent randomness introduced by the Levy mechanism. After the 10th iteration, both training and validation metrics become more stable, suggesting that the Levy-DT algorithm begins to converge and achieves optimal performance. At this stage, the best R² score stabilizes around 0.98, while the difference between training and validation metrics is minimized. This convergence analysis highlights a significant advantage of the Levy-DT model: it effectively mitigates the overfitting issue commonly observed in standard DTs, while consistently enhancing generalization across iterations. These findings demonstrate that Levy-DT can offer reliable and stable performance in complex regression tasks.

Figure 7 provides a detailed comparison between the actual and predicted shear strength values for a series of specimens, as derived from the Levy-DT algorithm. The blue line in the figure represents the real shear strength measurements, while the red line illustrates the predicted values generated by the Levy-DT model. The close spacing of these two lines in most specimens indicates the close prediction of shear strength by the Levy-DT algorithm. The model captures rather appreciable, some important variations in shear strength over the dataset, hence making a valid performance comparison of RC beams under different conditions. One of the most striking features of the plot is the marked peaks and troughs in both observed and predicted values which reflect the inherent variability in the dataset owing to differing sample properties. The algorithm demonstrates an extremely high level of accuracy in peak matching where there is a significant increase in shear strength value and valley matching where there is a decrease in value. This serves as an indicator that the Levy-DT algorithm considerably recognizes and responds to those underlying structural patterns within shear strength, which exhibit pronounced variability across the samples. The strength of the algorithm is shown by how it works with all types of data points. Even in places where the shear strength has fast changes, the predicted numbers are close to the actual measurements. This indicates the model’s ability to generalize effectively across a diverse set of specimens, which might have different shapes or materials that change their shear behavior. The Levy-DT algorithm proves its trustworthiness in finding shear strength by keeping a near-perfect match between expected and real values; especially in complex datasets with many parameters. Another key point shown in Fig. 7 is how well the model deals with odd specimens. Even if small differences can be seen between guessed and true values, especially for some peaks, these errors are tiny and do not harm the overall quality of the model’s work. This indicates that the Levy-DT algorithm retains high predictive accuracy even in instances where challenge or odd conditions might be represented by data points. The small amount of error at these points indicates that the model can well handle noise or outliers in the data.

Figure 8 shows the Taylor diagram for the comparison of various ML models used to forecast the shear strength of RC T-beams. This diagram gives a clear idea of how model predictions match up with real values by looking at three important performance measures: standard deviation, correlation coefficient, and root mean square difference (RMSD). Each model is shown on the basis of these measures, helping to compare its predictive accuracy fully. The radial distance from the origin shows͏ the standard deviation of the predicted values while models nearer to the origin reflect a standard deviation like that of the observed data. The correlation coefficient is represented by circular lines that spread out from the center, with numbers going from 0 to 1. Models that show a stronger correlation are found near the right side, meaning there is a bigger relationship between what is predicted and what actually happens. The RMSD is shown with colors, and the gradient scale on the right side of the picture goes from 15 to 50. Models with lower RMSD values which are better are shown in cooler colors like blue and cyan while higher RMSD values are shown in warmer colors like orange and red. As seen in the picture, the Levy-DT algorithm shows the best results. It has a strong correlation (near 0.99), a pretty low standard deviation, and one of the lowest RMSD values shown in a cool blue color. This means Levy-DT does well in finding differences and is closely linked with real numbers, making it the best model out of the ones looked at. The traditional DT method is also near͏ly matched with Levy-DT, having a high correlation number and a bit higher standard deviation, though it still does alright compared to other models. KNN, on the other hand, sits farther from the perfect spot on the graph. It shows a lower correlation number (around 0.7), showing a weaker guessing ability. Its standard deviation is much higher than the actual values, and its RMSD goes into the warmer color range, pointing out its less precise performance. This implies that KNN has a hard time picking up the changes in the data and exhibits significantly higher prediction errors. The AdaBoost and RF models demonstrate satisfactory performance. Both are placed nearer to the center of the diagram than KNN, with correlation coefficients of around 0.8. However, their RMSD values are higher than those of Levy-DT and DT, which implies a trade-off between prediction accuracy and reduction of error. AdaBoost and RF also have a higher standard deviation than the actual data, so these models probably overstate or understate the variability of the dataset. Ridge regression lies between tree-based models and KNN; performance is moderate. It has a slightly higher correlation coefficient than AdaBoost and RF but lower than Levy-DT and traditional DT. The standard deviation is close to what is observed, and its RMSD is moderate, represented by a mid-range color on the scale. This shows that Ridge can deliver acceptable predictive performance but probably not as much as tree-based algorithms in reducing error. The Taylor diagram presented in Fig. 8 provides a clear and informative depiction of the differences in the performance of the ML models. Levy-DT comes out as the best model, showing better results in all key numbers, with DT not far behind. The better tree-based methods for finding shear strength emphasize this by the poor results of models like KNN, AdaBoost, and RF when it comes to correlation and standard deviation. The figure highlights the need for choosing models that find a good middle point between variance, correlation, and less error to make the right predictions in structural engineering.