Experimental assessment with data-driven machine learning-based prediction of compressive strength of waste natural fiber-reinforced sustainable concrete

Analytical overview of dataset attributes

In the context of the present investigation, a comprehensive dataset of about 444 data points was obtained from prior research papers. The dataset emphasizes concrete mix compositions and comprises critical input parameters, including water, natural fiber, supplementary cementitious materials (SCMs), sand, coarse aggregate, and curing period. The data sample for the concrete includes 444 samples of mix proportions, with 8 data attributes, among which compressive strength is regarded as the response variable. Additionally, an overview of the input and output datasets and their statistics can be found in Table 3. Each parameter’s statistical data is outlined in Table 3, accompanied by its minimum and maximum values, kurtosis, skewness, and standard deviation (std). Most values accumulate around the mean in cases of low standard deviation, such as for variables like Natural Fiber, SCM, and Compressive strength. However, the values get more dispersed as the std for Cement, Sand, and Coarse aggregate rises. Skewness, which can be zero, positive, negative, or undefined, assesses the imbalance in the probability distribution of a variable about the average value⁷⁵. Positive skewness values suggest data density in Fig. 4e for Compressive strength and SCM. When skewness numbers are negative, it means that there is more data on the right side of Fig. 4a. Kurtosis measures the sharpness or flatness of a data distribution relative to the normal distribution. It helps assess whether the data has more or fewer extreme values, with typical values falling between − 10 and + 10⁷⁶. Table 3 reports a higher kurtosis value for Curing Period (days), due to a small sample size. The presence of skewness and kurtosis in the dataset indicates non-normality in several input variables. This non-normality affects performance metrics (R², RMSE, MAPE, and MAE) and requires normalization techniques. However, since the data is non-normal, non-parametric models have been employed for this study, which can handle such data without relying on normality assumptions.

The Seaborn library was employed to generate the correlation contour maps in Fig. 4, which illustrate the interconnection between the various mix components of the concrete and compressive strength (CS). A high concentration of cement data within the 350–450 kg/m³ range is depicted in Fig. 4a. Figure 4b shows that compressive strength values are most concentrated when fine aggregate content ranges between 650 and 800 kg/m³, indicating this is a commonly adopted range in the mixes. In Fig. 4c, a large group of data points is found between 800 and 1200 kg/m³ of coarse aggregate, where compressive strength tends to peak.

Table 3 Statistical description of the parameters.

Figure 4d illustrates the relation between water content and compressive strength, with the highest data concentration observed between 150 and 220 Kg/m³ of water and 20–50 MPa of strength. The marginal plots indicate that most values cluster around 180–200 Kg/m³ of water and 30–40 MPa of compressive strength. Figure 4e–g illustrates the influence of SCM content, curing age, and natural fiber percentage on compressive strength. Similarly, Fig. 4e reveals two distinct peaks in SCM content, mostly between 5 and 10 kg/m³, each associated with varying strength levels. As shown in Fig. 4f, the compressive strength is primarily concentrated between 20 and 40 days of curing age, indicating the least variation compared to other input parameters, where most of the curing age data clusters around 28 days. Lastly, Fig. 4g indicates that compressive strength is highest when the natural fiber content is near 0%, with a few observations extending beyond 1%.

From Fig. 5, the Pearson correlation coefficient between all the pairs of variables was calculated. The coefficient is restricted within the − 1 to 1 range such that the coefficients greater than 0 imply the existence of a positive correlation between two variables, whereas coefficient less than 0 implies a negative correlation between the two variables. The nearer the coefficient is to any of the extremes, the higher the correlation of the features. The heat map shows that 6 of the input variables are positively correlated with Compressive strength (CS), but water and similarly, these data inputs are significantly correlated with water and are less clearly correlated with Compressive strength (CS). Notably, cement shows the highest positive correlation with Compressive strength (R = 0.47), suggesting that the concrete strength shows a direct correlation with this factor. There is a moderate negative relation between the water content and compressive strength (CS), implying that the more water there is, the less strength there is. The main relationship between curing age, fine aggregate, and CS is highly positive, meaning that an increase in the level will increase the strength to the maximum level.

Error plot and performance evaluation of ML algorithms

Figure 6a illustrates the predictive performance of the XGB model in estimating the concrete compressive strength (CS). The figure presents both the experimental and predicted CS values along with the error distribution for the training and testing datasets. During the training phase, the model maintains relatively consistent prediction accuracy, with errors largely contained within 31 MPa. During the testing phase, although the error margin slightly increases, the error range remains approximately 37 MPa, indicating a controlled rise in prediction deviation. Figure 7a provides a scatter plot depicting the relationship between the predicted and experimental compressive strength, where the R² values for the training and testing are 0.9556 and 0.8295, respectively. It means that the training set could be predicted well and that the testing set would be moderately accurate. Figure 6b illustrates the predictive performance of the LGBM model. In the line graph, the predictions from the training data align with the experimental values, and the errors remain within 29 MPa. During the testing phase, the error stays within 27 MPa. The training error range of the data inputs of the LGBM model is slightly lower compared to that of the XGB model training data inputs, while the testing error range is 10 MPa less compared to the XGB model. Strong correlation of predicted and actual values is indicated in Fig. 7b, and the values of R² are 0.9230 during training and 0.8637 during testing. The LGBM model has a slightly lower prediction performance in comparison with the XGB model during the training phase, while LGB outperforms XGB in R² during the testing phase. It suggests that LGBM demonstrates a more stable and reliable prediction accuracy than the XGB model. Figure 6c compares the observed and predictive values of compressive strength of the NRFC concrete using the RF model. The line plot suggests that the estimated values follow the experimental trends well in the training phase, with most errors remaining within 32 MPa. However, in the testing phase, the error values increase within 30 MPa, suggesting reduced accuracy when exposed to unseen data. The error range is bigger than that of the LGBM model and lower than that of the XGB model during both stages.

The scatter plot in Fig. 7c presents the relationship between the predicted and actual values, where training data points are more densely aligned with the expected outcomes than the testing data in the RF model. The R² values for training and testing are 0.9379 and 0.7884, respectively. These results imply that while the RF model learns the training data effectively, its generalization capacity is weaker than that of the LGBM and XGB models, showing higher dispersion and slightly lower testing accuracy. At Fig. 6d, the error plot shows that the predicted values follow the experimental trends for the training and testing datasets employing the KNN model. However, noticeable fluctuations exist throughout, with the error distribution exceeding 50 MPa during the training phase and within 40 MPa during the testing phase, indicating reduced consistency in the model’s generalization. A higher mismatch exists between the observed and predicted results, with both stage errors surpassing those of the XGB, LGBM, and RF models, suggesting relatively weaker predictive reliability. Figure 7d further supports this observation, showing the correlation between predicted and experimental values through a scatter plot using the KNN model. While the R² values of 0.8460 for training and 0.8257 for testing imply moderate model accuracy, especially from the testing set. Though the KNN model shows less accuracy than XGB, LGBM, and RF models during training, its testing accuracy is comparable to XGB, higher than RF, but lower than the LGBM model.

Fig. 6

Relative error of compressive strength for training and testing datasets of (a) XGB, (b) LGBM, (c) RF, (d) KNN, (e) DT, (f) GB models.

As shown in Fig. 6e, the predicted compressive strength values strongly aligned with the experimental data during the training phase using the DT model. Most prediction errors are within 32 MPa, with a significant portion confined to within 5 MPa, indicating a high degree of model accuracy. The error range of the training stage is similar to that of the XGB, LGBM, and RF models. However, in the testing phase, the model demonstrates a clear decline in prediction accuracy, with errors exceeding 55 MPa. This implies that the model has less ability to generalize data and highlights that its error range is worse than that of the XGB, LGBM, RF, and KNN models. Figure 7e further supports this observation through a scatter plot comparing predicted and experimental values in the DT model. The training data points cluster tightly around the ideal diagonal line, yielding a high R² of 0.9499, which implies that the prediction accuracy is superior to the KNN model and comparable to the XGB, LGBM, and RF models during the training stage. On the contrary, the data on testing indicates a large scatter and deviation with a lesser R² value of 0.7212. These findings indicate that its predictive reliability is greatly diminished when applied to testing data compared with all models. Figure 6f exhibits the error between observed and predicted CS values for training and testing sets in the GB model. The error figure, especially in the testing phase with noticeable fluctuations, reveals that the model sometimes struggles with new data. Errors in the training phase range up to 30 MPa. In contrast, the errors in the testing stage scatter around 45 MPa, which suggests that the GB model is more reliable than the DT model and slightly less reliable than XGB, LGBM, RF, and KNN models. The scatter plot between predicted and experimental values is shown in Fig. 7f using the GB model. In the training stage, most points are close to the diagonal line, like the DT model, which represents perfect prediction. While the training data fits closely, the testing data shows a broader spread, particularly at higher strengths. The R² values of 0.9499 for training and 0.8326 for the testing phase reflect this pattern, that the model learns well during training, but is less accurate on unseen data. The GB model performs reasonably well in the training stage in comparison with XGB, DT, RF, and LGBM models. In the testing stage, this model’s predictive accuracy is less than that of the LGBM model. According to the error plot, the LGBM model shows less error in both the training and testing phases than all other models. Based on the scatter plot, the LGBM model has slightly less predictive accuracy than XGB, RF, DT, and GB during the training period; however, LGBM outperforms all models during the testing stage, indicating superior capability in predicting and generalizing new data.

Comparative analysis of ML algorithms

The radar graph (Fig. 8), as well as Table 4, shows the evaluation matrix for the machine learning algorithms, including R², RMSE, MAE, and MAPE values during training and testing, where test performance emphasizes a critical factor in identifying the best model, rather than training accuracy. Figure 8a provides a side-by-side comparison of how six machine learning models performed on both training and testing datasets using R² values. Among them, LGBM stood out with consistently superior R² scores of 0.9230 and 0.8637 for both the training and testing datasets. This suggested that the testing stage made accurate predictions and generalized well to new data, while the training stage reduced errors and fit data precisely. On the other hand, the XGB model has an excellent ability to reduce error and fit the dataset, which is indicated by the highest R² value of 0.956 during the training stage. However, its prediction and generalization ability are much less compared to the LGBM model, with an R² value of 0.829 in the testing phase.

Other models like GB, RF, and DT performed well during training but dropped sharply on the test set, pointing to overfitting and poor generalization, while KNN showed the weakest performance overall, particularly on the test data, with an R² value of 0.8257. Figure 8b displays a radar chart comparing the Mean Absolute Error (MAE) of six machine learning models on both datasets. Lower MAE values indicate better performance of the models. From the chart, it’s clear that the LGBM model performed well, showing the lowest MAE on testing data (2.950). In contrast, the DT, RF, GB, XGB, and KNN models had higher MAE values, especially on the testing set, indicating larger average prediction errors and possible overfitting, which points to less accurate predictions compared to LGBM. Figure 8c displays the MAPE values, where lower values indicate better performance. Similarly, the LGBM model performs better, showing the lowest MAPE values of 0.108 among all algorithms, followed by GB (0.114) and XGB (0.115). On the other hand, in Fig. 8d, the RMSE values are displayed, where lower values signify superior performance. Like the MAPE chart, the LGBM model exhibits better performance on testing data, achieving the lowest RMSE value of 4.192, being the top performer, followed by GB (4.646) and XGB (4.689). The results, presented in Table 5, show that all models considered in this study performed within an acceptable range when compared to the models used by other researchers in previous studies.

Table 4 Evaluation of the model’s performance indicators.

Table 5 Model’s performance comparison with previous research models.

Evaluating performance using the Taylor diagram

The Taylor diagram in Fig. 9 visually evaluates the performance of six machine learning models, XGB, LGBM, RF, KNN, DT, and GB, by illustrating their standard deviation and correlation with observed data during training and testing. This tool helps assess how well each model captures the pattern and variability of real-world values. In both phases, all models demonstrate high correlation with the observed data, with R values clustering around 0.95 or higher, indicating strong correlation and effective predictive capability. The break lines (black dashed) indicate the line of constant correlation, which highlights how closely the models match observed patterns. From the training perspective, the two models, LGBM and RF, are aligned closest with the observed reference point (red star), indicating a strong correlation and similar standard deviation. This suggests that these models were able to capture the data’s variability to maintain good predictive performance. In the test stage, the performance trend follows a similar pattern, and LGBM and RF maintain better than other models the original data consistency, as demonstrated by the true time series. Differences are not very noticeable, but LGBM is slightly closer to the reference line, meaning it generalizes slightly better on the test set. Models like XGB and GB also perform well in both phases but show slightly larger deviations from the observed standard deviation. In contrast, KNN and DT are positioned farther from the reference, particularly in terms of standard deviation, suggesting relatively lower precision in capturing the variability of the observed data. The Taylor plot indicates that LGBM and RF are our research’s most stable and consistent models, successfully balancing precision and variability in the training and testing datasets.

SHapley additive explanations (SHAP) analysis

Data on SHAP values have been used to estimate the impact of input features on the model prediction, as shown in Fig. 10. The mean SHAP value denotes the average contribution of each feature in all data samples. Feature importance was evaluated using the LGBM model due to its strong performance in previous phases. Among all input variables, cement content emerged as the most influential factor, with an average SHAP value of 5.73, highlighting its dominant role in enhancing compressive strength (CS). This aligns with its known effect on strength gain through hydration and binding of other mix components. Water content and coarse aggregate followed in importance, with water playing a more critical role due to its influence on hydration, workability, and durability, while coarse aggregate enhances load distribution and reduces shrinkage. SCM ranked next, contributing to strength by improving the concrete’s microstructure and supporting secondary C-S-H formation. Fine aggregate also showed a notable impact by ensuring mix cohesion and density. Finally, fiber content and curing age, though lower in SHAP value, played distinct roles in refining the material’s strength development over time.

SHAP values were used to evaluate the proportional influence of each input variable on the concrete compressive strength prediction; the findings are presented in Fig. 11. The significance of each variable is represented by a cluster of colored dots on the x-axis, where the amount and direction of the influence on the model output are determined by the positive or negative value of SHAP. The y-axis is the ranking of features in terms of the net effect, and a gradient of colors, blue to red, denotes the actual feature value that has been ranked high with red, and in contrast, the low value that has been ranked in blue. From the Fig. 11, cement content is the most impactful feature, consistently contributing positively to strength predictions, especially at higher values, as shown by the cluster of red points with strong positive SHAP values. Water content also plays a significant role, though its influence is mixed.

In contrast, higher water content may initially aid workability, but it tends to lower strength, as reflected by the spread of positive and negative SHAP values. Coarse aggregate appears similarly impactful, generally enhancing strength but with a nuanced influence depending on its amount. SCM and fine aggregate also show an apparent positive effect on strength, though to a lesser extent, contributing to a denser and more cohesive mix. In contrast, natural fiber and curing period exhibit relatively lower SHAP values, indicating a subtler effect on the model’s prediction result; however, they still contribute positively overall. The distribution pattern and variation in SHAP values across data points emphasize the interactive nature of these features in influencing the model’s prediction of compressive strength in concrete.

The heatmap shown in Fig. 12 provides a comprehensive visualization of SHAP values across all features and data instances. This plot captures how each input variable contributes to individual predictions made by the model. The top curve (f(x)) illustrates the model’s output across instances. At the same time, the color gradient represents the extent and trajectory of each feature’s SHAP value, ranging from positive (red) to negative (blue). The cement content shows a strong positive impact on the model’s predictive capacity in the initial portion of the dataset (approximately the first 100 instances), indicated by the intense red coloring, while its contribution becomes more neutral or slightly negative towards the end. In contrast, water content demonstrates a more varied influence, with alternating bands of red and blue, suggesting its dual role in both enhancing and reducing compressive strength depending on the context of other features. The coarse aggregate and SCM features reflect mixed SHAP values throughout the dataset, indicating that their impact on prediction varies widely with different data combinations. On the other hand, fine aggregate, natural fiber, and curing period appear to have a relatively minor influence overall, as shown by the lighter and more neutral color tones, though subtle patterns still emerge that suggest conditional relevance in certain subsets of the data. This heatmap not only highlights the dominant role of cement but also underscores how the contribution of each feature shifts depending on the data instance, offering a detailed view of feature importance in predicting concrete compressive strength.

SHAP interaction plots analysis

The SHAP interaction plots in Fig. 13 provide valuable insights into how various input variables interact and influence the output. In Fig. 13a, the relationship between cement and water shows a positive effect of cement, reaching around 600 kg/m³, while the water content influences this effect. Building on this, Fig. 13b illustrates that fine aggregate positively affects strength when used between 400 and 600 kg/m³, particularly when paired with higher SCM levels. However, beyond 700 kg/m³, the impact fluctuates, indicating a limited benefit at higher amounts. Figure 13c reveals that as SCM increases from 0 to 50 kg/m³, the SHAP value remains relatively stable. The optimal range appears around 30 kg/m³, where its influence on the model output is consistent. Interestingly, water content subtly interacts with SCM; higher water levels slightly alter its effect on the prediction. Similarly, Fig. 13d highlights a steady increase in SHAP value as Coarse Aggregate rises from 200 to 1600 kg/m³. The most favorable range is between 800 and 1200 kg/m³, where the material strongly influences the model’s output, and further increase of its value reduces the concrete strength.

In contrast, Fig. 13e presents a more variable pattern: the impact of Natural Fiber fluctuates as its content increases from 0 to 3 kg/m³. Nevertheless, its positive influence becomes more pronounced when paired with higher cement levels, with the ideal range falling between 1 and 2 kg/m³. Figure 13f shows a non-linear relationship between the curing period and SHAP value. After an initial increase, the effect stabilizes, with the optimal curing period identified as 5 to 7 days, especially when natural fiber levels are high. Lastly, the interaction between water and fine aggregate is clear in Fig. 13g. Their combined influence is most evident between 150 and 250 kg/m³ of water. In this range, higher acceptable aggregate levels tend to reduce water’s positive effect, while lower levels allow it to contribute more strongly. Beyond this point, the impact of water becomes more consistent and less sensitive to changes in fine aggregate.

PDP analysis

Figure 14 depicts the influence of each input variable on the compressive strength of concrete. Cement content strongly and predominantly positively influences CS, as shown in Fig. 14a. The CS value steadily increases from approximately 25 MPa to over 42 MPa when cement increases from 200 to 450 kg/m³. This indicates an enhancement of around 68% in strength, emphasizing cement’s critical role in strength. Figure 14b shows a more complex and generally exhibits a negative trend, with the fine aggregate content increasing from 150 to around 1050 kg/m³, the CS drops from about 34.5 MPa to nearly 32.5 MPa, indicating that excessive fine aggregate reduces strength by affecting packing density or increasing water demand. Figure 14c depicts CS increases gradually from around 29 MPa to 36 MPa as the coarse aggregate rises from 200 to approximately 1000 kg/m³, but excessive presence may disrupt the homogeneity and workability of the concrete. According to Fig. 14d, a negative relationship of water content on CS, increasing from around 100 to 200 kg/m³ as CS declines from approximately 35 MPa to 31 MPa. Beyond 200 kg/m³, strength fluctuates, indicating excess water reaches a saturation point in weakening concrete. Figure 14e illustrates that an increase in CS is observed at 0.1 to 0.2 kg/m³ fiber contents, peaking near 34 MPa, then a steady decline in strength occurs, dropping to around 30.5 MPa at 1.5 kg/m³ and remaining flat thereafter. This suggests that excessive fiber content may negatively affect mix uniformity and compaction, reducing strength. In contrast, Fig. 14f shows that SCM clearly influences CS. The strength increases sharply from approximately 35.5 MPa to 36.5 MPa as SCM content rises to around 20 to 30 kg/m³, then stays almost constant. The effect of curing time on compressive strength (CS) is shown in Fig. 14g. CS increases sharply from 28.5 MPa to 32.8 MPa as curing extends from 15 to 28 days, then slowly reaches around 34.3 MPa at 90 days. This trend confirms that while early curing is critical, extended curing still contributes to long-term strength. The recommendations of the results of the PDP analysis are shown in Table 6 below.

The parametric optimization method (SHAP and PDP) allows optimizing the selection of materials and adjusting their dosage according to scientific arguments rather than trial and error. As an example, the PDP-generated optimum cement, aggregates, natural fibers, and SCM can be used to inform the field engineers to create the optimum mix that will be the strongest, as well as be sustainable. Further, the outcome of the SHAP confirms the relevance of such parameters as the curing period and the water content in the view of materials science by relating them to the hydration kinetics and formation of microstructures.

Table 6 Optimized the value of all mixed materials depending on the PDP analysis.