Ensemble machine learning models for predicting strength of concrete with foundry sand and coal bottom ash as fine aggregate replacements

Machine Learning


Dataset description

With an emphasis on combinations that partially substitute natural fine aggregates with variable amounts of bottom ash sand and foundry sand, the data includes concrete mix designs, curing period and the accompanying compressive strength values. The mix proportions reported in the previously published studies were adopted, and the corresponding compressive strength values at different curing periods were considered, resulting in a total of 172 data entries 27,28,52,53,54,55. The dataset comprises a broad spectrum of concrete mixtures, reflecting diverse combinations of constituent materials, mix proportions, and curing conditions. With its own set of input parameters and corresponding compressive strength as the output variable, each record relates to a unique concrete mix design. The step-by-step methodology used in the current study is shown in Fig. 1.

Fig. 1
figure 1

The step-by-step methodology used in the current study.

The input features selected for model development based on their documented influence in prior studies and practical importance to concrete performance are detailed in Table 2. In particular, these are the contentious parameters of the study: cement content, fine aggregate (FA), coarse aggregate (CA), FS, and CBA contents, water content, water-cement (w/c) ratio, superplasticizer (SP), and curing duration. The output variable, compressive strength (CS), quantifies the concrete’s capacity to resist axial loading at a specified age.

Table 2 Description of input and output parameters used in the current study.

Structurally, the dataset is organized in a tabular format, where each row represents a unique concrete mix and its corresponding CS measurement. All input variables are continuous numerical features, as is the output variable CS, which is expressed in megapascals (MPa). Each parameter in the dataset spans a large range of values, offering enough variety to efficiently train and verify machine learning models.

The selection of input parameters for model development was guided by two primary considerations, i.e. their physical relevance and established influence on concrete’s compressive strength based on prior experimental studies, and their availability and consistency across published datasets used to build the current data corpus. Cement content, water content, and w/c ratio were included due to their direct control over hydration, workability, and binder strength development. Fine and coarse aggregates represent the inert skeleton of concrete, influencing its packing density and load transfer characteristics. FS and CBA were specifically included as input features because they are the key sustainable replacement materials investigated in this study. Their replacement ratios significantly impact the microstructure and mechanical properties of concrete, particularly in terms of strength gain and porosity control. Superplasticizer dosage was considered because it affects the flowability and compaction of mixes with high replacement content or fine particles like FS. Curing duration was included due to its known role in influencing hydration progression and long-term strength development. All selected parameters are continuous variables and are commonly reported in concrete mix design literature, ensuring both practical relevance and compatibility with machine learning workflows.

Statistical description

The dataset utilised for ML summary statistics, such as mean, standard deviation, median, variance, skewness, kurtosis, and interquartile range, is shown in Table 3. With a standard deviation of 49.91 and a mean cement content of 396.78 kg/m3, there is significant variation in the cement proportions among the various mix proportions that are employed. A few samples may have less cement than the average, according to the modest negative skewness of − 0.29. The distribution of FA is symmetric, with a minor positive skewness of 0.25, showing that few samples have FA greater than the mean FA content. The mean of the FA was 518.53 kg/m3. The mean CA content is 1101.38 kg/m3, and the standard deviation of 169.30 kg/m3. The CA also has a small negative skewness of 1.46, but has the second highest variance in the dataset after the variance of FA content in the dataset. The average percentage of FS used to partially replace natural sand is 10.87%, with the dataset showing a slight positive skewness of 1.16. For CBA, the average replacement level is 9.7%, and the dataset exhibits a positive skewness of 2.95. The mean water content in the dataset is 196.91 kg/m3, with a relatively small skewness of 0.95. The w/c ratio’s volatility is quite low, indicating that the majority of concrete mixes maintain a constant w/c ratio. The w/c ratio has a standard deviation of 0.06 with a mean of 0.48, and its distribution shows a positive skewness of 0.43. Similarly, SP exhibits a small positive skewness of 1.65 and kurtosis of 1.97, which shows that certain samples have noticeably higher SP dosages.

Table 3 Descriptive statistics of input and output parameters.

This implies that mixes were designed with improved workability. The dataset is substantially skewed towards lower concrete ages, as indicated by the concrete age’s skewness of 1.57 and kurtosis of 1.28. The majority of the samples were evaluated at a moderate age of hydration and later at a well-hydrated age, as the maximum age is 365 days and the median age is 56 days. The output variable, CS shows a broad range of 13.45–63.3 MPa, with a mean of 35.47 MPa and a standard deviation of 10.77 MPa. The variability in CS across various mix designs is further demonstrated by the interquartile range of 14.74 MPa. The skewness values indicate that variables like FS and CBA exhibit moderate positive skewness, with long right tails, implying that higher substitution percentages are less frequently reported in literature-derived mixes. This asymmetry, along with kurtosis values slightly exceeding, suggests the presence of peaked distributions with occasional outliers. To ensure dataset integrity, the 172 mix records were collected from different peer-reviewed studies and checked for duplication based on composition, property values, and publication metadata. Any overlapping or repeated entries were removed. Despite the natural imbalance in the range of replacement levels (especially for FS and CBA), ensemble models are robust to such irregularities due to their internal handling of feature heterogeneity. These characteristics were further considered during model tuning and validation.

Although the dataset of 172 mixes was carefully curated from multiple peer-reviewed studies and duplicates were removed, certain limitations in representativeness must be acknowledged. The replacement levels of FS and CBA are unevenly distributed, with most mixes clustered below 20% substitution. Higher replacement levels (≥ 50%) are comparatively rare, which may constrain the generalizability of predictions for extreme substitution scenarios. Similarly, curing conditions are heterogeneous across studies, with a majority of mixes tested at standard ages (28–90 days) but fewer at long-term curing (≥ 180 days). Variability in material properties such as particle gradation, residual carbon in CBA, or binder coatings in FS also introduces heterogeneity, as these characteristics differ by source and processing methods. While ensemble ML models are robust to such imbalances and variability, these dataset characteristics highlight the importance of cautious interpretation when applying the trained models to untested ranges or region-specific materials. This transparency ensures that the model’s strengths are clear while also delineating the boundaries of its applicability.

Correlation analysis & data visualization

All variables’ histogram distributions are shown in Fig. 2, and underlying trends are revealed by superimposing Kernel Density Estimates on top. While variables like FS, CBA, SP, and curing days show significant concentrations close to zero, the distributions of cement, w/c ratio, and CS, for instance, show unimodal patterns, indicating that a sizable portion of samples lack these components. To explore the relationships among the variables, both Pearson and Spearman correlation coefficients are computed, as defined in Eqs. (1) and (2), respectively. Spearman correlation is a non-parametric metric that captures monotonic associations and is more resilient to outliers and non-linear trends than Pearson correlation, which gauges the strength of linear relationships between two continuous variables.

$$r = \frac{{\sum\nolimits_{i = 1}^{n} {\left( {x_{i} – \overline{x}} \right)\left( {y_{i} – \overline{y}} \right)} }}{{\sqrt {\sum\nolimits_{i = 1}^{n} {\left( {x_{i} – \overline{x}} \right)^{2} } \sum\nolimits_{i = 1}^{n} {\left( {y_{i} – \overline{y}} \right)^{2} } } }}$$

(1)

Fig. 2
figure 2

Histogram distribution of all the input and output parameters.

$$\mu =1-\frac{6\sum_{i=1}^{n}{d}_{i}^{2}}{n({n}^{2}-1)}$$

(2)

Figure 3 presents heatmaps illustrating both Pearson and Spearman correlation coefficients. In Fig. 3 (Left), the Pearson correlation analysis reveals that curing days exhibit the strongest positive correlation with CS (r = 0.45), followed by SP with a correlation of 0.21. On the other hand, a negative correlation of − 0.23 between CS and the w/c ratio suggests that greater w/c ratios typically result in lower CS, most likely as a result of bleeding and the interfacial transition zone being weaker. The Spearman correlation heatmap in Fig. 1 (Right) displays similar trends while also capturing non-linear, monotonic relationships, offering deeper insight into rank-based associations within the dataset.

Fig. 3
figure 3

Pearson correlation (Left) & Spearman correlation (Right).

Overall, the combined use of visualizations and correlation analyses provides a meaningful understanding of the data. The strong positive correlations observed for SP and curing duration underscore their significant role in enhancing strength development. In contrast, the negative correlation with water content highlights the adverse effects of excess water on concrete performance.

Data preprocessing

Treatment and outlier detection

The interquartile range (IQR) approach is used to find data points that significantly depart from the majority. It does this by using the difference between the first quartile (Q1) and the third quartile (Q3). Equation (3) provides the definition of IQR, and Eqs. (4) and (5) define the typical cutoffs for outlier detection.

$${\text{IQR}} = {\text{Q3}} – {\text{Q1}}$$

(3)

$${\text{Lower}}\;{\text{Bound}} = {\text{Q1}} – {1}.{5} \times {\text{IQR}}$$

(4)

$${\text{Upper}}\;{\text{Bound}} = {\text{Q3}} + {1}.{5} \times {\text{IQR}}$$

(5)

Any data point falling outside the Lower Bound, Upper Bound is considered an outlier. To maintain the central tendency of the distribution and lessen significant distortions brought on by extreme values, these values are swapped out for the median of the associated feature.

Feature scaling and normalization

Preprocessing data is essential for improving predictive models’ resilience and performance. It guarantees that problems like inconsistent feature representations, outliers, and missing values are methodically fixed. The dataset is loaded into the analytical environment to start the preparation step. The target variable, CS, is isolated from the input parameters. To enable the model to learn from the majority of the data while being validated on previously unseen cases, the data is split into training and testing subsets using an 80:20 ratio, consisting of a total of 172 samples, with 138 samples allocated for training and 34 samples for testing. Feature scaling is a crucial step in getting data ready for deep learning regression models. Using the standard scaler technique, input features are adjusted to have a mean of zero and a standard deviation of one in order to prevent bias during training brought on by different feature magnitudes. Alternatively, normalization can be applied using Eq. (6), where each feature value is scaled to a [0, 1] range.

$${x}_{norm}= \frac{x-\text{min}(x)}{\text{max}\left(x\right)-\text{min}(x)}$$

(6)

where x is the original value of the feature, min(x) and max(x) denote the feature’s minimum and maximum value, respectively.

Train-test split strategy

To ensure robust model evaluation and minimize overfitting, the dataset was partitioned using an 80:20 train-test split strategy. With this method, the dataset is divided at random so that 80% of it is utilised to train the ML models and 20% is set aside for evaluating how well they perform on hypothetical cases. In addition to ensuring that the models learn from a significant amount of data, this division offers an objective evaluation of the models’ capacity for generalization.

During the splitting process, the target variable CS was separated from the input features to maintain data integrity. To preserve the statistical distribution of the dataset, the splitting was performed using stratified sampling where necessary. This technique maintains the proportion of various input parameters across both the training and testing subsets, ensuring representativeness. The train-test split forms the basis for further steps such as feature scaling, hyperparameter tuning, and cross-validation, ultimately supporting the development of reliable and generalizable predictive models.

ML algorithm considered

The selection of ML algorithms in this study was based on their proven effectiveness in handling nonlinear, multivariable problems in material and structural engineering applications. Traditional models like Linear Regression (LR), Decision Tree Regressor (DTR), and Support Vector Regressor (SVR) were included as baseline models due to their interpretability and widespread use in regression analysis. However, these models often struggle to capture the complex, nonlinear interactions present in concrete mixtures with multiple variable replacements, such as FS and CBA. To address this limitation, ensemble methods such as Random Forest (RF), Gradient Boosting (GBM), and Extreme Gradient Boosting (XGBoost) were selected for their superior ability to model complex relationships through bagging and boosting strategies. Ensemble learners reduce overfitting, improve generalization, and have shown high predictive accuracy in previous studies related to concrete strength, asphalt mix design, and engineered cementitious composites. XGBoost, in particular, offers efficient training, regularization capabilities, and built-in feature importance analysis, making it a robust choice for this study. Additionally, models like Gaussian Process Regressor (GPR), Kernel Ridge (KR), and Multi-Layer Perceptron (MLP) were included to explore probabilistic learning, nonlinear kernel effects, and neural network architectures, respectively, ensuring a comprehensive model comparison.

Linear regression

By taking a linear approach, LR models the connection between input data and the target variable. Consider a dataset with n samples and p features, where each sample is represented as xi = (xi1,xi2,…,xip), and the corresponding target value is yi. The LR model is expressed as given in Eq. (7). Despite its simplicity and interpretability, linear regression often fails to capture complex nonlinear relationships. However, it is frequently used as a standard by which to measure the effectiveness of more complex models. The model parameters are estimated by minimizing the sum of squared residuals expressed as given in Eq. (8).

$$\widehat{{y}_{i}}= {\beta }_{0}+ {\beta }_{1}{x}_{i1}+ {\beta }_{2}{x}_{i2}+\dots + {\beta }_{p}{x}_{ip}$$

(7)

Here, \({\beta }_{0}\) denotes the intercept, and \({\beta }_{j}\) (for j = 1, 2, …., p) are the coefficients associated with each feature.

$$\underset{{\beta }_{0, }{\beta }_{1}, \dots , {\beta }_{p}}{\text{min}}\sum_{i=1}^{n}\left({{y}_{i}-{\widehat{y}}_{i})}^{2}\right.$$

(8)

Decision tree regressor

The DTR constructs a tree-based structure, where each internal node represents a decision rule based on a specific feature, and each leaf node contains a predicted output value. The procedure usually employs a criterion, like variance reduction or mean squared error minimisation, to identify the best splits. At any given node, which contains a subset of the data D, the best split is chosen to maximize the improvement in prediction accuracy using Eq. (9).

$$\sum_{{x}_{i}\in {D}_{left}}({y}_{i}-{\overline{y} }_{left}{)}^{2}+\sum_{{x}_{i}\in {D}_{right}}({y}_{i}-{\overline{y} }_{right}{)}^{2}$$

(9)

Here, \({D}_{left}\) and \({D}_{right}\) represent the two partitions of dataset D, created based on whether the feature \({x}_{j}\) satisfies the condition \({x}_{j}\le t\) or \({x}_{j}\ge t\), respectively. While decision trees offer easily interpretable, rule-based structures, if hyperparameters like the minimum number of samples needed at a leaf node or the maximum tree depth are not appropriately set, they are prone to overfitting.

Random forest

A bootstrap sample of the original dataset and a randomly chosen subset of features are used to train each decision tree in the RF ensemble learning technique. This randomized training process reduces the correlation among individual trees, thereby enhancing prediction accuracy and increasing robustness to noise. The final production is obtained by averaging the output of all trees using Eq. (10).

$$\overline{y }= \frac{1}{B}\sum_{b=1}^{B}{\widehat{y}}^{(b)}$$

(10)

where B denotes the total number of trees, and \({\widehat{y}}^{(b)}\) is the prediction made by the b-th tree. The bagging technique effectively lowers the variance and helps prevent overfitting, often outperforming a single decision tree.

Gradient boosting machines

The GBM uses error correction to gradually improve the model’s predictions, creating an additive ensemble of weak learners, often shallow decision trees.

The final predictive model can be expressed as Eq. (11).

$$F_{M} \left( x \right) = \mathop \sum \limits_{m = 1}^{M} \upeta h_{m} \left( x \right)$$

(11)

where hm(x) is the weak learner at iteration m, η is the learning rate that controls the contribution of each learner, and M is the total number of iterations. According to Eq. (12), the new learner is trained to fit the loss function’s negative gradient with regard to the prior prediction at each stage.

$${r}_{im}= \frac{\delta L({y}_{i}, F\left({x}_{i}\right))}{\delta F({x}_{i})}$$

(12)

Extreme gradient boosting

The XGBoost is an advanced implementation of the gradient boosting algorithm, designed for improved performance and efficiency. It enhances the training process by incorporating second-order derivatives (i.e., both gradients and Hessians) for more accurate optimization. Additionally, XGBoost introduces regularization terms to explicitly control model complexity, helping to prevent overfitting. A scoring function is added to the model. This regularization mechanism helps reduce overfitting by discouraging the formation of excessively complex trees. Furthermore, XGBoost is designed for high computational efficiency, leveraging parallel processing and optimized memory usage to achieve faster training compared to many other gradient boosting implementations.

Support vector regressor

With kernel functions, the SVR works effectively for managing non-linear interactions. SVR maintains a balance between model correctness and complexity while minimizing the error within a given margin. The mathematical formulation is given in Eq. (13).

$$y = \sum\limits_{{i = 1}}^{n} {\left( {\alpha _{i} – \alpha _{i}^{*} } \right)} K\left( {X_{i} ,X} \right) + b$$

(13)

where K \(\left({X}_{i},X\right)\) is the kernel function (radial basis function, polynomial), \({\alpha }_{i}\), \({\alpha }_{i}^{*}\) are Lagrange multipliers, \(b\) is the bias term. SVR effectively captures non-linear patterns but requires careful tuning of hyperparameters.

Gaussian process regressor

A non-parametric, Bayesian method of regression known as the GPR establishes a distribution over potential functions that could fit the data. GPR uses a probabilistic framework to describe data and assess prediction uncertainty, in contrast to parametric models that assume a fixed shape. Any finite set of function values has a joint Gaussian distribution, and it is assumed that the goal values are samples from a multivariate normal distribution. The covariance (or kernel) function k(x,x′), which expresses similarity between data points, serves as the basis for the prediction for a new input x  . Common kernels include the Radial Basis Function, Matern, and Rational Quadratic. The GPR prediction consists of both a mean and a variance, offering a measure of confidence for each prediction. Because GPR can simulate complicated, nonlinear connections with defined uncertainty, it is especially helpful for small to moderate datasets. Large datasets may be limited by their computational complexity, which scales cubically with the number of training samples. Equation (14) provides the posterior predictive distribution for a test point.

$$\widehat{{y}_{*}}= N({\mu }_{*}, {\sigma }_{*}^{2})$$

(14)

where the mean μ  and variance \({\sigma }_{*}^{2}\) depend on the training data, the kernel, and the observed noise.

Kernel ridge

In order to model nonlinear interactions between inputs and outputs, KR integrates the ideas of kernel techniques with ridge regression. While ridge regression applies L2 regularization to linear models to prevent overfitting by employing a kernel function to transfer the input data into a high-dimensional space, KR expands on this capability and makes it possible to identify intricate, nonlinear patterns. KR is computationally efficient for moderately sized datasets since it offers a closed-form solution in dual space. However, like other kernel methods, its performance may degrade on very large datasets due to the need to compute and invert large kernel matrices. The prediction function in KR is expressed as Eq. (15).

$$\widehat{y}= \sum_{i=1}^{n}{\alpha }_{i}k(x, {x}_{i})$$

(15)

where αi are the dual coefficients, k(x,xi ) is the kernel function evaluating similarity between the test input x and each training instance xi.

Multi-Layer perceptron regressor

One kind of feedforward neural network is the MLP Regressor, which consists of an input layer, one or more hidden layers, and an output layer. Each layer consists of many neurons, each of which applies a nonlinear activation function σ() and calculates a weighted sum of its inputs. Equation (16) provides the prediction for an MLP with a single hidden layer. Backpropagation and a gradient-based method, such as stochastic gradient descent, are used to optimize the model parameters to minimize the loss function, which is typically the mean squared error (MSE) in regression tasks.

$$\widehat{y}= {v}^{T}\sigma \left(Wx+b\right)+c$$

(16)

Here, W is the weight matrix of the hidden layer, b is the bias vector, σ is the activation function (typically ReLU), v is the output weight vector, and c is the output bias term.

Modal training, hyperparameter tuning, and cross validation

As explained in previous sections, the preprocessed dataset was separated into subsets for testing and training. The training set was used to train each model initially, and Bayesian optimization was used to adjust the hyperparameters. A final training run was carried out to produce the final version of each model after the ideal hyperparameters had been determined. The scikit-learn library in Python was used to create each model. For specific algorithms, such as Extreme Gradient Boosting, dedicated libraries like XGBoost were employed to leverage their optimized implementations. Predetermined tolerance limits for iterative solvers like SVR and a set number of maximum epochs for the MLP model were used to control convergence.

The optimization of hyperparameters plays a crucial role in adapting each machine learning model to the specific characteristics of the dataset, particularly considering the limited sample size and the non-linear relationships between inputs and outputs. In this study, a two-step hyperparameter tuning approach was employed. First, Grid Search was used as the primary method, conducting an exhaustive search across predefined parameter ranges, with each combination evaluated through fivefold cross-validation to ensure robust performance and minimize overfitting. Second, the search space was strategically defined based on domain expertise and prior studies in materials informatics and construction-related regression tasks, which helped narrow down the most promising parameter ranges and reduce computational effort. The final hyperparameter settings, presented in Table 4, were carefully selected to balance model complexity, generalization ability, and predictive accuracy. For instance, the maximum depth of the Random Forest model was constrained to avoid overfitting, while the learning rates in boosting algorithms were fine-tuned to promote stable and gradual learning.

Table 4 Hyperparameters configuration for machine learning models.

Five-fold cross-validation was used to assess the model’s capacity for generalization. There were five roughly equal subsets (folds) created from the dataset. One-fold was utilized as the validation set and the other four as the training set for every iteration. To guarantee that every data point was used precisely once for validation, this procedure was carried out five times. The mean squared error across all folds was computed as shown in Eq. (17). To give a more thorough assessment of model performance, additional performance metrics were computed and averaged across the folds in addition to mean squared error. These metrics included the coefficient of determination, mean absolute error, mean absolute percentage error, and root mean squared error.

$$\text{CV-Metric}= \frac{1}{K}\sum\limits_{{k = 1}}^{K} {{\text{Metric}}^{{\left( k \right)}} } \quad {\text{where}}\;K = 5$$

(17)

Evaluation metrics

Mean absolute error

Mean Absolute Error (MAE) is the average of the absolute differences between the predicted and actual values. Without taking into account the direction of the errors, it provides an estimate of the average size of the errors in a set of forecasts. MAE is a simple metric that shows how well predictions match actual data and is easy to understand. It was determined using Eq. (18).

$${\text{MAE}} = \frac{1}{k}\sum\limits_{{i = 1}}^{k} {\left| {x_{i} – y_{i} } \right|}$$

(18)

Root mean square error

Root Mean Square Error (RMSE) measures the square root of the average of the squared differences between predicted and actual values. It is susceptible to outliers because it assigns greater weight to huge errors. When significant errors are undesired and require harsher penalties, RMSE is very helpful. It is determined using Eq. (19).

$$\text{RMSE}=\sqrt{\frac{1}{k}\sum_{i=1}^{k}\left({\left({x}_{i}-{y}_{i}\right)}^{2}\right)}$$

(19)

Coefficient of correlation

Coefficient of correlation (R2) represents the proportion of variance in the target variable that is explained by the model. It has a range of 0–1, where 1 denotes a perfect fit and 0 denotes no variation explained by the model. A popular statistic for assessing the goodness of fit in regression tasks is R2. It is determined using Eq. (20).

$${\text{R}}^{2} = 1 – \frac{SSR}{{SST}}$$

(20)

Mean absolute percentage error

Mean Absolute Percentage Error (MAPE) measures the average percentage error between predicted and actual values, making it intuitive and easy to interpret in percentage terms. Lower MAPE indicates better model accuracy. However, it is sensitive when actual values are close to zero. It is determined using Eq. (21).

$$MAPE= \frac{100\%}{n}\sum_{i=1}^{n}\frac{{y}_{i}-\widehat{{y}_{i}}}{{y}_{i}}$$

(21)

Coefficient of variation of RMSE

Coefficient of Variation of RMSE (CVRMSE) expresses RMSE as a percentage of the mean actual value, allowing comparison across datasets with different scales determined using Eq. (22). More prediction accuracy in relation to the data scale is shown by a lower CVRMSE. Useful for comparing model performance across different datasets.

$${CV}_{RMSE}= \frac{RMSE}{\overline{y} }*100\%$$

(22)



Source link