Machine learning and interactive GUI for concrete compressive strength prediction

Machine Learning


This research builds upon prior work by employing machine learning models to predict the CS of concrete across a broad spectrum of data, varying from 2.33 to 82.60 MPa. The main objective was to evaluate the efficacy of different ML models for predicting the CS of concrete. Figure 1 shows the flowchart of the methodological approach used in this study to predict the compressive strength of concrete. Initially, data from 1030 datasets are collected, including various components like cement and aggregates, and their properties are analyzed through histograms and heatmaps. Then, two types of predictive models are applied: non-ensemble models and ensemble models. The models’ performance is evaluated by comparing actual and predicted values, using metrics like R2 and RMSE, and through k-fold cross-validation. Sensitivity analysis is conducted, and the results are benchmarked against previous studies to identify the best predictive model. This approach aims to facilitate the researcher’s ability to gauge the effect of different variables on the prediction of CS in a more time-efficient and cost-effective manner compared to extensive experimental studies.

Figure 1
figure 1

Database collection

To develop a model for predicting outcomes and to analyze the data statistically, researchers can use data from laboratory experiments or gather information from previously published studies. For this research, a substantial dataset consisting of 1030 data points related to the CS of concrete was assembled by reviewing past scholarly articles: Song et al.19, Song et al.37, and Yeh57. The data analysis of the study focused on eight principal attributes, which were used as the input variables: cement (C), blast furnace slag (Slag), fly ash (FA), water (W), superplasticizer (SP), coarse aggregate (Cagg), fine aggregate (Fagg), and the number of days of curing (Age). These were all considered to predict the final compressive strength, the outcome variable. Table 2 provides a concise overview of the statistical description of the collected data, presenting a comprehensive summary of its characteristics. Each row refers to a distinct variable, while the columns contain specific statistical measures for these variables.

Table 2 Descriptive statistics for the collected database.

Furthermore, the frequency distribution of the dataset is visually represented in Fig. 2 through histogram plots. These plots are invaluable for understanding the distribution patterns of each variable, such as normality, skewness, and the presence of outliers, which align with the statistics presented in Table 2. The x-axis represents each variable, while the y-axis indicates the frequency of occurrences. This visualization enables a thorough assessment of these variables. The general observations include:

  • Most variables (i.e., X2, X3, X5, and X8) show a robust positive skewness, indicating a higher concentration of lower values and fewer higher values.

  • The X4, X6, X7, and Y variables display more balanced distributions with central tendencies.

  • Outliers are more prominent in features with positive skewness, where higher values occur less frequently.

Figure 2
figure 2

Histograms of input variables.

Correlation analysis

Examining the correlation between variables is crucial for comprehending the connections between dependent features and the target strength factor, as this analysis seeks to determine the most effective prediction model. This method’s most widely used measure is the Pearson correlation coefficient (r), which helps to understand these relationships58,59. It can be calculated as the ratio of the covariance (cov) of two variables (x, y) to the product of their standard deviations, as represented in Eq. (1).

$$r = \frac{{cov \left( {x,\;y} \right)}}{{\sigma_{x} \sigma_{y} }} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {x_{i } – \overline{x}} \right)\left( {y_{i } – \overline{y}} \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {x_{i } – \overline{x}} \right)^{2} } \sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i } – \overline{y}} \right)^{2} } }}$$

(1)

where \(\overline{x}\) and \(\overline{y}\) are the mean of two variables x and y; n is the number of a dataset.

Figure 3 presents a heatmap that demonstrates the influence of each variable on all other variables. Notably, the strongest positive correlations between X1, X5, X8, and Y are observed, with r-values of 0.50, 0.37, and 0.33, respectively. This indicates that the CS of concrete is significantly influenced by adding cement, followed by superplasticizer, and finally by the number of days of curing, as evidenced by their higher values. Conversely, the strongest negative correlation between X4 and X5 (− 0.66) suggests an inverse relationship between water and superplasticizers. Also, there is a substantial negative correlation between water and CS of concrete, with an r-value of − 0.29. The remaining variables show a weak correlation between concrete’s CS and each other, indicating that variables do not have linear solid relationships with each other. The absence of uncorrelated features implies that all eight input parameters are relevant and can be effectively employed in predicting the CS of concrete.

Figure 3
figure 3

Pearson correlation of input and output variables.

Figure 4 presents a scatterplot matrix that provides a comprehensive visual analysis of the eight input variables and their relationship with the output variable (Y). The matrix includes histograms on the diagonal, illustrating the distribution of each variable individually, as previously discussed. The off-diagonal cells in the matrix contain scatter plots that show the pairwise relationships between variables. Each scatter plot provides a visual representation of the correlation between two variables. The input parameters X1 and X5 exhibit a clear positive linear relationship with the output variable Y in Positive Linear Relationships. These findings suggest a positive correlation between the increase in X1 and X5 and the increase in Y, indicating a direct relationship. In addition, the variables X3 and X5 exhibited a positive linear relationship, indicating that as the value of X3 increases, the value of X5 also tends to increase.

Figure 4
figure 4

Scatter pair plots matrix with interaction variables.

A strong negative linear relationship is evident between the input parameters X3 and X4, indicating that as X3 increases, X4 consistently decreases. Furthermore, the correlation between X4 and X5 is highly negative, suggesting that higher values of X4 correspond to lower values of X5. In contrast, the X2 input exhibits weak or unclear linear associations with the majority of other variables, suggesting a low level of correlation. Similar to X2, the input X6 does not exhibit distinct linear patterns with the majority of its variables. Regarding the input parameter X7, it is worth noting that while there is a minor correlation with X5, overall, X7 does not exhibit significant linear associations with most other variables. The input X8 exhibits clear clusters when plotted against other input parameters, indicating the existence of sub-groups within the data. Finally, a subtle non-linear pattern can be observed regarding the correlation between inputs X6 and X7. To gain a better understanding of the underlying relationship, additional investigation may be necessary.

Data normalization

Some machine learning models may not function optimally when there is a variation in the scale of input data. As indicated in Table 2, the cement range lies between 102 and 540 kg/m3, while the range for superplasticizers is between 0.0 and 32.2%, highlighting the disparate magnitudes of different input features. To address this, data normalization or rescaling is employed, which adjusts all input variables to a uniform scale. This process utilizes the max–min mapping function, as outlined in Eq. (2).

$$X_{n} = \frac{{X – X_{min} }}{{X_{max} – X_{min} }}$$

(2)

In this equation, Xn represents the normalized data, Xmin and Xmax denote the minimum and maximum values of each input variable, and X refers to the original dataset undergoing rescaling. The primary benefit of data rescaling lies in its ability to expedite computations and enhance the accuracy and stability of the machine learning-based prediction model.

Non-ensemble models

In this research, six different non-ensemble models were utilized, namely Multiple Linear Regression (MLR), Multiple Nonlinear Regression (MNLR), Support Vector Regression (SVR), Gene Expression Programming (GEP), Artificial Neural Networks (ANN), and Adaptive Neuro-Fuzzy Inference System (ANFIS). These models were developed using the Python programming environment within the ANACONDA software, MATLAB, and SPSS programs. Concise explanations of each model are provided in the subsequent sub-sections.

MLR model

The MLR model is an extension of simple linear regression to predict a single output variable using multiple input variables60,61,62. This method assumes a linear relationship between the inputs and the output. It’s particularly valuable in situations where various factors influence the response variable, allowing for the assessment of the relative contribution of each predictor. The MLR model is straightforward, interpretable, and widely used in various fields for its ability to provide insights into relationships between variables. Multivariate Linear Regression is effective for predicting and understanding the underlying data structure.

MNLR model

Nonlinear models are straightforward, easy to understand, and effective for making predictions63. These models are versatile in terms of the range of average outcomes they can express. However, they might not be as adaptable as linear models when describing different data types. Nonetheless, if the nonlinear model is well-suited for a particular situation, it could be more efficient, use fewer parameters, and be simpler to understand. This clarity is often due to how parameters relate to significant, meaningful processes.

The process of using the MNLR model involves several steps: firstly, identifying the variable we want to predict; secondly, creating a nonlinear equation that represents how this variable is affected by other variables; thirdly, inputting initial guesses for the parameters of this equation, with the Levenberg–Marquardt method being the chosen technique for estimation; and finally, initiating the MNLR analysis to generate and review the results in the output log.

SVR model

The SVR model is an extension of Support Vector Machines (SVMs) used for regression problems64,65. SVR effectively finds the best-fit hyperplane in a high-dimensional space that can predict continuous values, maintaining a balance between the complexity of the model and the amount of error tolerated. It’s especially useful for datasets with many features and is known for its robustness against overfitting. This study uses the linear kernel to model relationships between input variables and the target variable linearly. The linear kernel in SVR essentially represents a straight line in the feature space. It assumes that the relationship between the input features and the target variable is linear, meaning that a change in the input features results in a proportional change in the predicted value.

GEP model

The GEP model was developed to create computer programs and is similar to Genetic Algorithms (GAs) and Genetic Programming (GP)61,66. Figure 5 shows a flowchart of the GEP model. The GEP model follows a structured flow that begins with creating an initial chromosome population, representing potential solutions. These chromosomes are then expressed as computer programs. Following this, each program is executed, and its performance is evaluated based on a predefined fitness function.

Figure 5
figure 5

Flowchart of GEP model67.

If the termination condition is met, the process ends. Otherwise, it iterates to produce a new generation. This involves selecting the best-performing programs to continue to the next cycle and using genetic operators to create a new generation of chromosomes. These genetic operators include mutation, inversion, one-point recombination, two-point recombination, gene recombination, and insertion sequence (IS) transposition rate. The cycle repeats, continually evaluating the fitness of programs and generating new ones until the best possible solution is found or another termination condition is met, at which point the model concludes.

ANN model

ANNs are a cornerstone of machine learning, inspired by the structure and function of the human brain. They are particularly adept at identifying complex, non-linear relationships within large datasets. ANNs consist of interconnected nodes or neurons, which collectively learn to perform tasks like regression and classification by considering examples. Their flexibility and adaptability make them suitable for various applications, from image recognition to natural language processing67,68,69. A typical neural multilayer perceptron in an ANN consists of three layers: an input layer, one or more hidden layers, and an output layer, as illustrated in a three-layered architecture. In predicting new data sets, a model employs numerous neurons organized into a network to process information. These neurons are interconnected through weights and biases, crucial determinants of a machine learning model’s precision. Networks can be categorized into basic ANNs with a single hidden layer or deep neural networks with multiple layers. Utilizing additional hidden layers augments the ANN’s ability to identify the connections between inputs and outputs, thereby enhancing model accuracy. Figure 6 shows that the variable Y, represented by the CS of concrete, was set as the output from the ANN model, while the eight variables were assigned as the inputs to the ANN model.

Figure 6
figure 6

Inputs and output variables used for ANN model development.

ANFIS model

The ANFIS model, initially introduced by Jang70 and subsequently elaborated upon by Jang et al.71, constitutes a universal approximation methodology. In this capacity, it can approximate any real continuous function defined on a compact set with arbitrary precision. The ANFIS structure closely resembles an ANN, featuring five layers, each comprised of nodes, including rules. Notably, the Sugeno fuzzy model, as proposed by Takagi and Sugeno72, is frequently employed in ANFIS. A prototypical rule set for a first-order Sugeno fuzzy model, embodying two fuzzy If–Then rules, can be succinctly expressed as follows:

$${\text{Rule}}\;1:\;\;{\text{If}}\;x\;{\text{is}}\;A_{1} \;\;{\text{and}}\;\;y\;{\text{is}}\;B_{1} ,\;\;{\text{then}}\;f_{1} = p_{1} x + q_{1} y + r_{1}$$

(3)

$${\text{Rule}}\; 2:\;\;{\text{ If}}\; x\;{\text{ is}}\; A_{2} \;{\text{ and}}\; y\;{\text{ is}}\; B_{2} , \;{\text{then}}\; f_{2} = p_{2} x + q_{2 } y + r_{2}$$

(4)

Here, \((A_{1} ,\;A_{2} ,\;B_{1} ,\;B_{2} )\) represent fuzzy sets, and \((p_{ij} ,\;q_{ij} ,\;r_{ij} )\) denote parameters associated with the consequent part of each rule. This formulation describes the basic configuration of a Sugeno fuzzy model as it applies to an ANFIS. Figure 7 depicts the equivalent ANFIS framework. In this ANFIS setup, nodes within the same level perform similar functions. The structure is composed of five layers: the first is the input layer, followed by the rule layer, then the normalization layer, the consequent layer, and finally, the output layer. For an in-depth explanation of the ANFIS framework, one can refer to Chang and Chang73.

Figure 7
figure 7

Ensemble models

In this research, four different ensemble models were utilized, namely Adaptive Boosting (AdaBoost), Random Forest (RF), eXtreme Gradient Boosting (XGBoost), and Categorical Gradient Boosting (CatBoost). The development of these models was carried out using the Python programming environment within the ANACONDA program. Concise explanations of each model are provided in the subsequent sub-sections.

AdaBoost model

Boosting is a well-known algorithm in machine learning, first suggested by Schapire75. Subsequently, Freund76 developed AdaBoost. This method focuses on combining several basic classifiers created during training into a single strong classifier. Additionally, it enhances the training process to improve the formation of these basic classifiers. The AdaBoost model is an ensemble technique that combines multiple weak learners to form a strong learner. In regression, it sequentially fits a model to adjust the weights of instances based on the errors of the previous model, focusing more on difficult-to-predict instances. The AdaBoost model is often used to improve the accuracy of decision trees and is known for its simplicity and effectiveness in reducing bias and variance.

RF model

The RF model is an ensemble learning method that operates by constructing many decision trees during training and outputting the average prediction of the individual trees. Breiman77 first developed the RF model, which combines the ideas of randomly selecting features and grouping data samples together. The RF model is widely used for both classification and regression tasks. It is particularly well-known for its ability to handle large datasets with higher dimensionality and provides estimates of feature importance, which can be very insightful.

XGBoost model

The XGBoost model implements gradient-boosted decision trees designed for speed and performance. It is a highly flexible and versatile algorithm known for its efficiency in handling sparse data and its ability to perform well on a wide range of regression and classification problems. The XGBoost model has been used successfully in numerous machine learning competitions due to its scalability and ability to produce highly competitive predictive models62.

The choice to use XGBoost for this research is based on its useful characteristics. It uses regularization to avoid overfitting and uses second-order gradients for quicker convergence. It can handle missing data when finding splits and uses stochastic gradient descent to increase variety and reduce overfitting. Reduction is also employed to minimize overfitting. Furthermore, the XGBoost is designed with system-level enhancements such as parallel processing and cache optimization, which make it both fast and capable of handling large datasets.

CatBoost model

Categorical Gradient Boosting (CatBoost) represents a recent advancement in gradient-boosting algorithms designed to handle categorical features while minimizing information loss78. CatBoost distinguishes itself through two key characteristics: the utilization of ordered boosting to mitigate target leakage and its effectiveness, particularly on small datasets. Within the CatBoost model, the computation of the sample average for \(x_{{\sigma_{i,k} }}\) involves considering the target values of preceding samples in a random permutation: \(\sigma\) = (\(\sigma_{1}\), \(\sigma_{2}\), …, \(\sigma_{N}\)) of the dataset, thereby guarding against overfitting. This process is illustrated by Eq. (5).

$$x_{{\sigma_{i,k} }} = \mathop \sum \limits_{j = 1}^{i – 1} \left[ {x_{{\sigma_{i,k} }} = x_{{\sigma_{j,k} }} } \right]y_{{\sigma_{j} }} + \overline{a} P/\mathop \sum \limits_{j = 1}^{i – 1} \left[ {x_{{\sigma_{i,k} }} = x_{{\sigma_{j,k} }} } \right]y_{{\sigma_{j} }} + \overline{a}$$

(5)

where \(x_{{\sigma_{i,k} }} = x_{{\sigma_{j,k} }}\) denotes a condition being met, indicated by a value of 1. P represents a predetermined value, and \(\overline{a}\) is the coefficient to determine the significance of P.

Hyperparameters tuning

To optimize the hyperparameter settings in the current study, BO is employed. This technique contrasts with traditional methods like Grid-Search by initially modeling the prior distribution of the objective function and iteratively refining the search within the hyperparameter space for the optimal configuration. Initially, each model’s range and prior distribution of hyperparameters are specified. BO then identifies the configuration that maximizes performance within this predefined hyperparameter space. This approach enhances the efficiency of hyperparameter tuning, minimizing redundant experiments and accelerating the identification of the most effective hyperparameter combinations79,80,81,82. Figure 8 illustrates a framework example of BO-XGBoost.

Figure 8
figure 8

Framework example of XGBoost with Bayesian optimization83.

Comprehensive performance evaluation of models

The dataset was methodically partitioned into two distinct sets: training and testing. The training set is used to fit the model, and the testing set is used to evaluate the model’s predictive performance. The split ratio was carefully chosen to balance the need for sufficient training data against the necessity of a robust evaluation. Hence, the collected dataset was divided into 80% for training and 20% for testing. This ensures that the model is trained on a representative sample of the data while still subject to rigorous testing on data it has not previously encountered. However, two commonly used approaches involve quantitative and visual methods to evaluate and compare the adopted ML models84.

Visual methods

Visual methods include scatter plots, violin boxplots, and Taylor diagrams. These methods offer a quick and informative way to compare models, providing insights into accurate predictions for various statistical measures like maximum, minimum, median, and quartiles. They may not capture information about the performance and ranking of models. Scatter plots are used to visualize the relationship between two variables. Violin plots provide a full distribution of the data. This is crucial when comparing models because it shows not only the central tendency (i.e., mean or median) but also the spread and density of model performance metrics. Taylor diagrams are a specialized graphical representation that quantifies the similarity between actual and predicted values. These diagrams plot the correlation, the standard deviation, and the root mean square error of predictions on a single chart. This provides a comprehensive view of a model’s accuracy, variability, and overall performance compared to the actual observations.

Quantitative methods

Quantitative methods including seven performance indices: Determination Coefficient (R2), Willmott Index (WI), Root Mean Square Error (RMSE), Scatter Index (SI), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Mean Bias Error (MBE). The ideal values for these indices are as follows: R2 and WI should ideally be 1, indicating perfect prediction accuracy, while RMSE, SI, MAE, MAPE, and MBE should ideally be 0, indicating no error in the predictions. In summary, a predictive model is ideal if its performance indicators are close to or strictly at these values. The equations for calculating these indices are presented in Eqs. (6–12) as follows85:

$$R^{2} = 1 – \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {x_{i} – y_{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} \left( {x_{i} – \overline{x}} \right)^{2} }}$$

(6)

$${\text{WI}} = 1 – \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {x_{i} – y_{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} \left( {\left| {x_{i} – \overline{x}} \right| + \left| {y_{i} – \overline{x}} \right|} \right)^{2} }}$$

(7)

$${\text{RMSE}} = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{ n} \left( {x_{i} – y_{i} } \right)^{2} }}{n} }$$

(8)

$${\text{SI}} = \frac{RMSE}{{\overline{{x_{i} }} }}$$

(9)

$${\text{MAE}} = \frac{1}{n} \mathop \sum \limits_{i = 1}^{ n} \left| {x_{i} – y_{i} } \right|$$

(10)

$${\text{MAPE}} = \frac{1}{n} \mathop \sum \limits_{i = 1}^{ n} \left| {\frac{{x_{i} – y_{i} }}{{x_{i} }}} \right|$$

(11)

$${\text{MBE}} = \frac{1}{n} \mathop \sum \limits_{i = 1}^{n} (x_{i} – y_{i} )$$

(12)

where \(x_{i}\) is the actual CS values; \(\overline{{x_{i} }}\) is the mean of the actual CS dataset; \(y_{i}\) is the predicted CS value.

k-fold cross-validation

K-fold cross-validation (Fig. 9) is a widely used method to check the performance of ML models. It involves dividing the dataset into several parts, typically ten, known as “folds.” In this tenfold system, the dataset splits into ten subsets. For each test, nine groups are used to train the model, and one group is kept for testing. This approach is suitable for understanding variability within the data and doesn’t take too much time to compute62. Each of the ten subsets becomes the test set, with the others being used for training. A reliable measure of model accuracy is obtained by averaging results from all ten tests. This way of testing helps ensure effective training of the adopted ML models and reduces the chance of missing out on essential data in the dataset.

Figure 9
figure 9

Schematic representation of k-fold cross-validation86.

SHAP feature importance analysis

To analyze the sensitivity and interpret ML models on both a wide-scale and a more detailed level, researchers use the SHAP approach, which draws on principles from cooperative game theory47. The SHAP method was employed to gauge the comparative impact of input variables on the prediction process. As an advanced method within the realm of explainable artificial intelligence, SHAP helps clarify the complex interactions between the input variables and the model predictions, as shown in Fig. 10. It offers critical insights by identifying which features are most influential on predictions and how they modify the predicted results87,88. Equation (13) shows the Shapley value \(\phi_{i}\) for feature i is determined by calculating the average marginal contribution of that feature across all possible permutations of features. In this equation, N represents the set of all features, S represents a subset of features that excludes feature i, S denotes the cardinality of set (S), v(S) represents the model’s prediction when only features in set (S) are considered, and v(S {i}) represents the model’s prediction when feature i is added to set S89.

$$\phi_{i} = \mathop \sum \limits_{{S{ \subsetneq }N\left\{ i \right\}}} \frac{{\left| s \right|!\left( {\left| N \right| – \left| S \right| – 1} \right)!}}{\left| N \right|!}\left[ {v\left( {S \cup \left\{ i \right\}} \right) – v\left( S \right)} \right]$$

(13)

Figure 10
figure 10

SHAP values method workflow.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *