Machine learning models based wear performance prediction of AZ31/TiC composites

This study presents an experimental approach developing a predictive mapping between TiC content, sliding speed and applied load as input parameters and volume loss as output parameters, using multiple supervised ML regression models. Initially, the experimental data was systematically organized and preprocessed to ensure its suitability for ML applications. The dataset was then partitioned into two subsets: training data (80%) and validation data (20%). Five distinct ML models were developed and trained using the training dataset.

In total 135 wear measurements were collected, corresponding to 45 unique test conditions (combinations of three TiC levels, five loads and three speeds) each repeated in triplicate to quantify measurement variability. Condition‑wise means and standard deviations were computed and used for model fitting and replicates belonging to the same condition were kept together during cross‑validation using a Group K Fold strategy. Only three controllable, a‑priori factors—TiC content, load and sliding speed—were used as predictors. Although microstructural attributes such as grain size, hardness and dislocation density were measured experimentally, they represent downstream consequences of the processing conditions and would introduce information leakage and multicollinearity into the models. A pilot ablation study including hardness and grain size confirmed that these additions did not improve cross‑validated accuracy and instead increased variance. We therefore restricted the model inputs to the controllable factors and emphasize that the reported high R² values arise from interpolation within the factorial design space rather than extrapolation beyond it.

Data preparation and model evaluation were performed using a leakage‑free pipeline: (1) an 80/20 train–test split was made on the condition‑wise means; (2) a scikit‑learn Pipeline was constructed chaining StandardScaler and each estimator; (3) GridSearchCV with 5‑fold cross‑validation was applied on the training set only, ensuring that scaling and model fitting occurred inside each fold; (4) the best hyperparameters were selected based on the mean cross‑validated R² and the final model was evaluated on the untouched 20% test set. This procedure prevents optimistic bias due to data leakage.

To probe robustness beyond simple random splits, we also conducted leave‑one‑factor‑level‑out cross‑validation (e.g., excluding all data at 50 N during training) and grouped folds by FSP stir zone location. Predictive accuracy decreased under these more stringent protocols—particularly when the highest load level was held out—but the relative ranking of algorithms (RF ≥ GB ≫ LR) remained unchanged. These additional analyses suggests that the developed models interpolate reliably within the studied domain while highlighting that true external validation requires additional FSP batches or different counterface materials, which are planned for future work.

To ensure model robustness, a 5-fold cross-validation technique was employed, wherein the training dataset was divided into five subsets, with each subset iteratively used for validation. This step ensured that the models were assessed for generalizability and overfitting before final evaluation. To determine the optimal model configuration, a grid search method was employed, systematically testing multiple hyperparameter configurations for each model. Cross-validation was performed to identify the best-performing model based on training data. The most accurate models were subsequently evaluated on the validation dataset and the results are discussed in this section. The ML dataset comprised 135 individual wear measurements derived from 45 unique experimental conditions (three TiC contents × five loads × three speeds) with three repeats per condition; unless otherwise noted, condition‑wise means were used as input to the models. This approach acknowledges the limited sample size and reduces variance by pooling replicates.

ML models

Five widely used regression models (Linear Regression (LR), Decision Tree (DT), Random Forest (RF), Gradient Boosting (GB) and Extreme Gradient Boosting (XGB)) were selected for performance comparison on the given dataset. The application of regression methodologies was necessitated by the nature of the dependent variable—volume loss—which is characterized as a continuous real-valued metric. The execution of all model implementations was conducted utilizing the Python programming language, with the computational tasks executed on the Google Colab platform. The scikit-learn package was utilized for ML tasks, while Seaborn was employed for data visualization. To enhance prediction accuracy, data preprocessing was performed using the Standard Scaling algorithm, which normalizes input and output features to improve the efficiency of ML models. Standardization ensured consistency across varying data scales, leading to more precise and reliable predictions. A detailed block diagram depicting the essential stages of the ML process is shown in Fig. 9. A pseudocode of the entire process is also shown as Algorithm 1.

Linear regression

By creating links between dependent and independent variables, ML—especially the linear regression method—fundamentally drives predictive modeling. A supervised learning algorithm called linear regression minimizes the error between actual and predicted values to fit a linear equation to a given dataset. Its simplicity and interpretability make this approach popular in many sectors, including engineering, economics and healthcare. Usually, model parameters are estimated by means of the least squares technique, which reduces the sum of squared residuals. Though it is good in modeling linear connections, linear regression has significant drawbacks when dealing with complicated, nonlinear data, so requiring the use of sophisticated ML methods like polynomial regression or neural networks. Still, linear regression is a basic approach in data-driven decision-making since it offers insightful analysis of trends and patterns across many uses.

Decision tree (DT)

The decision tree model predicts output values depending on the input features by means of a tree-like structure made up of nodes and leaves. Decisions at every node are made according to splitting criteria derived from feature values, such as information gain, Gini impurity, or mean squared error (MSE). Every leaf node is a last prediction matching a numerical target value in regression work. The criterion used to assess the efficacy of every split is the main driver of decision-making in decision trees. Node partitioning is often done using metrics like Mean Absolute Error (MAE) and Mean Squared Error (MSE). A more complicated model may result from deepening the tree, which could cause overfitting where there is a situation in which the model remembers the training data and underperforms on unobserved test data.

Random forest (RF)

Random Forest (RF), a supervised ML method, offers better performance in regression and classification activities. It builds several decision trees from randomly chosen training data portions. Every tree is trained separately in regression jobs and the last prediction comes from averaging the results of all trees. The number of trees in the forest is defined by n_estimators which is a main hyperparameter. Inherited from the Decision Tree model, other parameters are minimum samples per split and maximum depth. By improving model accuracy and reducing overfitting, the ensemble method confirms Random Forest as a strong and consistent predictive modeling tool.

Gradient boosting regressor

A very efficient ensemble ML technique, gradient boosting builds models sequentially, each following one trying to fix the errors of its predecessor. Its remarkable predictive accuracy and ability to manage complicated and high-dimensional data make it especially popular for regression tasks. This approach iteratively combines multiple weak learners into a robust predictive model. The basic idea is to gradually add new models forecasting the residuals (errors) of the combined ensemble from prior iterations, therefore minimizing a differentiable loss function. Gradient Boosting methodically increases the performance by concentrating on the errors at every stage, therefore enabling it to catch complex patterns in the data. Inspite of its benefits, hyperparameter tuning including tree depth, number of estimators and learning rate is crucial to avoid overfitting and guarantee best model generalization. Given a dataset D={(x_i, y_i)} ⁿi=1, where denotes the feature vector and represents the corresponding target value, the goal is to identify a model () that minimizes the loss function (y, F(x)). Through this iterative approach, Gradient Boosting enhances the model’s accuracy by concentrating on the data points that previous models struggled with, leading to a strong predictive model that can tackle complex regression challenges.

Extreme gradient boosting (XGB)

Specifically meant to forecast output variables from input features, Extreme Gradient Boosting (XGB) is a sophisticated, tree-based supervised ML technique. Unlike Random Forest, which constructs trees independently and in parallel, XGB constructs trees sequentially, each new tree rectifying the mistakes of the prior ones. High predictive performance results from this boosting technique, which makes XGB especially good at managing structured and tabular data for regression and classification tasks. Like RF, XGB consists of multiple decision trees, but its training mechanism is fundamentally different. Additionally, XGB incorporates an L2 regularization function to control model complexity, reducing overfitting and enhancing generalization. This makes XGB a more efficient and precise alternative to traditional Gradient Boosting models.

Hyper parameter optimization

Hyperparameter optimization was performed using the Grid Search CV method from the scikit-learn library, employing a 5-fold cross-validation strategy for each model to identify the optimal combination of parameters. The search grids were designed to balance model performance with computational efficiency, particularly considering the limited size of the dataset. The hyperparameters and their respective ranges explored for each model are as given in Table 3.

Table 3 Hyperparameters and their tuning ranges.

Each model was evaluated using R², RMSE, MAE and MSE on the validation folds. The final model configurations were selected based on the highest mean R² score across folds. This systematic tuning ensures that all models were fairly optimized for their respective parameter spaces and prevents performance bias due to arbitrary parameter selection.

The parameter tuning phase has been initiated to ascertain the optimal configuration for each ML model. Owing to the discrepancies in tuning configurations for each model, the combinations of parameter subsets have been tailored individually.

Upon completion of the hyperparameter optimization phase, the ideal configuration for each model is determined. The ideal configuration for the Random Forest (RF) model is as follows: the maximum number of features considered for the best split (MF) is established at 2, the minimum number of samples necessary at a leaf node (MSL) is set to 1 and the total number of decision trees in the forest (NE) is fixed at 100. An optimal configuration for XGB consists of a learning rate (η) of 0.1 and an estimator of 50 trees. The optimal design of the decision tree involves using the absolute error as the splitting criterion, setting the maximum depth level to 8 and the minimum sample leaf (MSL) to 1.

As mentioned earlier, four distinct measures, namely R², RMSE, MSE and MAE, were utilized in this work to evaluate and compare the performances of the ML models. The coefficient of determination, R², is a statistical concept used to assess the goodness of fit of a regression function. It is derived using the following Eq. (1).

$$R^{2} = \frac{{\sum\limits_{{i = 1}}^{n} {\left( {y_{i} – \hat{y}_{i} } \right)^{2} } }}{{\sum\limits_{{i = 1}}^{n} {\left( {y_{i} – \bar{y}} \right)^{2} } }}$$

(1)

Let n denote the quantity of assessments, $\:yi$ signify the empirically measured output value, $\:\hat yi$ denote the anticipated output value and$\:y$ represent the arithmetic mean of the empirically measured values. The RMSE, or root mean squared error, serves as an indicator of the mean divergence between forecasted and actual values. It is computed by ascertaining the square root of the mean of the squared differences between the anticipated and actual values Eq. (2).

$$RMSE = \sqrt {MSE} = \sqrt {\frac{1}{n}} \sum\nolimits_{{i = 1}}^{n} {\left( {y_{i} – \bar{y}_{i} } \right)^{2} }$$

(2)

The MSE, or mean squared error, is a measure of the average squared difference between the actual and projected values. It is calculated using the following Eq. (3):

$$MSE = \frac{1}{n}\sum\limits_{{i = 1}}^{n} {\left( {y_{i} – \bar{y}_{i} } \right)^{2} }$$

(3)

The MAE, or mean absolute error, is a measure of the average difference between the actual and anticipated values. It is derived using the following Eq. (4):

$$MAE = \frac{1}{n}\sum\limits_{{i = 1}}^{n} {\left| {y_{i} – \bar{y}_{i} } \right|^{2} }$$

(4)

As shown in Eq. 1, the coefficient of determination (R²) evaluates how well the model’s predictions match the actual data. A perfectly fitting model has an R² value of 1, while a value of 0 indicates that the model fails to explain the variability in the data. R² is particularly effective in assessing the proportion of variance captured by the model compared to other metrics. Moreover, three supplementary error metrics serve to quantify the efficacy of the model: Root Mean Squared Error (RMSE) and Mean Squared Error (MSE) evaluate the squared deviations between actual and forecasted values. Given that RMSE represents the square root of MSE, both metrics exhibit a monotonic relationship—implying that reduced values signify enhanced model precision. Mean Absolute Error (MAE) computes the average absolute difference between actual and predicted values. Unlike RMSE and MSE, MAE is less sensitive to outliers, making it a useful metric when extreme deviations need to be minimized.

Table 4 Comparative performance of ML models without hyperparameters optimization.

Table 5 Comparative performance of ML models with hyperparameters optimization.

The performance of machine learning models before and after hyperparameter tuning, as summarized in Tables 4 and 5, shows substantial improvements with tuning. The bar plots for R², RMSE, MSE and MAE (Fig. 10) highlight these enhancements. Among the models, Gradient Boost and XGBoost exhibited the most considerable improvements, with Gradient Boost achieving the highest R² = 0.9987 and the lowest RMSE = 0.087 after tuning. XGBoost also showed improvement, reaching R² = 0.9963 and RMSE = 0.0147 after tuning. In contrast, Linear Regression displayed no significant change in performance, as both the R² = 0.7713 and RMSE = 0.1166 remained unchanged before and after hyperparameter optimization, highlighting the model’s limitations with the given dataset. Although Gradient Boost showed a near-optimal R² of 0.9987 and RMSE of 0.0080, it is critical to evaluate the precision of these results relative to the inherent experimental measurement uncertainty. To validate these findings, the measurement uncertainty was estimated based on repeatability trials and the standard deviation of wear volume measurements across triplicates, yielding an estimated experimental uncertainty margin of ± 0.05 mm³. Accordingly, the RMSE values for both Gradient Boost and XGBoost (0.0087 and 0.0147, respectively) are well within this margin, indicating that the discrepancies between predicted and observed wear values are within the acceptable tolerance of the measurement system. This suggests that the model deviations are not physically significant and lie within the instrument’s error margin.

Confidence interval analysis of R² scores

The performance of the machine learning models was evaluated using the coefficient of determination (R²) and 95% confidence intervals (CIs) from bootstrapped resampling as shown in Table 6. GB and XGBoost demonstrated the highest R² values of 0.9987 and 0.9963, respectively, indicating exceptional predictive performance on the training dataset. These models also exhibited slightly wider confidence intervals, suggesting that while they are highly accurate, there is moderate uncertainty in their ability to generalize to unseen data. In comparison, the Random Forest model also showed robust performance with an R² of 0.9919 and a narrow confidence interval (0.9969, 0.9996), suggesting robust and stable predictions across various conditions. The Decision Tree model, while achieving a high R² of 0.9945, had a broader confidence interval (0.9854, 0.9966), indicating potential challenges with overfitting and a risk of less reliable generalization. Finally, Linear Regression demonstrated significantly lower predictive performance with an R² of 0.7713, accompanied by a broad confidence interval, which highlights its limitations in capturing complex patterns within the dataset.

Table 6 Confidence interval analysis of R² Scores.

Comparative bar plots of R², RMSE, MSE and MAE before and after hyperparameter tuning are shown in Fig. 10. The R² values before and after hyperparameter tuning are consistently high across the models, with Gradient Boost and XGBoost maintaining the highest values, reflecting excellent predictive accuracy. The Random Forest model also shows robust performance, while LR exhibits the lowest R², indicating poor predictive power. RMSE, MSE and MAE: Hyperparameter tuning significantly improved the RMSE, MSE and MAE for GB and XGBoost, resulting in smaller values post-tuning, which suggests better model fit and less error in predictions. The DT and RF models show minor improvements, while Linear Regression’s performance remains unchanged, highlighting its limitations in capturing the underlying data patterns effectively. This plot and its analysis (Fig. 10) clearly demonstrate the effectiveness of hyperparameter tuning in enhancing model performance, particularly for GB and XGBoost, which exhibit notable improvements across multiple metrics. However, the inherent limitations of Linear Regression are still evident despite tuning, as it fails to show significant improvements in these key metrics.

Figure 11 presents a comparative analysis of actual vs. predicted volume loss for several machine learning models: (a) LR, (b) DT, (c) RF, (d) GB and (e) XGBoost. After hyperparameter tuning, Gradient Boosting achieved an R² of 0.9987 and RMSE of 0.008755, demonstrating exceptional predictive performance with minimal prediction error. XGBoost closely follows with an R² of 0.9963 and RMSE of 0.014707, highlighting its effectiveness in capturing the underlying patterns and minimizing residuals. The Random Forest model also performed well, yielding an R² of 0.9919 and RMSE of 0.021840, reflecting strong generalization capability, though slightly higher error compared to the ensemble methods. Decision Tree demonstrated satisfactory performance with an R² of 0.9945 and RMSE of 0.018074, though it showed slightly more variability in predictions. In contrast, Linear Regression showed limited improvement after tuning, with an R² of 0.7713 and RMSE of 0.116618, indicating significant limitations in capturing the complexity of the data and producing larger residuals. Overall, Gradient Boosting and XGBoost emerge as the top performers, while Linear Regression lags in predictive accuracy.

To assess the prediction errors in relation to experimental variability, residual plots were generated for each ML model (Fig. 12a–e). Figure 12 presents the residual plots for (a) LR, (b) DT, (c) RF, (d) GB and (e) XGBoost models, which highlight the effectiveness of each model in predicting wear volume loss. The residual plot for Linear Regression shows noticeable spread, indicating poor model fit and higher error, especially for larger predicted values. The DT residuals exhibit a moderate spread with some outliers, suggesting potential overfitting despite an improved fit over Linear Regression. RF, however, shows reduced error with most residuals close to zero, indicating better generalization. GB and XGBoost models provide the most favorable residual plots, with residuals tightly clustered around zero, indicating excellent model fit and minimal error, thus demonstrating their superior ability to capture the underlying data complexity.

Figure 13 illustrates the feature importance scores derived from multiple tree-based ML models, providing insight into the relative influence of input variables—load, sliding speed and TiC content—on the predicted wear volume loss. The plot reveals that normal load consistently exhibits the highest importance across all models, emphasizing its dominant role in governing material wear behavior. This aligns with tribological principles, where increased load typically intensifies contact stress and accelerates wear mechanisms such as ploughing, delamination, or adhesion. The reinforcement content (TiC vol%) is the second most influential factor, as it directly affects the composite’s hardness, grain refinement and wear resistance. In contrast, sliding speed shows the least importance, suggesting its secondary role compared to load and material properties in determining volumetric wear.

3D Response surface analysis

The 3D surface plots illustrate (Fig. 14) the predicted wear volume behavior of AZ31/TiC composites with 5%, 10% and 15% TiC reinforcement as a function of sliding speed and applied load. Across all compositions, wear volume increases with load due to intensified contact stresses, while it decreases with increasing sliding speed, due to the formation of protective tribolayers at higher velocities. A clear trend of improved wear resistance is observed with increasing TiC content, with the 15% TiC composite showing the lowest predicted wear volume across all conditions. This highlights the effectiveness of TiC in enhancing load-bearing capacity and reducing material removal. The non-linear shape of the surfaces reflects the complex interactions between process parameters, which are accurately captured by the ML model. These plots not only confirm experimental trends but also provide a predictive tool for optimizing wear performance under various operating conditions.

Table 7 Comparison with published works.

The present study demonstrates superior wear prediction accuracy compared to existing literature on Mg composites (Table 7). Among all the models evaluated, GB achieved the highest R² value (0.9987) and the lowest RMSE (0.0088), in line with the performance of models such as GB and Light GBM reported in previous works^{51– 52}. In contrast, linear regression performed poorly in this study, which differs from its better performance in some earlier studies. This discrepancy emphasizes the significance of dataset-specific hyperparameter tuning. Therefore, the findings of this study further support the reliability and interpretability of Gradient Boost as a robust model for wear prediction in AZ31/TiC composites.

Source link