Optimizing base fluid composition for PEMFC cooling: A machine learning approach to balance thermal and rheological performance

Correlation based data analysis

The correlation heatmap and scatterplot matrix together provide insight into the relationships between key variables such as Base Fluid Ratio (BFR), Temperature (T), Concentration (Vol.%), Thermal Conductivity (TC), and Viscosity and presented in Fig. 8. From the heatmap(Fig. 8 (a)), it is evident that BFR has a strong positive correlation with thermal conductivity (r = 0.95), indicating that increasing the base fluid ratio significantly enhances heat transfer performance. Conversely, BFR shows a strong negative correlation with viscosity (r = − 0.76), suggesting that higher BFR reduces fluid resistance, which is desirable for efficient flow. Temperature has a weak positive correlation with thermal conductivity and a moderate negative correlation with viscosity, implying that higher temperatures slightly improve conductivity while reducing viscosity. Concentration appears to have minimal influence on both thermal conductivity and viscosity, as indicated by its weak correlations.

The scatterplot matrix (Fig. 8 (b)) visually confirms these trends. Clear linear patterns are visible between BFR and TC, as well as BFR and viscosity. In contrast, scatterplots involving concentration and temperature show more dispersed data points, reflecting their limited direct impact. The diagonal plots reveal the distribution of each variable, with BFR and TC showing more structured trends compared to the relatively uniform spread of concentration and temperature. Overall, BFR emerges as a dominant factor affecting both thermal conductivity and viscosity, highlighting its importance in optimizing fluid performance for heat transfer applications.

Model prediction

The dataset employed in this investigation was sourced from a structured Excel sheet consisting of three predictor variables and one target output. We used `pandas` to change the data and `numpy` to do the math on it before utilizing these variables. We shuffled the dataset at random to get rid of any bias in the order, and we used “SimpleImputer” to fill in any missing values. We utilized `matplotlib` to make both regression and error plots for display and assessment. We investigated eight regression models in all. Two of the most important baseline models were LR and DT. We used `sklearn.linear_model` to build LR and optimized `fit_intercept`. We changed the `max_depth` parameter of the DT model, which came from `sklearn.tree`. We used statistical measures including Mean Squared Error (MSE), Coefficient of Determination (R²), and Kling-Gupta Efficiency (KGE) to check how well the model worked. This made sure that the assessment framework was strong. We also utilized the `xgboost` library to deploy the XGBoost model, which included setting the hyperparameters `n_estimators`, `max_depth`, and `learning_rate`. All of the model metrics were stored in a single CSV file, and the prediction and error plots were sent to the local system’s download folder at a high resolution for additional analysis and reporting.

Thermal conductivity (TC) model

Figures 9(a) and 9 (b) provide insight into the thermal conductivity model’s predictive capability using LR. As evident in Fig. 9(a), the actual vs. predicted scatter plot aligns closely along the 1:1 reference line with the bulk of predictions falling within the ± 10% confidence bounds. The model got a Train R² of 0.9745 and a Test R² of 0.9510, which shows that it worked well in both stages. The MSE values for the Train and Test sets were quite low (0.0001 and 0.0003), which shows that the fit was very good. The Train KGE (0.9818) and Test KGE (0.9727) both show that the model is quite strong and agrees with what was seen. In Fig. 9(b), the residuals for both the train and test sets stay within the range of ± 0.04. They don’t spread out much, and the pattern is mostly unbiased throughout the sample index. This shows that LR was able to accurately capture the linear connection in the thermal conductivity dataset using a simple “fit_intercept = True” setting, without making any major mistakes. The model performance measures in Table 2 show that LR can accurately and generally estimate thermal conductivity. LR is a good foundation model for adding more non-linear features since it stays accurate over folds and sample points, even if it is simple.

Figure 9(c) and 9 (d) indicate that the DT model is better at predicting thermal conductivity values than the other models. Figure 9 (c) shows that both the training and test points are quite close to the 1:1 line, and most of them are within the ± 10% range. DT has very good statistical performance, with a Train R² of 1.0000 and a Test R² of 0.9815. The Train MSE (0.0000) and Test MSE (0.0001) are also very low, which means that the regression splits are quite accurate. The Train and Test KGE values of 1.0000 and 0.9646, respectively, also point to a good correlation, exact magnitude reproduction, and little bias. Figure 9 (d) shows that there is very little residual dispersion in the training predictions (clustered at zero), and the test residuals don’t vary much, keeping generally within ± 0.03. The model’s settings (Table 3) with `max_depth = 10` seem to be just right for capturing the complexity in thermal conductivity data without making it unstable. These signs show how strong and adaptable DT is in mapping correlations in structured datasets of thermal conductivity. There is some difference between test residuals and LR, however, DT always maintains pattern integrity and makes predictions more accurate in thermal models.

Using ensemble boosting, Figs. 9 (e) and 9 (f) show how well the XGBoost model predicts thermal conductivity. Figure 9(e) shows that the actual and projected data points are quite close to each other, with all of them clustered firmly around the 1:1 diagonal. Almost all of the test and train samples are inside the ± 10% range, and the metrics are excellent: Train R² is 0.9999 and Test R² is 0.9941, both of which are quite close to their theoretical maximum. The MSE values (Train: 0.0000, Test: 0.0000) are the lowest of all the models, and the KGE scores (Train: 0.9985, Test: 0.9613) show that the correlation, variance match, and bias reduction are all very consistent. The test residuals in Fig. 9 (f) are mostly close to zero, and there isn’t much spread even at higher indices. The model’s adjusted settings (Table 3), which include `n_estimators = 100`, `learning_rate = 0.1`, and `max_depth = 5`, seem to work quite well for the thermal conductivity job. XGBoost is the best algorithm for predicting thermal conductivity because it strikes the optimum balance between accuracy and generalization. It does better than LR and DT in both numerical metrics and the distribution of residual errors. This makes it even more suitable for modelling high-fidelity thermal properties in nanofluid or composite systems.

Table 2 Model evaluation using statistical metrics.

Table 3 Model training hyperparameters range and best value.

Viscosity models

The predictive modeling of viscosity (mPas) using three ML algorithms—LR, DT, and Extreme Gradient Boosting (XGBoost)—is comprehensively illustrated in Figs. 10 (a) to (f). The statistical metrics in Table 2 and the hyperparameter optimization details in Table 3 add to the evidence for these outcomes. The discussion focuses on the model performances by looking at how well they forecast trends, how statistically strong they are, and how consistent they are across training and testing sets. Figure 10 (a) shows the real viscosity values compared to the projected values for the LR model. The LR model only explains things moderately well, with a training coefficient of determination (R²) of 0.8047 and a testing R² of 0.8561. The MSE values for the training and testing sets are 1.0571 and 0.6937, respectively. The KGE values of 0.8544 (train) and 0.8280 (test) further show that the model is only moderately well at capturing correlation, variability, and bias at the same time. Table 3 shows that the LR model used a simple “fit_intercept = True” setting. Some of the test predictions in Fig. 10a are outside the ± 10% range, which shows that adopting a basic linear model to simulate the non-linear behaviour of viscosity has its limits. The error scatter plot in Fig. 10(b) supports this even more. It shows that the residuals are quite spread out, particularly in higher index samples, which suggests that the fit is not good throughout the whole dataset range.

Figure 10(c) and 10(d) illustrate the DT model, which is a big step ahead from LR. It got excellent training accuracy (R² = 1.0000, MSE = 0.0000, KGE = 1.0000) and good generalization on the test set (R² = 0.9468, MSE = 0.2565, KGE = 0.9245). The model works well since its “max_depth” parameter is 10, which was shown to be the best range between 3 and 20 (Table 3). Figure 10 (c) shows that almost all of the predictions are within the ± 10% range, and there is a small group of data points around the optimum fit line. This shows that the method is consistent at capturing nonlinear relationships. The error distribution in Fig. 10 (d) is smaller and more symmetrical than that of LR. This means that there are fewer deviations and that the input-output mapping has been successfully trained. XGBoost, shown in Figs. 10 (e) and (f), had the most accurate predictions of all the models that were tested. As shown in Table 3, the model was set up using `n_estimators` = 100, `learning_rate` = 0.1, and `max_depth` = 5. With this setup, the training R² was 0.9999, the testing R² was 0.9944, the training KGE was 0.9991, and the testing KGE was 0.9903. The MSE values were 0.0008 for training and 0.0269 for testing. These numbers show that XGBoost is the most accurate and balanced algorithm for predicting viscosity. Figure 10(e) shows that the observed and anticipated values are almost exactly the same, with all data points falling within the ± 10% range. Figure 10(f) shows an error profile that is almost flat, with very little change over the sample range. This shows that the predictions are very consistent and have very little bias.

To sum up, LR gives a fundamental baseline that is easy to understand, but it doesn’t show how the data is related in a complicated way. The DT model is better at modelling non-linear data, whereas XGBoost is a little more accurate and stable. Figure 10; Tables 2 and 3 show that XGBoost is the best model for predicting viscosity in this investigation, both statistically and visually.

Explainable machine leaning based transparent model

Figure 11 illustrates the global and local interpretability of the XGBoost model applied to the prediction task, using SHAP and LIME methodologies, respectively. he top part of Fig. 11a shows a SHAP summary graphic that shows how much each input feature—BFR, T (°C), and Conc. (Vol.%)—adds to the model’s output throughout the whole dataset. The horizontal distance between each point shows the range of SHAP values for each observation, and the colour gradient shows how big the feature value is (blue = low, red = high). BFR is the most important characteristic since greater BFR values are always linked to positive SHAP values, which means that it has a substantial positive effect on model prediction. Next in significance are T (°C) and Conc. (Vol.%), which both display mixed SHAP contributions depending on their value. Lower temperature and concentration levels tend to make the forecast less accurate, whereas higher values either make it more accurate or have less of an effect. The bottom panel (Fig. 11b) shows a LIME-based local explanation for a certain forecast. Here are the three most important rules or choice routes that affect the forecast. The criterion BFR > 0.60 has a big, beneficial effect (green bar), which is in line with SHAP’s general perspective. On the other hand, Conc. (Vol.%) < 0.25 and T (°C) ≤ 33.75 exhibit negative contributions (red bars), which means that these feature values made the anticipated output for this case lower. In general, both interpretability frameworks support the idea that BFR is the main factor in the model’s judgments, with temperature and concentration also playing a role, but only in certain situations. This two-pronged approach makes the model more open and helps optimization tactics that are based on data.

Figure 12 presents a dual-perspective interpretability analysis for the viscosity model, combining global SHAP insights (Fig. 12(a)) with a local LIME breakdown (Fig. 12(b)). In the SHAP summary, each dot represents an individual sample’s contribution to the predicted viscosity, with features ranked by overall influence: BFR exerts the greatest effect, followed by temperature (T, °C) and Concentration (Conc., Vol.%). The horizontal axis quantifies the impact on model output, where values to the right increase predicted viscosity and those to the left decrease it. High BFR (red points) consistently shifts SHAP values toward positive extremes (up to + 5), indicating that elevated filler ratios strongly elevate viscosity. Conversely, low BFR (blue) yields negative contributions (down to − 3), damping viscosity predictions. Temperature exhibits a more symmetrical distribution: high T occasionally raises viscosity (positive SHAP) but often lowers it when paired with other features. Concentration displays subtle but discernible effects, with midrange values clustering near zero impact and extremes causing slight positive or negative shifts.

Beneath the global view, the LIME plot zooms in on one specific instance to reveal feature-level decision rules. Here, the condition BFR > 0.60 appears as a major negative driver (red bar, approximately − 2.5), suggesting that once filler ratio crosses this threshold, the model predicts markedly lower viscosity than baseline. In contrast, T ≤ 33.75 °C contributes positively (green bar, around + 2.0), meaning cooler temperatures in this sample tend to elevate the predicted viscosity. Lastly, Conc., Vol.% ≤ 0.25 exerts a modest negative influence (red bar, about − 1.0), subtly lowering the local prediction. By juxtaposing SHAP’s dataset-wide feature importance with LIME’s case-specific rule contributions, Fig. 12 elucidates both the overarching drivers and the nuanced interactions that govern viscosity predictions, thereby fostering deeper transparency and guiding targeted formulation adjustments.