Advanced hybrid machine learning-based modeling to predict ionic liquid properties at different temperatures

Machine Learning


Ionic liquid dataset

A large dataset (from 69 ionic liquids) from previous sources was used in this study.1. The total number of data points considered in this study is 1042, with IL structure and temperature as inputs and surface tension as output. Details of the parameters and their ranges are reported elsewhere.1. Surface tension at constant pressure is measured between 18.7 and 70.3 mN/m, and temperature between 268.29 and 532.4 K (0.101 MPa).20. Considering temperature and chemical structure as inputs, this study settled on surface tension as the desired outcome for modeling.twenty one. The complete dataset is reported at:1.

Figure 1 shows the Pearson correlation matrix for the dataset, highlighting the linear association between all pairs of input and output features. The Pearson correlation coefficient (r) varies between – 1 and + 1. + 1 represents a perfect positive linear connection, -1 a perfect negative linear connection, and 0 indicates no linear relationship. This plot was used as a preliminary data exploration step to identify potential multicollinearity and understand how strongly the input features (such as temperature and molecular descriptors) were associated with the target variable (surface tension). Correlation coefficients were calculated directly from the normalized dataset using traditional statistical methods. The insights gained played a key role in guiding both the feature selection and the structure of the machine learning model applied in later stages of the study.

Figure 1
Figure 1

Pearson correlation of datasets.

A series of preprocessing steps were applied to the dataset to ensure high quality data analysis. Outlier detection was first performed using Cook’s distance to identify influential points that could unduly influence the regression model. These outliers were carefully evaluated and removed as necessary to improve the robustness of the analysis.

Following outlier detection, we normalized the dataset using a Min-Max scaler. This scaling method transforms feature values ​​into a common range. [0,1]allows the model to operate effectively across a variety of scales and minimizes the effects of the size of a single feature.

Principal component analysis (PCA) was applied for dimensionality reduction. PCA helped reduce the dimensionality of the dataset while preserving as much variance as possible, improving computational efficiency and interpretability. The resulting principal components serve as input to the machine learning model, ensuring a balance between complexity and performance.

Decision Tree (DT)

Trees are a fundamental data structure widely used across various subfields of artificial intelligence. Among them, decision trees are a useful and intuitive ML technique commonly implemented for classification, regression, and predictive modeling tasks. It consists of decision nodes, connecting edges, and leaf nodes that provide the final result or prediction.13,22. The prognosis or outcome of DT will be a leaf node12, 23, 24. DT has various guidance algorithms including CARTtwenty fourchaid13C4.5, and C5.0twenty five.

Random Forest (RF)

RF is an ensemble learning model that builds a collection of decision trees and aggregates their predictions to increase accuracy and reduce overfitting.15. This method incorporates randomness through two main mechanisms: bootstrapping (sampling with replacement) the training data for each tree and selecting a random subset of features at each split node.26,27.

In contrast to a single decision tree, which tends to overfit the training data, a random forest generates an ensemble of trees, each trained on a unique random subset of the dataset. Furthermore, at each node, the algorithm evaluates only a randomly selected subset of features to determine the optimal split. This blend of data and feature randomness enhances the generalization ability and overall resilience of the model.

After training an ensemble of trees, the final prediction for the regression task is derived by averaging the outputs of all individual trees. The main hyperparameters that affect the performance of a random forest include the amount of trees in the forest, the minimum number of samples required to branch a node, and the number of features evaluated in each split.

The effectiveness of RF models lies in their ability to reduce variance without significantly increasing bias. This makes it particularly good at managing high-dimensional data and datasets characterized by complex non-linear relationships. In this study, we optimized the RF through the Harmony Search algorithm to achieve optimal parameter selection to accurately predict the surface tension values ​​of various ionic liquids.

Extremely randomized tree (ET)

ET is a model that generates an ensemble of decision trees without using standard top-to-bottom tree growth techniques.28. ET constructs a set of DTs in a manner similar to previous tree-based ensemble models, but with an emphasis on randomization to reduce variance without significantly increasing bias.29.

Randomness is built into the tree growth process by creating split nodes with randomly selected features and cut points. In particular, we build the tree using the entire training set rather than using bootstrap samples. Randomization and ensemble averaging allow us to reduce the potential variance of the DT without introducing bias into the system by reusing the original training samples rather than replicas from the bootstrap. This method is highly influenced by three parameters: the minimum number of instances to split a node (nminutes), the features of each internal node (K), and the desired number of ensemble trees. After generating the ensemble tree, the respective predictions are combined using majority voting to obtain the final estimate of the ET model.30,31.

Hyperparameter tuning

Hyperparameter optimization was performed using the Harmony Search (HS) algorithm, a metaheuristic derived from musicians’ improvisational techniques that seek harmonious results. We used HS to optimize the parameters of each ML model, facilitating the exploration of different parameter combinations to enhance predictive accuracy and model generalization.

In this study, we randomly split the dataset into two segments, with 80% designated for training and 20% for testing. Training segments were used for model development and refinement, and test segments were used for independent performance evaluation. Three-fold cross-validation (CV) was applied within the training set to enhance the robustness of the model and reduce the risk of overfitting. A 3x CV was adopted due to relatively limited data size (1042 instances), but future studies may consider increasing the multiple (e.g., 5x or 10x CV) for more thorough validation. The choice of this fitness function was motivated by its ability to capture the fitness of the regression task. The three-fold CV approach helped alleviate overfitting problems often associated with small datasets by averaging model performance across multiple folds. This approach also increases the robustness and reliability of the selected parameters by exposing the model to a diverse subset of the data, ultimately ensuring a better balance between variance and bias.

Harmony Search iteratively explored the hyperparameter space and adjusted key parameters, such as maximum tree depth and minimum samples per leaf (for tree-based models), until an optimal parameter set that maximized the cross-validated R was obtained.2 Score found. This systematic adjustment process allows the model to achieve high predictive performance while reducing the risk of overfitting, making HS a suitable choice to handle the complexity and limited size of ionic liquid datasets.



Source link