Machine learning models for the prediction of hydrogen solubility in aqueous systems

Machine Learning


Data collection & processing

In this study, the data required to develop high-accuracy and efficient machine learning models were collected from previous research available in various databases. The aim of this process was to ensure comprehensiveness and extensive coverage of information. Figure 1 presents the details of the extracted data, and box plots were used to analyze data distribution and identify key features. These plots provide insights into the mean and data dispersion.

The data were selected to cover a wide range of physical and chemical variables, including temperature, pressure, and salinity. Initially, a total of 1020 data points were collected from various literature sources. Following outlier detection and removal using the Gaussian method, the final dataset comprised 992 valid samples, which were subsequently used for machine learning model development. Additionally, the data include information on both pure and saline water, which enhances the diversity and applicability of the models. The broad range of data ensures that machine learning models can predict and analyze complex scenarios effectively. The necessary statistical information is detailed in Table 2. The collected dataset was compiled from the following literature sources: Chabab et al.59, Ollarves & Trusler60, Haza et al.61, Kling & Maurer62, Ruetschi & Amlie63, Wiebe & Gaddy64, Crozier & Yamamoto65, Gordon et al.66, and Morrison & Billett67.

The final dataset of 992 data points was randomly divided into training and testing subsets using varying ratios ranging from 10 to 90%. For each ratio, ten independent random splits were performed, and the average performance metrics were reported.

Fig. 1
figure 1

The range of input data used from previous studies available in various databases.

For data analysis, various tools such as histograms, heat maps, box plots, and violin plots are widely used. A histogram displays the distribution of a numerical variable by dividing the data into intervals and showing the frequency of each interval as bars. A heat map uses colors to represent the intensity or value of data in a two-dimensional matrix, and it is useful for analyzing correlation matrices and multidimensional data. A box plot provides a summary of data distribution by displaying the median, quartiles, and interquartile range, and it is suitable for identifying outliers and comparing distributions. Meanwhile, a violin plot, which combines a box plot and a density plot, not only shows the median and quartiles but also illustrates the shape of the data distribution using a density curve, making it useful for more detailed analysis and identifying the multimodality of distributions.

In this paper, the method of outlier removal using the Gaussian approach has been employed to enhance data quality. To demonstrate and visualize the improvement in data quality, various tools such as histograms, heat maps, box plots, and violin plots have been used. The process of this study is divided into three main stages:

  1. 1.

    Analysis of the data (Figs. 2 and 3).

  2. 2.

    Identification and visualization of outliers (where the red triangle represents the outliers and the “*” symbol indicates the remaining consistent data) (Fig. 4).

Fig. 2
figure 2
Fig. 3
figure 3

Specialized analysis of input data (Before and After outlier removal ). Box plots, Histograms, and Violin plots.

The statistical analysis confirms that temperature and pressure are among the most influential parameters affecting hydrogen solubility, as demonstrated by the trends observed in the box plots and regression plots.

The Gaussian Removal method, or outlier removal using the Gaussian distribution, is a statistical technique used to identify and remove outliers in a dataset. This method is based on the assumption that the data follows a normal distribution (Gaussian distribution), and data points that deviate significantly from this distribution are identified as outliers.

The process of the Gaussian Removal method consists of three main steps. First, the mean \(\:\left(\mu\:\right)\) and standard deviation \(\:\left(\sigma\:\right)\) of the dataset are calculated to identify outliers. Then, data points are identified as outliers if their distance from the mean is greater than one or more times the standard deviation. This distance is usually defined as \(\:\mu\:\pm\:k\alpha\:\), where \(\:k\) is a constant value that determines the sensitivity to outliers. Finally, data points that fall outside this range are identified as outliers and removed.

Formulas for calculating mean and standard deviation:

$$\:\mu\:=\frac{1}{n}\sum\:_{i=1}^{n}{x}_{i}$$

(3)

$$\:\sigma\:=\sqrt{\frac{1}{n}\sum\:_{i=1}^{n}{\left({x}_{i}-\mu\:\right)}^{2}}$$

(4)

Where \(\:{x}_{i}\) are the data points, \(\:n\) is the number of data points, and \(\:\mu\:\) is the mean of the data.

Gaussian removal equation:

Data points identified as outliers follow the following equation:

$$\:P\left(x\right)=\frac{1}{\sigma\:\sqrt{2\pi\:}}exp\left(-\frac{{\left(x-\mu\:\right)}^{2}}{{2\sigma\:}^{2}}\right)$$

(5)

Where \(\:P\left(x\right)\) is the probability density for data point \(\:x\), \(\:\mu\:\) is the mean of the data, \(\:\sigma\:\) is the standard deviation of the data and \(\:x\) are the data points.

Data points with very low probability values (i.e., significantly distant from the mean) are identified as outliers and removed.

Fig. 4
figure 4

Outlier removal using the Gaussian method.

Machine learning methods

Bayesian linear regression

Bayesian Linear Regression was employed to incorporate prior distributions over the model parameters, enabling regularization and uncertainty quantification during training. The Bayesian method is one of the fundamental approaches in machine learning, utilizing Bayesian probability principles for modeling and learning from data. This method combines prior knowledge and evidence from data. The core foundation of this approach is Bayes’ Theorem, which establishes a relationship between conditional probabilities (Fig. 5). Bayes’ Theorem is expressed as:

$$\:P\left(H|D\right)=\frac{P\left(D|H\right)P\left(H\right)}{P\left(D\right)}$$

(6)

In this equation, \(\:P\left(H|D\right)\) represents the posterior probability, which indicates the likelihood of hypothesis \(\:H\) given the evidence \(\:D\). \(\:P\left(D|H\right)\) is the likelihood, showing how probable the data \(\:D\) is under the assumption that hypothesis \(\:H\) is true. \(\:P\left(H\right)\) is the prior probability, representing initial knowledge about hypothesis \(\:H\), and \(\:P\left(D\right)\) is the marginal likelihood, acting as a normalizing factor.

In machine learning, \(\:H\) typically denotes the model or its parameters, while \(\:D\) represents the training data. The goal is to estimate the posterior probability \(\:P\left(D|H\right)\) to learn the model or its parameters. Bayesian methods are broadly categorized into parametric Bayesian learning and non-parametric Bayesian learning. In parametric learning, the model parameters are assumed to be fixed but unknown. For instance, if \(\:\theta\:\) represents the model parameters, the posterior distribution is expressed as:

$$\:P\left(\theta\:|D\right)=\frac{P\left(D|\theta\:\right)P\left(\theta\:\right)}{P\left(D\right)}$$

(7)

In contrast, non-parametric Bayesian learning is employed when the number of parameters or the model structure is unknown. This approach is commonly used in models like Gaussian Processes or Bayesian clustering, where model complexity adjusts automatically based on the data.

One of the key applications of Bayesian methods in machine learning is prediction. Predictions are made using the posterior expectation. For example, the prediction of \(\:{y}^{*}\) for a new data point \(\:{x}^{*}\)is calculated as:

$$\:P\left({y}^{*}|{x}^{*},D\right)=\int\:P\left({y}^{*}|{x}^{*},\theta\:\right)P\left(D|\theta\:\right)d\theta\:$$

(8)

Here, \(\:P\left({y}^{*}|{x}^{*},\theta\:\right)\) is the predictive probability of the output \(\:{y}^{*}\) given the input \(\:{x}^{*}\) and parameters \(\:\theta\:\), and \(\:P\left(D|\theta\:\right)\) represents the posterior distribution of the parameters. This integral is often computed using methods such as Monte Carlo sampling (MCMC) or other approximation techniques.

Bayesian methods are applied in various models, such as the Naive Bayes Classifier and Bayesian Networks. In the Naive Bayes Classifier, it is assumed that the features are independent of each other, and the probability of a class \(\:C\) given features \(\:{x}_{1},\:{x}_{2},\:\dots\:,\:{x}_{n}\) is calculated as:

$$\:P\left(C|{x}_{1},\:{x}_{2},\:\dots\:,\:{x}_{n}\right)\propto\:P\left(C\right)\prod\:_{i=1}^{n}P\left({x}_{i}|C\right)$$

(9)

In contrast, Bayesian Networks utilize causal relationships between variables. In these models, nodes represent variables, and edges denote probabilistic dependencies among them.

Fig. 5
figure 5

The Bayesian method has several advantages and disadvantages. Its benefits include the ability to combine prior knowledge with new data, providing probabilistic distributions rather than deterministic values, and its applicability in scenarios with limited data. However, its primary drawbacks include computational complexity and sensitivity to the choice of the prior distribution.

Linear regression

Linear Regression is a foundational and widely used supervised machine learning algorithm. It models the relationship between independent variables (features) and a dependent variable (target) by fitting a linear equation to the observed data. The primary objective is to predict the value of the dependent variable based on the given independent variables. The general mathematical representation of linear regression is as follows:

$$\:y={\beta\:}_{0}+{\beta\:}_{1}{x}_{1}+{\beta\:}_{2}{x}_{2}+\dots\:+{\beta\:}_{p}{x}_{p}+ϵ$$

(10)

Where, \(\:y\) is the dependent variable, \(\:{x}_{1},{x}_{2},\:\dots\:,\:{x}_{p}\:\)are the dependent variable, \(\:{\beta\:}_{0}\) is the intercept, \(\:{\beta\:}_{1},{\beta\:}_{2},\:\dots\:,\:{\beta\:}_{p}\) are the coefficients, and \(\:ϵ\) is the error term. In matrix form, it can be written as:

$$\:y = X\beta \: + \in$$

(11)

Here, \(\:y,\:X,\:\beta \:,\:and\: \in\) are matrices and vectors representing the data, coefficients, and errors. The main goal of linear regression is to minimize the error between the predicted and observed values, which is typically measured using the Mean Squared Error (MSE). The formula for MSE is:

$$\:MSE=\frac{1}{n}\sum\:_{i=1}^{n}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}$$

(12)

Where, \(\:{y}_{i}\) represents the actual values, and \(\:{\widehat{y}}_{i}\) represents the predicted values.

The parameters of the model \(\:\left(\beta\:\right)\) are estimated using the Ordinary Least Squares (OLS) method. The OLS solution is derived as:

$$\:\beta\:={\left({X}^{T}X\right)}^{-1}{X}^{T}y$$

(13)

Once the model is trained, predictions for new data points are made using the following formula:

$$\:\widehat{y}=x\bullet\:\beta\:$$

(14)

Here, \(\:x\) is the feature vector of the new data point.

Artificial neural network

Artificial Neural Networks (ANNs) are a foundational method in machine learning, modeled after the structure and operation of biological neural networks. ANNs are used to recognize patterns, classify data, make predictions, and model intricate relationships between inputs and outputs. They are composed of interconnected layers of nodes (neurons), each transforming input data and passing it to the next layer. The architecture of an ANN consists of three primary layers: the input layer, hidden layers, and the output layer. The input layer receives raw data, with each neuron corresponding to a single feature of the dataset. The hidden layers process the input data through transformations, and the output layer provides the final results, such as classifications or predictions (Fig. 6).

Fig. 6
figure 6

For a single neuron in a hidden or output layer, the pre-activation value \(\:z\) is calculated as:

$$\:z=\sum\:_{i=1}^{n}{w}_{i}{x}_{i}+b$$

(15)

Here, \(\:{x}_{i}\) inputs to the neuron, \(\:{w}_{i}\) weights associated with the inputs, \(\:b\) bias term, and \(\:z\) pre-activation value.

Support vector machine

Support Vector Machines (SVMs) are a robust supervised machine learning algorithm commonly used for classification, regression, and outlier detection. They are particularly effective in high-dimensional spaces and are capable of handling both linear and nonlinear classification tasks. The primary objective of an SVM is to identify a hyperplane that effectively separates different classes of data points within the feature space.

The central idea of SVM is to determine a hyperplane that maximizes the margin between two classes. This margin is defined as the distance between the hyperplane and the nearest data points from each class, known as support vectors. For a training dataset represented as \(\:\left\{\left({x}_{i},\:{y}_{i}\right)\right\}\), where \(\:{x}_{i}\) is the feature vector and \(\:{y}_{i}\in\:\left\{-1,\:+1\right\}\) is the class label, the hyperplane is mathematically defined as:

Where, \(\:w\) represents the weight vector, \(\:x\) is the input feature vector, and \(\:b\) is the bias term. The hyperplane acts as the decision boundary, while the support vectors are the data points closest to this boundary.

In scenarios where the data is linearly separable, SVM seeks the optimal hyperplane that maximizes the margin between the two classes. The equations governing the boundary of the margin are:

$$\:{w}^{T}{x}_{i}+b=+1\:\:\:\:\:for\:{y}_{i}=+1$$

(17)

$$\:{w}^{T}{x}_{i}+b=-1\:\:\:\:\:for\:{y}_{i}=-1$$

(18)

This approach ensures that the hyperplane achieves maximum separation while maintaining the closest points (support vectors) at the boundary of the margin.

Least squares boosting

Least Squares Boosting (LSBoost) is a machine learning method that integrates boosting with least squares regression to improve the accuracy of predictions. It is mainly used for regression tasks but can also be adapted to classification problems. LSBoost enhances traditional boosting techniques by focusing on minimizing the least squares error of the model. The process begins with an initial simple model, typically the mean of the target values. In each iteration of boosting, residuals are computed by calculating the difference between the actual values and the model’s current predictions. A new weak learner, usually a decision tree, is then trained to predict these residuals, aiming to minimize the least squares error. The model is updated by adding the scaled predictions of this weak learner, which are adjusted by a learning rate. After a predefined number of iterations or when the model’s performance stabilizes, the final model is constructed by combining the predictions of all weak learners.

The initial prediction is computed as:

$$\:{\widehat{y}}_{0}=\frac{1}{N}\sum\:_{i=1}^{N}{y}_{i}$$

(19)

Where, \(\:{y}_{i}\) is the actual target value for the \(\:{i}^{th}\) instance and \(\:N\) is the total number of instances.

At iteration \(\:m\), the residuals are calculated as:

$$\:{r}_{i}\left(m\right)={y}_{i}-{\widehat{y}}_{i}\left(m\right)$$

(20)

Where \(\:{\widehat{y}}_{i}\left(m\right)\) represents the prediction for the \(\:{i}^{th}\) instance at iteration \(\:m\).

A weak learner is then fit to these residuals, aiming to minimize the least squares error, expressed as:

$$\:min\sum\:_{i=1}^{N}{\left({r}_{i}\left(m\right)-{f}_{m}\left({x}_{i}\right)\right)}^{2}$$

(21)

Where \(\:{f}_{m}\left({x}_{i}\right)\) is the prediction of the weak learner for the \(\:{i}^{th}\) instance.

The model is updated by adding the predictions of the weak learner, scaled by a learning rate \(\:\alpha\:\):

$$\:{\widehat{y}}_{i}\left(m+1\right)={\widehat{y}}_{i}\left(m\right)+\alpha\:{f}_{m}\left({x}_{i}\right)$$

(22)

Where \(\:\alpha\:\) is the learning rate (also known as the shrinkage parameter).

After \(\:M\) iterations, the final prediction is given by:

$$\:{\widehat{y}}_{i}={\widehat{y}}_{0}+\sum\:_{m=1}^{M}\alpha\:{f}_{m}\left({x}_{i}\right)$$

(23)

Random forest

Random Forest (RF) is a widely used machine learning technique that excels in classification and regression tasks, especially when working with large and complex datasets. Its ability to reduce variance and prevent overfitting makes it a popular choice. RF is an ensemble method that combines multiple decision trees, each trained independently on a random subset of the data. The final result is obtained by aggregating the predictions from these individual trees (Fig. 7).

Fig. 7
figure 7

The process of the RF algorithm starts by generating random samples from the training data through bootstrap sampling (sampling with replacement). Each decision tree is trained on a different random subset of the data. To introduce diversity among the trees, a random selection of features is made at each tree node. When predicting, the results from all trees are combined. For classification tasks, the final prediction is determined by majority voting, whereas for regression tasks, the predictions are averaged.

In classification, the final prediction for a new sample \(\:x\) is computed using the following formula, where \(\:{T}_{1},\:\:{T}_{2},\:\dots\:,\:\:{T}_{n}\) represent the decision trees and \(\:{C}_{1},\:\:{C}_{2},\:\dots\:,\:\:{C}_{n}\) are the possible classes:

$$\:Prediction\:\left(x\right)=\text{arg}max\left(\sum\:_{j=1}^{n}I\left({T}_{j}\left(x\right)={C}_{i}\right)\right)$$

(24)

Here, \(\:I\) is an indicator function that equals 1 if \(\:{T}_{j}\left(x\right)\) equals \(\:{C}_{i}\) and 0 otherwise.

For regression tasks, the final prediction for a new sample \(\:x\) is calculated as the average of the predictions from all the decision trees:

$$\:Prediction\:\left(x\right)=\frac{1}{n}\sum\:_{j=1}^{n}{T}_{j}\left(x\right)$$

(25)

Where \(\:{T}_{j}\left(x\right)\) is the predicted value from decision tree \(\:{T}_{j}\) for the sample \(\:x\).



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *