Evaluation of hydraulic fracturing using machine learning

Machine Learning


Peculiarities of the applied machine learning methods

The machine learning framework developed in this study exhibits several distinctive features that set it apart from conventional approaches applied in HF analysis. First and foremost, a large-scale dataset comprising 16,000 data records was utilized, which is significantly larger than the datasets used in many previous studies. This extensive dataset enhances the robustness and generalizability of the models, allowing them to better capture the underlying patterns and interactions among HF parameters.

Secondly, the study integrates comprehensive statistical analysis—including metrics such as mean, variance, skewness, kurtosis, quartiles, and data visualization via box plots and violin plots—to better understand the distribution and variability of the input variables. Such detailed preprocessing is often overlooked in many data-driven HF studies, yet it plays a crucial role in improving model accuracy and interpretation.

Another unique aspect of the proposed methodology is the evaluation of model performance across multiple train/test ratios, ranging from 0.1 to 0.9. This systematic approach provides a deeper understanding of how data availability affects model performance and stability. The analysis of R2 values across these splits, supported by multiple independent runs for each model, offers insights into the consistency and reliability of different algorithms under varying data constraints.

Furthermore, the models were developed using domain-specific parameters such as fracture height, fracture length, fluid viscosity, and injection time, which are directly derived from the governing physical equations of HF. This integration of physics-based variables into ML models enhances their relevance to real-world operations and bridges the gap between data-driven techniques and conventional engineering understanding.

Finally, by comparing three well-established algorithms—RF, NN, and SVM—under identical conditions, the study provides a fair and comprehensive evaluation of model capabilities, with RF demonstrating superior performance in terms of accuracy and error minimization.

Data preprocessing

In this study, the data were initially plotted using analytical formulas, and the assumptions considered for modeling were outlined. The data were then analyzed to examine patterns and key features. Subsequently, using the MATLAB programming language libraries — a common tool in engineering and computer science — the SVM, NN, and RF methods were implemented, and the datasets were analyzed, organized, and sorted using Microsoft Excel. These algorithms are among the most widely used ML techniques for prediction and data analysis. Using these algorithms, the R2 value was calculated, which serves as a metric for evaluating the accuracy of prediction models. The dataset comprised 16,000 data points with 4 input variables. Table 2 presents the statistical insights related to the dataset and its distribution.

The required statistical information is also presented in Table 2.

In this paper, several key parameters for analyzing fractures in the HF process are examined. These parameters are each represented by a specific symbol: \(\:\mu\:\) (mu) denotes the viscosity of the fracturing fluid (in centipoise), \(\:h\) represents the crack height, \(\:t\) indicates the injection time, and \(\:X\) corresponds to the crack length. Understanding and accurately measuring these variables are crucial for interpreting the study’s results, as well as for developing predictive models and strategies. This is particularly important in HF, where the interaction between fluid flow and fracture propagation has a significant impact on the process outcomes.

The data obtained were calculated using Eqs. 1–6. Subsequently, machine learning methods were applied to analyze the data with the corresponding inputs. The data range is also presented in Table 2; Fig. 3.

The provided table presents a statistical analysis of the HF process, aimed at evaluating its performance. This analysis includes various statistical parameters such as maximum, minimum, range, median, first quartile (Q1), third quartile (Q3), mean, variance, and skewness.

The median, as a significant statistical measure, divides the dataset into two equal parts, with half of the data points are below the median and the other half are above it. This measure is particularly valuable in analyzing datasets containing outliers, as it is minimally affected by extreme values. To calculate the median, the data must first be arranged in ascending order. If the number of data points is odd, the median corresponds to the middle value. Conversely, if the number of data points is even, the median is calculated as the average of the two middle values.

$$\:Median\:\left\{\begin{array}{c}{x}_{\left(\frac{n+1}{2}\right)}\:For\:an\:odd\:EquationNumber\:of\:data\:points\:n\:\\\:\frac{{x}_{\left(\frac{n}{2}\right)}+{x}_{(\frac{n}{2}+1)}}{2}\:\:\:For\:an\:even\:EquationNumber\:of\:data\:points\:n\end{array}\right.$$

(7)

Quartiles are statistical measures that divide a dataset into four equal parts. The Q1 marks the value below which 25% of the data points are located, indicating that the remaining 75% lie above it. Conversely, the Q3 identifies the value below which 75% of the data points fall, with 25% positioned above it. Quartiles play a crucial role in Box Plots, as they help visualize the distribution and concentration of data, offering valuable insights into its range and potential outliers.

The mean, a measure of central tendency, represents the average of a dataset. It is determined by adding all the data points together and dividing the sum by the total number of observations. However, the mean is highly sensitive to outliers, which can distort its accuracy. As a result, in datasets with high variability or extreme values, the median or quartiles may serve as more reliable indicators of central tendency.

$$\:Mean=\frac{1}{n}\sum\:_{i=1}^{n}{X}_{i}$$

(8)

Variance measures the degree to which data points deviate from the mean. A high variance reflects a wide distribution, indicating that the data points are spread out significantly from the mean. On the other hand, a low variance indicates that the data points are closely concentrated around the mean. Variance is a valuable tool in statistical analysis, as it quantifies the level of variability within a dataset and offers a deeper understanding of its distribution and characteristics.

$$\:{Population\:Variance:\:\sigma\:}^{2}=\frac{1}{n}\sum\:_{i=1}^{n}{({X}_{i}-\mu\:)}^{2}$$

(9)

$$\:{Sample\:Variance:\:s}^{2}=\frac{1}{n-1}\sum\:_{i=1}^{n}{({X}_{i}-\stackrel{-}{X})}^{2}$$

(10)

Skewness quantifies the asymmetry in a dataset’s distribution. A positive skewness indicates that the distribution extends towards higher values, with a larger concentration of lower values and a few extreme high values. In contrast, negative skewness suggests the distribution is elongated towards lower values, characterized by more high values and a few extreme low values. Analyzing skewness is crucial for understanding data distribution, as it reveals tendencies towards a particular direction, thereby influencing statistical interpretations and decision-making processes.

$$\:\text{S}\text{k}\text{e}\text{w}\text{n}\text{e}\text{s}\text{s}=\frac{n}{(n-1)(n-2)}\sum\:_{i=1}^{n}{\left(\frac{{X}_{i}-\stackrel{-}{X}}{s}\right)}^{3}$$

(11)

Each of these parameters, with their unique characteristics, plays a crucial role in describing and analyzing datasets, providing valuable insights into their distribution, variability, and asymmetry. The collected data have been carefully analyzed, and the parameters related to the HF process have been separately plotted from a statistical perspective. The analyses are visually presented using Box Plots and Violin Plots (Fig. 3), which illustrate a range of statistical indicators, including the median, Q1, Q3, mean, variance, skewness, as well as the maximum and minimum values.

Fig. 3
figure 3

Box-plot and Violin Plots of HF parameters.

Based on Fig. 3, it can be concluded that:

In this study, the descriptive statistics for four critical variables related to HF are analyzed: viscosity of the fracturing fluid (µ), height of the fracture (h), length of the fracture (X), and injection time

  • Input Layer: Receives the input data.

  • Hidden Layers: Perform computations and feature extraction.

  • Output Layer: Provides the prediction or final result.

  • Each node in a layer is connected to nodes in the next layer via weighted connections. During training, these weights are adjusted to minimize the error between predictions and actual outputs (see Fig. 5).

    Fig. 5
    figure 5

    Equations and Evaluation Metrics of NN.

    1. (1)

      Activation functions

    Each neuron processes input values by applying an activation function, which determines whether the neuron should be activated. Common activation functions include:

    • Sigmoid:

      $$\:f\left(x\right)=\frac{1}{1+{e}^{-x}}$$

      (12)

    Used for binary outputs, compressing values between 0 and 1.

    Commonly used in hidden layers for faster convergence.

    • SoftMax:

      $$\:f\left({x}_{i}\right)=\frac{{e}^{{x}_{i}}}{{\sum\:}_{j=1}^{N}{e}^{{x}_{i}}}$$

      (14)

    Used in multi-class classification problems.

    1. (2)

      Forward propagation

    During forward propagation, data flows from the input layer through hidden layers to the output layer. At each neuron, the weighted sum of inputs is calculated, followed by the application of an activation function:

    $$\:z=\sum\:_{i=1}^{n}{w}_{i}{x}_{i}+b$$

    (15)

    $$\:a=f\left(z\right)$$

    (16)

    Here, \(\:{w}_{i}\) are the weights, \(\:{x}_{i}\) are inputs, \(\:b\) is the bias term, \(\:z\) is the weighted sum, and \(\:a\) is the activated output.

    1. (3)

      Cross-Entropy loss (for classification)

    $$\:L=-\frac{1}{n}\sum\:_{i=1}^{n}\sum\:_{j=1}^{C}{y}_{ij}\text{log}{\widehat{y}}_{ij}$$

    (17)

    Where \(\:y\) is the true label and \(\:\widehat{y}\) is the predicted probability.

    1. (4)

      Backward propagation and optimization

    Backward propagation adjusts the weights to minimize the loss function using optimization algorithms like Gradient Descent. The gradients of the loss function with respect to the weights are computed using the chain rule of calculus.

    The weight update formula is:

    $$\:{w}^{(t+1)}={w}^{\left(t\right)}- \eta \frac{\partial\:L}{\partial\:w}$$

    (18)

    Where \(\eta\) is the learning rate, \(\:L\) is the loss, and \(\:w\) are the weights.

    Random forest

    RF is a ML algorithm based on the ensemble learning technique, which combines multiple decision trees to improve model accuracy and reduce the risk of overfitting. It is applicable to both classification and regression tasks and leverages two primary techniques: Bootstrap Aggregation (Bagging) and Random Feature Selection.

    In the Bagging method, multiple subsets of training data are randomly generated with replacement. Each decision tree is trained on one of these subsets, reducing the variance of the model. In the Random Feature Selection method, at each node of the tree, only a random subset of features is considered for decision-making. This approach reduces the correlation between trees and enhances the final model’s accuracy.

    The final prediction in RF is made using majority voting for classification tasks and by calculating the mean output of all trees for regression tasks. This combination results in a highly accurate model that is resilient to small variations in data. Furthermore, RF is highly resistant to overfitting due to its use of random data subsets and feature limitations (see Fig. 6).

    Fig. 6
    figure 6

    Equations and Evaluation Metrics of RF.

    1. (1)

      Entropy (for classification)

    $$\:H\left(S\right)=-\sum\:_{i=1}^{C}{p}_{i}{{log}}_{2}\left({p}_{i}\right)$$

    (19)

    Where \(\:C\) is the number of classes, and \(\:{p}_{i}\) is the probability of each class.

    1. (2)

      Gini index (for classification)

    $$\:G\left(S\right)=1-\sum\:_{i=1}^{C}{p}_{i}^{2}$$

    (20)

    1. (3)

      Prediction Aggregation:

    • For classification:

      $$\:\widehat{y}=Mode\left\{{h}_{1}\left(x\right),\:{h}_{2}\left(x\right),\:\dots\:,\:{h}_{k}\left(x\right)\right\}$$

      (21)

    • For regression:

      $$\:\widehat{y}=\frac{1}{k}\sum\:_{i=1}^{k}{h}_{i}\left(x\right)$$

      (22)

    Where \(\:{h}_{i}\left(x\right)\) is the prediction of the \(\:i-th\) tree, and \(\:k\) is the number of trees.

    The implementation of the RF algorithm involves three main steps. The first step is generating random samples using Bootstrap. In this process, multiple random subsets are created from the original training dataset through sampling with replacement. These subsets serve as the training data for individual decision trees, ensuring diversity and reducing overfitting in the overall model.

    The second step involves building decision trees. Each subset is used to construct a unique decision tree. At each node within the tree, a random subset of features is selected, rather than considering all features. This ensures further randomness and reduces correlation among the trees. To determine the optimal decision-making criteria at each node, metrics such as Entropy or Gini Index are used to measure the impurity or information gain.

    The final step is aggregating the outputs of the trees. For classification tasks, the final prediction is based on a majority voting system, where the class predicted by the majority of trees becomes the output. For regression tasks, the final prediction is calculated by taking the mean of the outputs from all the decision trees. This aggregation method ensures a robust and accurate final prediction, leveraging the diversity of the ensemble.

    Support vector machine

    SVM is a supervised ML algorithm used for both classification and regression tasks. It aims to find the best hyperplane that separates the data while maximizing the margin between the classes. The hyperplane serves as the boundary that separates data points belonging to different classes. In two-dimensional space, the hyperplane is a line, while in three-dimensional space, it becomes a plane. The general equation of the hyperplane is as follows:

    Here, \(\:w\) represents the weight vector, \(\:x\) denotes the feature vector, and \(\:b\) is the bias term. The margin is the distance between the hyperplane and the nearest data points from each class, known as support vectors. The objective of SVM is to maximize this margin, which enhances the model’s generalization capability.

    To address non-linear problems, SVM employs kernel functions. These functions map the data into higher-dimensional spaces where it becomes linearly separable. Common kernels include the linear kernel, polynomial kernel, and radial basis function (RBF) kernel. Their equations are as follows:

    $$\:K\left({x}_{i},{x}_{j}\right)={x}_{i}\cdot\:{x}_{j}$$

    (24)

    $$\:K\left({x}_{i},{x}_{j}\right)={{(x}_{i}\cdot\:{x}_{j}+C)}^{d}$$

    (25)

    $$K\left( {x_{i} ,x_{j} } \right) = {\text{exp}}\left( { – \gamma \left\| {(x_{i} – x_{j} } \right\|^{2} } \right)$$

    (26)

    In cases where the data is not perfectly separable, SVM uses the concept of a soft margin. This allows some data points to be misclassified. A regularization parameter, \(\:C\), controls the trade-off between maximizing the margin and minimizing classification errors.

    The optimization problem in SVM to find the optimal hyperplane is defined as:

    $$min\frac{1}{2}\left\| w \right\|^{2}$$

    (27)

    Subject to:

    $$\:{y}_{i}\left(w\cdot\:{x}_{i}+b\right)\ge\:1\:\:\:for\:all\:i$$

    (28)

    For non-separable data, slack variables (\(\:{\xi\:}_{i}\)) are introduced, and the optimization problem is modified as follows:

    $$\:min\frac{1}{2}{ \left\| w \right\| }^{2}+C\sum\:_{i=1}^{n}{\xi\:}_{i}$$

    (29)

    This problem is typically solved using its dual form, where the objective function becomes:

    $$\:max\sum\:_{i=1}^{n}{\alpha\:}_{i}-\frac{1}{2}\sum\:_{i=1}^{n}\sum\:_{j=1}^{n}{\alpha\:}_{i}{\alpha\:}_{j}{y}_{i}{y}_{j}K({x}_{i},{x}_{j})$$

    (30)

    Subject to:

    $$\:0\le\:{\alpha\:}_{i}\ge\:C\:\:\:and\:\:\:\sum\:_{i=1}^{n}{\alpha\:}_{i}{y}_{i}=0$$

    (31)

    The decision function for a new sample \(\:x\)is given by:

    $$\:f\left(x\right)=sign\left(\sum\:_{i=1}^{n}{\alpha\:}_{i}{y}_{i}K\left({x}_{i},\:x\right)+b\right)$$

    (32)

    In various studies, the use of machine learning algorithms for predicting the characteristics of oil and gas reservoirs, especially in the hydraulic fracturing process, has been explored. For instance, in 2022, Kamali et al.70 utilized machine learning models to predict permeability in carbonate reservoirs and simulated the GMDH model as the most accurate one. Additionally, in 2024 the study by Feng et al.,71 the CNN model demonstrated the best performance in predicting groundwater levels, which could similarly predict fluid behavior in hydraulic fracturing reservoirs. In 2021, Barjouei et al.72 applied deep learning algorithms to predict liquid flow rates through oil wells, showing that deep learning models outperformed other models in terms of accuracy. These studies highlight the potential of machine learning algorithms, particularly deep learning models, in predicting reservoir characteristics and optimizing the hydraulic fracturing process.

    In their 2023 study, Ghorbani et al.73 used similar algorithms like RF and SVM for predicting coronary artery disease, identifying the RF model as the most accurate. The use of machine learning algorithms like RF for predicting complex reservoir features can improve the accuracy of predictions in processes related to oil and gas extraction, such as hydraulic fracturing. These studies demonstrate that advanced machine learning models, especially in complex environments like carbonate and oil reservoirs, can serve as valuable tools for predicting and optimizing production processes.



    Source link

    Leave a Reply

    Your email address will not be published. Required fields are marked *