This section explains each of the implemented machine-learning algorithms.
Convolutional neural network
Figure 2 demonstrates that CNNs represent a novel type of deep learning model crafted to efficiently handle grid-like structured data, rendering them particularly suitable for tasks involving images and spatial data29. The main characteristics of CNNs, such as convolutional layers, pooling techniques, and hierarchical feature extraction, enable the model to automatically and adaptively learn intricate spatial hierarchies of features30,31. This has greatly impacted many fields, including computer vision, pattern recognition, and multi-dimensional signal processing.

Architecture of the convolutional neural networks applied in this study.
CNNs have revolutionized the deep learning field due to their remarkable ability to progressively extract more complex and abstract features directly from raw data, such as images. This is achieved using two essential architectural ideas: local connectivity designs and common parameters.
Enhancements in CNN frameworks have greatly expanded the possibilities of deep learning, addressing critical issues related to computational complexity, interpretability, and generalization across various domains55. In the past few years, particularly since 2020, research has focused on developing new network architectures, including lightweight CNNs, hybrid models based on transformers, and self-supervised learning methods that lessen the dependence on large labeled datasets32,33. Recent advancements in Convolutional Neural Networks (CNNs), such as neural architecture search, attention mechanisms, and improved regularization methods, have significantly improved their effectiveness in tackling complex real-world challenges34.
Artificial neural network
ANNs are neural networks inspired by the human brain, consisting of interconnected nodes representing input, hidden, and output layers. These networks process data by computing weighted sums of inputs, applying an activation function, and propagating the result to deeper layers. Mathematically, a neuron operates as follows:
$$h(x)=f\left( {\sum\limits_{{i=1}}^{n} {{\omega _i}{x_i}+b} } \right)$$
(1)
Where ω are weights, x are inputs, b is the bias term, and f is the activation function (e.g., sigmoid or ReLU). Optimization methods such as gradient descent minimize error during training.
ANNs are widely applied in image identification, natural language processing, and speech recognition. Techniques like dropout and transfer learning mitigate overfitting, ensuring adaptability to modern applications. Studies such as those by Zhu et al. and Heidari et al. highlight ANNs’ transformative role in deep learning, bridging computational advancements and applications35,36.
Decision tree
As shown in Fig. 3, a decision tree is a layered configuration displaying a flowchart that aids decision-making by representing features as internal nodes, branches as decision rules, and outcomes as leaf nodes. The tree begins with a root node at the top, and data is divided recursively according to attribute values. This straightforward and comprehensible framework allows for easy visualization of the decision-making process, which is why decision trees are commonly preferred for classification or regression tasks in machine learning37,38,39,40.
Each node in the tree evaluates a specific attribute or factor, while the branches represent possible outcomes or selections. The tree grows by choosing the attribute and decision criterion that lowers impurity or improves information gain. In decision tree algorithms, the process described earlier involves the recursive splitting of the dataset based on the feature values. At every node, the algorithm selects a feature to split the dataset into two or more subsets, improving an objective function such as information gain, Gini index, or chi-squared test.
Decision trees can handle numerical and categorical data and are often used for selecting features, detecting outliers, and interpreting models. They may operate as standalone models or as parts of more complex ensemble methods, such as random forests or gradient boosting. The clarity of decision trees makes them attractive for situations where understanding the core decision-making process is crucial41,42,43. When building decision trees, metrics like Gain Ratio and Gini Index are essential for determining the optimal split to divide the data. These techniques assist in identifying the best feature and decision criterion at every node, thus enhancing the accuracy of the tree44,45,46,47.

The structural layout of the decision tree algorithm.
Random forest
As depicted in Fig. 4, Random Forest is a consisted of classifications and regressions made on datasets corresponding to the size of the training set, known as bootstraps, which are created by randomly resampling from the training portion of the dataset48. When a tree is made, a set of bootstraps that omits any individual entry from the initial dataset acts as the testing dataset. The categorization of error magnitudes across all testing sets indicates the out-of-bag evaluation of generalization errors. It has been shown before that for bagged classifiers. The out-of-bag error is as accurate as using a test set with the same data points as the training set. Consequently, employing the out-of-bag estimation removes the need for a separate testing set. To classify new input data points, each classification and regression tree votes for a class, and the forest determines the class that receives the highest number of votes49. The approach uses specific guidelines regarding tree construction, tree combination, self-assessment, and post-processing, showing strength against over-fitting.
In comparison to other machine learning methods, it is deemed to be more reliable when dealing with outliers or in very large-dimensional parameter spaces50. The concept of variable importance is an inherent feature selection performed by RF employing a random subspace method, assessed using the Gini impurity index criterion. The Gini index evaluates the strength of predictive variables for regression and classification based on minimizing impurities. To achieve the best split of a binary node, maximizing the improvement in the Gini index is essential. In basic terms, a low Gini suggests that a particular predictor48,50,51.

Schematic representation of the random forest algorithm.
Linear regression
The approach matches data on linear models, where the relation between y and independent x is modeled as:
$$y = \omega _{0} + \sum\limits_{{j = 1}}^{n} {\omega _{j} x_{j} + \varepsilon }$$
(2)
Ωj is the weights, ω0 is the intercept, and ∈ is the error term. Despite simplicity, assumptions like linearity and homoscedasticity may not align with real-world data. Its relevance spans predictive analytics and remains a foundation for regression-based modeling, highlighted in works such as Chen et al.52.
Ridge regression
The technique is an extension of linear regression that addresses multicollinearity and overfitting by adding an L2 regularization term to the loss function. This term penalizes the squared magnitude of the coefficients, which prevents them from growing too large and reduces the model’s sensitivity to slight variations in the data. By introducing this penalty, ridge regression ensures better generalization, particularly in high-dimensional datasets with highly correlated predictors. It balances the trade-off between fitting the data and maintaining model simplicity. Typical applications include finance, genomics, and machine learning, which are valuable for improving predictive accuracy in complex or noisy datasets with many features.
Ridge Regression extends linear regression by adding L2 regularization to the cost function:
$$L = \sum\limits_{{i = 1}}^{n} {(y_{i} – \hat{y}_{i} )^{2} + \lambda \sum\limits_{{j = 1}}^{p} {\omega _{j}^{2} } }$$
(3)
L is the loss function, λ controls regularization, and ω are coefficients. This penalization prevents coefficients from becoming too large and improves model stability in datasets with multicollinearity. Hoerl and Kennard’s work laid the foundation for Ridge Regression, establishing its effectiveness in high-dimensional datasets53,54.
Lasso regression
This approach is a linearized regression variation that addresses overfitting and multicollinearity but focuses on feature selection. Unlike ridge regression, which uses an L2 penalty, lasso regression adds an L1 penalty to the loss function, which is proportional to the absolute value of the coefficients. This L1 penalty forces several coefficients to zero, excluding some features. This characteristic makes lasso regression particularly useful for feature selection in high-dimensional datasets, where many predictors may be irrelevant or redundant. Shrinking some coefficients to zero helps improve model generalization and provides a simpler, more interpretable model by identifying the most important predictors. Lasso regression is ideal when it has many features and wants to reduce the complexity of the model by eliminating those that do not contribute meaningfully to the prediction. The result is a sparse model that is easier to interpret and less prone to overfitting, particularly in cases with many correlated predictors55,56.
Applications of Lasso Regression are diverse and span multiple domains. In genomics, it is used to select a subset of genes that contribute most significantly to a disease outcome, effectively decreasing the dimensionality of the data while bringing the most crucial predictors. In finance, lasso regression can help identify key factors influencing asset prices or credit risks, especially when the number of potential predictors (such as economic indicators or market metrics) is significant. In marketing, it is applied to determine which channels or features (e.g., pricing, advertising, or product attributes) most impact consumer behavior, allowing businesses to allocate resources efficiently. Furthermore, in machine learning, lasso regression is commonly used for sparse modeling in high-dimensional datasets, such as image processing or text classification, where only a small subset of features might be relevant for accurate predictions. Its ability to perform feature selection while preventing overfitting makes it especially useful in applications involving large-scale, complex datasets.
Lasso Regression adds L1 regularization to the cost function:
$$L = \sum\limits_{{i = 1}}^{n} {(y_{i} – \hat{y}_{i} )^{2} + \lambda \sum\limits_{{j = 1}}^{p} {|\omega _{j} |} }$$
(4)
This penalty shrinks some coefficients to zero, effectively enabling feature selection. Lasso is particularly helpful for high-dimensional data, where irrelevant predictors must be excluded. Zhang et al. and Zhan et al. work on Lasso Regression underscores its utility in sparse modeling57,58.
Support vector regression
SVR is a regression approach that uses the principles of SVM, typically used for classification tasks but adapted to handle regression problems. In SVR, the model aims to find a function approximating the underlying relationship between input and output variables. However, with an important distinction: it does not aim to minimize the error directly for all points but instead focuses on finding a balance between error tolerance and model complexity. SVR defines a tolerance margin, represented by a “tube” (or epsilon-insensitive tube), within which errors are not penalized. Data points within this margin are considered well-predicted, while data points outside the margin incur a penalty based on how far they lie from the predicted values59,60.
SVR can be implemented in both linear and nonlinear forms. In the nonlinear case, SVR employs the kernel trick, which maps the input data into a higher-dimensional space, allowing the model to capture complex, nonlinear relationships. Popular kernels include the Radial Basis Function (RBF) kernel, which helps capture highly complex, nonlinear patterns in the data. VR is powerful when dealing with noisy data or when we need a model that balances fitting the data with avoiding overfitting. It is often used in financial forecasting, engineering, and time-series analysis, where precision and robustness against outliers are important. Its ability to work well with high-dimensional data and its inherent regularization properties make it a valuable tool in many fields requiring predictive modeling47.
Gradient boosting machine
GBM, introduced by Jerome Friedman in 1999, is a supervised ensemble learning method that builds a robust predictive model by combining multiple decision trees. This iterative approach aims to reduce the errors of the existing trees and enhance a loss function to increase accuracy in classification and regression tasks61,62. GBM is particularly valued for its ability to handle complex, nonlinear connections and its capability to provide insights into the importance of features and aid in feature selection. However, despite these advantages, GBM can be demanding regarding resources and needs careful tuning of hyperparameters, such as the learning rate and depth of trees, to balance the trade-offs between overfitting and underfitting.
The algorithm uses a step-by-step, iterative approach, beginning with applying a simple model, such as a decision tree, to the data. The primary objective is to minimize the loss function L(y, f(x)), where y represents the true target, f(x) denotes the predictions generated by the model, and L is the loss function. The model is gradually refined at each stage to reduce mistakes and improve its predictions63,64.
The initial phase, which constitutes the first iteration, is outlined as follows.
$$F_{0} (x) = \mathop {\arg \min }\limits_{c} \sum\limits_{{i = 1}}^{n} {L(y_{i} ,C)}$$
(5)
The second stage, for every iteration m, is as follows:
Calculate the negative gradient (pseudo-residuals):
$$r_{{im}} = – \left[ {\frac{{\partial L(y_{i} ,F(x_{i} ))}}{{\partial F(x_{i} )}}} \right]_{{F(x) = F_{{m – 1}} (x)}}$$
(6)
The next stage: Fit a base learner to these residuals:
$$h_{m} (x) = \mathop {\arg \min }\limits_{h} \sum\limits_{{i = 1}}^{n} {(r_{{im}} – h(x_{i} ))^{2} }$$
(7)
Fourth stage: Update the model:
$$F_{m} (x) = F_{{m – 1}} (x) + \upsilon .h_{m} (x)$$
(8)
In the GBM algorithm, the learning rate, υ determines how much each decision tree impacts the overall model. In each iteration, the algorithm adds a new decision tree to correct the errors in the combined predictions of all previous trees. The flexibility of GBM arises from its ability to customize the loss function (L) to achieve specific objectives, making it an extremely adaptable approach. Regularization parameters such as the learning rate (υ) and depth of the tree are included to enhance model generalization and prevent overfitting. As a result, the final classifier is a weighted sum of predictions from individual trees, gradually grasping more complex patterns in the data65,66,67. A diagrammatic illustration of the GBM algorithm is shown in Fig. 5.

Schematic flow of the gradient boosting machine (GBM) algorithm.
K-nearest neighbors
KNN is mainly utilized in classifications and regressions. For classifications, the algorithm calculates the difference between the query point and all points in the training set using metrics like Euclidean, Manhattan, or Minkowski distance. It then identifies the K’s nearest neighbors and assigns the class that appears most frequently among them. In regression, KNN predicts the target value by averaging (or weighting) the values of the K nearest neighbors. The choice of K and the weighting scheme significantly influence model performance. Importantly, KNN is computationally intensive since it requires storing the entire dataset and performing distance calculations at runtime, making it less efficient with large datasets. Despite this, KNN is valued for its simplicity and effectiveness, particularly when the data is well-distributed and no strong assumptions are made about its underlying distribution.
KNN has a wide range of applications across various fields due to its versatility and ease of implementation. In medicine, it is used for disease diagnosis, classifying patients based on features such as age, medical history, and test results. In finance, it aids in credit scoring and fraud detection by classifying transactions or customers as safe or suspicious. In e-commerce, KNN is applied in customer segmentation and personalized recommendations by analyzing customer behaviors. The algorithm is also valuable in image recognition and computer vision for tasks like facial recognition and object detection, where it classifies images based on pixel features.
Additionally, in geospatial analysis, KNN is used to predict geographic features, cluster regions with similar characteristics, and analyze land use patterns in urban planning and agriculture areas. The algorithm performs exceptionally well in problems with irregular decision boundaries and high-dimensional data settings where similar data points cluster closely in the feature space47,68.
Extreme gradient boosting
The technique is an efficient ML based on GBM principles, primarily used for classification and regression tasks. It starts with a base model, usually a simple decision tree, and iteratively builds additional trees to eliminate the errors. This correction is done by focusing on the residuals—i.e., the differences between the predicted and actual values of the previous models. Each subsequent tree is trained using gradient descent to minimize these residuals, adjusting the model’s predictions accordingly. One of the algorithm’s strengths is incorporating L1 and L2 regularization (Lasso and Ridge) to prevent overfitting, promoting more straightforward and generalizable trees. Another key feature is its use of weighted quantile sketching, which handles sparse data efficiently. Early stopping is also employed to prevent overfitting by halting the training when the model’s performance on a validation set starts to degrade. The final model prediction is obtained by aggregating the outputs of all trees, where the contribution of each tree is weighted according to its performance. XGBoost also allows parallelization, making it faster than other gradient-boosting algorithms. The flexibility to tune hyperparameters like learning rate, max depth, subsampling ratio, and number of trees further enhances its performance and computational efficiency. These attributes make it particularly effective in handling large datasets and complex data patterns, leading to its widespread use in machine learning competitions and production environments.
XGBoost’s versatility makes it highly effective across various industries. It is widely used in finance for credit scoring, risk assessment, and fraud detection. It builds predictive models that estimate the likelihood of events such as loan defaults or fraudulent transactions. In healthcare, XGBoost aids in disease prediction and medical image classification thanks to its capacity to capture intricate patterns in high-dimensional medical data. In e-commerce, it is utilized for customer segmentation, recommendation systems, and demand forecasting, efficiently handling vast customer interaction datasets. Marketing applications also benefit from using XGBoost to predict customer behavior, optimize ad targeting, and improve sales forecasting. Due to its robustness in managing sparse textual data, the algorithm is valuable in Natural Language Processing (NLP) tasks, such as text classification, sentiment analysis, and spam detection. Energy forecasting, image recognition (including autonomous driving and facial recognition), and geospatial analysis for predicting geographic trends have significantly improved with XGBoost’s predictive capabilities. Its success in machine learning competitions highlights its ability to solve diverse and complex prediction and classification problems across industries, solidifying its position as a go-to tool for machine learning practitioners69,70.
Gaussian process machine
The approach is a powerful Bayesian machine-learning model for regression and classification tasks. It models the relationship between inputs and outputs as a collection of random variables, where any subset follows a Gaussian distribution. Mathematically, GP assumes that the function f(x) generating the predictions is distributed as:
$$f(x) \sim GP(m(x),k(x,x\prime ))$$
(9)
m(x) denotes the mean function and k(x, x′) presents the covariance kernel defining the relationship between inputs x and x′. The kernel k(x, x′) is critical and can take forms such as the RBF (Radial Basis Function) kernel:
$$k(x,x\prime ) = \exp \left( {\frac{{|x – x\prime |^{2} }}{{2\ell ^{2} }}} \right)$$
(10)
where ℓ is the length scale parameter.
GPs provide uncertainty estimates for predictions, making them highly suitable for robotics, geostatistics, and optimization applications. However, GP models struggle with scalability as their computational complexity grows cubically with the number of data points. Techniques such as sparse approximations help mitigate this challenge. Wenming et al. and Su et al. provide a pivotal reference for Gaussian Processes, explaining their theory, applications, and kernel design71,72.
Light gradient boosting machine
The technique is an efficient, scalable gradient-boosting framework developed by Microsoft that is designed to handle large datasets with high dimensionality while optimizing speed and performance like traditional gradient-boosting methods, building various decision trees, where each new tree corrects the errors made by the previous one. However, LightGBM incorporates several optimizations to enhance its performance, most notably its leaf-wise tree growth strategy, as opposed to the level-wise approach used by other boosting methods. This allows the model to converge faster and achieve better accuracy by focusing on the most impactful data splits. Additionally, LightGBM uses histogram-based algorithms that reduce both computational time and memory usage, making it suitable for large-scale datasets. Its advanced features, including categorical feature handling, parallel and GPU learning support, and built-in cross-validation, make it a powerful tool for rapid model deployment and real-time predictions.
LightGBM is extensively used across various industries for its speed, efficiency, and ability to handle vast data. It powers applications such as credit scoring, risk assessment, fraud detection, and high-frequency trading in finance. It supports predictive modeling for patient outcomes, treatment effectiveness, and disease diagnosis in healthcare. Its rapid computation and accuracy make it ideal for e-commerce and marketing tasks, including recommendation systems, customer segmentation, and churn prediction. LightGBM optimizes network operations and service quality in telecommunications and technology through complex data analysis. The algorithm is also widely applied in NLP tasks like text classification and sentiment analysis, leveraging its ability to handle sparse data. LightGBM’s versatility extends to forecasting applications, such as energy consumption prediction and sales forecasting, underscoring its broad utility in real-time data analysis and decision-making processes. Its exceptional performance and ease of use make it a preferred tool for machine learning practitioners seeking efficient and high-accuracy models65,73.
Elastic net
Elastic Net is a regularization technique that combines the benefits of both Lasso (L1 regularization) and Ridge (L2 regularization) regression methods, making it highly effective for datasets with numerous predictors, significantly when some predictors are correlated or when the number of predictors exceeds the number of observations. Unlike Lasso, which sets some coefficients to zero for feature selection, and Ridge, which shrinks coefficients without eliminating them, Elastic Net combines L1 and L2 penalties. This hybrid approach mitigates the limitations of each method: Lasso’s tendency to select only one variable from correlated predictors and Ridge’s inability to eliminate any variables. By balancing both penalties, Elastic Net can perform feature selection while maintaining model stability, even in the presence of highly correlated predictors.
This technique is beneficial for high-dimensional datasets where multicollinearity is a concern. Lasso can result in a model where only one variable from a group of highly correlated features is retained, potentially missing important predictors. In contrast, Elastic Net retains multiple correlated features, enhancing model accuracy and interpretability. The technique’s flexibility comes from its key parameters: λ (lambda), which controls the regularization strength, and α (alpha), which determines the mix of Lasso and Ridge penalties. When α equals 1, Elastic Net behaves like Lasso; when α equals 0, it behaves like Ridge. Cross-validation is commonly used to find the optimal values of λ and α, ensuring that the model generalizes well to unseen data. Elastic Net’s ability to combine sparsity with stability makes it a robust feature selection and regularization solution, particularly in complex, high-dimensional datasets.
It helps address multicollinearity while performing feature selection for high-dimensional datasets. The objective function is expressed as:
$$L=\sum\limits_{{i=1}}^{n} {{{({y_i} – X_{i}^{T}\beta )}^2}+{\lambda _1}\sum\limits_{{j=1}}^{p} {|{\beta _j}|+{\lambda _2}\sum\limits_{{j=1}}^{p} {\beta _{j}^{2}} } }$$
(11)
Where λ1 controls the sparsity (lasso penalty) and λ2 controls shrinkage (ridge penalty). Elastic Net finds an optimal balance between feature selection and regularization, making it highly useful when predictors are highly correlated, or there are more features than data points.
Elastic Net is widely applied in domains like genomics, where selecting the most informative predictors among thousands of features is crucial. Guo et al. introduced Elastic Net, emphasizing its ability to overcome lasso’s limitations in multicollinear settings while preserving interpretability74,75.
Categorical boosting
CatBoost is a high-performance gradient boosting library developed by Yandex that is designed to efficiently handle categorical features in data modeling tasks involving mixed feature types. Unlike traditional gradient boosting methods that require manual categorical data preprocessing, CatBoost automates this process through permutation-driven transformations and ordered boosting. These innovations lead to improved accuracy and reduced overfitting. The library also introduces symmetric trees for faster model inference, reduced prediction latency, and support for GPU acceleration, enabling efficient processing of large datasets. CatBoost’s robustness, ease of use, and native capability to process categorical data have made it popular in various applications, including finance and e-commerce. Its competitive performance frequently surpasses traditional gradient-boosting methods in modeling accuracy and computational efficiency, making it an invaluable tool for data scientists and analysts.
CatBoost’s efficient handling of categorical data makes it widely applicable across various industries. In finance, it aids in credit scoring, fraud detection, and risk analysis, leveraging categorical features like customer demographics and transaction types. In e-commerce, it enhances recommendation systems, personalized marketing, and customer segmentation by analyzing user preferences and behaviors, even with incomplete data. The library is also beneficial in healthcare for predicting patient outcomes and disease diagnosis using mixed data types such as medical records and test results. Additionally, CatBoost supports churn prediction, targeted advertising, and sales forecasting with rich customer datasets in marketing.
Furthermore, it is applied in natural language processing tasks like text classification and sentiment analysis, thanks to its capability to handle mixed categorical and text data. Its scalability and efficiency make CatBoost popular for real-time predictive analytics and large-scale modeling across retail, telecommunications, and gaming sectors.
CatBoost is a gradient-boosting algorithm specifically optimized for categorical features, relying on ordered target encoding and randomized permutations to reduce overfitting and handle high-cardinality categorical data. Its objective is similar to general gradient boosting:
$$L = \sum\limits_{{i = 1}}^{n} {l\left( {y_{i} ,\hat{y}_{i} } \right) + \lambda \sum\limits_{{j = 1}}^{p} {\Omega (T_{i} )} }$$
(12)
CatBoost ensures faster training, robustness, and improved missing or categorical data handling. These features make it ideal for recommendation systems, retail forecasting, and financial modeling. Cha et al. introduced CatBoost, outlining its advantages in efficiently handling categorical and tabular data67,76,77,78.
