Workflow overview
Figure 7 displays the schematic of the proposed ML workflow, which is mainly composed of the following parts: data acquisition and preprocessing, hyperparameter tuning using Grid Search (GS), and evaluation of the tuned model. A detailed description of each part will be treated in subsequent sections. The principal interest of this work is to assess the ability of MLAs to predict the lithologies found in a meteorite impact crater.

Workflow for the proposed machine learning process.
Data acquisition and preprocessing
The application of MLAs in predicting lithologies was investigated through the analysis of two boreholes, LB-07A and LB-08A, drilled within the Bosumtwi Impact Crater as part of the International Continental Scientific Drilling Program (ICDP)2. Physical property measurements and borehole log information were obtained, including density and magnetic susceptibility measurements on the core material, and borehole total magnetic field data extracted from the borehole deviation surveys33. These measurements, along with total gamma radiation and caliper data, were presented as borehole logs2.
Due to the inaccessibility of the original data, the borehole logs were extracted from figures in Morris et al.2 using a digital tool for this purpose34. The final output was then plotted and visually compared with the original borehole plot from Morris et al.2 to ensure fidelity to the original plots. Figures 8 and 9 show the petrophysical property logs for boreholes LB-07A and LB-08A, respectively. Both boreholes have the following individual logs: (a) simplified lithology; (b) density (g/cm3) measured on core segments; (c) magnetic susceptibility (\(\text{log}10\times {10}^{-5}\) SI) measured on core segments; (d) Total Magnetic Intensity (TMI) (nT), scalar magnetic intensity derived from borehole deviation survey; (e) total gamma derived from borehole survey by ICDP (cps); and (f) caliper (mm) survey.

Physical property logs for borehole LB-07A (digitized and extracted from2).

Physical property logs for borehole LB-08A (digitized and extracted from2).
The dataset includes two wells with five features each: density, total gamma, caliper, scaler TMI, and magnetic susceptibility. Five lithology classes are targets for classification. These targets are MGW, SSP, MLB, PLB, and SUE. Table 3 shows the number of samples for each lithology in LB-07A and LB-08A datasets after preprocessing.
In the preprocessing stage, several key aspects were addressed to ensure the integrity and suitability of the well-log data for subsequent ML analysis. The two well-log datasets, characterized by small size and high-class imbalance, were combined. This was done to increase the overall dataset size and enhance the model’s ability to discern lithologic patterns. Before merging, a new dataset identifier feature was introduced for each sample to facilitate post-prediction reconstruction of the two well-log datasets. To maintain data quality, samples with missing values for any of the features were systematically removed. Finally, the lithology target variable was encoded, mapping classes to numeric values for compatibility with algorithms.
Hyperparameter optimization
Choosing an appropriate MLA for a given task is one step in developing an effective ML model. The other step is to obtain an optimal architecture for the algorithm by tuning its hyperparameters. Hyperparameter tuning is the process of finding the set of hyperparameters that produce the most effective ML model for a given task. Hyperparameters are parameters whose values must be specified before the learning process35 and the selection of the appropriate values is done by the ML practitioner or through an optimization process. GS is a hyperparameter optimization approach that exhaustively searches for the optimal combination of hyperparameters from a fixed domain of hyperparameters36. It involves specifying a range of values for selected hyperparameters, after which the performance of the MLA is thoroughly assessed for every combination of these hyperparameters. The following steps outline the GS hyperparameter tuning process.
-
1.
Define a set of hyperparameters \(H\) and a range of values for each. Let \({h}_{i}\) represent a specific hyperparameter, \({h}_{i}^{1},{h}_{i}^{2},\ldots ,{h}_{i}^{{k}_{i}}\) represent the range of values for the hyperparameter \({h}_{i}\), and \({k}_{i}\) represent the number of values for \({h}_{i}.\)
-
2.
Construct a grid containing all values of combinations of hyperparameter values. The grid represents the entire search space. For \(n\) parameters, the total number of grid points is \({k}_{1}*{k}_{2}*\dots *{k}_{n}\). Each grid point is denoted as \(p\).
-
3.
For each \(p\), set the MLA’s hyperparameter values, then train the algorithm on the training dataset and evaluate it on the test dataset using repeated stratified k-fold cross-validation to obtain the accuracy of the model.
-
4.
Select the combination of hyperparameters that produce the best performance using the evaluation metric.
GS is intuitive and easily implemented. Nevertheless, as the hyperparameter space expands with the inclusion of more hyperparameters and values, GS encounters a challenge known as the curse of dimensionality36. This results in an increase in the number of grid points for evaluation.
Description of machine-learning algorithms
Four MLAs were used in this study: Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), and K Nearest Neighbors (KNN). These four MLAs were selected for their well-established nature, easy understanding of their decision-making process, and extensive application across various geophysical tasks. This section provides a description of each of the four ML models and an overview of their hyperparameters.
The structure of a DT is presented in Fig. 10. The DT algorithm is a tree-structured classifier in which features are represented as internal or decision nodes, decision rules are represented as branches, and results are represented as leaf nodes. Branches (or decision rules) represent the chance outcomes that emanate from the root node, and they are used to create the hierarchy of the tree. DT is a powerful algorithm used in various lithology classification tasks4,37,38. A DT can be regarded as a deterministic algorithm for deciding which variable to test next, based on the previously tested variable and the results of their evaluation until the function’s value can be determined. Fig. 4 illustrates the structure of a DT. Six DT hyperparameters were optimized in this study: criterion (assess split quality), splitter (chooses the best splitting strategy: random or best), max depth (prevents overfitting by limiting tree depth), min samples split (determines the number of data points required to split a decision node), min sample leaf (sets minimum samples per leaf node), and max features (reduces overfitting by limiting features considered for splits)

RF is an ensemble ML algorithm that uses a group of DTs. Each tree is dependent on a random vector that is sampled independently and with the same distribution for all trees in the forest39. As the number of trees increases, the generalization error for the RF algorithm converges almost surely to a limit. The generalization error of RF classifiers is influenced by the strength of the individual trees and the correlation between them. Figure 11 shows a schematic illustration of the RF algorithm. The main idea behind RF is to use multiple uncorrelated DT models to predict a label for each instance. Unlike in DT, where the best feature is selected out of all the features, and where the entire dataset is used, trees in RF do not get access to all features and data. RF uses feature randomness or bagging to ensure each DT is not correlated to other DTs. With feature randomness, each tree selects the best feature out of a random subset of features. The bagging method ensures each tree is trained on a different random sample of the training data. RF is effective because it is a highly versatile and effective MLA that delivers precise predictions across diverse applications, while also enabling feature importance assessment during model training and facilitating the computation of pairwise proximity between samples40. RF has been applied in several areas such as bioinformatics40, subsidence susceptibility assessment41, lithology classification4, fault detection42, and facies and fracture prediction43. Six RF hyperparameters were optimized in this study: criterion (assess split quality), n estimators (specifies the number of DTs in the forest), max depth (prevents overfitting by limiting tree depth), min samples split (determines the number of data points required to split a decision node), min sample leaf (sets minimum samples per leaf node), and max features (reduces overfitting by limiting features considered for splits)

A schematic illustration of a random forest algorithm.
LR is an extension of the linear regression method. While linear regression models continuous outcomes and assumes a linear relationship between the outcome and independent variables, logistic regression is used for binary classification tasks, predicting the probability of a certain outcome occurring44. In multi-class prediction tasks, such as identifying different lithologies, LR is adapted to handle multiple classes. Some of the adaptations include the one-vs-rest/one-vs-all45 approach where separate LR models are trained for each class and the final classification of an input is the LR model with the highest probability. Another approach is the multinomial (SoftMax) LR where a single LR model with multiple output classes is trained simultaneously. LR estimates the conditional probability, denoted as \(Pr\left(G|X\right)\) where \(G\) represents the target class and \(X\) represents the input features. One advantage of using LR is its interpretability which is a desired feature in many ML applications. Three LR hyperparameters were considered in this study: penalty, C, and max iter. The penalty hyperparameter specifies the type of regularization (l1, l2, elasticnet, or None) to help prevent overfitting. C controls the trade-off between fitting the training data and generalization. Max iter sets the maximum number of iterations for the solvers to converge.
The graph of a logistic function (Fig. 12) represents the relationship between the input and the probability of the input belonging to a certain class. The logistic function converts the linear combination of input variables into a probability between 0 and 1. The x-axis of the graph represents the values of the input variables. In contrast, the y-axis represents the predicted probability of the outcome variable being in one of the classes. The x-axis of the graph represents the values of the input variables. In contrast, the y-axis represents the predicted probability of the outcome variable being in one of the classes.

Logistic regression graph.
The KNN algorithm is quite straightforward: for a given input, its class is determined by the majority class among its k nearest neighbors, where k is a positive number. In regression with KNN, the value of a given input is simply the average of its k nearest neighbors. The distance from an input to its k nearest neighbors is typically measured using the Euclidean distance. KNN is non-parametric and lazy25, meaning it doesn’t assume data distribution or require training before predictions. It memorizes the training dataset and predicts based on local patterns. However, its performance can depend on k and the distance metric25, and it may be computationally complex for large datasets.
Figure 13 illustrates the feature space of the KNN algorithm. The circular boundary contains training data inputs closest to the x input awaiting prediction. The x input will be classified as a square because, among the three closest shapes, two are squares. Three KNN hyperparameters were optimized in this study: k, weights, and algorithm. The k hyperparameter determines the number of neighbors to consider when making predictions. The weights hyperparameter determines the weight given to each neighbor when making predictions. It can be set to either ‘uniform’, where all neighbors are weighted equally, or ‘distance’, where closer neighbors have more influence on the prediction. Finally, the algorithm hyperparameter specifies the algorithm used to compute the nearest neighbors.

Training and evaluation
The computational experiments presented in this paper were conducted in Python with the sci-kit-learn library. Given the constraints of limited and imbalanced well-log datasets, it was impractical to employ one well-log for training and another for evaluating the trained model. Instead, the LB-07A and LB-08A well-logs were combined to enhance the robustness of the ML-based lithology identification models. This amalgamation served a dual purpose: it increased the overall dataset size and improved the generalizability of MLAs by exposing them to a more comprehensive range of lithologies. Subsequently, each of the MLA’s hyperparameters was optimized using GS combined with a repeated stratified k-fold cross-validation. The average overall accuracy, recall, precision, and F1 scores of the optimized ML models were obtained for evaluation, as well as their confusion matrices for further lithology-specific evaluation. Finally, the lithology predictions of the best-performing model were plotted alongside the actual lithology. Inferior quality and class imbalance in LB-07A and LB-08A logs rendered traditional evaluation methods impractical. To overcome this, the predictions for each well-log were plotted by first saving the predictions from the repeated stratified k-fold cross-validation on the combined dataset and subsequently reconstructing the LB-07A and LB-08A from these saved predictions.
