The purpose of the study is to design a system that predicts SOC using soil parameters and topography. This model helps farmers make informed decisions to increase crop production. The interaction between biological and environmental factors determines the amount of nutrients present in the soil. A collection of attributes, topography, spectral index, and climate, are selected to predict SOC. The roadmap for the proposed methodology is shown in Figure 2.

Details of the proposed methodology workflow.
Research location
The study area being considered in this study is the Damtari district of Chhattisgarh, India, as shown in Figure 3, which was created using data provided by NBSS, India, and QGIS 3.14.16. Located at an average sea level of 457 m, it covers 4084 square kilometers. The mentioned districts are located at latitude and vertical directions, respectively. The area is raining on an average of 1221 mm. Paddy fields, or rice, are the main crops grown in this area.

Details of the survey area generated using QGIS 3.1416.
Data Processing
Many biological and environmental factors, and their associations, control the concentration of nutrients in the soil. Various covariates must be selected to function as potential predictors of soil properties, such as topography, climate, and remote sensing. The process of creating a dataset is as shown in Figure 4. It consists essentially of three main steps:
-
Preprocessing of raster images (satellite images)
-
Geospatial data extraction
-
Data fusion (with field observation)

The process of creating datasets for modeling.
The above-mentioned pre-processing method17 It is considered prior to primary data analysis and soil information extraction. The preprocessing results are shown in the flow diagram format in Figure 5. Data extraction consists mainly of two steps. One combines raster stacking to get predictor values, and the second samples the raster values from the sampling raster raster raster. Data fusion is the process of combining data obtained using the extraction step with field observations.

The raster pretreatment results generate a dataset for applying machine learning models to assess soil factors.
Soil health card data was used in this study. Location errors, replicas, missing values, incorrect values, and outliers were checked in the data. Location block-by-block matching was performed to eliminate missing or incorrectly evaluated data. Boxplots were used to identify and eliminate outliers. Based on the same location, replicated data was found and excluded if necessary. The data was then converted into geodatabase spatial point data frames, allowing further investigation. Landsat-8's multifaceted data18 Additionally, SRTM DEM was utilized in experiments to become familiar with pre-processing and data extraction processes prior to modeling. Images of the Landsat-8 sensor were taken between November and December 2019, and topographic variables were taken from a 30 m resolution SRTM DEM. Additionally, 21 km2 Spatially resolved climate data (over 20 years) were obtained from WorldClim. Pretreated raster photographs, four topographic variables, two soil-related remote sensing indexes, and four climate variables were recovered. Table 1 provides details on these qualities. Topographic variables were extracted at a resolution of 30 m using SRTM DEM. Climate data were interpolated to a depth of 30 m to match the spatial resolution of the digital elevation model (DEM). All variables were reproduced using WGS 84 UTM 44N projection technology.
Plant- and soil-related variables were found to investigate the relationship between multi-universe reflectance and soil properties. These indices were calculated from each image using reflectance values over different spectral bands. These indices were calculated using the corresponding equations listed below.
$$ndvi = \frac {\rho nir – \rho r} {{\rho nir + \rho r}} $$
(1)
$$savi =\frac {{\left({\rho nir -\rho r}\right)\times \left({1 + t}\right)}} {\rho nir + \rho r + t}$$
(2)
where \(\ rho nir \): Near-infrared reflectance. \(\ rho r \): The red band reflectance, the coefficient of the parameter t is 1.
Modeling methods
The two methodologies dominating the machine learning field are ensemble learning and deep learning.21,22andtwenty three. Ensemble learning is the process of combining two or more machine learning algorithms into an ensemble, with better results than individual algorithms used alone. Rather than relying on a single model, we use combination rules to aggregate predictions made by individual learners into a single, more accurate prediction. Compared to individual base learners, ensemble learning methods train many base learners and combine predictions to generate superior performance and more generalization ability.twenty four. However, machine learning algorithms have many limitations, such as the production of highly variable, biased, and inaccurate models.25,26. However, many studies have shown that ensemble models typically work more accurately than a single ML model.27. The variance and bias errors of a single machine learning model can be reduced via an ensemble approach. Bagging reduces variance without increasing bias while increasing bias28,29,30. Ensemble learning can reduce the likelihood of overfitting as various baseline models are available. Boost, bagging and stacking are three main categories where ensemble learning techniques are degraded31.
Boost is a type of ensemble meta-algorithm that reduces variance and bias. Specifically, the boost method trains weak learners with input data and uses predictions made by the learners to select falsely predicted training samples. Powerful learners achieve excellent accuracy and form the basis for boosting ensemble algorithms26. Weak learners are learners who do slightly better work than random guesses. Boost can turn weak learners into powerful learners. The basic principle of basic boosting is to use a base learning algorithm to apply a modified version of the input data.24,32. Examples of boost include gradient boost (xgboost)29 and Adaptive Boost33.
According to34,Bugs are a type of ensemble learning that combines predictions from randomly generated training sets to improve the predictive capabilities of machine learning models. Bagging has the advantage of efficiently reducing variability without improving bias. Examples of this technique include a random forest35 and extra tree classifiers.
Stacking is an ensemble learning framework that aggregates predictions from two or more ensemble members by training alternative machine learning algorithms. Walpert36 We developed it to reduce generalization errors in machine learning problems. Stacking can be useful if many machine learning models are particularly skilled at a particular job. As discussed37,Stacking strategies use different machine learning models to determine when to utilize predictions from different models. According to the results of38 and39,Metallearning is a branch of machine learning in which the algorithm is trained on the output of other machine learning algorithms, resulting in predictions that exceed the accuracy of the basic learner. This research study assessed all three categories of ensemble learning. Details of the basic learners in all categories are shown in Table 2.
Hyperparameter settings
The possibility of hyperparameters that directly modulate the behavior of training algorithms contributes to their importance. The machine learning model was then trained on the training dataset using hyperparameters. Details of these hyperparameters considered are mentioned as in Table 3.
Model evaluation
Root mean square error, RMSE, symmetric mean absolute percentage error (SMAPE), and measurement coefficients were used.2 As a performance indicator. r2 Statistical metrics used to evaluate the fit intensity of a regression model. This is a useful metric for assessing the overall effectiveness and explanatory ability of regression models. Coefficient of determination, r2 Interpreted as a percentage of the variance of the dependent variable predictable from the independent variable40. RMSE is a measure that shows the mean difference between predicted and actual values in a dataset. This is a statistical measure that implies an estimate of error deviation41. Symmetric Average Absolute Percentage Error (SMAPE) is an evaluation metric based on percentage (or relative) errors42.
$$r^{2} = 1 – \frac {{\mathop \sum \nolimits_ {i = 1}^{m} \left({p_ {i} – o_ {i}} \right)^{2}}} {{{mathop \sum \nolimits_} \left({\overline {o} – o_ {i}} \right)^{2}}} $$
(3)
$$rmse = \sqrt {\frac {1}{m}{m}\left({\mathop \sum \limits_ {i = 1}^{m}\left({i} – o_ {i}}\right)^{2}}\right)}\right)}
(4)
$$smape = \frac {100}{m}\mathop\sum\limits_{i=1}^{m}\frac {{\left | {p_{i}-o_{i}}\right |}}}{{\left({\left | {o_{i}}}\right | + \left | {p_{i}}\right |}\right)/2}}}”
(5)
Here, the number of M-observed 2, pI– Predicted value of ith Sample, oI-Actual value of ith sample, \(\overline {{\varvec {o}}} \)-Average of actual values.
