Estimation of soil-free iron content using spectral reflectance and machine learning algorithms

Survey area and soil sampling points

The study area is located in Z Jiang and (27.4°-30.0°, 118.4°-122.0°E) in southeastern China (Fig. 1a). The total area is 16,850 km² The altitude is 58.45 m above sea level. It has a humid subtropical climate with an average annual temperature of 18.4°C, receiving approximately 1395.3 mm of precipitation and 150 rainy days each year³⁹. Based on land cover and topography, 135 local sites were selected and soil sampling was performed in the study area (see Figure 1A). The location of each sampling was defined by GPS, and four samples were taken from various soil horizons with depths of 0-80 cm using a bamboo shovel. Approximately 1 kg of soil in each sample was collected by collecting five subsamples. In total, 540 soil samples were obtained.

Free iron content measurement

Free iron in soil samples was extracted using dithionate cytrate carbon (DCB) treatment⁴⁰. The procedure includes the following steps: (1) Sample preparation: Air-dried soil samples for removing large particles and organic matter. (2) Preparation of Reagents: Prepare a DCB reagent consisting of sodium dithionite, sodium citrate, and sodium bicarbonate. (3) Extraction: Weigh a specific amount of soil (usually 1-2 g), add soil to a flask containing DCB reagent, and heat the mixture for 15 minutes to ensure complete reaction, cool the mixture and separate soil residue from the solution. Filtering, and (4) Iron Quantification: Measure the concentration of iron in the solution using a spectrophotometer.

Statistics for the soil-free iron content dataset are shown in Table 1. The soil iron content of the total, training, and test samples was 4.07-60.3 g/kg, 4.49-60.3 g/kg, and 4.07-56.44 g/kg. Median soil iron content for the three datasets was 19.96 g/kg, 19.99 g/kg, and 19.82 g/kg. Q1 (a quarter) and Q3 (a quarter) of the training and test datasets are very close to the datasets of the entire dataset. The data distribution parameters for Total, Training, and Testing Data, IE, Skewness, and Kurtosis are 1.13, 1.16, and 1.06 and 1.09, 1.13, and 1.06. This shows that the training and test data are a good representation of the total data.

Table 1 Statistics of the total, training, and test datasets in the experiment.

Soil reflectance measurement

To minimize the effect of soil moisture and particle size on spectral measurements, all samples were air-dried and sieved to 0.25 mm. Before reflectance measurements, soil samples were placed in culture dishes with diameters of 5.0 cm and depth of 1.0 cm. The dish was packed with soil samples, and the soil surface was flat and washed away with the dish. Reflectance of soil samples was measured using an ASD (analytical spectral device) FieldSpec 3 portable spectrometer (Malvern Panalytical Ltd, Malvern, UK) (see Figure 1B). The light source incorporates a halogen light source probe with a front field with a 25° viewing angle. The probe should be placed approximately 2 cm above the soil surface. The spectrometer's wavelength range was 350-2500 nm. The spectral resolution was 3 nm from 350-1000 nm and 10 nm from 1000-2500 nm. The spectrometer was calibrated on a whiteboard prior to each sample measurement. To reduce error, each sample was measured three times and 10 spectral curves were averaged for representative analysis per measurement.

Pre-processing and analysis of raw spectra

The soil spectral reflection data range was optimized to the 400-2400 nm range for noise recovery. Spectral preprocessing methods such as FD, SNV, and CR conversion of the original spectral data were carried out in this study. SNV performs spectrum normalization, consisting of subtracting each spectrum by its own average and dividing it by its own standard deviation. CR technology is a way to emphasize spectral absorption capabilities. It can be seen as a way to perform albedo normalization. Reflections containing vast amounts of data can make analysis complicated and difficult, so it is necessary to reduce the amount of data and select the appropriate spectral variables to construct a soil-free iron content estimation model. Pearson correlation analysis and PCA were applied to reduce the size and volume of spectral data.

Building an estimation model

Use 10x cross-validation (10x cross-validation) to validate the optimal model selected from different models (the most appropriate model). The dataset consisted of all soil samples and was divided into two parts using a stratified sampling method. The training set was 70% of the total data (i.e. 360 samples). This set was used to develop estimation models. The test set consisted of 30% of the total data (i.e. 180 samples). This set was used to test the performance of the model. A soil-free iron content estimation model was constructed based on the fully original spectral reflectance and its transformations (FD, SNV, and CR). In the model using Pearson correlation, the original spectral, FD, SNV, and CR spectral reflectances with correlation coefficients greater than 0.400 were selected as model inputs. In models using PCA, principal components (PCs) with eigenvalues greater than 1.00 were selected as input variables.

To construct a soil-free iron content estimate, we used PLS, SVM, RF, and DNN to estimate soil iron condels. PLS can reduce the dimensions of spectral data while maintaining variance related to iron content. SVMs can construct hyperplanes in high-dimensional spaces to regress based on spectral features. The kernel functions with the “Gaussian” in SVM were determined after 10x cross-validation. RF is an ensemble method that uses multiple decision trees to improve prediction accuracy. Multiple decision trees outputs can be combined to reach a single result.

DNN allows machines to learn complex patterns from data with high accuracy. When properly trained, DNNS ensures that machine learning models can interpret spectral data with confidence. The DNN structure used is shown in Figure 2. The DNN input layer has three selected types and four hidden layers. Layers 1-4 have 256, 128, 64, and 32 neurons, respectively. The output layer contained soilless iron content data. I used the Relu activation feature after each hidden layer. Additionally, after the initial hidden layer, a dropout layer with a ratio of 0.1 was used to prevent overfitting. I trained my network with Adam Optimizer. The maximum number of training rounds was set to 500, and the mini batch size was 32. The initial learning rate was set to 0.001, which reduced by 10% every 100 rounds. Figure 3 shows a flow chart of the entire process from data collection to analysis and modeling.

$$r^{2} = 1 – \frac {{\sum \limits _{{i = 1}}^{n} {(y_{i} – \hat {y}_{i})^{2}}}} {{{\sum \sum \binit _{} {{{n n {{n n {{n n n {{n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n

(1)

$$rmse = \sqrt {\frac {1}{n}\sum \rimits _{{i=1}}^{n} {(y_ {i}){y})^{2}}}}$$$

(2)

$$rrmse = \sqrt {\frac {\frac {1}{n}\sum \nolimits _{{i=1}}^{n} {(y_{i} – \hat {y})^{2}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}} 1}}^{n} {(y_{i})^{2}}}}}\times 100 \%$$

(3)

Here y_I and $\hat {y} _{i} $ Soil-free iron content measured and predicted on samples Irespectively. and $\bar {y} $The average measured soil-free iron content. n is the number of observations. r²root mean square error (rmse), and relative root squared error (rrmse) was used to evaluate training and test performance (equations (1) – (3)). r² Represents the percentage of the variance of the dependent variable explained by the independent variable. rmse Measures the average difference between the predicted and actual values of a statistical model. rrmse It reflects accuracy and allows you to compare the accuracy of different models.

Two tail Pearson correlation analyses were performed in Excel 2022 and PCA was performed using IBM SPSS Statistics 25.0 (SPSS Inc., NY, USA, 2017). Correlation, scattering, fitting lines, and PCA plots were drawn by Origin 2022 (Origin Lab Corporation, MA, USA, 2022). Transformations of FD and SNV and PLS algorithms were performed on Unscrambler X 10.4 (Camo Software AS., 2016). SVM, RF algorithms, and DNN were performed on MATLAB R2022A (The Mathworks, Inc., CA, USA, 2022). CR transformation of spectral data was processed in Envi 5.3 (ITT Visual Information Solutions, Co, Co, USA, 2015).

Source link