Influence of environmental variables and remote sensing data
This study demonstrates the critical role of remote sensing data in improving the prediction of soil properties across a heterogeneous landscape. By incorporating indices a combination of spectral indices and environmental predictors, such as TCB, Landsat bands (B7, B5, B3, and B4), and the Salinity Index (SI3), we identified strong correlations between these remote sensing-derived variables and soil properties like calcium carbonate (CaCO3) and sulfate (SO4).
These results support the hypothesis that spectral reflectance, especially in the SWIR region, is closely linked to surface soil chemistry in arid regions. The positive correlations observed between CaCO3 and remote sensing indices reflect the influence of soil composition on surface reflectance properties. We further emphasize that this correlation is not only due to mineral content but also affected by surface conditions, which amplifies spectral brightness.
High levels of CaCO3 and SO4 tend to increase surface brightness, particularly in arid and semi-arid regions where bare soil or sparse vegetation dominates the landscape. This finding underscores the value of remote sensing in capturing these soil-environment interactions, aligning with previous studies that highlight the utility of remote sensing in digital soil mapping14,15,63.
Calcium carbonate (CaCO3) and sulfate (SO4) are bright minerals that significantly influence the reflectance of soil surfaces in these regions. Their presence enhances the brightness captured by remote sensing indices like TCB and SI3, which are reliable proxies for carbonate and sulfate concentrations. Landsat bands, particularly those in the shortwave infrared (SWIR) region such as B5 and B7, are particularly sensitive to the reflectance of these minerals. The absorption features within these bands align with the presence of CaCO3 and SO4, enabling precise detection and mapping of these minerals across the landscape. As a result, the high reflectance of CaCO3 and SO4 in these bands correlates strongly with the spectral signature captured by remote sensing tools64,65.
Furthermore, the identification of key predictors such as elevation, PCA1, and TVDI for soil calcium, and PCA2 and TCB for calcium carbonate, provides valuable insights into the environmental factors that govern soil variability. Elevation influences soil calcium distribution by shaping microclimatic conditions (e.g., precipitation and temperature) and controlling runoff and sediment deposition patterns. In low-elevation areas, depositional environments are more likely to accumulate calcium-rich sediments, whereas high-elevation zones may experience leaching due to increased rainfall. Consistent with these findings, altitude, along with annual mean rainfall and the Normalized Difference Salinity Index (NDSI), were among the most influential predictors for soil pH distribution in arid and semi-arid landscapes using Random Forest modeling21. These patterns have been corroborated in studies exploring topographic controls on soil nutrients66,67.
TVDI reflects moisture availability and vegetation stress, which directly influence soil properties in arid and semi-arid conditions. Soils in drier regions with stressed vegetation are often characterized by increased carbonate and sulfate concentrations due to reduced leaching and enhanced deposition of weathered materials. The strong correlation of TCB with calcium carbonate highlights its utility in detecting bright, carbonate-rich soils, particularly in landscapes dominated by bare surfaces6,68.
In regions with minimal vegetation cover, such as those dominated by bare soil, the spectral signal is primarily influenced by soil properties rather than vegetation. This helps explain the strong correlation between remote sensing indices and soil chemical properties observed in our study. Furthermore, surface crusting, which is common in soils rich in carbonates and sulfates, amplifies reflectance by reducing soil roughness and enhancing surface brightness. These characteristics make remote sensing indices even more effective in detecting and mapping soil properties in such environments.
These findings underscore the value of remote sensing data in capturing soil-environment interactions and improving the spatial resolution of soil property maps. By providing spatially extensive, consistent information, remote sensing data enhance the precision of digital soil mapping, supporting informed decision-making for sustainable agricultural practices and soil conservation. This is particularly critical in arid and semi-arid regions, where soil degradation poses significant challenges to land management and food security.
Moreover, the enhanced detection of CaCO3 using TCB and PCA2 underscores the power of principal component transformation in reducing dimensionality and isolating meaningful spectral patterns associated with specific soil constituents. These results are consistent with recent advances in digital soil mapping literature and confirm the potential of integrating spectral indices, topographic variables, and principal components to improve prediction accuracy in dryland regions69.
Effectiveness of machine learning models
The integration of remote sensing data with machine learning algorithms has become a transformative approach in digital soil mapping, especially for predicting soil properties in complex and heterogeneous landscapes. In this study, among all tested models, the Ensemble model consistently outperformed others, delivering the highest R2 and the lowest RMSE and MAE across key soil properties such as calcium, calcium carbonate, calcium sulfate, and sulfate. The superior performance of the Ensemble model stems from its ability to aggregate predictions from diverse base learners, balancing individual model biases and reducing variance caused by overfitting. This synergy enables the model to capture both broad trends and localized variations in soil characteristics, resulting in more robust and accurate predictions70,71.
In contrast, traditional regression-based methods such as GLM and GAM showed limited performance. Their reliance on linear or semi-linear assumptions restricts their capacity to model complex, nonlinear interactions between soil properties and environmental predictors. For instance, relationships involving RS indices like TCB and SI3 often include threshold effects, interactions, and saturation points, dynamics poorly captured by linear approaches25. This limitation was reflected in their relatively low R2 values (e.g., 0.39 for Ca in GLM and 0.46 in GAM) and higher RMSE values, suggesting reduced suitability for DSM in heterogeneous regions.
Nonlinear models such as RF, CART, and SVR outperformed GLM and GAM, indicating their effectiveness in handling complex interactions and high-dimensional data. Among these, RF emerged as particularly powerful, achieving an R2 of 0.82 for Ca and maintaining consistently low RMSE and MAE scores. RF’s ensemble-based architecture, which combines multiple decision trees, helps minimize overfitting and improves generalization, while also offering insights into variable importance72,73. CART, though slightly less accurate than RF, remained valuable due to its interpretability and ability to highlight dominant predictors of soil variability.
SVR showed moderate performance relative to RF and the Ensemble model. While SVR is adept at managing high-dimensional spaces and avoids overfitting, its effectiveness is contingent on careful kernel and parameter tuning. Without optimal settings, SVR may underperform, as suggested by its lower R2 values and higher errors in this study74. Future research could enhance SVR performance by refining hyperparameters or integrating it within hybrid modeling frameworks.
The clear outperformance of the Ensemble model across all metrics underscores the advantage of model stacking and blending in soil prediction tasks. For example, the Ensemble model reached an R2 of 0.89 for Ca, notably higher than RF (R2 = 0.82) and SVR (R2 = 0.49), while also yielding the lowest RMSE and MAE values. These results reflect the model’s ability to combine the predictive strength of RF with the interpretability of simpler algorithms like CART, leading to a more balanced and resilient predictive system.
The variation in model performance highlights the critical importance of algorithm selection in environmental modeling. Soil properties are shaped by a complex set of biophysical factors including mineral composition, moisture availability, and topographic features that often interact in nonlinear and spatially variable ways. Ensemble methods and nonlinear models like RF are better equipped to capture such relationships, as supported by numerous DSM studies72,75. Furthermore, the use of RS data enhances the models’ predictive capacity by offering spatially detailed information on environmental processes.
These findings demonstrate the potential of advanced ML models, particularly in arid and semi-arid regions where sparse data and high environmental variability are major challenges76. The ability of RF and Ensemble models to handle such complexity, combined with the broad spatial reach of RS data, makes them indispensable tools for precision agriculture and sustainable land management.
Notably, nonlinear models such as RF and CART also provide valuable insights into variable importance, helping to identify key environmental drivers of soil properties77. This interpretability, along with high predictive accuracy, reinforces the growing consensus on the effectiveness of ensemble approaches in heterogeneous and data-limited settings.
It is also important to contextualize model error metrics. In this study, the mean observed value of Ca was 25.68 meq/L. The Ensemble model achieved an RMSE of 13.69 and an MAE of 8.15, values well below the mean, indicating high predictive accuracy. Similar patterns were observed for other soil variables, further supporting the robustness of the Ensemble model. Generally, RMSE and MAE values approaching zero signify lower prediction error and improved model performance.
Implications for soil management
The spatial patterns of soil properties revealed by this study have important implications for soil management, particularly in regions like the western sand dunes, where significant variations in calcium, carbonate, sulfate, and sulfate concentrations were observed. These variations are heavily influenced by local environmental conditions. They underscore the need for tailored, localized soil management strategies. The Ensemble model’s ability to generate accurate soil property maps provides valuable information that can guide environmental interventions aimed at mitigating soil degradation and improving land productivity.
In areas with elevated levels of lime and gypsum, such as the western and southwestern regions of the study area, these soil characteristics are primarily driven by the underlying parent materials, including limestone and marl. The presence of evaporite minerals like calcite and gypsum is also influenced by the high groundwater table around the salt marsh. In Iran’s arid and semi-arid regions, where soils typically exhibit high pH values (greater than 7) and abundant calcium carbonate, these conditions can hinder plant growth by limiting the availability of nutrients such as phosphorus, iron, and zinc78,79. This further highlights the importance of accurate soil property mapping for improving agricultural practices and ensuring long-term environmental sustainability.
The heterogeneity of the study area’s geomorphological surfaces, which include high and medium elevations, sand dunes, gypsum plateaus, lowlands with high gravel proportions, glaciated areas, wetlands, playas surrounding the wetlands, floodplains, and a delta, adds to the complexity of soil property distribution. This diversity in landforms creates varied conditions for soil formation, contributing to the significant differences observed in soil composition and structure. The use of remote sensing data, combined with advanced machine learning models, offers a robust solution for capturing and predicting these intricate patterns of soil variability. Traditional methods may fail to account for this complexity, but the detailed maps produced through this study provide essential insights for managing diverse landscapes.
Study limitations and future research opportunities
Although the findings of this research are encouraging, several limitations should be acknowledged to guide future investigations. A key consideration is the use of climatic data. In this study, direct climatic variables such as precipitation and air temperature were not included. This decision was based on the relatively small size of the watershed, the limited climatic variability across the area, and the presence of only one climatological station. These factors reduced the feasibility and potential value of integrating coarse-resolution climatic datasets.
Nevertheless, relevant climate-related information was still incorporated through remote sensing–based indices such as LST and TVDI. These indices effectively captured spatial variations in surface temperature and drought stress, which are important drivers of soil processes in arid and semi-arid regions.
Furthermore, due to the extremely low and relatively uniform annual rainfall, which is generally less than 80 mm, and the strong influence of geomorphological and soil-forming factors such as parent material, salinity, and groundwater levels, indirect climate proxies were considered sufficient to address the study’s objectives.
However, for larger or more climatically diverse regions, future research should consider the integration of gridded climatic datasets such as CHELSA bioclimatic variables or downscaled data from meteorological stations. Including such data could potentially improve the accuracy and generalizability of soil prediction models.
Additionally, increasing the number of soil samples and the range of auxiliary variables could enhance the precision of spatial distribution models. A more extensive dataset would allow for a more comprehensive representation of the environmental factors that influence soil properties. This, in turn, would strengthen the predictive capacity of machine learning models. Improved modeling of soil variability has important implications for sustainable land management, especially in areas facing challenges such as soil degradation, water scarcity, and reduced agricultural productivity.
