The impact of green infrastructure on ecosystem quality based on explainable machine learning: a case study of Shanxi Province, China

Machine Learning


To achieve our objectives, we develop a four‐stage integrative framework that systematically links GI form and landscape pattern metrics with regional ecosystem quality and its underlying response mechanisms. First, we quantify ecosystem quality dynamics by applying the RSEI. Next, we characterize GI elements through MSPA and calculate key landscape pattern metrics to capture the structural and configurational attributes of green spaces. We then employ interpretable machine learning techniques to elucidate how individual GI features drive variations in ecosystem quality and to uncover the primary mechanisms of influence. Finally, leveraging the insights gained from these models, we formulate targeted recommendations for optimizing GI design and management, thereby enhancing the resilience and overall health of regional ecosystems. Specific details of each step are provided in the following sections.

Study area and data sources

Shanxi Province is situated in the central part of the Yellow River basin, extending from 34°31’ to 40°44’ north latitude and from 110°15’ to 114°32’ east longitude (Fig. 1). This province is located on the northeastern edge of the Loess Plateau, characterized by its distinctive loess soil and dramatic topography. The region features a diverse landscape, with elevations higher in the northeast and lower in the southwest, encompassing several prominent mountain systems and basins, such as the Lüliang and Taihang mountain systems, along with basins like the Fenhe river basins. The province of Shanxi is characterized by a semiarid weather pattern, receiving yearly precipitation of roughly 468 mm36. The unique topography and the concentration of heavy rainfall during the summer are important factors contributing to soil erosion in this area37.

Fig. 1
figure 1

Location map of study area. The maps were generated by the authors using ArcGIS 10.2. Administrative boundaries used in the maps were obtained from the Resource and Environmental Science Data Center (RESDC, http://www.resdc.cn/Default.aspx).

Shanxi Province covers an area of 156,700 km2, comprising 11 major cities and 117 county-level administrative units. By the end of 2023, approximately 34.66 million people reside here, with the population continuing to grow38. The region has a long history of human activity, with significant agricultural and industrial demands for natural resources. Holding nearly 40% of China’s total coal reserves, Shanxi has historically been a key energy production base39. However, extensive land cultivation and ongoing resource extraction have led to serious ecological degradation, including soil erosion, water resource depletion, and significant air pollution issues.

In response to these environmental challenges, various ecological restoration initiatives have been implemented to improve the region’s environmental quality and promote sustainable development. These efforts have included large-scale vegetation restoration, the conversion of sloping croplands to forests, and integrated approaches to address mining impacts and industrial pollution. These projects have significantly altered the vegetation and landscape of Shanxi, consequently impacting the overall health of the ecosystem.

Given that landscape changes often induce spatiotemporal heterogeneity in GI, satellite-derived remote sensing data provide a robust foundation for analysis. This study integrates land use data and ecological remote sensing indicators—widely recognized in ecology, urban planning, and remote sensing research—to assess GI dynamics over time. Specifically, GI was identified and landscape metrics were computed using the China Land Cover Dataset (CLCD)40. This widely used dataset, generated via deep learning methods on Landsat imagery, provides annual 30-m resolution land cover data for China from 1990 to the present. For the computation of the ecosystem quality, we employed the MOD09A1 and MOD11A2 datasets (2000–2022) from the U.S. Geological Survey, accessed via the Google Earth Engine platform (https://code.earthengine.google.com/). To supplement environmental covariates, monthly precipitation data (1 km spatial resolution) were sourced from the Science Data Bank (https://www.scidb.cn/), while the terrain data was obtained from the SRTM 90 m Digital Elevation Database (https://bigdata.cgiar.org/). Slope data were generated through DEM surface analysis in ArcGIS 10.2. All raster datasets were preprocessed by clipping to the study area and reprojecting to UTM Zone 49N.

Evaluation of ecosystem quality

The Remote Sensing Ecological Index (RSEI) functions as a comprehensive environmental assessment tool that synthesizes four key variables to evaluate ecosystem conditions. Through the incorporation of principal component analysis, RSEI minimizes researcher bias in the assessment process. Multiple research investigations have validated RSEI’s effectiveness in delivering impartial environmental assessments31,41. Considering these advantages, our research employed RSEI methodology to evaluate ecosystem conditions across Shanxi Province during the 2000–2022 timeframe.

The RSEI offers a comprehensive assessment by integrating four key ecological components: greenness, humidity, heat, and dryness42. Each of these factors plays a critical role in understanding the overall condition of the ecological system. Greenness indicates the health and density of vegetation, serving as a vital indicator of ecosystem vitality. Humidity reflects moisture levels, which are essential for plant growth and overall ecosystem sustainability. Heat represents temperature variations that can influence biological activity and environmental stressors, while dryness assesses aridity levels that can impact water availability and vegetation health. Together, these components provide a nuanced view of ecological dynamics, helping to highlight areas of environmental improvement or degradation. By analyzing these indices, researchers can better understand how human activities and natural processes affect ecosystem health.

The greenness index can be quantified using Normalized Difference Vegetation Index (NDVI) or Enhanced Vegetation Index (EVI) metrics. Given the heterogeneous plant distribution across the Loess Plateau landscape, we selected EVI for analysis. The calculations were based on satellite imagery obtained during maximum vegetative productivity periods, utilizing data with 250-m spatial precision.

$${\text{Greenness}}=2.5({\rho }_{2}-{\rho }_{1})/({\rho }_{2}+6.0{\rho }_{1}-7.5{\rho }_{3}+1)$$

where, ρi (i = 1, 2, 3) represents the i th band of the MOD09A1 product.

The humidity index is used to represent surface moisture 31. By applying the hat transformation to the MOD09A1 dataset, components closely related to vegetation and soil moisture were extracted. The formula for calculating the moisture index is as follows:

$$\begin{array}{cc}{\text{Humidity}}=& 0.1147{\rho }_{1}+0.2489{\rho }_{2}+0.2408{\rho }_{3}+0.3132{\rho }_{4}-0.3122{\rho }_{5}\\ & -0.6416{\rho }_{6}-0.5087{\rho }_{7}\end{array}$$

The heat index is quantified through land surface temperature (LST), with the MOD11A2 LST product serving in this study. In calculating the heat index, the following formula is used to convert the original land surface temperature data (LST0) from Kelvin (K) to Celsius (°C):

$${\text{Heat}}=0.02{\text{LST}}_{0}-273.15$$

Here, we select the Normalized Differential Built-up and Bare Soil Index (NDBSI) to represent dryness, which is calculated using the following formula:

$$\text{Dryness}=\left(\frac{\frac{2{\uprho }_{6}}{{\uprho }_{6}+{\uprho }_{2}} – \left(\frac{{\uprho }_{2}}{{\uprho }_{2}+{\uprho }_{1}}+\frac{{\uprho }_{4}}{{\uprho }_{4}+{\uprho }_{6}}\right)}{\frac{2{\uprho }_{6}}{{\uprho }_{6}+{\uprho }_{2}} + \left(\frac{{\uprho }_{2}}{{\uprho }_{2}+{\uprho }_{1}}+\frac{{\uprho }_{4}}{{\uprho }_{4}+{\uprho }_{6}}\right)}+\frac{\left({\uprho }_{6}+{\uprho }_{1}\right)-\left({\uprho }_{2}+{\uprho }_{3}\right)}{\left({\uprho }_{6}+{\uprho }_{1}\right)+\left({\uprho }_{2}+{\uprho }_{3}\right)}\right)/2$$

where, ρi (i = 1, 2, 3…6) represents the i-th band of the MOD09A1 product.

The environmental parameters encompassing vegetation coverage, moisture content, thermal conditions, and aridity were standardized to values between 0 and 1. The Modified Normalized Difference Water Index (MNDWI) was applied to remove water bodies. Subsequently, Principal Component Analysis (PCA) was performed. Typically, selecting PC1 helps to mitigate subjective weighting bias during the calculation process. Subsequently, the preliminary Remote Sensing Ecological Index (RSEI0) was derived through the first principal component (PC1).

$${\text{RSEI}}_{0}=\left\{\begin{array}{c}\text{PC1}\left[f\left(\text{Greeness},\text{ Humidity},\text{ Heat},\text{ Dryness}\right)\right], {\text{ V}}_{\text{Greenness}}, {\text{V}}_{\text{Humidity}}>0\\ 1-\text{PC1}\left[f\left(\text{Greeness},\text{ Humidity},\text{ Heat},\text{ Dryness}\right)\right], {\text{ V}}_{\text{Greenness}}, {\text{V}}_{\text{Humidity}}<0\end{array}\right.$$

It is important to note that the RSEI was originally developed using Landsat data. Due to issues such as the long revisit cycle and interference from band images associated with Landsat satellites, this study utilized MODIS data instead.

To better quantify and analyze the spatiotemporal changes in ecosystem quality across different regions, we normalized RSEI0 to obtain the final RSEI values used for displaying the research results:

$$\text{RSEI}=\frac{{\text{RSEI}}_{{0}_{\text{i}}}-{\text{RSEI}}_{{0}_{\text{min}}}}{{\text{RSEI}}_{{0}_{\text{max}}}-{\text{RSEI}}_{{0}_{\text{min}}}}$$

Based on previous research experience and the actual conditions of the study area, we consider regions with RSEI values below 0.3 to have poor quality, while areas with values above 0.7 are regarded as having good quality. Other regions fall into the moderate category. Finally, to better establish the relationship between GI morphological types and characteristics and ecosystem quality, the analysis results of RSEI and MSPA were assigned values within a 6 km × 6 km grid.

MSPA and mapping of GI

MSPA is an effective method utilized in landscape ecology and remote sensing for the quantitative examination of spatial patterns in digital imagery43. This approach employs mathematical morphology, a technique analyzing spatial patterns based on geometric structures, to assess landscape configurations. Using morphological operations, MSPA classifies and quantifies spatial features such as patches, corridors, and matrices. MSPA not only maps these elements but also evaluates their structural attributes—critical for assessing Green Infrastructure (GI), where spatial connectivity and distribution directly inform urban planning and ecosystem management.

Land cover features such as forest, shrubland, grassland, and aquatic systems were categorized as primary GI components (foreground), while artificial surfaces, cropland and other non-vegetated land types were designated as non-GI elements (background). The 30 m CLCD land cover data was employed to extract GI components, a resolution that facilitates the identification of fine-scale GI types and detailed green space structures. This dichotomy was applied to land use data spanning 2000–2022, generating binary raster datasets for spatial analysis. Subsequent processing and visualization were performed using Guidos Toolbox to assess GI patterns. Consequently, the green infrastructure categories were organized into seven distinct, non-overlapping types based on their landscape morphology (Table S1): core, islet, perforation, edge, loop, bridge, and branch44. After accurately identifying and classifying the morphological categories, we calculated the proportion of each GI type within each 6 km × 6 km grid and presented the research findings through visual mapping.

Quantitative analysis of GI’s features

The landscape pattern, which reflects the heterogeneity of landscape, is the result of various ecological processes at multiple scales45. The LPM serves as a quantitative tool for analyzing and describing landscape features46. It is a well-established method for assessing the spatial distribution patterns and characteristics of GI at the landscape level. In this study, the landscape indices were computed using the “pylandstats” library in Python, with the corresponding names and quantified target features presented in Table 1. The computed results were also averaged at the 6 km × 6 km grid level to characterize the overall condition of the region.

Table 1 Selected landscape pattern metrics and their explainable target features used in this study.

The relative contributions of GI’s types and features factors to RSEI

Existing research indicates that the coverage, morphology, and landscape characteristics of GI can significantly influence its ecological functions. However, the relationships among these factors are quite complex. Based on a review of various GI characteristics and considering the temporal and spatial variability of ecosystem quality, we developed separate models for different GI forms and characteristics for the years 2000 and 2022. These models aim to reveal their specific impacts on ecosystem quality.

In this study, we employed a Machine Learning algorithm, the eXtreme Gradient Boosting (XGBoost). It is an advanced gradient boosting framework for decision trees, to develop our explaining model. This approach constructs a strong learner by combining multiple weak models, utilizing Classification and Regression Trees as the base classifiers47. It is recognized as one of the fastest decision tree algorithms currently available48. The incorporation of regularization parameters effectively mitigates overfitting by penalizing tree complexity49. This capability, along with the model’s interpretability, makes it a more attractive choice compared to other classification techniques like neural networks. In addition to delivering robust predictive performance, this framework facilitates clearer evaluations of variable importance, enhancing our understanding of how different factors influence the model’s predictions. This method was selected to address the non-linear relationships between GI factors and RSEI values. The dataset was partitioned into a 70% training set and a 30% test set. To optimize performance and prevent overfitting, a grid search with fivefold cross-validation was employed for hyperparameter tuning to achieve the optimal model fit. The model’s predictive accuracy was evaluated using R-squared (R2) and Root Mean Squared Error (RMSE).

Ultimately, we created three models that reflect the relative impacts of the forms and characteristic factors of GI in 2000, 2010 and 2022 on ecosystem quality. Both models incorporated the proportion of GI and seven different morphological types in the grid, along with eight distinct landscape characteristic factors such as Shannon diversity, as explanatory variables. The Shapley Additive exPlanations (SHAP) tool was utilized to interpret the model outputs and determine the influence of each factor on ecosystem quality. The entire process was implemented in Python.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *