Application of machine learning algorithms for groundwater level prediction in the Najafabad plain

Study area

The Najafabad Plain, located in Isfahan Province in central Iran, is a key agricultural region within the Zayandeh Rud Basin. It spans approximately 1,712 km², comprising a central alluvial aquifer of about 940.9 km², with the remaining 772.5 km² consisting of surrounding highlands. Geographically, the plain lies between longitudes 50°57′ and 51°44′26″ E and latitudes 32°20′13″ to 32°49′21″ N (Fig. 1). The aquifer is hydrologically isolated from adjacent groundwater systems and functions as the principal reservoir for subsurface water in the area.

The elevation of the plain ranges from roughly 2,906 m above sea level in the northern highlands to approximately 1,538 m in the southern lowlands, near the Zayandeh Rud River, which provides intermittent surface flow. The region experiences a semi-arid climate, with mean annual precipitation ranging from 158 to 172 mm, while the potential evapotranspiration exceeds 1,500 mm. The average annual temperature is approximately 15 °C. Water usage in the region is dominated by agriculture, accounting for about 93.3% of the total water consumption. Domestic and industrial uses represent around 4.5% and 2.2%, respectively. Groundwater supplies nearly 72% of the total water demand, primarily through wells. In total, 15,933 groundwater extraction points have been identified, including 15,753 wells, 151 qanats, and 29 springs. Among them, 15,673 wells are located within the alluvial aquifer, collectively extracting approximately 881 million cubic meters of water annually²².

In recent years, declining precipitation and increasing water demand have exerted considerable pressure on groundwater resources. Although the development of new wells is legally restricted, unauthorized wells—frequently reported by the Isfahan Regional Water Authority—pose a significant challenge to sustainable groundwater management in the plain.

Data collection and preprocessing

Meteorological, surface Water, and groundwater data

To investigate groundwater level fluctuations in the Najafabad Plain, comprehensive hydrological and meteorological datasets were collected and preprocessed. Meteorological variables, including precipitation and temperature, were obtained from the Najafabad synoptic station, which served as the primary source for representing regional climatic conditions²³. Surface water data were acquired from three active hydrometric stations—Lenj, Mousian, and Zefreh—located along the Zayandeh Rud River, the principal surface water source in the region²⁴. These stations provide valuable insights into river flow dynamics and their interactions with the underlying groundwater system. The total annual surface inflow into the plain, including contributions from the Lenjanat, Mahyar-e-Shomali, and Karon sub-basins, was estimated at approximately 557.4 million cubic meters.

These representative stations (see Table 1; Fig. 1) were selected based on their strategic locations and the availability of long-term, reliable data, ensuring adequate spatial and temporal coverage of the key hydrological and climatic variables influencing groundwater levels across the study area. During the study period (2011–2021), the average annual precipitation was approximately 195 mm in the upland regions and 153.7 mm in the plain. March was identified as the wettest month, with average rainfall values of 31.6 mm in the uplands and 29.3 mm in the plain. Mean annual temperatures were estimated at around 13.8 °C in the uplands and 15.4 °C in the plain.

Table 1 Selected stations for the Research.

Groundwater level data were collected from 53 observation wells distributed across the Najafabad Plain, corresponding to a monitoring density of approximately 1.4 wells per 25 km² (see Fig. 1)²⁴. These wells provided continuous records of groundwater level fluctuations throughout the aquifer system, which extends beneath approximately 87% of the study area and constitutes the primary source of groundwater extraction.

For spatial analysis, the study area was divided into two major zones: the uplands and the plain. The uplands, primarily located in the northern and western parts of the region, cover an area of about 679.2 km² and are predominantly composed of Jurassic and Cretaceous limestone formations. In contrast, the plain—spanning approximately 932.2 km²—is characterized by Quaternary alluvial deposits with variable thicknesses and hydraulic conductivities. This zoning approach enabled differentiation of hydroclimatic and geological conditions across the region, thereby enhancing the accuracy of groundwater level modeling in response to meteorological and surface water drivers.

Outlier detection and treatment

To enhance the quality and reliability of the dataset, outlier detection was performed using the boxplot method—a widely adopted graphical technique in hydrological studies for identifying anomalous values. This approach effectively visualizes the distribution and dispersion of data, highlighting values that deviate significantly from the typical range. Given that the primary objective of this study is not the analysis of extreme events, the identified outliers—presumed to be due to measurement or recording errors—were excluded from further analysis. Subsequently, the resulting gaps in the time series were filled using interpolation techniques, ensuring data continuity and temporal consistency across all variables.

Regional and aquifer zoning

The Najafabad aquifer, like many complex groundwater systems, exhibits significant spatial heterogeneity due to the combined influence of natural and anthropogenic factors such as groundwater abstraction, precipitation, riverbed infiltration, and temperature variability. Owing to the physiographic diversity of the plain, these factors impact different regions of the aquifer to varying degrees. For example, in the western parts of the aquifer—located at a considerable distance from the Zayandeh-Rud River—the influence of river discharge on groundwater levels is minimal to negligible. To improve the spatial resolution and accuracy of groundwater level modeling, the aquifer was subdivided into five distinct zones. This zoning was based on a conceptual understanding of the system, as well as the physical and geographical characteristics of the area, particularly in relation to the presence of the river and irrigation canals. The spatial layout and boundaries of these zones are depicted in Fig. 1, and their key characteristics are summarized in Table 2.

Table 2 Description of groundwater zones based on dominant influencing factors in the Najafabad aquifer.

Zone 1 – River Corridor Zone: This zone includes areas adjacent to the Zayandeh-Rud River, where groundwater levels are directly influenced by river discharge. Piezometric data from this region reflect the dynamic interaction between surface water and groundwater.
Zones 2 and 3 – Nekouabad Irrigation Zones: These zones encompass agricultural areas affected by the Nekouabad irrigation network. Based on their geographical orientation, they are divided into: Zone 2: Right side of the Nekouabad irrigation system, and Zone 3: Left side of the Nekouabad irrigation system.
Zone 4 – Khamiran Irrigation Zone: Located in the western part of the aquifer, this zone is mainly influenced by groundwater abstraction for agriculture under the Khamiran irrigation scheme.
Zone 5 – Western Highland Zone: This zone includes piezometers situated in the elevated western highlands. Here, groundwater dynamics are primarily governed by natural recharge from precipitation, with limited anthropogenic impact.

Descriptive statistics of input variables

To provide an overview of the characteristics of the modeling inputs, descriptive statistics were calculated for all variables in each hydrogeological zone during the training and testing phases. The analyzed variables included precipitation (mm/month), temperature (°C), river discharge (m³/s), groundwater abstraction (m), irrigation volume (MCM/month), and groundwater level (m). For each variable, the minimum (Min), maximum (Max), mean (Mean), kurtosis, skewness, and standard deviation (SD) were computed separately for the training, testing, and total datasets.

These statistics help to identify the distribution patterns, variability, and potential anomalies in the input data, which may influence model performance. For instance, high kurtosis and skewness values in precipitation indicate occasional extreme rainfall events, while the relatively low SD in groundwater levels reflects gradual changes over time. The detailed descriptive statistics for each zone are presented in Table 3.

Table 3 Descriptive statistics of input variables for training, testing, and total datasets in the five hydrogeological zones.

Exploratory data analysis of groundwater level fluctuations

To gain deeper insights into the temporal behavior of groundwater levels across the Najafabad Plain, an exploratory data analysis (EDA) was conducted using violin plots. These plots effectively combine boxplots and kernel density estimates to depict the distribution and variability of groundwater levels on both monthly and seasonal scales, across the five delineated observation zones (Z1 to Z5). The monthly and seasonal distributions are presented in Figs. 2 and 3, respectively.

The monthly violin plots (Fig. 2) reveal distinct fluctuation patterns among the zones. Zone 1 (Z1), located near the river corridor, exhibits pronounced variability during both summer and winter months, as indicated by the wider spread of values. Zone 2 (Z2) shows a gradual rise in groundwater levels from April to December, along with considerable dispersion during the transitional months, potentially due to irrigation cycles. Zone 3 (Z3) maintains relatively stable levels, with the most notable fluctuations occurring between June and October—likely reflecting seasonal agricultural demand. In Zone 4 (Z4), groundwater levels remain consistent throughout the year except for December, which displays a sharp increase in variability. Zone 5 (Z5), located in the western highlands, demonstrates a gradual rise in groundwater levels beginning in June and peaking around October, possibly due to delayed recharge from precipitation or irrigation return flows.

Seasonal violin plots (Fig. 3) provide a comprehensive visualization of the temporal dynamics of groundwater levels across the five hydrogeological zones. Zones Z1, Z2, and Z3 exhibit wider distributions during spring and summer, indicating higher variability likely associated with intensified evapotranspiration, peak agricultural water demand, and potential seasonal recharge events. In contrast, Z4 and Z5 display comparatively narrower and more stable distributions across most seasons, with only a modest increase in variability observed during fall and winter.

Median groundwater levels tend to be elevated in spring and summer, particularly in Z1 and Z2, suggesting the combined effects of irrigation inputs and seasonal recharge from precipitation or upstream inflows. These patterns highlight the spatial heterogeneity in seasonal groundwater behavior, underscoring the greater sensitivity of some zones to environmental and anthropogenic drivers. Such insights are valuable for guiding zone-specific feature engineering, optimizing model calibration, and improving the robustness of predictive groundwater models.

Groundwater level modeling approach

Machine learning algorithms

In this study, three state-of-the-art supervised machine learning algorithms—Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM)—were implemented to model and forecast groundwater level variations across the Najafabad aquifer. These algorithms are widely recognized for their robustness in handling nonlinear relationships, high-dimensional input spaces, and missing or noisy data, making them well-suited for hydrological applications^16,25. They were specifically selected to represent three distinct modeling paradigms—ensemble bagging (RF), ensemble boosting (XGBoost), and kernel-based learning (SVM)—allowing a robust comparative assessment. This selection also reflects their proven capability in previous groundwater studies to perform effectively under heterogeneous aquifer conditions and limited-resolution anthropogenic datasets.

This ensemble learning method is based on the bagging (bootstrap aggregating) technique. Multiple regression trees are constructed using bootstrapped samples of the training data. Each tree h_i(x) makes a prediction, and the final RF output is the average of predictions from all trees:

$$\:\widehat{y}=\frac{1}{M}\sum\:_{i=1}^{M}{h}_{i}\left(x\right)$$

(1)

RF minimizes the mean squared error (MSE) across trees and uses random subsets of features at each split to reduce correlation among trees. The objective function focuses on reducing variance while maintaining low bias. Feature importance is computed based on the mean decrease in impurity or prediction accuracy across all trees. In this study, the RF model was implemented using the scikit-learn library with hyperparameter tuning (e.g., number of trees, max depth) via grid search^26,27.

Gradient Boosting builds models sequentially by fitting each new model to the residual errors of the ensemble prediction so far. The objective function combines a loss function $\:L(y,\widehat{y})$ (e.g., squared error loss) and a regularization term $\:{\Omega\:}\left({f}_{t}\right)$ to penalize model complexity:

$$\:{L}^{\left(t\right)}=\sum\:L\left({y}_{i},{\widehat{y}}_{i}^{(t-1)}+{f}_{t}\left({x}_{i}\right)\right)+{\Omega\:}\left({f}_{t}\right)$$

(2)

In this study, the eXtreme Gradient Boosting (XGBoost) version was used, which improves efficiency and regularization. XGBoost uses second-order Taylor approximation of the loss function for optimization and supports shrinkage (learning rate) and column subsampling to reduce overfitting. The additive model is updated as:

$$\:{\widehat{y}}^{\left(t\right)}={\widehat{y}}^{(t-1)}+\eta\:{f}_{t}\left(x\right)$$

(3)

where is the learning rate. The model was implemented using the XGBoost library with hyperparameter tuning on parameters such as the number of estimators, learning rate, and maximum tree depth^28,29.

Support Vector Regression (SVR) aims to find a function that approximates the underlying relationship between input features and target values while balancing model complexity and prediction accuracy. The SVR function is expressed as:

$$\:f\left(x\right)=\langle w,\varphi\:\left(x\right)\rangle+b$$

(4)

where: () maps the input vector into a high-dimensional feature space, and and are the model parameters.

The SVR formulation attempts to minimize the model complexity (by keeping ∥∥ small) while ensuring that the predictions are within an ε-insensitive margin. The optimization problem is defined as:

$$\:{min}_{w,b,{\xi\:}_{i},{\xi\:}_{i}^{*}}=\frac{1}{2}{\parallel w\parallel}^{2}+C\sum\:_{i=1}^{n}\left({\xi\:}_{i}+{\xi\:}_{i}^{*}\right)$$

(5)

Subject to:

$$\:\left\{\begin{array}{c}{y}_{i}-\langle w,\varphi\:\left({x}_{i}\right)\rangle-b\le\:\epsilon\:+{\xi\:}_{i}\\\:\langle w,\varphi\:({x}_{i}\rangle+b-{y}_{i}\le\:\epsilon\:+{\xi\:}_{i}^{*}\\\:{\xi\:}_{i},{\xi\:}_{i}^{*}\ge\:0\end{array}\right.$$

(6)

where: is the regularization parameter that controls the trade-off between the flatness of the function and the amount up to which deviations larger than ε are tolerated. $\:{\xi\:}_{i},{\xi\:}_{i}^{*}$ are slack variables that allow violations of the ε margin.

To model nonlinear relationships, a kernel function $\:K\left({x}_{i},{x}_{j}\right)=\langle\varphi\:\left({x}_{i}\right),\varphi\:\left({x}_{j}\right)\rangle$ is employed. In this study, the Radial Basis Function (RBF) kernel was used:

$$\:K\left({x}_{i},{x}_{j}\right)=\text{e}\text{x}\text{p}(-\gamma\:{\parallel{x}_{i}-{x}_{j}\parallel}^{2})$$

(7)

The RBF kernel enables the model to learn complex, nonlinear patterns in the data.

The SVR model was implemented using the scikit-learn library in Python. The key hyperparameters— (penalty), (margin width), and (kernel coefficient)—were optimized through a grid search approach combined with cross-validation to ensure robust performance^30,31.

All models were trained using the historical time series of meteorological variables (e.g., precipitation, temperature), surface water inflow, and groundwater levels across five hydrogeological zones (Sect. “Input variables per zone”). A combination of grid search and five-fold cross-validation was applied to fine-tune model hyperparameters and ensure robust generalization. Model performance was evaluated using standard statistical metrics, as detailed in Sect. “Model evaluation metrics”.

The complete implementation codes, including data preprocessing, model training, and evaluation scripts, are provided in Supplementary Material (S2) for reference and reproducibility.

Input variables per zone

To account for the spatial heterogeneity of hydrological and anthropogenic influences across the Najafabad aquifer, groundwater level modeling was conducted separately for five hydrogeological distinct zones (Z1–Z5). This zonal approach allows for a more accurate representation of local dynamics by tailoring model inputs to the specific conditions of each area.

Table 4 Input variables for groundwater level change modeling in the defined Zones.

The selection of input variables in each zone was guided by both data availability and their relevance to groundwater level fluctuations. The monthly change in groundwater level (GwL) was considered the output variable across all zones, while the predictor variables included monthly precipitation (P), average temperature (T), river discharge (RD), irrigation volume (IV), and groundwater abstraction (GwA). Table 4 provides an overview of the input variables used in each zone. The inclusion or exclusion of specific variables was influenced by each zone’s proximity to surface water resources, irrigation infrastructure, and elevation. For example, zones located downstream of the Nekouabad canal included irrigation volume as a model input, while zones without reliable abstraction data excluded that variable.

To validate the selection of input variables and to explore the interdependencies among climatic, hydrological, and anthropogenic drivers, Pearson correlation heatmaps were generated for each hydrogeological zone (Fig. 4). These heatmaps quantitatively depict the strength and direction of linear associations between groundwater levels (and their decline) and the predictor variables.

The results highlight substantial spatial variability across the zones. For instance, Zone Z1 displayed a moderate positive correlation between groundwater level decline and abstraction (r = 0.44), confirming the strong impact of pumping activities. In Zone Z2, a distinct negative correlation between temperature and groundwater levels (r = − 0.51) indicated the dominant role of evapotranspiration in intensifying depletion. Zone Z3 demonstrated a mixed influence of both climatic (precipitation and temperature) and anthropogenic (irrigation and abstraction) factors, reflecting the complex hydro–climatic setting. In comparison, Zones Z4 and Z5 exhibited weaker correlations overall, implying relatively stable conditions or the presence of additional drivers not captured in the available datasets.

This correlation-based diagnostic analysis is not only scientifically relevant but also essential for justifying the zonal modeling framework. By identifying zone-specific dominant drivers, the heatmaps strengthen the rationale for input selection and demonstrate the necessity of considering both climatic and human-induced factors when predicting groundwater dynamics in the Najafabad Plain.

Model training and validation strategy

To ensure robust and generalizable model performance, the available dataset for each hydrogeological zone was split into training and testing subsets. 70% (70%) of the data was used for training the machine learning models, while the remaining 30% was reserved for testing and evaluation. The time series were divided chronologically to preserve the temporal structure and prevent data leakage. To fine-tune the model parameters and enhance prediction accuracy, a five-fold cross-validation (CV) strategy was employed during the training phase. In this method, the training data was partitioned into five equal subsets, and each subset was used once for validation while the remaining four subsets were used for training. The final model performance was averaged across the five folds, ensuring a balanced evaluation of bias and variance.

Hyperparameter optimization was performed for each algorithm using a grid search approach, where multiple combinations of key parameters were systematically tested. The optimal hyperparameter set was selected based on minimizing the validation error, specifically the Root Mean Square Error (RMSE). To further prevent overfitting, each model incorporated algorithm-specific regularization mechanisms. For example, the Random Forest model utilized a maximum tree depth constraint and minimum sample split size, the Gradient Boosting model implemented learning rate and subsampling adjustments, and the Support Vector Machine model optimized the kernel function and regularization coefficient (C).

This training-validation strategy ensured that the models were both accurate and resilient to noise and overfitting, and it enabled fair comparisons between algorithms across different zones of the Najafabad Plain.

Model evaluation metrics

To evaluate the predictive accuracy of the developed machine learning models, five widely used performance metrics were employed: Coefficient of Determination (R²), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Willmott’s Index of Agreement (WI), and Percent Bias (PBIAS). These metrics provide complementary insights into model performance in terms of accuracy, bias, agreement, and error distribution.

$$\:{R}^{2}=1-\frac{\sum\:_{i=1}^{n}{\left({O}_{i}-{P}_{i}\right)}^{2}}{\sum\:_{i=1}^{n}{\left({O}_{i}-\stackrel{-}{O}\right)}^{2}}$$

(8)

$$\:RMSE=\sqrt{\frac{1}{n}\sum\:_{i=1}^{n}{\left({O}_{i}-{P}_{i}\right)}^{2}}$$

(9)

$$\:MAE=\frac{1}{n}\sum\:_{i=1}^{n}\left|{O}_{i}-{P}_{i}\right|$$

(10)

$$\:WI=1-\frac{\sum\:_{i=1}^{n}{\left({P}_{i}-{O}_{i}\right)}^{2}}{\sum\:_{i=1}^{n}{\left(\left|{P}_{i}-\stackrel{-}{O}\right|+\left|{O}_{i}-\stackrel{-}{O}\right|\right)}^{2}}$$

(11)

Values range from 0 (no agreement) to 1 (perfect agreement).

The Percent Bias (PBIAS) measures the average tendency of predictions to overestimate or underestimate observations:

$$\:PBIAS=\frac{\sum\:_{i=1}^{n}\left({P}_{i}-{O}_{i}\right)}{\sum\:_{i=1}^{n}{O}_{i}}\times\:100$$

(12)

Positive values indicate a model underestimates observations, while negative values indicate overestimation.

Where and are the observed and predicted values, respectively, and $\:\stackrel{-}{O}$ is the mean of the observed values. In summary, higher R² and WI values closer to 1, together with lower RMSE, MAE, and ∣PBIAS∣, indicate better model performance. ∣PBIAS∣ below 10–15% is generally considered satisfactory in hydrological applications.

These metrics were computed for each model (RF, GB, SVM) and each hydrogeological zone to enable detailed comparison and identify the best-performing algorithm in each context. The selected metrics have been widely applied in hydrological and groundwater modeling studies^{32,33,34,35,36}.

For a detailed overview of the methodological framework including data preprocessing, zone delineation, model training, and evaluation steps, refer to the Methodology Flowchart provided in Supplementary Material (S1).

Source link