Comparative assessment of machine learning models for daily streamflow prediction in a subtropical monsoon watershed

Study area and dataset

This study focuses on the Boluo watershed, a sub-watershed of the Dongjiang River, which is a primary tributary of the Pearl River watershed. Originating in Jiangxi Province, the Dongjiang River flows southwest through Guangdong Province, serving as one of the most critical water sources for the Guangdong-Hong Kong-Macao Greater Bay Area. The Dongjiang River watershed encompasses approximately 35,340 km², with reservoir-controlled areas comprising 33.2% of the total watershed area. The Boluo watershed, serving as the outlet watershed of the Dongjiang River system, covers a drainage area of 3941.3 km² with a shape ratio of 1.06, indicating a relatively compact watershed configuration. The watershed is significantly influenced by upstream flow from the Lingxia station station and the Baipenzhu Reservoir, the latter being an artificially regulated reservoir capable of flood control and water storage operations (Fig. 1). The watershed is characterized by a subtropical monsoon climate with a mean annual temperature of approximately 21 °C. Mean annual precipitation ranges from 1500 to 2400 mm, with approximately 80% concentrated during the wet season from April to September. Frontal and typhoon rainfall events are predominant in this region, generating rapid streamflow responses and high flood peaks.

The modeling framework employs meteorological forcing data and upstream flow measurements to predict daily streamflow at the Boluo outlet. Precipitation data were obtained from a network of 14 rain station stations strategically distributed across diverse topographical zones (mountainous, plain, and valley regions) within the watershed, effectively capturing spatial precipitation heterogeneity. Additionally, watershed-averaged rainfall and evapotranspiration were calculated from national meteorological control stations. Streamflow data comprised daily measurements from three hydrological stations (Boluo, Lingxia, and Baipenzhu), providing comprehensive coverage of the watershed’s hydrological dynamics.

During data preprocessing, missing daily observations (< 3% missing rate) were interpolated using linear interpolation methods. For continuous gaps exceeding five days, missing values were replaced with climatological means from corresponding periods (same month and hydrological year type). Outlier detection was performed using the 3σ criterion to identify anomalous data. Prior to model training, all input features were normalized using MinMaxScaler to transform values into the range [0, 1]. This normalization is critical for LSTM, ANN and LR models, as these algorithms are sensitive to feature scales and may produce suboptimal results when input variables have vastly different magnitudes. For gradient-based optimization methods, normalized inputs ensure faster convergence and more stable training. To prevent data leakage, the scaling parameters were computed exclusively from the train period and subsequently applied to the test period using the same transformation.

Given varying record lengths across observation stations, the study period was divided into train period (1985–2004, 70%) and test period (2005–2013, 30%). This temporal division ensures adequate representation of different hydrological regimes (wet, normal, and dry years) in both datasets, enhancing model generalization capability. The train period was used for model calibration and hyperparameter optimization through cross-validation, while the test period provided independent evaluation of model performance.

Models

Seven machine learning models representing different algorithmic families were employed for streamflow prediction: LR, GBR, ANN, RF, ETR, XGB and LSMT. These models encompass linear methods, ensemble tree-based approaches, and neural networks, enabling comprehensive comparison of data-driven modeling strategies.

LR

LR represents the simplest approach, establishing linear relationships between input features and target streamflow²⁸. The model assumes that the target variable can be expressed as a weighted sum of input features.

$$\begin{array}{*{20}c} {\hat{y} = \beta_{0} + \mathop \sum \limits_{j = 1}^{p} \beta_{j} x_{j} } \\ \end{array}$$

(1)

where $\:\widehat{\text{y}}$ is the predicted streamflow, $\beta _{0}$ is the intercept, $\beta _{j}$ represents the regression coefficient for feature $\:{\text{x}}_{\text{j}}$, and p is the total number of input features. The coefficients are estimated by minimizing the sum of squared residuals between observed and predicted values. Hyperparameter tuning examined fit intercept [True, False] and positive constraint [True, False] to ensure physically plausible non-negative streamflow predictions, and regularization strategies including None (ordinary least squares), Ridge (L2 regularization with alpha values of 0.1, 1.0, 10.0), and Lasso (L1 regularization with alpha values of 0.1, 1.0, 10.0) to prevent overfitting in high-dimensional feature spaces. The optimal LR configuration (fit intercept=True, positive constraint=False, regularization=Ridge, alpha = 1.0) was selected. Despite its simplicity, LR provides a benchmark for assessing whether more complex models offer substantial performance improvements and serves as an interpretable baseline for understanding feature-target relationships.

GBR

GBR employs sequential ensemble learning, where decision trees are added iteratively to correct residuals from previous iterations²⁹. The model builds trees in a stage-wise manner, with each subsequent tree fitted to the negative gradient of the loss function.

$$\begin{array}{*{20}c} {F_{m} \left( x \right) = F_{m – 1} \left( x \right) + \gamma_{m} h_{m} (x)} \\ \end{array}$$

(2)

where $\:{\text{F}}_{\text{m}}\left(\text{x}\right)$ is the ensemble prediction at iteration m, $F_{{m – 1}} \left( x \right)$ is the prediction from previous iterations, $\gamma _{m}$ is the learning rate, and $\:{\text{h}}_{\text{m}}\text{(}\text{x}\text{)}$ is the newly added decision tree. Hyperparameter tuning explored learning rates (0.01, 0.1, 0.2), maximum tree depths (3, 5, 7), and numbers of estimators (50, 100, 200). The optimal GBR configuration (learning rate = 0.1, max depth = 5, n estimators = 200) was selected. This boosting strategy enables effective capture of nonlinear patterns while maintaining reasonable computational efficiency. GBR is particularly effective for structured tabular data common in hydrological applications.

ANN

ANN utilizes multi-layer perceptron architecture with nonlinear activation functions, enabling approximation of complex nonlinear mappings between inputs and outputs¹². The network consists of an input layer receiving hydrological and meteorological variables, one or more hidden layers for feature transformation, and an output layer producing streamflow predictions. For each neuron in the network, the output is computed as follows.

$$\begin{array}{*{20}c} {a = f(\mathop \sum \limits_{i = 1}^{n} w_{i} x_{i} + b)} \\ \end{array}$$

(3)

where $\:\text{a}$ is the neuron output, f is the nonlinear activation function, $\:{\text{w}}_{\text{i}}$ represents the weight for input $\:{\text{x}}_{\text{i}}$, n is the number of inputs, and b is the bias term. Architecture search explored hidden layer configurations [(64,), (128,), (128, 64)], activation functions (ReLU, tanh), optimization solvers (Adam, SGD), and learning rate strategies (constant, adaptive). Training was conducted using mini-batch gradient descent with a batch size of 32. To prevent overfitting, early stopping was implemented with a patience of 10 iterations. The L2 regularization parameter alpha was tuned from [0.0001, 0.001, 0.01]. The maximum number of training iterations was set to 200. The optimal ANN configuration (hidden layer sizes=(128, 64), activation=relu, solver=adam, learning rate=adaptive, alpha = 0.0001, batch size = 32, early stopping=True, n iter no change = 10, max iter = 200) was selected. The network’s layered structure allows hierarchical feature learning, potentially capturing intricate rainfall-streamflow relationships that simpler models may overlook.

RF

RF constructs an ensemble of decision trees trained on bootstrapped samples with random feature selection at each split³⁰. The final prediction is obtained by averaging the predictions from all individual trees.

$$\begin{array}{*{20}c} {\hat{y} = \frac{1}{T}\mathop \sum \limits_{t = 1}^{T} f_{t} (x)} \\ \end{array}$$

(4)

where $\:\widehat{\text{y}}$ is the ensemble prediction, T is the total number of trees, and $\:{\text{f}}_{\text{t}}\text{(}\text{x}\text{)}$ is the prediction from the t^th decision tree. Each tree is trained on a bootstrap sample of the original dataset, and at each node split, only a random subset of features is considered. Hyperparameter tuning examined ensemble sizes (50, 100, 200 trees), maximum depths (None, 10, 20), and minimum samples for splits and leaves (2, 5, 10). The optimal RF configuration (n estimators = 200, max depth = 20, min samples split = 5, min samples leaf = 2) was selected. This bagging approach reduces overfitting and improves generalization through averaging predictions across multiple trees. RF naturally handles feature interactions and provides built-in feature importance measures.

ETR

ETR extends the random forest concept by introducing additional randomness in tree construction where the notation follows that of RF³¹. Unlike RF, ETR uses the entire training sample for each tree and selects split thresholds randomly rather than searching for optimal splits. The key difference lies in the tree construction process, where both training samples and split thresholds are randomized. This enhanced randomization can further reduce overfitting while potentially increasing computational efficiency through simplified split selection. Hyperparameter tuning examined ensemble sizes (50, 100, 200 trees), maximum depths (None, 10, 20), and minimum samples for splits and leaves (2, 5, 10). The optimal ETR configuration (n estimators = 200, max depth = 20, min samples split = 5, min samples leaf = 2) was selected.

XGB

XGB is a machine learning technique that has been embraced worldwide because of its ability to deal with computational performance as well as accuracy issues in training and prediction. Since XGB is an ensemble of trees, it constructs the ensemble of a collection of distinct decision trees to predict. Every tree gives a score to the sample based on its features. During training, the model constructs multiple trees, and each gives a distinct score to the sample leaf nodes. The predicted value is obtained by summing up all the trees scores³².

$$\begin{array}{*{20}c} {\emptyset = \mathop \sum \limits_{k = 1}^{K} L(y_{k} , \hat{y}_{k} ) + \mathop \sum \limits_{i = 1}^{N} \Omega (g_{i} )} \\ \end{array}$$

(5)

$$\begin{array}{*{20}c} {\Omega \left( {g_{i} } \right) = \gamma V + 0.5\lambda \omega^{2} } \\ \end{array}$$

(6)

where $\emptyset$ is the objective function, K is the total number of samples, $\:{\text{y}}_{\text{k}}$ and $\:{\widehat{\text{y}}}_{\text{k}}$ represent the observed and predicted values for sample k, N is the number of trees in the ensemble, and $\:{\text{g}}_{\text{i}}$ denotes the i^th tree. There are two components in the XGB objective function. L denotes the loss function that measures the difference between the observed and predicted outcomes, while Ω represents the regularization term that applies a penalty. In XGB, the regularization process incorporates Ω to control the magnitude of this penalty. Furthermore, γ and λ are regularization parameters, V refers to the count of leaves in a CART model, and ω represents the vectors corresponding to the scores of these leaves. This regularization component plays a crucial role in controlling model complexity and preventing overfitting³³. Hyperparameter tuning explored learning rates (0.01, 0.1, 0.2), maximum tree depths (3, 5, 7), numbers of estimators (50, 100, 200), and subsample ratios (0.8, 1.0). The optimal XGB configuration (learning rate = 0.1, max depth = 5, n estimators = 200, subsample = 0.8) was selected to maintain high-flow event sensitivity while preventing overfitting.

LSTM

LSTM networks represent a specialized recurrent neural network architecture designed to capture long-term temporal dependencies in sequential data¹⁵. Unlike feedforward neural networks, LSTM incorporates explicit memory mechanisms through gating structures that selectively retain or discard information across time steps, making it particularly suitable for hydrological time series where antecedent conditions significantly influence current streamflow. At each time step t, the LSTM cell updates its hidden state h_t and cell state c_t following:

$$\begin{array}{*{20}c} {f_{t} = \sigma (W_{f} \cdot [h_{t – 1} ,W_{t} ] + b_{f} ),} \\ \end{array}$$

(7)

$$\begin{array}{*{20}c} {i_{t} = \sigma (W_{i} \cdot [h_{t – 1} ,W_{t} ] + b_{i} ),} \\ \end{array}$$

(8)

$$\begin{array}{*{20}c} {\tilde{c}_{t} = \tanh (W_{c} \cdot [h_{t – 1} ,W_{t} ] + b_{c} ),} \\ \end{array}$$

(9)

$$\begin{array}{*{20}c} {c_{t} = f_{t} \odot c_{t – 1} + i_{t} \odot \tilde{c}_{t} ,} \\ \end{array}$$

(10)

$$\begin{array}{*{20}c} {o_{t} = \sigma (W_{o} \cdot [h_{t – 1} ,W_{t} ] + b_{o} ),} \\ \end{array}$$

(11)

$$\begin{array}{*{20}c} {h_{t} = o_{t} \odot \tanh (c_{t} ),} \\ \end{array}$$

(12)

where W_t is the input vector at time step t; h_t is the hidden state at time step t; c_t is the cell state at time step t; f_t, o_t, i_t are the forget gate, input gate, and output gate activations, respectively; $\:\stackrel{\sim}{{\text{c}}_{\text{t}}}$ is the candidate cell state; W_f, W_i, W_c, W_o are the weight matrices for the gates and cell state; b_f, b_i, b_c, b_o are the bias terms.

The network architecture consisted of a dual-layer stacked LSTM with hidden sizes of 128 and 64 units respectively, followed by a fully connected output layer. Input sequences were constructed using a sliding window of 7 lag days, resulting in input tensors of shape (samples, 7, 19) where 19 represents the number of features per time step. Hyperparameter tuning explored hidden layer configurations [(64, 32), (128, 64), (256, 128)], dropout rates (0.1, 0.2, 0.3), learning rates (0.001, 0.01), and batch sizes (32, 64, 128). The model was trained using the Adam optimizer to directly optimize hydrological efficiency. Early stopping with patience of 30 epochs was implemented to prevent overfitting, with maximum training epochs set to 200. The optimal LSTM configuration (hidden sizes=(128, 64), dropout = 0.2, learning rate = 0.001, batch size = 32) was selected.

Experimental setup

To ensure fair comparison across all models, a unified training framework was established for model development and evaluation. The loss function for all models was defined as the Root Mean Square Error (RMSE) between observed and predicted streamflow values:

$$\begin{array}{*{20}c} {RMSE = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} (Q_{obs,i} – Q_{pred,i} )^{2} } } \\ \end{array}$$

(13)

where $\:{\text{Q}}_{\text{obs}\text{,}\text{i}}$ and $\:{\text{Q}}_{\text{pred}\text{,}\text{i}}$ denote observed and predicted streamflow at time i, n is the number of observations, $\:\stackrel{-}{{\text{Q}}_{\text{obs}}}$ represents mean observed flow. Hyperparameter optimization was conducted using GridSearchCV, which performs exhaustive search over specified parameter grids. The search ranges for hyperparameters were determined based on recommended values from the scikit-learn, torch and xgboost library documentation, and established practices in hydrological machine learning literature^14,34. To prevent data leakage and ensure proper temporal ordering in time series prediction, TimeSeriesSplit cross-validation with 5 folds was employed instead of standard k-fold cross-validation. This approach ensures that training data always precedes validation data chronologically, maintaining the temporal integrity of hydrological predictions. In each fold, the model was trained on historical data and validated on subsequent time periods, mimicking real-world forecasting scenarios. Model selection was based on the negative RMSE score averaged across all cross-validation folds, ensuring that the optimal hyperparameter configuration for each model was identified through consistent evaluation criteria. All model implementations were conducted using Python (version 3.12) with scikit-learn library (version 1.7.2) for LR, GBR, ANN, RF, and ETR models, xgboost library (version 3.0.5) for XGB model, and torch (version 2.8.0) for LSTM model.

Model performance was assessed using four complementary metrics (including RMSE) providing comprehensive evaluation of predictive accuracy and hydrological fidelity:

$$\begin{array}{*{20}c} {MAE = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} |Q_{obs,i} – Q_{pred,i} |} \\ \end{array}$$

(14)

$$\begin{array}{*{20}c} {NSE = 1 – \frac{{\mathop \sum \nolimits_{i = 1}^{n} (Q_{obs,i} – Q_{pred,i} )^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} (Q_{obs,i} – \overline{{Q_{obs} }} )^{2} }}} \\ \end{array}$$

(15)

$$\begin{array}{*{20}c} {KGE = 1 – \sqrt {(r – 1)^{2} + (\frac{{\sigma_{pred} }}{{\sigma_{obs} }} – 1)^{2} + (\frac{{\mu_{pred} }}{{\mu_{obs} }} – 1)^{2} } } \\ \end{array}$$

(16)

where r is the correlation coefficient, σ denotes standard deviation, and µ represents mean values.

Mean Absolute Error (MAE) quantifies absolute prediction errors in original units (m³/s). Nash–Sutcliffe Efficiency (NSE) assesses model skill relative to a naive mean predictor, with values approaching 1 indicating excellent performance. Kling–Gupta Efficiency (KGE) provides a decomposed evaluation considering correlation, bias, and variability ratio, offering more balanced assessment than NSE alone. Together, these metrics enable comprehensive characterization of model strengths and weaknesses across different flow regimes and hydrological conditions.

Feature importance analysis

To interpret model predictions and identify key hydrological drivers, feature importance analysis was conducted using multiple approaches. For LR, standardized regression coefficients were used to quantify variable contributions. For XGB, built-in feature importance measures based on information gain or impurity reduction were employed. For ANN and LSTM, SHapley Additive exPlanations (SHAP) was applied to interpret the black-box predictions. SHAP is a unified approach to explain individual predictions based on game-theoretic Shapley values³⁵. The SHAP value for each feature represents its contribution to the difference between the actual prediction and the average prediction across all samples. For a given prediction, the SHAP value of feature j is calculated as follows.

$$\begin{array}{*{20}c} {\phi_{j} = \mathop \sum \limits_{{S \subseteq F\backslash \{ j\} }} \frac{{\left| S \right|!\left( {\left| F \right| – \left| S \right| – 1} \right)!}}{\left| F \right|!}[f_{{S \cup \left\{ j \right\}}} \left( {x_{{S \cup \left\{ j \right\}}} } \right) – f_{S} (x_{S} )]} \\ \end{array}$$

(17)

where F is the set of all input features, S is a subset of features excluding feature j, |S| and |F| represent the number of features in sets S and F respectively, $\:{\text{f}}_{\text{S}}\text{(}{\text{x}}_{\text{S}}\text{)}$ is the model prediction using only features in subset S, and $f_{{S \cup \left\{ j \right\}}}$ is the prediction when feature j is added to subset S. The equation computes a weighted average of the marginal contributions of feature j across all possible feature combinations, ensuring fair attribution of prediction contributions among all features.

SHAP values satisfy three desirable properties including local accuracy (the sum of all feature SHAP values equals the difference between prediction and base value), missingness (features with no impact have zero SHAP values), and consistency (if a feature contributes more in one model, its SHAP value should not decrease). This method enables both global interpretation through aggregated SHAP values across all samples and local interpretation for individual predictions, which is particularly valuable for understanding model behavior during high-flow flood events.

Source link