Prediction of the monthly river water level by using ensemble decomposition modeling

Machine Learning


Data collection and analysis

The monthly stream flow and water level data are collected from the station from 2014 to 2023, this datasets pre-processed before applied models. The prediction of river water level modeling was created using two scenarios to selection of best models for future condition of river water level. The first scenario divided into four input variables such as lags 1 to 3 and stream water flow using ACF and PACF methods, and the second scenario decomposition modeling was used for prediction of the river water level based on the five input variables such as IMF-1 to 5 and stream flow also included as an input variables. In this research, the three lags and monthly stream flow time series are utilized as predictors (inputs), and the actual/observed time series of river water level are used as the target (outputs) for decomposition ML modeling.

ML modeling development for predicting river water level

Estimating accurate river water levels is vital for flood and drought risk management, water resources planning, electrical energy production, urban infrastructure design, and irrigation planning. Machine learning algorithms are widely used in estimating river water levels because they provide helpful data. These models stand out and provide simple and promising inferences where physical models cannot be used or are too complex49,50. In this paper, we have been selected the one station for prediction of monthly river water level based on the eight ML and decomposition models. We have ensured the data quality, and regular, missing values were removed in the river water level datasets using the dropna() function from Pandas library. The outliers were removed in the datasets using interquartile range (IQR) method. Hence, we have developed multiple models to predict river water levels based on the nine-year datasets. The seasonal decompose (multiplicative) model was used for seasonal plots such as trend, residual, and seasonal based on river water level and stream flow datasets. The many input variables and feature selection is important steps to development of ML modeling to prediction of target variable for prediction modeling. In this study, we have used to best subset regression analysis was used to identify the best input variables combination for river water level prediction. The seasonal decompose modeling to analysis and to prepared the seasonal, Residual and trend plots for understand the datasets. The river water level datasets were divided into 80% training and 20% testing, which were used to prediction of river water level modeling. During the training phase, ML modeling used river water level data from 01–06-2014 to 20–07-2020 and testing modeling was used to this datasets, from 01–08-2020 to 31–12-2022, these datasets was developed the prediction modeling of monthly river water level. Finally, this research has been developed eight powerful models using different input combinations such as three lags and CEEMDAN (IMF 1 to 5) methods in two scenarios. The details information related to input and output variables adopted are presented in Table 2. Table 3 represented the ML and decomposition modeling structures during the first and second combination scenarios, these hyper-parameters of modeling were help to improve the model accuracy and correct prediction values. The detailed adopted methodology is shown in Fig. 2. The details information about the input variables of second combination is presented in Fig. 3. The entire prediction models have been used developed and processed the datasets on the python programming and packages.

Table 2 The Input–output structures adopted for model development.
Table 3 Details of ML and decomposition models structures for predicting the river water level.
Fig. 2
figure 2

Adopted methodology framework of river water level monthly forecasting of the river water level values.

Fig. 3
figure 3

The workflow of CEEMDAN-ML model.

SVM-linear

The SVM algorithm has been developed by51 with further refined by52, relies on operational danger minimization and arithmetical learning principles53. Its primary objective is to decrease both model difficulty and mistakes. SVM achieves this by projecting the data into a greater-dimensional feature space to identify a best separation hyperplane from training data54. In practice, SVM effectively captures the nonlinear relationships among variables by creating linear boundaries using a kernel function55. This algorithm constructs straightforward classifications by establishing hyperplanes. The kernel function mathematically represents this relationship55. By projecting a separation hyperplane from the origin between points belonging to two classes within a specified error threshold, SVM delineates the relationship between the xi parameters in the original space with n coordinates. Considering input and output variables as x and y, respectively, where xi belongs to the set Rn, yi belongs to the set {1, −1}, and the value of i ranges from 1 to n, the optimal separation hyperplane is expressed by the following equation53:

$$g(x)=\text{sgn}\left(\sum_{i=1}^{n} {y}_{i}{\alpha }_{i}K\left({x}_{i},{x}_{j}\right)+b\right)$$

(1)

Here n is the number of contribution variables, αi is the Lagrange multipliers, K(xi, xj) is the kernel function, and b is the offset of hyperplane from source.

It has several options: linear, polynomial, RBF, or sigmoidal. The linear kernel is used to linearly decompose the input data of the SVM and express it on a hyperplane. The efficacy of the linear kernel support vector machine is notable, particularly in belongings where the dataset exhibits linear separate. However, in belongings where dataset is not linearly separable and possesses a complex structure, it is recommended to employ the RBF kernel.

SVM-RBF

In contrast to linear, and RBF kernel facilitates the modeling of nonlinear relationships among class labels and features. The distinctive shape of RBF kernel’s demonstrated by56, underscores its effectiveness in capturing such nonlinearities. Additionally, the difficulty of model choice is influenced by the number of tuning parameters, with RBF kernel requiring less factors compared to polynomial and sigmoid kernels57. Moreover, RBF kernels demonstrate robust show under common smoothness assumptions58. Due to its easy plan, strong generalization capabilities, high tolerance to involvement noise, and efficient online learning capability, the RBF kernel is preferred. The RBF kernel function is defined by Eq. 355,59:

$$K\left({x}_{i},{x}_{j}\right)=\text{exp}{\left(-\gamma {x}_{i}-{x}_{j}\right)}^{2}$$

(2)

Here, the parameter γ governs the level of nonlinearity in the SVM model.

Random forest (RF)

The RF algorithm60 proposed is based on combining many decision trees. RF algorithm helps solve many engineering problems by performing prediction and inference operations. RFs have significant superiority in modeling complex structures with small samples and high-dimensional feature spaces61. The mathematical expression of RF is presented in Eq. 162.

$$T_{j} = \sum\limits_{i = 1}^{n} {g_{j} \left( {X_{ij} } \right)s\left( {Y_{i} } \right)}$$

(3)

The RF model operates under the assumption that s(Yi)  R, where Yi  {1, 2, 3, …, k}, signifies the ordinal response of observation i with covariates Xij. Here, j = {1, 2, 3, …, p} represents the index for the predictor variables. A test statistic is utilized to evaluate the association among the ordinal answer and the forecast variable Xj. The function gj: Xj → Rpj denotes a deterministic conversion of the forecast variable Xj, converting it from a one-dimensional space vector to a pj-dimensional vector space63.

Random subspace

In RS, sampling and combining methods similar to bagging are used to create a prediction model. In contrast to bagging, the RS algorithm employs bootstrapping from the feature space rather than from the training samples64. RS stands out in effectively solving both regression and classification problems. This algorithm comprises several elements, primarily including the training dataset x, quantity of subspaces L, the classifier or regressor w, and number of features ds65. In the RS model, a random number of subsets with ds features are generated and stored in the L subspace. During the second phase, a distinct regressor is created for each subset by training each base regressor. The combination of these elements results in the formation of an ensemble regressor E66.

Complete ensemble empirical mode decomposition adaptive noise (CEEMDAN) technique

CEEMDAN is a spatiotemporal study method; the concept is an extra finite amplitude white sound in couples is regularly circulated in whole time–frequency space of novel signal, and space is combined with diverse scale mechanisms in various frequencies. The technique decreases residue mistake in rebuilding procedure by totaling white combined noise, which adds positive and negative signals and finds mechanisms with less noise and extra physical significance67,68. EMD and EEMD can decompose data into high-frequency signals, IMFs, over many iterations. However, it always causes a certain amount of white noise. To resolve this issue69, developed a novel data decomposition technique called CEEMDAN. Steps to implement CEEMDAN:

  1. (1)

    White noise, wi
    (4)

where k shows a real number.

  • (2)

    The collection of signals undergoes an EMD decomposition, after which the components from every decomposition are averaged.

    $$\overline{{IMF_{1} }} \left( t \right) = K^{ – 1} \sum\limits_{i = 1}^{k} {IMF_{1}^{i} \left( t \right)}$$

    (5)

  • (3)

    The residuals of the first stage are calculated:where EMDk(·) represents the k-th IMF mode decomposition by the EMD algorithm.

    $${r}_{1}
    (6)

  • (4)

    The signal r1
    (7)

  • where EMDk(·) represents the k-th IMF mode decomposition by the EMD algorithm.

  • (5)

    In the subsequent stages, the k + 1-th component and the k-th residual are measured according to following formula:where R
    (8)