The monthly stream flow and water level data are collected from the station from 2014 to 2023, this datasets pre-processed before applied models. The prediction of river water level modeling was created using two scenarios to selection of best models for future condition of river water level. The first scenario divided into four input variables such as lags 1 to 3 and stream water flow using ACF and PACF methods, and the second scenario decomposition modeling was used for prediction of the river water level based on the five input variables such as IMF-1 to 5 and stream flow also included as an input variables. In this research, the three lags and monthly stream flow time series are utilized as predictors (inputs), and the actual/observed time series of river water level are used as the target (outputs) for decomposition ML modeling.
ML modeling development for predicting river water level
Estimating accurate river water levels is vital for flood and drought risk management, water resources planning, electrical energy production, urban infrastructure design, and irrigation planning. Machine learning algorithms are widely used in estimating river water levels because they provide helpful data. These models stand out and provide simple and promising inferences where physical models cannot be used or are too complex49,50. In this paper, we have been selected the one station for prediction of monthly river water level based on the eight ML and decomposition models. We have ensured the data quality, and regular, missing values were removed in the river water level datasets using the dropna() function from Pandas library. The outliers were removed in the datasets using interquartile range (IQR) method. Hence, we have developed multiple models to predict river water levels based on the nine-year datasets. The seasonal decompose (multiplicative) model was used for seasonal plots such as trend, residual, and seasonal based on river water level and stream flow datasets. The many input variables and feature selection is important steps to development of ML modeling to prediction of target variable for prediction modeling. In this study, we have used to best subset regression analysis was used to identify the best input variables combination for river water level prediction. The seasonal decompose modeling to analysis and to prepared the seasonal, Residual and trend plots for understand the datasets. The river water level datasets were divided into 80% training and 20% testing, which were used to prediction of river water level modeling. During the training phase, ML modeling used river water level data from 01–06-2014 to 20–07-2020 and testing modeling was used to this datasets, from 01–08-2020 to 31–12-2022, these datasets was developed the prediction modeling of monthly river water level. Finally, this research has been developed eight powerful models using different input combinations such as three lags and CEEMDAN (IMF 1 to 5) methods in two scenarios. The details information related to input and output variables adopted are presented in Table 2. Table 3 represented the ML and decomposition modeling structures during the first and second combination scenarios, these hyper-parameters of modeling were help to improve the model accuracy and correct prediction values. The detailed adopted methodology is shown in Fig. 2. The details information about the input variables of second combination is presented in Fig. 3. The entire prediction models have been used developed and processed the datasets on the python programming and packages.
Table 2 The Input–output structures adopted for model development.
Table 3 Details of ML and decomposition models structures for predicting the river water level.
Fig. 2
Adopted methodology framework of river water level monthly forecasting of the river water level values.
Fig. 3
The workflow of CEEMDAN-ML model.
SVM-linear
The SVM algorithm has been developed by51 with further refined by52, relies on operational danger minimization and arithmetical learning principles53. Its primary objective is to decrease both model difficulty and mistakes. SVM achieves this by projecting the data into a greater-dimensional feature space to identify a best separation hyperplane from training data54. In practice, SVM effectively captures the nonlinear relationships among variables by creating linear boundaries using a kernel function55. This algorithm constructs straightforward classifications by establishing hyperplanes. The kernel function mathematically represents this relationship55. By projecting a separation hyperplane from the origin between points belonging to two classes within a specified error threshold, SVM delineates the relationship between the xi parameters in the original space with n coordinates. Considering input and output variables as x and y, respectively, where xi belongs to the set Rn, yi belongs to the set {1, −1}, and the value of i ranges from 1 to n, the optimal separation hyperplane is expressed by the following equation53:
Here n is the number of contribution variables, αi is the Lagrange multipliers, K(xi, xj) is the kernel function, and b is the offset of hyperplane from source.
It has several options: linear, polynomial, RBF, or sigmoidal. The linear kernel is used to linearly decompose the input data of the SVM and express it on a hyperplane. The efficacy of the linear kernel support vector machine is notable, particularly in belongings where the dataset exhibits linear separate. However, in belongings where dataset is not linearly separable and possesses a complex structure, it is recommended to employ the RBF kernel.
SVM-RBF
In contrast to linear, and RBF kernel facilitates the modeling of nonlinear relationships among class labels and features. The distinctive shape of RBF kernel’s demonstrated by56, underscores its effectiveness in capturing such nonlinearities. Additionally, the difficulty of model choice is influenced by the number of tuning parameters, with RBF kernel requiring less factors compared to polynomial and sigmoid kernels57. Moreover, RBF kernels demonstrate robust show under common smoothness assumptions58. Due to its easy plan, strong generalization capabilities, high tolerance to involvement noise, and efficient online learning capability, the RBF kernel is preferred. The RBF kernel function is defined by Eq. 355,59:
Here, the parameter γ governs the level of nonlinearity in the SVM model.
Random forest (RF)
The RF algorithm60 proposed is based on combining many decision trees. RF algorithm helps solve many engineering problems by performing prediction and inference operations. RFs have significant superiority in modeling complex structures with small samples and high-dimensional feature spaces61. The mathematical expression of RF is presented in Eq. 162.
The RF model operates under the assumption that s(Yi) ∈ R, where Yi ∈ {1, 2, 3, …, k}, signifies the ordinal response of observation i with covariates Xij. Here, j = {1, 2, 3, …, p} represents the index for the predictor variables. A test statistic is utilized to evaluate the association among the ordinal answer and the forecast variable Xj. The function gj: Xj → Rpj denotes a deterministic conversion of the forecast variable Xj, converting it from a one-dimensional space vector to a pj-dimensional vector space63.
Random subspace
In RS, sampling and combining methods similar to bagging are used to create a prediction model. In contrast to bagging, the RS algorithm employs bootstrapping from the feature space rather than from the training samples64. RS stands out in effectively solving both regression and classification problems. This algorithm comprises several elements, primarily including the training dataset x, quantity of subspaces L, the classifier or regressor w, and number of features ds65. In the RS model, a random number of subsets with ds features are generated and stored in the L subspace. During the second phase, a distinct regressor is created for each subset by training each base regressor. The combination of these elements results in the formation of an ensemble regressor E66.
CEEMDAN is a spatiotemporal study method; the concept is an extra finite amplitude white sound in couples is regularly circulated in whole time–frequency space of novel signal, and space is combined with diverse scale mechanisms in various frequencies. The technique decreases residue mistake in rebuilding procedure by totaling white combined noise, which adds positive and negative signals and finds mechanisms with less noise and extra physical significance67,68. EMD and EEMD can decompose data into high-frequency signals, IMFs, over many iterations. However, it always causes a certain amount of white noise. To resolve this issue69, developed a novel data decomposition technique called CEEMDAN. Steps to implement CEEMDAN:
(1)
White noise, wi
(4)
where k shows a real number.
(2)
The collection of signals undergoes an EMD decomposition, after which the components from every decomposition are averaged.
$$\overline{{IMF_{1} }} \left( t \right) = K^{ – 1} \sum\limits_{i = 1}^{k} {IMF_{1}^{i} \left( t \right)}$$
(5)
(3)
The residuals of the first stage are calculated:where EMDk(·) represents the k-th IMF mode decomposition by the EMD algorithm.
$${r}_{1}
(6)
(4)
The signal r1
(7)
where EMDk(·) represents the k-th IMF mode decomposition by the EMD algorithm.
(5)
In the subsequent stages, the k + 1-th component and the k-th residual are measured according to following formula:where R
(8)
(6)
Repeat Eqs. 7 and 8 until the residual component (rk) no longer sufficient the decomposition rules, and the decomposition process stops. Ultimately, the novel signal R can be represented by Eq. 9.
$$X
(9)
where R
It is utilized for data decomposition, denoising, and noise decomposition. The application of CEEMDAN to every model is crucial (Yan et al. 2023). Figure 3 shown the workflow of CEEMDAN.
Model comparison statistical analysis
In this research, assessing the model’s performance involved using three statistical performance metrics such as MSE, RMSE, and R2.The model’s prediction accuracy was evaluated from multiple perspectives, thus assessing its effectiveness. The evaluation of forecasting performance is conducted using various metrics. The calculation of these metrics required the application of the equations provided in the list below.
Here, PR is the forecast value and WL is the observed value of water levels, \(WL\)i and PRi are the observed and predicted ith value. When error values are near 0 and R2 values are close to 1, it signifies the utmost accuracy in prediction outcomes.
Results and discussion
In the initial phase of adopted methodology, the original river water level data was subjected to decomposition using CEEMDAN. Figure 1 illustrates the results of this decomposition. As depicted in Fig. 1, the CEEMDAN process sorted the frequency of each IMF from the uppermost frequency to the lowest. The first and second IMF (IMF 1 and IMF 2) components exhibited a highly irregular pattern, while IMF 3 and IMF 4 displayed periodic and more consistent patterns. The final IMF component (IMF 5) depicted the overall data trend. The five IMFs have been considered for predicting Monthly River water levels based on the decomposition modeling approach.
ACF and PCF analysis for river water level
The ACF and PACF graphs are presented in Fig. 4. Seasonal patterns are absent in the data for most of the years. The ACF showed a significant spike at lag 1, with no significant residual correlation after lag 1. However, the PACF is showed important spikes at lags 1, 14, and 16. Hence, the current research on the prediction of river water level considered the three lags as input variables in the first combination of ML modeling and second combination modeling based on IMFs input variables.
Fig. 4
ACF and PACF lags of river water level.
Trend analysis based on the seasonal decompose multiplicative model
In this section shown the outcomes of stream water flow and water level trend analysis based on the seasonal decompose multiplicative model method. Figure 5 is represented the valuable insights regarding the seasonal trend variability of the stream flow and water level during different years. The stream flow and water levels data series show high fluctuation by spiked lines, mostly during monsoon periods. The trend analysis can help understand the basin area input and output variables datasets before applied in the ML modeling.
Fig. 5
Seasonal and trends of stream flow and water level.
ML & decomposition ensemble model development and performance evaluation
The ML and decomposition models to development of river water level prediction was created by the univariate modeling approach, where individual accurate values of river water level were utilized as input variables for the ML model improvement. In this research, we have adopted two types of variables added into ML modeling. Hence, ACF and PACF were selected the three legs1-3 based times series datasets in the first combination modeling. The second combination of modeling five IMFs and original stream flow was used as input variables in the decomposition modeling based on the CEEMDAN approach. Both input combinations models performance metrics have been compared, which model can better predict river water level in the study area. This station performed inversely year to year climate change due to river features, drought, air pollution, and weather phenomena.
In the present study was to recommend various innovative ML and decomposition models and accepted their accuracy with various popular ML and advanced models in the literature for river water level and stream flow prediction. The performance of the first and second combination models performance was measured based on the statistical equations using various well-known like R2, RMSE, and MSE. For the selected the one station for prediction of river water level based on the two I/O (input–output) combinations variables, which is divided into training and testing phases given the results are shown in Tables 2 and 3, respectively. For the second I/O combination of the stations, based on the accomplished ML modeling outcomes for the training and testing phases, the second input–output combinations indicated higher predicting accuracy than the first input–output combination modeling (Table 4). Between the four ML models suggested (SVM-Linear, SVM-RBF, RF, and Random Subspace) with two input–output combinations, the SVM-Linear model, CEEMDAN-SVM (LINEAR) and CEEMDAN-RF observed excellent prediction performance results other than models in first and second input–output combinations in Table 4, respectively.
Table 4 Results of predication of water level for Sg Muar at Buloh Kasap Johor station.
SVM-Linear model is achieved a superior ML model compared with other models in first combination, that results are represent based on the R2 = 0.93, RMSE = 0.17, and MSE = 0.03 in the testing phase. Whereas the RF model is showed better results compared with other models in the training phase, that model accuracy is R2 = 0.97, RMSE = 0.11, and MSE = 015 for the first input and output combination. During the testing phase, the suggested models have been showed the same results, with slightly better performance accurateness from the RF model with R2 = 0.97, RMSE = 0.11, and MSE = 0.15 for the similar in the first input–output combination modeling. Line and scatter graphs are shown in Fig. 6 (a to d) and Fig. 7 (a to d), so that this station can look closer at the performance of the created ML models. Line and scatter graphs provide a more perceptive visualization of the predicted data matched against the experiential river water level datasets.
Fig. 6
Comparison Line plots of predicted and observed river water level for test and train period for (a) SVM-Linear, (b) SVM-RBF, (c) RF, (d) Random Subspace.
Fig. 7
Scatter plots of forecast and observed river water level for test and train period for (a) SVM-Linear, (b) SVM-RBF, (c) RF, (d) Random subspace.
The prediction results of CEEMDAN-based models and their comparison with standalone models are displayed in Table 4, it is directly indicated the CEEMDAN method is improved the model accuracy. In the second combination, we have found the CEEMDAN–RF model is perform better compared with other seven models, that performance metrics such as R2 = 0.98, RMSE = 0.08 and MSE = 0.01, this values directly indicated the higher accuracy given as compared with the SVM-Linear (R2 = 0.84, RMSE = 0.27, MSE = 0.07), SVM-RBF (R2 = 0.87, RMSE = 0.24, MSE = 0.06), RF (R2 = 0.97, RMSE = 0.11, MSE = 0.015), Random Subspace (R2 = 0.86, RMSE = 0.25, MSE = 0.06), CEEMDAN-SVM-Linear ((R2 = 0.87, RMSE = 0.26, MSE = 0.07), CEEMDAN-SVM-RBF (R2 = 0.91, RMSE = 0.23, MSE = 0.01), and CEEMDAN-RS (R2 = 0.88, RMSE = 0.24, MSE = 0.06) during training phase. The CEEMDAN-RF model outperforms all the models during the testing phase are given higher accuracy values based on the R2 (0.94) and the lowest values of RMSE (0.13) and MSE (0.02), hence that models shows the better results compared with other models in the testing phase. Amongst all the developed models, the RS models performed lower during training and testing phases, indicating its inadequacy in understanding the pattern and behavior of the data and thus unsuitable for predicting the river water level. The decomposition of the river water levels has been improved the performance of the models. CEEMDAN with the combined the RF models significantly improved and that model performance metrics also shown the higher results based on the R2, RMSE, and MSE values in the training and testing phases, respectively, compared to the performance of the standalone RF model performance in predicting river water level. Therefore, other hybrid models based on CEEMDAN are more improved model performance in the training and testing compared with standalone model performance. Amongst all the standalone models, RF model is outperformed for prediction of river water level. The line diagrams and scattered plots of the standalone and CEEMDAN hybrid models are presented in Fig. 8 (a to d) and Fig. 9 (a to d). Actual and predicted river water level lines of the CEEMDAN-RF model performance are shown in Fig. 8c and 9c. The Fig. 9 (a to d) shown predicted model of river water level data are represented on the best-fit line, this plots are shown the four CEEMDAN models. This further confirms the superiority of the CEEMDAN-RF models in predicting river water levels. The hybrid models based on CEEMDAN method are visualized in Fig. 9 (a to d). Hence, the first and second ML and hybrid models were studied by using violin plots, which better understand the prediction model performance, which model can give better performance as per violin plots. Figure 10a and Fig 10b are shown the observed and predicted values for eight ML models, this plots can more helpful for understanding the model performance during first and second combination of input and output variables. Finally, result of violin plots are shown that the SVM-Linear and CEEMDAN-RF model simulated predictions values are most accurate during training and testing phases.
Fig. 8
Comparison line plots of predicted and observed river water level for test and train period for (a) CEEMDAN-SVM-Linear, (b) CEEMDAN-SVM-RBF, (c) CEEMDAN-RF, (d) CEEMDAN-Random subspace.
Fig. 9
Scatter plots of forecast and observed river water level for test and train period for (a) CEEMDAN-SVM-Linear, (b) CEEMDAN-SVM-RBF, (c) CEEMDAN-RF, (d) CEEMDAN-Random subspace.
Fig. 10
Violin plots displaying the performance of ML models for (a) First I/O combination, and (b) First I/O combination.
The best model selected based on the various statistics performance metrics; this process is standard for selection of best models. Based on the model performance metrics of CEEMDAN-RF significantly outperformed seven other models, It is reach the higher accuracy of R2 = 0.98; R2 = 0.94 and other performance metric also shown the lowest RMSE and MSE values during training and testing phases, respectively. This signs are shown higher predictive accurateness and generalization ability. In contrast, other standalone ML models such as SVM and Random Subspace have been indicated comparatively lower R2 and greater error metrics, mainly during the testing phase. The advantage of CEEMDAN-RF model is combined the both the models, it’s a hybrid structure-CEEMDAN is excellently decomposes the difficult and non-stationary stream flow time series into IMFs, every capturing different temporal trends. This decomposition mechanisms, when added into a RF model then these model is improve learning by decreasing noise and refining signal clarity.
We have compared with the another CEEMDAN-based models, CEEMDAN-RF is found the outperformed due to the ensemble strength of RF model, which efficiently handles non-linearity and avoids over-fitting issue using averaging across decision trees. The real-world implications for river basin management and hydrological forecasting are so important in the heavy rainy area. In this paper, higher accuracy hybrid model of CEEMDAN-RF will gives more reliable model and method for flood forecasting, river water level prediction, water resources development and reservoir management and planning. Its robustness ensures accurate policymaking support for governments and experts to critical situations of climate variability and extreme events. Hence, adopted the hybrid models can lead to more adaptive and data-driven sustainable water development and management policies.
Fig. 11 a to Fig. 11 d is presented the river water level prediction models such as SVM-Linear, SVM-RBF, RF, Random Subspace and hybrid models like CEEMDAN-SVM-Linear, CEEMDAN-SVM-RBF, CEEMDAN-RF, and CEEMDAN-Random Subspace performances are estimated by Taylor diagram. It is shown the model performance during training and testing, which model better performance based on the R2, RMSE and standard deviation values. In the testing phase of Taylor diagrams shows the SVM-Linear and CEEMDAN-RF best models in first and second input combinations, respectively. The different input combinations of ML models and CEEMDAN models performance are shown in Fig. 12; these plots better to understand the models accuracy, which model suitable for predication. Figure 13 showing the heat maps of ML models understanding the training and testing models accuracy in both combination of models.
Fig. 11
visualization of Taylors diagram of Models: (a) Training data of first combination, (b) Testing data of first combination, (c) Training data of first combination, (d) Testing data of second combination.
Fig. 12
Visualization of radar plots: (a) Training Data of first combination, (b) Testing data of first combination, (c) Training data of first combination, (d) Testing data of second combination.
Fig. 13
Heat map showing of ML models: (a) Training, (b) Testing.
Future work and limitations of CEEMDAN and ML models for river water level prediction
The river water level is most affected by outside features i.e. stream flow, rainfall and evaporation. The long historical datasets can important for the prediction of river water level ML modeling70. The CEEMDAN was applied to decomposition the observed data into various IMF components70. These methodology adopted addresses the limitations of traditional EEMD models and mitigates interference problems. While CEEMDAN-established hybrid models have been presented robust predictive abilities in the accurate prediction of river water levels, and flood risk is serious for water resource management, early warning systems, and sustainable surface water resources development71. The CEEMDAN model, while active in decomposing complex and non-stationary time series into IMFs, is computationally severe and may low-variance IMFs that contribute minimal prediction value. Without careful selection and preprocessing of these IMFs, they can familiarize noise and rise model difficulty. Standalone ML models such as SVM-Linear, SVM-RBF, RF, and RS are also constrained by good quality datasets and well arrangement. These models accuracy is affected due to missing datasets, errors, and outlier datasets, hence before apply in the ML models first priority seriously check the availability of datasets, clean, and high-resolution datasets, which is frequently not feasible in river monitoring systems plagued by missing values, well stations or instruments faults, and irregular collection datasets. Additionally, their completely data-driven nature limits physical interpretability and flexibility entire varying river basins area. These models many problems face during the extreme events for e.g. heavy rainfall or man-made influences such as dam operations and built-up runoff flow, which are not proper captured in historical datasets. Therefore, while hybrid methods like CEEMDAN-SVM and CEEMDAN-RF better improve better performance for river water level prediction; their practical established to need long time series datasets, deep learning models, rigorous datasets preprocessing, advanced level model tuning, and combination with climate and hydrological field knowledge for operational reliability. In future investigation, we aim to incorporate large datasets related with river basin system, and maximum years river water level datasets or include into deep learning models to improve both the accurateness and interpretability of river water level forecasting for operational work, this study area every time facing issues of flood risk and heavy rainfall and suddenly increase river water level problems.