To clarify the proposed methodology, the study’s overall workflow is shown in Fig. 2. The process starts with data collection and preprocessing. Next, the dataset is divided into three scenarios: M1, M2, and M3. A baseline LSTM model is developed first. Then, its hyperparameters are optimized using various metaheuristic algorithms: APO, DBO, GWO, and HHO. Using these optimized parameters, hybrid models are created and trained. Finally, the models’ performance is assessed using statistical metrics, including RMSE, MAE, R², and NSE. A thorough comparison is carried out.

Workflow of the proposed methodology.
Data preprocessing
Before model development, the dataset went through several preprocessing steps to ensure quality and consistency. First, the raw meteorological variables, including minimum temperature (Tmin), maximum temperature (Tmax), extraterrestrial radiation (Ra), and pan evaporation (Epan), were checked for completeness and consistency. Any irregular or inconsistent records were addressed to prevent bias during model training. To improve numerical stability and the learning efficiency of the LSTM models, all input variables were normalized to a common scale. This step is important in deep learning because it stops larger variables from dominating the training process.
Additionally, various input combinations were created using different sets of climatic variables, such as Tmin, Tmax, and Ra, to study their effect on model performance. These combinations aimed to evaluate how well the models perform with limited data. Finally, the preprocessed dataset was divided into three data-splitting scenarios (M1, M2, and M3) to test the model’s performance under different training and testing conditions.
Selection of input variables
The choice of input variables depended on data availability and their relevance to evaporation processes. Minimum temperature (Tmin) and maximum temperature (Tmax) were chosen as main inputs because they directly affect vapor pressure deficit and thermal conditions, which drive evaporation. Extraterrestrial radiation (Ra) was added as another variable to represent the theoretical solar energy incident at the top of the atmosphere.
To evaluate the effect of input structure with limited data, we created several input scenarios using different combinations of these variables, including Tmin, Tmax; Tmin and Ra; Tmax and Ra; and Tmin, Tmax, and Ra. This method helps assess model performance and the contribution of each variable to evaporation prediction. This approach is vital in data-scarce environments, where identifying the most informative, non-redundant variables is crucial for building strong, dependable, and adaptable models.
Long short term memory
The LSTM represents an advanced variant of RNN, widely used for modeling sequential and temporal data. LSTMs were introduced for the first time in 1997 by Hochreiter and Schmidhuber (1997) as a remedy for the vanishing gradient problem33. However, LSTM performance is highly sensitive to hyperparameter selection, including learning rates, hidden unit counts, and dropout rates. Recent studies have demonstrated that systematic hyperparameter optimization can significantly enhance LSTM predictive capabilities across diverse applications34. It has become a powerful tool and used by researchers in various applications for time series prediction where it has the capability preserving important information over extended time spans35,36,37,38,39. LSTM introduced a novel cell state, which serves as a memory unit to capture and store relevant information over long time intervals. This cell state is regulated by three gating mechanisms: the input, forget, and output gates.
The forgetting gate is regulating the degree to which components of the cell state vector (Ct−1) have to be eliminated.
$$\:{f}_{t}=\:\sigma\:({W}_{fxt}+{U}_{f{h}_{t-1}}+{b}_{f})$$
(1)
Where ft is the resultant vector and its value lie inside the interval (0,1), σ is the Sigmoid function, Wf, Uf are the two modifiable matrices of wight and bf is the bias factor. After that, in the input gate and by using the present value of (xt) and the previous hidden state (ht−1) which is provided by the following Eq. (2), a possible updating vector for the cell state is computed.
$$\:\stackrel{\sim}{{c}_{t}}=\text{t}\text{a}\text{n}\text{h}({W}_{\stackrel{\sim}{c}{x}_{t}}+{U}_{\stackrel{\sim}{c}{h}_{t-1}}+{b}_{\stackrel{\sim}{c}})$$
(2)
Where \(\:\stackrel{\sim}{{c}_{t}}\) is a vector in the interval (0,1), \(\:{W}_{\stackrel{\sim}{c}{x}_{t}},{U}_{\stackrel{\sim}{c}{h}_{t-1}}and\:{b}_{\stackrel{\sim}{c}}\) are another group values of wight matrices and bias. Additionally, In this stage, the input gate is compute using the following equation
$$\:{i}_{t}=\:\sigma\:({W}_{i{x}_{t}}+{U}_{i{h}_{t-1}}+{b}_{i})$$
(3)
Where it is the vactor in the interval (0,1), Wi, Ui and bi are set value of the wight matrices and bias. According to the results of the above three equation., the value of the cell state is modified following Eq. (4)
$$C_{t} = \:f_{t} \odot C_{{t – 1}} + i_{t} \odot \mathop {c_{t} }\limits^{\sim }$$
(4)
Where (ʘ) denote the element wise mutiplication.
According to Eq. 5, the information in Ct−1 is either forgotten (ft = 0) or mentioned (ft =1). The samething will happen with the Ct, where the information will forgoteen when the value of it is 0 and it will kept when the value of it is one.
After that, the output gate is rgulate the data flow from the cell state to the new hidden state using the following equation:
$$\:{O}_{t}=\:\sigma\:\left({W}_{o{x}_{t}}+{U}_{o{h}_{t-1}}+{b}_{o}\right)$$
(5)
Where Wo, Uo and bo are the set values of wight matrices and bias.
The new hidden state is computed using the Eq. (4) and Eq. (5) as folloing:
$$h_{t} = {\text{tanh}}\left( {C_{t} } \right) \odot O_{t}$$
(6)
Figure 3 illustrates the architecture of the network. These gating mechanisms enable LSTMs to selectively remember and forget information, allowing them capturing both short- and long-term dependencies within the data. LSTMs have several advantages over traditional RNN architectures. Firstly, they can handle long sequences without suffering from the vanishing gradient problem. This makes them particularly effective for tasks that involve long-term dependencies. Secondly, LSTMs have a capability of learning, recognizing and remembering important patterns in the data, which makes them useful for tasks that needs understanding and reasoning. The interaction between the gating components and the memory cell enables the network to dynamically regulate information flow—retaining relevant inputs while discarding less useful ones—thereby ensuring strong adaptability across diverse sequence modeling applications.

The architecture of LSTM.
Optimization algorithms
One of the challenges in training LSTM models is finding the optimal set of hyperparameters that can lead to better performance. This is where optimization algorithms come into play. In this study, several optimization algorithms are used to enhance the LSTM performance such as Harris Hawks Optimization (HHO), Grey Wolf Optimization (GWO) and Sea Horse Optimization (SHO).
Harris Hawks Optimization (HHO)
The HHO algorithm introduced by Heidari et al. in 2019 is a nature-inspired metaheuristic derived from the cooperative hunting strategies of Harris’s hawks, a species of raptors natives to the southwestern US. This algorithm emulates the collective and dynamic hunting tactics of hawks to explore and exploit the search space efficiently in solving complex optimization problems. By leveraging the collective intelligence of the flock, the algorithm aims to find the best solution for the optimization problem.
Similar to other population-based methods such as PSO GWO, HHO relies on the collaborative intelligence of multiple search agents to locate near-optimal solutions. This algorithm has been successfully employed in diverse areas, including engineering design41, classification42 and ML optimization43,44. Owing to its hierarchical design and adaptive switching between exploration and exploitation, HHO has demonstrated superior convergence speed and robustness compared to several conventional algorithms. Its conceptual workflow is illustrated in Fig. 4, and further algorithmic details can be found in40.

The exploration and exploitation of the HHO algorithm7.
Grey Wolf Optimization
The GWO, introduced by Mirjalili et al. (2014)45, is another bio-inspired optimization approach modeled after the leadership hierarchy and cooperative hunting behavior of grey wolves. GWO’s appeal lies in its conceptual simplicity, few control parameters, and high search efficiency, which have led to successful applications in image processing46,47 and forecasting tasks48,49.
In this algorithm, each candidate solution—referred to as a wolf—represents a potential position in the search landscape. The hierarchy consists of four ranks: alpha (α), beta (β), delta (δ) and omega (ω) wolves as shown in Fig. 5a and b. These wolves collaborate in a hierarchical manner to hunt their prey effectively. The main stages of the algorithm are summarized as follows:

a Hierarchy of grey wolves, b location update of ω wolves according to other wolves (α, β and δ)17.
-
Initialization: Initialize the positions and fitness of the wolves randomly within the search space.
-
Prey search: Each wolf searches for prey by updating its position based on its current position and the positions of the other wolves. The updated position is determined by the formula:
$$\:{X}_{(t+1)}=\:{X}_{P\left(t\right)}-A*D$$
(7)
Where: t where t refers the existing iteration, X(t+1) is the new position of wolf, XP
$$\:D=\:\left|C*{X}_{Pt}-\:{X}_{t}\right|$$
(9)
Where a ranged in (0–2) and r1 and r2 ranged in (0–1).
-
Update the hierarchy: After the prey search, the hierarchy is updated based on the fitness values of the wolves. The alpha, beta, and delta wolves are updated based on their fitness values, while the omega wolf is updated according to the other three wolves.
-
Boundary handling: If a wolf moves outside the search space, its position is adjusted to the boundary.
-
Fitness evaluation: Calculate the fitness values of the wolves based on the objective function of the optimization problem.
-
Termination: The process continues iteratively through Steps 2–5 until a predefined stopping condition is fulfilled, such as achieving the maximum iteration count or attaining a satisfactory fitness value.
Artificial Protozoa Optimizer (APO)
The APO is a new bio-inspired metaheuristic that was introduced by Wang and collaborators in 202450. The algorithm is inspired by the euglena protozoa’s survival strategies and simulates key behaviors such as (i) foraging, (ii) dormancy, and (iii) reproduction during optimization to maintain an effective balance between exploration and exploitation during the optimization process. In general, Protozoa, as simple unicellular organisms, have an autotrophic (photosynthesis-based) and heterotrophic (absorption-based) foraging, dormancy during stress, and asexual reproduction through binary fission50,51. The APO is used in the context of hyperparameter optimization (APO) to refine deep learning model hyperparameters (here the LSTM) such as learning rates, number of hidden units, and dropout rates to better capture complex nonlinear dependencies. The key to the APO lies in its mathematical abstraction of protozoan movement. Foraging is divided into autotrophic and heterotrophic modes, with position update of the ith protozoan in autotrophic mode given by:
$$X_{i}^{{new}}={X_i}+f \cdot ({X_j} – {X_i}+\frac{1}{{np}}\mathop \sum \limits_{{k=1}}^{{np}} {w_a} \cdot ({X_{k – }} – {X_{k+}})) \odot {M_f}$$
(11)
where f is the foraging factor, np is the number of neighbor pairs, wa is the autotrophic weight, and Mf is a mapping vector. Heterotrophic foraging searches and follows nutrient-rich areas by updating track to closest nutrient-rich areas. While autotrophic foraging updates based on nutrient rich areas, dormancy resets the stressed protozoa with new random solutions to optimize exploration which promotes more exploration. Reproduction simulates binary fission with controlled perturbations which promote diversity while also ensure diversity and convergence. So far, APO has shown competitive capabilities in benchmark tests, including CEC2022, and in real-world uses where they achieved greater accuracy and stability in solutions compared to GWO and HHO25. Through adaptively balancing global search, achieved through dormancy and autotrophic foraging, and local refinement through heterotrophic foraging and reproduction, APO optimally leverage ML models’ hyperparameters even with small datasets.
Dung Beetle Optimizer (DBO)
The DBO was first introduced by Xue and Shen in 202326. This algorithm takes inspiration from the broad range of activities done by dung beetles including five satges of: (i) ball-rolling, (ii) dancing, (iii) foraging, (iv) stealing, and (v) reproduction. By its nature, the DBO takes the advantages of both exploration and exploitations procedures. In this sense, the DBO mimics dung beetles’ ecological decomposer roles by balancing global exploration (like rolling and dancing) and local foraging (exploitation, like stealing). The DBO is used to tune hyperparameters, e.g., to set the learning rate and the number of hidden layers in LSTMs, for monthly pan evaporation forecasting improving the model’s ability to learn the non-linear relationships of hydrological data and the temperature and solar radiation inputs52,53. The DBO separates the population of beetles by roles in different stages: rolling beetles for navigation, brooders for reproduction, small beetle foragers, and competing thieves. These roles enable DBO to enhance the rate of convergence and the accuracy of the resultant solution. The mathematical framework of DBO implements dung beetle behaviors by iteratively updating positions. In the case of ball-rolling, which imitates navigation using celestial cues, the position of the ith beetle is updated as:
$${x_i}(t+1)={x_i}
(12)
where \({{\varvec{\Delta}}}x=\mid {x_i}
(13)
Reproduction involves defining spawning areas around local best positions X*, foraging optimizes around global best Xb, and stealing updates thieves’ positions to compete for resources. By doing this procedure, the inspired algorithm keeps adapting during the optimization process. In the final optimization stages of DBO, after updating all positions and evaluating fitness values, the algorithm renews the global best solution Xb and checks termination criteria.
Deep learning integration with optimization algorithms
LSTM-GWO
The combination of LSTM and GWO, known as LSTM-GWO, has gained attention in recent years for its ability to enhance the performance of LSTM models. By incorporating the GWO algorithm into the training process of LSTM, it is possible to improve the convergence speed and the quality of the learned representations.
The core concept of the LSTM-GWO is to employ the GWO for tuning the weights and biases of the LSTM model during the training phase54. The optimization process begins with an initial population of grey wolves, each representing a potential candidate solution. Through iterative position updates inspired by the wolves’ cooperative hunting mechanism, the algorithm explores the search domain to identify improved parameter configurations.
In this hybrid framework, a fitness function is formulated according to the validation performance of the LSTM model. The wolves adjust their locations within the search space based on their fitness scores, progressively converging toward the optimal set of weights and biases that minimize the prediction error between simulated and observed values.
The advantages of using LSTM-GWO include improved convergence speed, enhanced generalization ability, and effective handling of complex, high-dimensional datasets. T By integrating the GWO algorithm, the model mitigates common drawbacks of conventional gradient-based optimizers, particularly issues related to local minima entrapment and sensitivity to initialization. The overall workflow of the hybrid architecture is illustrated in Fig. 6. Computational workflow of the hybrid LSTM-GWO model is given in the supplementary materials (Table S1).

The flowchart of LSTM—GWO21.
LSTM-HHO
LSTM-HHO, an abbreviation for Long Short Term Memory with Harris Hawks Optimizer, is a hybrid deep learning model that combines the power of LSTM and the optimization technique of Harris Hawks55. This unique combination aims to enhance the performance of LSTM models by leveraging the benefits of the Harris Hawks optimization algorithm.
The main idea behind LSTM-HHO is to improve the training efficiency and convergence speed of LSTM models by employing the Harris Hawks optimization algorithm. This hybrid model not only benefits from the powerful memory retention capabilities of LSTM but also takes advantage of the optimization capabilities of the Harris Hawks optimizer.
The training process of LSTM-HHO involves two main steps: the forward pass and the backward pass. During the forward pass, the input data is fed into the LSTM layers, and the model makes predictions based on the current weights and biases. The backward pass utilizes the Harris Hawks optimization algorithm to update the weights and biases of the LSTM model, aiming to minimize the loss function. Figure 7 shows the flowchart of hybrid model (LSTM-HHO). Computational workflow of the hybrid LSTM-HHO model is given in the supplementary materials (Table S2).

The flowchart of LSTM—HHO model22.
LSTM-APO
The integration of LSTM and APO (so called LSTM-APO developed in this research) represents a novel hybrid deep learning model. Similar to other applied integrated LSTM models, it was decided to cope with the potential shortcomings of the individual LSTM model (such as sensitivity to hyperparameter selection and potential overfitting in nonlinear hydrological processes) by integrating the bio-inspired APO algorithm into the LSTM architecture. As previously described, the APO, inspired by protozoan behaviors like foraging, dormancy, and reproduction, provides a robust mechanism for global and local search. This can lead to efficient tuning of LSTM hyperparameters including learning rate, number of hidden units, batch size, and dropout rate. This hybridization leverages APO’s ability to balance exploration (through autotrophic foraging and dormancy) and exploitation (via heterotrophic foraging and reproduction), resulting in improved model generalization and stability.
Figure 8a depicts the general flowchart of the applied LSTM-APO procedure in this study. As can be seen in Fig. 8a, the optimization process in LSTM-APO begins with the initialization of a population of protozoa. Each agent represents a candidate set of LSTM hyperparameters. The APO, which in embedded with the LSTM, iteratively updates these solutions by simulating protozoan survival mechanisms. For this, autotrophic foraging guides the population toward promising regions by modulating movement intensity the (see Eq. 5). As this process happens, heterotrophic foraging refines local searches around rich areas of nutrients. Dormancy replaces suboptimal solutions with random ones to escape local optima, and reproduction introduces perturbations for fine-tuning. The fitness function, typically based on metrics like RMSE or NSE during training, evaluates each configuration, with the best hyperparameters selected to train the LSTM model on inputs variables (such as Tmin, Tmax, and Ra). The iterative process terminates after reaching a maximum number of evaluations or converging to an error threshold, yields an optimized LSTM mode for modeling evaporating. Computational workflow of the hybrid LSTM-APO model is given in the supplementary materials (Table S3).

Schematic structure of a LSTM-APO and b LSTM-DBO models developed in this study.
Long Short Term Memory with Dung Beetle Optimizer (LSTM-DBO)
The Long Short-Term Memory with Dung Beetle Optimizer (LSTM-DBO) is another integrated model developed in this research. The structure of hybrid LSTM-DBO model proposed in this study is similar to the LSTM-APO model. As can be observed in Fig. 8b, adjusting the hyperparameters of the LSTM model is done by the application of the dung beetle behaviors such as ball-rolling for navigation, dancing for reorientation, foraging for resource acquisition, stealing for competition, and reproduction for population diversity. After setting up the initial population, the DBO divides its population into specialized roles including rolling beetles, brood balls, small beetles, and thieves to effectively balance global exploration and local exploitation. This integration allows DBO to optimize LSTM hyperparameters, including learning rate, number of hidden units, batch size, and dropout rate, addressing challenges like overfitting and suboptimal convergence in nonlinear evaporation processes with limited climatic inputs (e.g., Tmin, Tmax, Ra).
In the LSTM-DBO framework, the optimization initializes a population of dung beetles, each encoding a candidate LSTM hyperparameter set, and iteratively updates positions based on behavioral simulations. Ball-rolling updates, mimicking celestial navigation, are modeled (See Eq. 6). Fitness is assessed using metrics like RMSE during training on inputs variables (e.g., Tmin, Tmax, Ra). The optimization process would end after achieving the ideal residuals trough the training process or reaching maximum iterations. Computational workflow of the hybrid LSTM-DBO model is given in the supplementary materials (Table S4).
In this study, the optimization algorithms are employed to determine the optimal values of the LSTM hyperparameters. These algorithms search for the best combination of parameters by minimizing the objective function (RMSE). The optimized hyperparameter settings obtained using different algorithms are summarized in Table 2.
