Flood-prone area mapping using a synergistic approach with swarm intelligence and gradient boosting algorithms

Case study

Shushtar County is located in the southwest of Iran in Khuzestan Province (Fig. 4). This County is situated between 48°34′ to 49°12′ east longitude and 31°36′ to 32°8′ north latitude. The study area has an average rainfall of 294.8 mm and an average temperature of 26.8 °C. The geological structure of the region is a sequence of the Zagros mountain range, which stretches from north to southeast and forms a wide range of mountains in western Iran. The significant formations in this area include the Asmari, Gachsaran, Mishan, Aghajari, and Bakhtiari formations and sediments of the fourth geological period. According to soil science studies, the lands located within the boundaries of Shushtar include mountains, hills, sedimentary plains, and pebble-shaped Babzani alluviums, and most of the land is arable and irrigated based on the usual soil science standards.

Methodology

The study employs a technique that comprises five primary stages (Fig. 5): (1) Gathering flood samples and determining the associated flood conditioning factors. Subsequently, the dataset is partitioned into training and testing subsets using a 70:30 ratio. (2) We used the Frequency Ratio (FR) approach to determine significant scores for each class of factors and executed a multicollinearity test to find correlated conditioning factors. (3) Creating and optimizing a CatBoost model using swarm-based metaheuristics algorithms (WOA and ZOA). (4) Constructing an FSM utilizing three development models: CatBoost, CatBoost-WOA, and CatBoost-ZOA. (5) Assessing the prediction capabilities of these models by employing diverse performance indicators.

Data

Flood dataset

Satellite imagery has been used to identify the study zone’s flood spots. Sentinel-1 images were used in the Google Earth Engine (GEE) system (https://earthengine.google.com/) to monitor floods between 2017 and 2022. To prepare the flood dataset for modeling, the spatial distribution of flood locations has been represented as individual points. In addition, an equivalent number of non-flood locations have been chosen to train the flood model. The values 1 and 0 represent flood events and non-flood events, respectively. The complete dataset has been partitioned into two segments: the training and testing sets. The training dataset comprises 70% (273) flood and non-flood points for model training, while the remaining 30% (117) points constitute the testing dataset used for model validation (Fig. 4). This 70:30 ratio is a widely accepted convention in supervised learning tasks, particularly when working with moderately sized datasets, as it offers a balanced trade-off between two objectives: (1) providing the model with a sufficient number of samples to learn complex patterns in the training phase, and (2) retaining enough independent data to ensure robust and unbiased model validation during testing.

Flood condition factors

Natural disaster research, including flood susceptibility modeling, considers various factors that might cause or mitigate disasters. Consequently, choosing this data is the most crucial step in developing flood susceptibility models since it will significantly influence the study’s quality and the conclusions’ correctness⁴⁶. According to flood susceptibility research, many factors impact the occurrence, development, and progression of floods. In this regard, prior research has been considered when choosing the controlling parameters. For this purpose, 13 spatial factors affecting flooding were considered in this study^3,47,48 (Fig. 6a–m).

Topographical parameters affect how water flows on the ground, accumulates, and finally drains and significantly impact the occurrence of floods⁴. This study extracted topography parameters, including Topographic Wetness Index (TWI), slope, elevation, stream power index (SPI), aspect, plan curvature, and profile curvature from Shuttle radar topography mission (SRTM) images with a pixel size of 30 × 30 m in GEE. Then, ArcGIS 10.8 and SAGA GIS 8.2.1 software were used to process and prepare the parameters. Land cover and Normalized Difference Vegetation Index (NDVI) parameters significantly affected flood susceptibility by affecting runoff production, water infiltration and storage, and surface runoff⁵⁷. These two parameters were prepared using the Landsat-8 image in the GEE system, which had a pixel size of 30 m × 30 m from 2017 to 2022. RF classification method was used for land cover. Rainfall is a crucial factor in flood research as it directly and indirectly impacts other flooding-related variables. Important determinants of flood frequency include the location, severity, and total amount of rainfall¹¹. The rainfall map was prepared using the average rainfall data between 2017 and 2022 at 60 rain gauge stations in Khuzestan Province. These data were obtained from the Iranian Meteorological Organization, and the kriging interpolation method was used in ArcGIS 10.8 to prepare a rainfall raster map. Calculating the distance from rivers to each pixel is essential because waterways and their tributaries serve as the primary routes for flooding⁶⁹. The Euclidean distance approach has been employed to compute the spatial distance between raster pixels and rivers. The river layer was obtained from the Natural Resources Organization of Iran with a scale of 1:50,000. Lithological factors affect flood susceptibility, including infiltration, runoff, and erosion⁴⁶. This factor was extracted from the Iranian geological layer on a scale of 1:100,000. The soil texture factor affects the occurrence of floods by affecting the infiltration, water-holding capacity, and erodibility⁵⁵. Two factors, lithology and soil texture, were processed and prepared using ArcGIS 10.8 software.

Methods

Multicollinearity analysis

Multicollinearity is a strong correlation between two or more predictive variables in multivariate regression. This condition can lead to inaccurate statistical inferences, indicating a form of data disorder (Bui et al. 2011). In the regression dataset, the Variance Inflation Factor (VIF) measures multicollinearity (Pradhan et al. 2017; Shogrkhodaei et al. 2021). The values of the VIF index greater than 10 indicate multicollinearity between factors. So, if the values of a factor are greater than 10, that factor should be excluded from modeling (Razavi-Termeh et al. 2020).

Frequency ratio (FR)

FR is a widely employed technique in evaluating flood susceptibility⁵⁹. FR measures the likelihood of an event happening based on all the factors that influenced a similar event in the past compared to the possibility of it not happening¹⁶. The flood susceptibility assessment considers both the locations with high flood severity and the extent of the areas affected by the parameters employed in the research area (Shafapour Tehrany et al. 2019) (Eq. 1).

$${\text{FR}} = \frac{{\text{X}}}{{\text{Y}}}$$

(1)

X represents the proportion of flood surface area within each subclass of a parameter that affects flooding. In contrast, Y represents the proportion of each subclass of a parameter that affects flooding within that parameter.

CatBoost algorithm

In 2017, the Russian search engine Yandex debuted CatBoost, an algorithm for enhancing the search results. Owing to its enhanced feature properties and resolution of prediction shifts, CatBoost outperforms conventional Gradient Boosting Decision Tree (GBDT) methods⁷. This advancement guards against overfitting problems, strengthening the model’s ability to generalize and withstand challenges and producing more precise prediction outcomes⁷². Some features of the CatBoost algorithm include using ordered boosting to overcome target leakage problems, being useful for small datasets, controlling categorical features, and successfully handling various data types and formats^28,30. The output of the CatBoost algorithm’s estimation is described as follows¹⁹ (Eq. 2):

$${\text{Z}} = {\text{H}}\left( {{\text{x}}_{{\text{i}}} } \right) = \mathop \sum \limits_{{{\text{j}} = 1}}^{{\text{J}}} {\text{c}}_{{\text{j }}} 1_{{\left\{ {{\text{x}} \in {\text{R}}_{{\text{j}}} } \right\}}}$$

(2)

where $\text{H}\left({\text{x}}_{\text{i}}\right)$ is a decision tree function of explanatory variables ${\text{x}}_{\text{i}}$, and ${\text{R}}_{\text{j}}$ is the disjoint region corresponding to the leaves of the tree.

The CatBoost algorithm processes samples with random permutations and mean-label value calculation methods. Additionally, it effectively reduces the impact of noise from low-frequency categorical data by employing a prior distribution term. This approach optimizes processing capacity for high-dimensional sparse data using a base model of a fully symmetric tree³³.

Whale optimization algorithm (WOA)

The WOA was first proposed by Mirjalili and Lewis in 2016 and is a swarm intelligence optimization algorithm. Humpback whales’ natural hunting mechanism inspired this program, which mimics the pods’ diminishing surroundings, spiraling position updates, and erratic hunting behaviors⁷⁰. In the WOA, the hypothesis states that each solution is represented as a whale, and the whale attempts to occupy a new position in the search space, regarded as the benchmark for the best element in the group. Whales use two mechanisms to search for prey and attack: encircling prey and creating bubble nets. In the case of optimization, search space exploration occurs when whales search for prey, and exploitation happens during attack behavior²³. The steps of the WOA algorithm are described below⁵:

a) During the initial hunting phase, whales encircle the prey spotted only once. The program takes into account the optimal position for locating prey. Thus, whales navigate to the optimal position using Eqs. 3–6²³:

$$\overrightarrow {{\text{X}}} \left( {{\text{t}} + 1} \right) = \overrightarrow {{\text{X}}}^{*} \left( {\text{t}} \right) – \overrightarrow {{\text{A}}} \cdot \overrightarrow {{\text{D}}}$$

(3)

$$\overrightarrow {{\text{D}}} = \left| {\overrightarrow {{\text{C}}} \cdot \overrightarrow {{\text{X}}}^{*} \left( {\text{t}} \right) – \overrightarrow {{\text{X}}} { }\left( {\text{t}} \right)} \right|$$

(4)

$$\overrightarrow {{\text{A}}} = 2\overrightarrow {{\text{a}}} \cdot \overrightarrow {{\text{r}}} – \overrightarrow {{\text{a}}}$$

(5)

$$\overrightarrow {{\text{C}}} = 2\overrightarrow {{\text{r}}}$$

(6)

The position of the whale in the next iteration, $\overrightarrow{\text{X}} \left(\text{t}+1\right)$, is determined by the position of the best solution, ${\overrightarrow{\text{X}}}^{*}$, along with the coefficient vectors $\overrightarrow{\text{A}}$ and $\overrightarrow{\text{C}}$. Additionally, a random vector $\overrightarrow{\text{r}}$ in the range [0,1] and a decreasing number $\overrightarrow{\text{a}}$ from two to zero in each iteration are used. In the initial relationship that governs the whales’ position updates during each iteration, it is imperative to modify the vectors $\overrightarrow{\text{A}}$ and $\overrightarrow{\text{C}}$ to enable the whales to relocate to various positions, thereby optimizing the solution²³.

b) The second stage is called exploration, and for efficiency and convergence, the global optimality of the algorithm is required to use both the exploitation and exploration phases. During the exploration phase, the search agents do not select the optimum solution. Instead, they randomly choose another search agent and move towards it. To facilitate this movement, vector $\overrightarrow{\text{A}}$ is utilized (Eqs. 7–8)²³:

$$\overrightarrow {{\text{X}}} \left( {{\text{t}} + 1} \right) = \overrightarrow {{\text{X}}}_{{{\text{rand}}}} – \overrightarrow {{\text{A}}} \cdot \overrightarrow {{\text{D}}}$$

(7)

$$\overrightarrow {{\text{D}}} = \left| {\overrightarrow {{\text{C}}} \cdot \overrightarrow {{\text{X}}}_{{{\text{rand}}}} – \overrightarrow {{\text{X}}} } \right|{ }$$

(8)

Once the termination condition is met, the search agents continue until the algorithm discovers the global optimum.

Zebra optimization algorithm (ZOA)

The ZOA, a metaheuristic algorithm that debuted in 2022, takes its cues from how zebras act in the wild⁶³. In the social life of zebras in nature, there are two behaviors: “searching for food and defense tactics against attackers,” which are essential. The zebra leader allows the rest of the pack to follow in their footsteps to get closer to the food source²¹. The Zebras have two defense techniques against their enemies: the first one is the zigzag flight pattern they use to escape, and the second one, they may occasionally try to confuse or frighten the hunter by gathering⁶³. ZOA mimics the actions of zebras as they forage for food and defend themselves from predators. Finding the right mix between exploring and exploiting could be the key to using the ZOA to solve optimization challenges in the real world^21,63. The following presents mathematical simulations of natural zebra behavior for the ZOA model²².

Initialization

Every zebra represents a possible response, and the area in which they are situated represents the search space for the subject of interest. A single vector is sufficient to represent each zebra. To construct the ZOA population matrix, the following Equation must be satisfied²².

$${\text{P}} = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {{\text{P}}_{1} } \\ \vdots \\ \end{array} } \\ {\begin{array}{*{20}c} {{\text{P}}_{{\text{i}}} } \\ \vdots \\ \end{array} } \\ {{\text{P}}_{{\text{N}}} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {{\text{P}}_{1,1} } & \ldots \\ \end{array} } & {\begin{array}{*{20}c} {{\text{P}}_{{1,{\text{j}}}} } & \ldots \\ \end{array} } & {{\text{P}}_{{1,{\text{m}}}} } \\ {\begin{array}{*{20}c} {{\text{P}}_{{{\text{i}},1}} } & \ldots \\ \end{array} } & {\begin{array}{*{20}c} {{\text{P}}_{{{\text{i}},{\text{j}}}} } & \ldots \\ \end{array} } & {{\text{P}}_{{{\text{i}},{\text{m}}}} } \\ {\begin{array}{*{20}c} {{\text{P}}_{{{\text{N}},1}} } & \ldots \\ \end{array} } & {\begin{array}{*{20}c} {{\text{P}}_{{{\text{N}},{\text{j}}}} } & \ldots \\ \end{array} } & {{\text{P}}_{{{\text{N}},{\text{m}}}} } \\ \end{array} } \right]$$

(9)

P, ${\text{P}}_{\text{i}}$, and ${\text{P}}_{\text{i},\text{j}}$ are the zebra population, ith zebra candidate, and jth problem variable suggested by the ith zebra candidate, respectively. N represents the number of search factors, and m represents the number of variables to be set. The values of the fitness function are described by Eq. 10²².

$${\text{F}} = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {{\text{F}}_{1} } \\ \vdots \\ \end{array} } \\ {\begin{array}{*{20}c} {{\text{F}}_{{\text{i}}} } \\ \vdots \\ \end{array} } \\ {{\text{F}}_{{\text{N}}} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {{\text{F}}({\text{P}}_{1} )} \\ \vdots \\ \end{array} } \\ {\begin{array}{*{20}c} {{\text{F}}({\text{P}}_{{\text{i}}} )} \\ \vdots \\ \end{array} } \\ {{\text{F}}\left( {{\text{P}}_{{\text{N}}} } \right)} \\ \end{array} } \right]{ }$$

(10)

F and ${\text{F}}_{\text{i}}$ are a column vector containing fitness function candidates and the fitness function value determined for the first zebra, respectively.

Foraging activity

The most competent member of a zebra optimizer population becomes the leader and is tasked with recruits additional group members to participate in the study area. The following Equations are used to model the zebras’ location update throughout the foraging period²²:

$${\text{P}}_{{{\text{i}},{\text{j}}}}^{{{\text{new}},{\text{S}}1}} = {\text{P}}_{{{\text{i}},{\text{j}}}} + {\text{r}} \cdot \left( {{\text{ZL}}_{{\text{j}}} – {\text{I}} \cdot {\text{P}}_{{{\text{i}},{\text{j}}}} } \right)$$

(11)

$${\text{P}}_{{\text{i}}} = \left\{ {\begin{array}{*{20}l} {{\text{P}}_{{\text{i}}}^{{{\text{new}},{\text{S}}1}} ,} \hfill & {{\text{F}}_{{\text{i}}}^{{{\text{new}},{\text{S}}1}} < {\text{F}}_{{\text{i}}} } \hfill \\ {{\text{P}}_{{\text{i}}} ,} \hfill & {else} \hfill \\ \end{array} } \right.$$

(12)

${\text{P}}_{\text{i}}^{\text{new},\text{S}1}$ represents the update of the ith zebra according to the first stage, and ${\text{P}}_{\text{i},\text{j}}^{\text{new},\text{S}1}$ is the value of its jth dimension, ${\text{F}}_{\text{i}}^{\text{new},\text{S}1}$ represents its fitness function, ZL represents the zebra leader, and ${\text{ZL}}_{\text{j}}$ is its i-th dimension, r means an arbitrary value between 0 and 1 and $\text{I}=\text{round}(1+\text{rand})$.

Anti-Predator defense technique

Here, we update the search space placements of the ZOA population’s individuals by mimicking zebras’ defensive strategies²². This stage includes two techniques: The defensive technique against the lion and the defensive technique against other predators.

Validation

Several assessment measures were employed to assess and validate the created models and FSMs. These metrics were evaluated in two categories: evaluation of the developed models (Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Square Error (RMSE) indices) and evaluation of flood susceptibility maps (Receiver Operating Characteristic (ROC)). Equations 13–15 can be used to determine the models’ evaluation metrics by comparing the actual values with their predictions^12,47.

$${\text{MSE}} = \frac{1}{{\text{n}}}\mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{n}}} \left( {{\text{y}}_{{\text{i}}} – {\text{y}}_{{\text{i}}}^{\prime } } \right)^{2}$$

(13)

$${\text{RMSE}} = \sqrt {\frac{1}{{\text{n}}}\mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{n}}} \left( {{\text{y}}_{{\text{i}}} – {\text{y}}_{{\text{i}}}^{\prime } } \right)^{2} }$$

(14)

$${\text{MAE}} = \frac{1}{{\text{n}}}\mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{n}}} \left| {{\text{y}}_{{\text{i}}} – {\text{y}}_{{\text{i}}}^{\prime } } \right|$$

(15)

The expected value is denoted by ${\text{y}}^{\prime }$, the actual value is represented by y, and the number of samples is denoted by n. The ROC curve and Area Under the Curve (AUC) values method are commonly employed in natural disaster research to evaluate the efficacy of susceptibility models created to analyze flood disasters³⁵. By visualizing the True Positive Rate (TPR) and False Positive Rate (FPR) using ROC and AUC values, the effectiveness of binary classification models may be evaluated (Eq. 16)¹².

$$\left\{ {\begin{array}{*{20}c} {{\text{TPR}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}} \\ {{\text{FPR}} = 1 – \frac{{{\text{FP}}}}{{{\text{FP}} + {\text{TN}}}}} \\ \end{array} } \right.$$

(16)

The AUC is a numerical value that measures the performance of a classification model. It runs from 0 to 1 and is calculated using Eq. 17⁴⁷.

$${\text{AUC}} = \frac{{\sum {\text{TP}} + \sum {\text{TN}}}}{{{\text{P}} + {\text{N}}}}$$

(17)

P is the sum of all flood data, while N is all data points that do not include flood data.

Model implement

Models for flood susceptibility were created in the Google Colab environment using Python. This is accomplished by taking the values of thirteen essential elements at the sites where floods have occurred as input and producing a likelihood prediction for each pixel in the research region. The CatBoost model was optimized using two metaheuristic algorithms (WOA and ZOA) with a population of 100 and 50 iterations. The control parameters of the metaheuristic algorithms were determined through trial and error. These two algorithms minimize the objective function in different iterations. The objective function for optimizing the hyperparameters of the CatBoost model is the NRMSE (Normalized Root Mean Squared Error) index (Eq. 18)³².

$${\text{NRMSE}} = \frac{{{\text{RMSE}}}}{{{\text{y}}_{{{\text{max}}}} – {\text{y}}_{{{\text{min}}}} }}$$

(18)

where ${\text{y}}_{\text{max}}$ and ${\text{y}}_{\text{min}}$ are the maximum and minimum observed values, respectively, finally, these two swarm-based algorithms optimize the CatBoost hyperparameters in different iterations.

Source link