Machine learning model optimization for flood susceptibility zonation over the Kosi megafan, Himalayan foreland basin, India

Machine Learning


Data

The data used in this study to generate the 19 conditioning factors, and the flood inventory (dependent variable) were obtained from three primary sources: 1) ALOS PALSAR digital elevation model (DEM), 2) optical and synthetic aperture radar (SAR) satellite imagery, and 3) ancillary geospatial datasets covering soil, rainfall, LULC, and lithology. Table 1 shows the data source, type, and other relevant information. The complete methodology framework is provided through a flowchart illustrated in Fig. 2.

Table 1 Details of the data sources and their usage in the study.
Fig. 2
figure 2

Flowchart of methodology.

Alos palsar dem

The ALOS PALSAR DEM, with a spatial resolution of 12.5 m, was downloaded from the Alaska Satellite Facility (https://search.asf.alaska.edu/). This DEM provided the topographic data necessary to derive several key conditioning factors, including Altitude, slope, aspect, curvature (longitudinal, plan, and profile), TWI, and others.

Satellite imagery and ancillary data

Landsat 5 Thematic Mapper (TM) imagery, ENVISAT-1 Advanced Synthetic Aperture Radar (ASAR) Image Mode Medium Resolution (IMM) data, and Sentinel-2A imagery satellite data, along with additional auxiliary information, were employed in this study. Landsat 5 Thematic Mapper (TM) optical imagery, acquired on August 8th and 9th, 2008, was used to pinpoint flood pixels and was sourced from the USGS Earth Explorer portal (https://earthexplorer.usgs.gov/). For details on the spectral features and applications of the Landsat 5 TM imagery, refer to Markham & Barker (1985). SAR data, ENVISAT-1 Advanced Synthetic Aperture Radar (ASAR) Image Mode Medium Resolution (IMM), collected between September 2nd and 5th, 2008, was retrieved from the ESA online dissemination portal (https://esar-ds.eo.esa.int/oads/access/). The ENVISAT-1 ASAR data, known for its capability to penetrate cloud cover, played a vital role in delineating flood extent areas, especially in regions obscured by cloudiness in the Landsat imagery. The specific dataset utilized was the”Image Mode Medium Resolution Image (stripline)”(ESA-ENVISAT, 2012), which features a swath width ranging from 5 to 1150 km. Auxiliary data for the Land Use Land Cover (LULC) map, soil map, rainfall, lithology, and lineament were obtained from various sources. Sentinel-2A imagery was utilized to generate the LULC map, which was categorized into eight types: shrubs, grassy areas, cropland, developed lands, sparse vegetation, water bodies, wetlands, and forests. The soil map was developed using the Harmonized World Soil Database (HWSD), sourced from the Food & Agriculture Organization (FAO), United Nations. This dataset is available in different soil classification classes. Average Annual Rainfall (AAR) data were gathered from the Climate Forecast System Reanalysis (CFSR) by The National Centers for Environmental Prediction (NCER). This dataset offered long-term average precipitation data for the study region. Geomorphological data was sourced from geological maps published by the Geological Survey of India (GSI). This information illustrated the types of geomorphological units and their likely effects on hydrological processes. Lineament data were also obtained from the GSI.

Flood inventory

Flood extent polygons, representing the dependent variable for model training and validation, were derived through a combination of techniques. The Normalized Difference Water Index (NDWI) was calculated from the Landsat 5 TM imagery and used to delineate open water areas60. However, due to cloud cover in the Landsat imagery, the ENVISAT-1 ASAR data for the year 2008 was utilized to refine the flood extent mapping, particularly in areas obscured by clouds61. A thresholding approach was applied to the NDWI and SAR backscatter values to identify flooded pixels62. The derived flood polygons represent the spatial extent of flooding during the period of image acquisition63. The derivation of the flood inventory is provided in Fig. 3.

Fig. 3
figure 3

Flood inventory preparation.

Following is the description of the dataset:

Mode: Image Mode Medium Resolution (IMM).

Polarization: VV (vertical transmit, vertical receive).

Resolution: 150 m × 150 m (azimuth × range).

Incidence angle: 23°–46°, enabling cloud-penetrating capabilities.

Flood conditioning factors: Selection and significance

The selection of appropriate conditioning factors is crucial for developing accurate and reliable flood susceptibility models. These factors represent the various environmental, topographical, and anthropogenic characteristics that influence the occurrence and severity of flooding. Our selection process was guided by a comprehensive literature review, coupled with expert knowledge of the specific hydrological and geomorphological conditions of the Kosi Megafan64. The 19 factors chosen for this study are categorized into four groups: Anthropogenic, Environmental, Hydrological, and Topographical illustrated in Fig. 4. Each factor is described below:

Fig. 4
figure 4figure 4

From A to J: the maps indicate: Altitude, Distance to Lineament, Distance to River, Distance to Road, Geomorphology, Longitudinal Curvature, Land Use Land Cover, Normalized Difference Vegetation Index, Plan Curvature. (NOTE: These maps were generated by the corresponding author (MP) when he was working at UCRD, Chandigarh University, Mohali, Punjab, India, and he thanks the organisation (CU) for providing the lab facilities, e.g. licensed version of ArcGIS 10.8.). From J to S indicate: Profile Curvature, Average Annual Rainfall, Slope (Degree), Slope Aspect, Soil Type, Stream Density, Stream Potential Index, Topographical Potential Index, Topographical Ruggedness Index, Topographical Wetness Index. (NOTE: These conditioning factors’ maps were generated by the corresponding author (MP) when he was working at UCRD, Chandigarh University, Mohali, Punjab, India, and he thanks the organisation (CU) for providing the lab facilities, e.g. licensed version of ArcGIS 10.8).

Conditioning factors

Knowledge of the field conditions of the Kosi Megafan, combined with extensive literature consultation, has helped us to outline a list of important flood conditioning factors (CgFs) that directly or indirectly contribute to flood inundation. All CgFs have been segregated and explained in three categories: Anthropogenic Factors, Environmental Factors, Hydrological Factors, and Topographical Factors.

Anthropogenic factors

Distance from Road (Dis2Road): Roads and transportation infrastructure can significantly alter natural drainage patterns and increase surface runoff, thereby influencing flood risk65. Proximity to roads was calculated using the Euclidean distance method in found in ArcMap 10.8 – Spatial Analyst Tools. The resulting Dis2Road layer represents the distance, in meters, from each pixel to the nearest road. Values in the study area range from 0 to 10,879.9 m (Fig. 4D). Areas closer to roads are generally considered more susceptible to flooding due to disrupted drainage and increased impervious surface area66.

Land Use Land Cover (LULC): LULC significantly impacts hydrological processes such as infiltration, runoff, and evapotranspiration, thereby influencing flood susceptibility67. A LULC map (Fig. 4G) was derived from Sentinel-2A satellite imagery using a supervised classification approach. Eight distinct LULC classes were identified: shrubs, grassy areas, cropland, developed areas, sparse vegetation, water bodies, wetlands, and forests68. The classification achieved an overall accuracy of 80.7% based on a confusion matrix assessment. Different LULC classes exhibit varying degrees of flood susceptibility. For example, developed areas with high imperviousness tend to have higher flood risk compared to forested areas69.

Environmental factors

Normalized Difference Vegetation Index (NDVI): NDVI is a widely used indicator of vegetation density and health, derived from the reflectance difference between near-infrared and red bands of satellite imagery. Vegetation plays a crucial role in regulating hydrological processes, influencing interception, infiltration, and soil erosion70. Higher NDVI values generally indicate denser and healthier vegetation, which can contribute to reduced flood risk. NDVI (Fig. 4H) was calculated using the standard formula: NDVI = (NIR—Red)/(NIR + Red), where NIR is the near-infrared band and Red is the red band of the Sentinel-2A imagery71.

Soil Type: Soil properties, such as texture, porosity, and permeability, significantly influence water infiltration, retention, and runoff generation, thereby affecting flood susceptibility72. The soil classification map (Fig. 4N) has been used from the Food and Agriculture Organization (FAO) (http://www.fao.org) of the United Nations18,19. Different soil types exhibit varying hydrological responses, with coarser-textured soils generally having higher infiltration rates and lower runoff potential compared to fine-textured soils73.

Hydrological factors

Distance to River: Distance to River is a fundamental conditioning factor in flood susceptibility modeling, representing a location’s proximity to a river or stream, the primary source of floodwater during inundation events. Areas situated closer to rivers are inherently more susceptible to flooding due to their location within the floodplain and increased exposure to overbank flow, channel avulsions, and bank erosion74. In this study, the Distance to River layer (Fig. 4C) was generated using the Euclidean distance method embedded in ArcMap 10.8 – Spatial Analyst Tools.

Stream Density: Stream density reflects the drainage network’s density within a given area, indicating the efficiency of surface water removal. Higher stream density generally corresponds to faster runoff and potentially higher flood risk75. Stream density (Fig. 4O) was calculated as the total length of streams within a defined area, divided by the area: Stream Density = ΣL/A where ΣL is the total length of streams and A is the area76.

Stream Power Index (SPI): SPI quantifies the erosive power of flowing water, which is related to sediment transport and channel instability, both of which can influence flood dynamics. It is calculated as: SPI = As * tan(β) where As is the specific catchment area and β is the slope gradient. Higher SPI values indicate areas with greater potential for erosion and sediment transport77. The SPI map shown in Fig. 4P.

Topographic Wetness Index (TWI): TWI is a widely used indicator of the potential for water accumulation based on topographic characteristics. It reflects the tendency of water to accumulate at a given location due to gravitational forces78. TWI is calculated as: TWI = ln(As/tan(β)) where As is the specific catchment area and β is the slope gradient. Higher TWI values indicate areas more likely to be saturated with water6,7. TWI map shown as Fig. 4S.

Topographical factors

Altitude: Elevation plays a crucial role in controlling drainage patterns, flow direction, and water accumulation. Lower-lying areas are generally more susceptible to flooding79. Altitude data were derived from the ALOS PALSAR DEM (12.5 m resolution). The altitude in the study area ranges from 23 to 147 m (Fig. 4A).

Distance to Lineament: Lineaments, representing linear geological features such as faults and fractures, can influence groundwater flow and surface water drainage, potentially affecting flood dynamics80. Distance to lineament (Fig. 4B) was calculated using the Euclidean distance method present in ArcMap 10.8 – Spatial Analyst Tools, measuring the distance from each pixel to the nearest mapped lineament.

Geomorphology: Geomorphological units, representing distinct landforms shaped by various geomorphic processes, provide insights into the landscape’s susceptibility to flooding81. A geomorphological map of the study area was obtained from the Geological Survey of India and classified into distinct geomorphological units, each with varying flood susceptibility characteristics. The Geomorphological unit Map shown in Fig. 4E.

Longitudinal Curvature: Longitudinal curvature measures the curvature of the terrain along the slope’s direction, which significantly impacts the convergence and divergence of flow paths. This curvature can help identify areas where water is likely to accumulate or flow rapidly, thus influencing flood susceptibility82 highlight that understanding the topographic features, including longitudinal curvature, is essential for effective flood risk mapping and management. Their integrated framework combines machine learning models with terrain analysis to enhance flood susceptibility assessments, demonstrating how longitudinal curvature can be a critical factor in predicting flood-prone areas. The map shown in Fig. 4F.

Plan Curvature: Plan curvature, which measures the curvature of the terrain perpendicular to the slope, also plays a crucial role in flood susceptibility mapping. It affects lateral water movement across the landscape, influencing how water collects in certain areas83 emphasize the importance of plan curvature in their flood impact assessments, noting that it helps visualize areas affected by extreme flood events. Their study illustrates how spatial analysis techniques, including plan curvature, can provide valuable insights into flood dynamics and susceptibility. Furthermore84 discuss how plan curvature contributes to understanding urban pluvial flooding characteristics, reinforcing its relevance in flood susceptibility mapping. The Fig. 4I illustrates plan curvature.

Profile Curvature: Profile curvature measures the rate of change of slope along a flow path, affecting flow acceleration and deceleration. This curvature is essential for identifying areas where water may either speed up, increasing flood risk, or slow down, potentially leading to accumulation85 emphasizes the integration of profile curvature in flood risk assessments, noting that it can significantly influence the identification of flood-prone zones. By analyzing various terrain parameters, including profile curvature, the study provides a comprehensive understanding of flood susceptibility in the Hunza-Nagar Valley, Pakistan. The Fig. 4J shows the Profile Curvature.

Slope (Degree): Slope is a fundamental topographic parameter that influences runoff velocity, infiltration, and erosion potential86. Slope Fig. 4L was derived from the DEM and expressed in degrees, ranging from 0 to 36.24 degrees in the study area.

Slope Aspect: Aspect represents the compass direction a slope faces, influencing solar radiation exposure, which can affect evapotranspiration and snowmelt patterns, indirectly influencing flood dynamics87. Aspect was derived from the DEM and categorized into standard compass directions and it shown in Fig. 4M.

Topographic Position Index (TPI): TPI compares the elevation of a cell to the mean elevation of its surrounding neighborhood, highlighting relative topographic position. Positive TPI values generally indicate ridges or hilltops, while negative values indicate valleys or depressions. The TPI was calculated by subtracting the mean elevation of the neighborhood from the elevation of the central cell88 and Fig. 4Q represents the TPI.

Topographic Ruggedness Index (TRI): TRI quantifies the variability in elevation within a defined neighborhood, reflecting the roughness or complexity of the terrain (Duan et al., 2014). Higher TRI values indicate more rugged terrain, which can influence flow paths and runoff patterns. TRI was computed using the ‘Terrain Ruggedness Index’ tool in SAGA GIS 7.8.2 with a 3 × 3 pixel neighborhood. This neighborhood size is commonly used for calculating TRI (Fig. 4R). The following equation, based on89, was used: TRI = √(|× 12—× 02| +|× 22—× 02|+ … +|× 82—× 02|) where × 0 is the elevation of the central cell and × 1 to × 8 are the elevations of the eight neighboring cells90.

Test of multicollinearity

To ensure the independence of the conditioning factors, a multicollinearity test was performed using the Variance Inflation Factor (VIF) and Tolerance. Multicollinearity occurs when two or more predictor variables are highly correlated, which can destabilize model estimations. Generally, a VIF value greater than 5 or 10 (depending on the source) and a Tolerance value less than 0.1 or 0.2 indicate problematic multicollinearity.

Variable significance test

The significant levels of flood predictors, obtained from the Random Forest algorithm, have been arranged sequentially as depicted in the figure. Random Forest, a method based on ensemble learning, is frequently used in feature selection to pinpoint the most pertinent predictors of the variable under investigation91. In this study, the significance and ranking of attributes were also determined using Random Forest methods. The methodology and the equations employed to calculate the weights and the resulting rankings are elaborated in92. The application of Random Forest in flood susceptibility studies has been well-documented, demonstrating its effectiveness in identifying critical predictors that influence flood events. For instance, recent studies have utilized Random Forest to assess flood vulnerability in various regions, highlighting its ability to handle complex datasets and provide reliable predictions44,93. Additionally, the integration of Random Forest with other machine learning techniques has shown promise in enhancing model accuracy and robustness94,95.

In the context of flood risk management, understanding the relative importance of different predictors is essential for developing effective mitigation strategies. The use of Random Forest allows for a comprehensive analysis of various conditioning factors, such as topography, hydrology, and land use, which are crucial for accurate flood modeling9697,. By leveraging the strengths of Random Forest, researchers can improve flood susceptibility assessments and contribute to more informed decision-making processes in flood-prone areas98,99.

Flood susceptibility prediction models

Artificial neural network (ANN)

Artificial Neural Networks (ANNs) are advanced computational models inspired by the neural architecture of the human brain, designed to address complex problems through parallel distributed processing100,101. These networks have gained prominence in various fields, particularly in pattern recognition, where six primary ANN models are frequently utilized: Hamming network, Carpenter/Grossberg classifier, Hopfield network, Kohonen’s self-organizing feature maps, single-layer perceptron, and multi-layer perceptron (MLP)102. Each of these models employs distinct learning methodologies, including feed-forward backpropagation, gradient descent with momentum, adaptive learning rate backpropagation, radial basis function, and Levenberg–Marquardt optimization75,103. The MLP, in particular, has emerged as the most widely adopted model in remote sensing and predictive analytics due to its effectiveness in learning complex mappings from inputs to outputs76,77.

The architecture of an MLP typically consists of three interconnected layers: the input layer, hidden layer(s), and output layer. The input layer receives the data, while the hidden layer processes this information to identify patterns and relationships, ultimately passing the results to the output layer104. The number of hidden layers and neurons can be adjusted based on the complexity of the problem; however, a single hidden layer is often sufficient for many applications105. The hidden layer is crucial for the network’s ability to learn from data, as it facilitates the transformation of input signals into meaningful outputs79.

In the MLP training process, neuron weights are adjusted through forward and backward propagation methods. The backpropagation algorithm is particularly significant, as it allows for the systematic updating of weights based on the error between predicted and actual outputs106. The mathematical representation of the MLP function can be expressed as follows:

$${\text{y}}_{\text{i}}=\text{f}\left(\sum_{\text{i}=1}^{\text{N}}{\text{w}}_{\text{ji}}{\text{x}}_{\text{i}}+{\text{b}}_{\text{j}}\right)$$

(1)

where \({\text{x}}_{\text{i}}\) is the ith & jth are the nodal values in the previous and present layer respectively, \({\text{b}}_{\text{j}}\) refers the bias of the jth node in the present layer. The \({\text{w}}_{\text{ji}}\) indicates the weight connecting between \({\text{x}}_{\text{i}}\) and \({\text{y}}_{\text{i}}\), N is the total number of nodes in the previous layer, and the f is the activation function in the present layer80.

$$f\left(z\right)=\frac{1}{1+{e}^{-z}}$$

(1a)

During training, weights are updated using the backpropagation algorithm. An improved weight update rule that incorporates momentum can be expressed as:

$$w_{ji} (t + 1) = w_{ji}
(2)

where \(\eta\) is the learning rate, \(\alpha\) is the momentum factor, \(\partial E\) is the error function (commonly measured as the mean squared error), and \(\Delta {w}_{ji}\left(t\right)\) is the previous weight change. The training process aims to minimize the root mean squared error (RMSE), calculated by:

$$RMSE = \sqrt {\frac{1}{n}\sum\nolimits_{i = 1}^{n} {(c_{i} – \widehat{{c_{i} }})^{2} } }$$

(3)

where \(n\) = number of flood sample points; \({c}_{i}\) and \(\widehat{{c}_{i}}\) refer to observed and modelled flood susceptibility values respectively81.In the current study, the MLP model was configured with 19 input neurons corresponding to 19 conditioning factors. Following the guidelines proposed by107 Sheela and Deepa, the first hidden layer was designed to contain 39 perceptrons. The MLP was trained using the backpropagation algorithm with the Levenberg–Marquardt optimization method, dividing the dataset into 70% for training and 30% for validation. The training was conducted over a maximum of 1000 epochs, utilizing a learning rate of 0.01 and a momentum of 0.9 to enhance convergence108,109.

Biogeography-based optimization (BBO)

The Biogeography-Based Optimization (BBO) algorithm, introduced by Simon in 2008, is a novel optimization technique inspired by evolutionary biology concepts such as migration, speciation, and extinction100,101,102. These concepts are fundamental to biogeography, which studies the spatial distribution of biological species and the factors influencing this distribution75,103. BBO shares similarities with other optimization algorithms, including Genetic Algorithms (GAs) and Particle Swarm Optimization (PSO), leveraging the principles of natural selection and adaptation to solve complex optimization problems76,77.

The implementation of BBO consists of two primary stages: migration and mutation. The migration stage is a probabilistic operation that utilizes both emigration and immigration rates to facilitate the sharing of features between solutions, or habitats104,105. In this context, let \({y}_{k}\) represent a solution chosen for modification, and \({y}_{j}\) be another solution from which a feature \(\left(S\right)\) is selected based on its emigration rate. The migration operation can be mathematically represented as follows:

$${y}_{k}\leftarrow {y}_{j}(S)$$

(4)

In addition to migration, a mutation step is applied to introduce random changes that help maintain diversity. This mutation can be modeled as:

$${y}_{k}^{new}= {y}_{k}+m\cdot \xi$$

(4a)

where m is the mutation rate and ξ is a stochastic variable representing random perturbations.

The selection probabilities for both the solutions and features are determined by their respective emigration and immigration rates, which are calculated based on the number of species present in each habitat77,101. Generally, a higher number of species correlates with a higher emigration rate and a lower immigration rate, reflecting the dynamics of species distribution in nature77.

The mutation stage introduces random alterations to the solutions, which is essential for maintaining diversity within the population. The mutation rate m is typically inversely related to the fitness of the solution, ensuring that less fit solutions undergo more significant changes to enhance their potential for improvement79,102.

In the present study, the BBO algorithm was employed to model for predicting flood susceptibility. The parameters set for the BBO algorithm included a population size of 50 habitats, 100 iterations, a mutation rate of 0.01, and an elitism parameter of 2. These settings were chosen to balance exploration and exploitation within the optimization process, allowing for effective convergence towards optimal solutions80,106.

J48 decision tree

The J48 decision tree algorithm, also known as C4.5, is a widely recognized machine learning algorithm employed for classification tasks. This algorithm constructs a hierarchical tree-like structure where each internal node signifies a decision based on an attribute, while each leaf node indicates the class label. The recursive partitioning of the dataset into subsets is based on the attribute values, with the objective of maximizing information gain or minimizing impurity at each step60,110. One of the significant advantages of the J48 model is its ability to handle mixed data types, as highlighted by62. Additionally, it incorporates an automatic feature selection method, which is beneficial for reducing dimensionality and improving model performance63. The robustness of the J48 algorithm to noise is another critical advantage, making it suitable for real-world applications where data may be imperfect64. Furthermore, its divide-and-conquer approach contributes to its scalability, allowing it to efficiently manage large datasets65.

In practical applications, the J48 model has been implemented using the WEKA data mining software, where it is accessible under the classifier name “weka.classifiers.trees.J48.” This open-source software provides various parameterization options for the J48 classifier. For the current work, specific parameters were set: a confidence factor (C) of 0.25, a minimum number of instances per leaf (M) of 2, and the unpruned option set to false. This configuration allows the model to randomize the data to mitigate bias without eliminating smaller values, thus enhancing the robustness of the classification66. The model was trained using a tenfold cross-validation technique, which is a standard method for assessing the performance of machine learning models by ensuring that the model is tested on unseen data67.

The efficacy of the J48 algorithm has been demonstrated across various domains. For instance, it has been successfully applied in medical informatics for predicting conditions such as diabetes and autism spectrum disorder, showcasing its versatility and effectiveness in handling diverse datasets69,111. Furthermore, studies have indicated that the J48 classifier often outperforms other algorithms in terms of accuracy and reliability, particularly in scenarios involving complex datasets71,72. The algorithm’s ability to produce interpretable models is also a significant advantage, as it allows practitioners to understand the decision-making process behind the classifications73.

The decision criterion at each node is often based on the entropy measure:

$$Entropy \left(S\right)= – {\sum }_{i=1}^{c}{p}_{i}{log}_{2}({p}_{i})$$

(5)

where is a subset of samples, c is the number of classes, and pi is the proportion of samples in S belonging to class i. The information gain for a split on attribute A is given by:

$$Gain \left(S,A\right)=Entropy \left(S\right)- {\sum }_{v\epsilon Values \left(A\right)}\frac{\left|{S}_{v}\right|}{\left|S\right|}Entropy ({S}_{v})$$

(6)

where Sv represents the subset of samples where attribute A takes the value v. This measure guides the tree construction and pruning process, which is performed using techniques like tenfold cross-validation to enhance generalization.

The J48 model has been used in this work using weka data minning software. The classifier is available in the open-source software with the following name “weka.classifiers.trees.J48”. There are different options for parameterization available in the module. For this work the seed option without unprunned paramenter has been used. In other words, the model performs randomizing the data, to remove biasness, without removing the smaller values.

The”weka.classifiers.trees.J48″classifier was used with the following parameters:

The model was trained using tenfold cross-validation on the training dataset.

Maximum entropy (MaxEnt) model

The Maximum Entropy (MaxEnt) model, introduced by Phillips, Anderson, and Schapire in 2006, is a powerful tool for ecological modeling and species distribution assessment. This model is particularly effective in making predictions from incomplete data, which is a common challenge in ecological studies86. The MaxEnt model operates on the principle of maximizing entropy, which allows it to derive a probability distribution that reflects the constraints imposed by the available environmental data112.

The MaxEnt model begins with a uniform distribution and iteratively adjusts this distribution based on significant conditioning factors derived from the observed data87. The mathematical formulation of the MaxEnt model can be expressed as follows:

$$\text{P}\left(\text{y}=1|\text{x}\right)=\frac{\text{P}\left(\text{y}=1\right)\text{P}(\text{x}|\text{y}=1)}{\text{P}(\text{x})}=\frac{\text{P}(\text{y}=1)\upphi (\text{x})}{1/\left|\text{x}\right|}$$

(7)

In this equation, (P(y = 1|x)) represents the probability of an event occurring at a specific location (x), while (P(y = 1)) denotes the prevalence of the event across the study area. The term |x| indicates the total number of pixels in the study area, and (phi(x)) is a function that incorporates the conditioning factors relevant to the model88.

The model’s primary goal is to estimate the probability distribution of an event, such as flood occurrence, by maximizing entropy subject to the constraints derived from environmental data. The probability of flood occurrence at a location (x) can be mathematically represented as:

Pr(y = 1|x) = exp(λ f(x))/Z(λ)(8).

Where:

  • Pr(y = 1|x) is the probability of flood occurrence at location x.

  • λ is a vector of weights for the features.

  • f(x) is a vector of features (conditioning factors) at location x.

  • Z(λ): The normalizing constant (partition function), which is a function of the weight vector λ.

The training of the MaxEnt model involves optimizing the values of λ to maximize the likelihood of the observed data, which is crucial for accurate predictions113. In the present study, the MaxEnt model was implemented using the MaxEnt software (version 3.4.1) with specific settings: a random test percentage of 30%, a regularization multiplier of 1, a maximum of 500 iterations, a convergence threshold of 0.00001, and an output format set to logistic.

Random subspace

The Random Subspace (RSP) ensemble method, first introduced by114, is a widely utilized sampling technique in various fields, including banking, computer science, and medical science115. This method has also found applications in earth sciences, enhancing the performance of weak classifiers and improving their accuracy116117,. The RSP method operates by randomly sampling a high-dimensional feature space to create low-dimensional subsets, known as subspaces, which are then used to train multiple classifiers. The final decision is made based on the majority votes from these classifiers118.

The RSP method can be summarized through a systematic approach. Let X = {x_1, x_2, x_3, ……….., x_n} represent a set function with n features, where X is the vector of dependent variables. The process begins by drawing L samples, each of size M, without replacement. Each subset drawn represents a subspace of cardinality M. Subsequently, classifiers are trained using either the entire feature set X or a subset (subspace) of it. The final classification decision is determined by the majority voting among the classifiers trained on these subspaces119.

In this study, the Random Subspace method was implemented using WEKA data mining software, with the REPTree algorithm serving as the base classifier. The parameters set for the implementation included:

  • Number of iterations L : 10.

  • Subspace size M : 50% of the total number of attributes (i.e., 9 or 10 attributes were randomly selected for each subspace).

  • Base classifier: REPTree12.

The RSP ensemble method is particularly effective in high-dimensional spaces, where it helps mitigate the curse of dimensionality by reducing the correlation among base learners through random feature selection120121,. This characteristic not only enhances the robustness of the model but also improves its generalization capabilities across various applications, including classification tasks in complex datasets122.

Model performance evaluation

The performance evaluation of the models involved in this study was conducted using both cut-off-independent and cut-off-dependent methods. The Receiver Operating Characteristics (ROC) curve is a cut-off-independent evaluation method that is widely recognized for its reliability and robustness in assessing model performance123,124. In contrast, cut-off-dependent evaluation metrics, such as accuracy, F-score, sensitivity, specificity, odds ratio, and Cohen’s Kappa, were utilized alongside ROC to provide a comprehensive assessment of the models’ performance125,126. A thorough model evaluation necessitates the use of both dependent and independent metrics, as highlighted by127, who reviewed various metrics and clarified their significance.

The ROC curve is particularly useful for understanding a model’s ability to discriminate between positive and negative classes. For instance, if an end-user agency is interested in the model’s capacity to predict non-flood events incorrectly, they would focus on the false positive rate (FPR), which is calculated as Type=”math/tex”ID=”MathJax-Element-25″> FPR = 1—\text{Specificity}128. Conversely, if the agency seeks to understand the overall error rate, they would examine the misclassification rate derived from the cut-off-dependent indices.

To facilitate the evaluation process, a confusion matrix was constructed for both training and validation datasets, organized in a 2 × 2 format to analyze four possible outcomes: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). These outcomes are critical for calculating various performance metrics, including sensitivity, specificity, FPR, false discovery rate (FDR), false negative rate (FNR), accuracy, precision, True Skill Statistics (TSS), and F1-score. The equations for these metrics are as follows:

$$TPR= Sensitivity=\frac{TP}{(TP+FN)}$$

(9)

$$Specificity= \frac{TN}{(TN+FP)}$$

(10)

$$FPR = \frac{FP}{(TN+FP)}=(1-Specificity)$$

(11)

$$FDR = \frac{FP}{\left(TP+FP\right)}=(1-Precision)$$

(12)

$$FNR = \frac{FN}{\left(FN+TP\right)} =(1-Sensitivity)$$

(13)

$$Accuracy=\frac{TP+TN}{(TP+TN+FN+FP)}$$

(14)

$$Precision=\frac{TP}{TP+FP}$$

(15)

$$TSS= Sensitivity + Specificity – 1$$

(16)

$$F1-Score = 2*\frac{(Precision\times Sensitivity)}{(Precision+Sensitivity}$$

(17)

In this study, the Area Under the Receiver Operating Characteristics (AUROC) curve was employed to evaluate the predictive value of the models, with values ranging from 0.5 (indicating no discrimination) to 1.0 (indicating perfect discrimination)129. The AUROC values can be categorized into four classes: excellent (0.9–1.0), good (0.8–0.9), fair (0.7–0.8), and poor (0.6–0.7) (Medrano et al., 2010). The calculation of AUROC is expressed as follows:

$$AUCROC= \sum TP+\sum TN/P+N$$

(18)

where (P) represents the predicted cases, (O) refers to observed values, and (N) is the total number of cases130.

Additionally, Cohen’s Kappa statistic was employed to measure the agreement between the two classification sets while accounting for randomness in classification. The Kappa statistic is calculated as follows:

$$K= \frac{{P}_{obs}-{P}_{exp}}{1-{P}_{exp}}$$

(19)

where \({P}_{obs}\) is the observed agreement and \({P}_{exp}\) is the expected agreement based on chance131. The Kappa value ranges from 0 to 1, with lower values indicating less agreement and higher values indicating a near-perfect prediction132.

To further assess the classification accuracy of the models, the Seed Cell Area Index (SCAI) method was utilized. This index is calculated as the ratio of each classified class to the susceptible seed cell percentage values:

$$SCAI(\%)=\frac{\frac{{N}_{pix}({X}_{j})}{{\sum }_{j=1}^{n}{N}_{pix}({X}_{j})} (area ratio) \times 100 }{\frac{{N}_{pix}({SX}_{i})}{{\sum }_{i=1}^{m}{SX}_{i}} (flood susceptible occurance ratio)\times 100}$$

(20)

where \({N}_{pix}\left({SX}_{i}\right)\) is the number of pixels with flood occurrence cases within class i of factor variable X, and \({N}_{pix}\left({X}_{j}\right)\) refers the number of pixels within the factor variable \({X}_{j}\) (Fernando et al., 2019). The m indicates the number of classes in the parameter variable Xi, and n represents the number of factors in the study area.

A low SCAI value indicates a high susceptibility class, while high SCAI values represent low susceptibility classes, thereby confirming the accuracy of the model’s classification133.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *