The implementations were performed on a system equipped with an Intel Core i7 processor (11th Gen) running at 3.2 GHz, 16 GB RAM, and a 1 TB SSD for efficient processing. The software environment included Python (version 3.10), QGIS (version 3.28), and specialized libraries such as Scikit-learn and TensorFlow for machine learning. ArcGIS Pro was utilized for geospatial analysis and mapping. These hardware and software configurations ensured smooth execution of flood susceptibility modeling and data processing tasks, contributing to the accuracy and efficiency of the study’s results. The ANN and RF models used in this study exhibit high adaptability and generalizability, allowing their application to varied geographic regions and environmental conditions. Their ability to incorporate diverse flood-conditioning parameters ensures robust modeling across different terrains.
However, successful generalization depends on regional calibration using localized data to account for unique climatic, hydrological, and topographical features. By tailoring input parameters and validating results with regional datasets, these models can effectively predict flood susceptibility in other areas, supporting their broader application in global flood risk management and environmental planning. According to the methodological flow chart in Fig. 2, the current study used the flood inventory map, the development of flood conditioning factors, the evaluation of the flood conditioning factors using information gain ratio and multi-collinearity test, and flood susceptibility models using machine learning techniques (ANN and RF)11. The groundwork of a flood inventory map for the study region is the first stage in creating a map of flood vulnerability. The usual state at the flooded areas was achieved and investigated using the readily available GIS technology. The development of the sensitive flood model is frequently extremely precise and difficult due to the requirement for numerous geographic topographical and hydrological parameters. Although SHAP analysis offers a comprehensive and model-agnostic interpretation of feature importance, the current study employed the information gain ratio due to its simplicity, computational efficiency, and successful application in prior flood susceptibility research. SHAP was considered; however, it was not implemented to maintain model simplicity and focus on interpretable, well-established methods. Future studies may incorporate SHAP for deeper insights into model behavior.

The entire methodology flow chart for Flood Susceptibility Mapping for Chennai Metropolitan Area, Tamil Nadu, India.
Determining the causes of the flood is therefore crucial, and the systematically designated factors will confirm the correctness of the maps of flood susceptibility. The flood susceptibility literature that is currently available, twelve flood-influencing factors, including elevation, lithology, slope, aspect, topographic wetness index (TWI), topographic roughness index (TRI), sediment transport index (STI), stream power index (SPI), land use/land cover (Lu/Lc), distance to the river, soil type, and rainfall, were selected for the current study area. The selection of 12 flood-conditioning parameters—elevation, slope, aspect, distance from river (DR), rainfall, land use and land cover (LULC), terrain ruggedness index (TRI), topographic wetness index (TWI), stream power index (SPI), sediment transport index (STI), soil type, and lithology—was based on their established relevance to flooding in Chennai’s urban landscape. These factors were statistically validated using multicollinearity diagnostics (VIF) and information gain ratio to confirm their predictive strength.
While over 20 factors are generally used, the selected parameters capture critical aspects of topography, hydrology, and land use, ensuring robust and efficient modeling. This focused approach enhances computational efficiency without compromising the accuracy or reliability of the flood susceptibility analysis. Ethical considerations were carefully addressed in this study. The data used, including flood records and geospatial parameters, were sourced from publicly accessible and authorized government repositories, ensuring compliance with legal and ethical standards. The study did not involve personal or sensitive data, protecting individual privacy and community interests. Additionally, the research adhered to ethical guidelines for data use and analysis, ensuring no harm or misuse of information. These measures ensure the integrity and transparency of the research process while contributing to sustainable urban planning and disaster management efforts.
The conversion of all affecting factors into a raster format with a spatial resolution of 30m was done. Topographic factors must be taken into account while modelling flood studies since they have a direct and indirect impact on the hydrological features of the study region. First, a Digital Elevation Model (DEM) for the research basin was produced in the ArcGIS 10.8 environment using the ASTER GDEM (Version 2). These data are processed to create a comprehensive representation of the Earth’s surface, enabling the visualization and analysis of terrain features for applications in cartography, engineering, and environmental studies. From DEM topographic parameters in the ArcGIS environment, slope, aspect, TWI, SPI, STI, and TRI have all been produced.
Elevation: Flooding and height are inversely associated; the higher the elevation, the lesser likely flooding will occur, and vice versa. It is possible to derive Chennai Metropolitan Areas elevation information using DEM data.
Slope: Another crucial factor that influences a flood is the slope, it controls the speed of the water’s flow. The likelihood of water stagnation is reduced, infiltration is reduced, and flow velocity is increased with increasing slope angle. The slope for the research area is generated using the Arc toolbox viva spatial analyst tool, and the surface slope is generated using the Arc GIS software’s Slope tool. Input data is DEM to generate slope as output.
Aspect: Another component, aspect, determines the directions that flooded water moves in addition to maintaining soil humidity. The aspect thus has an indirect impact on flooding. Consider the section of a slope that is shaded, where the soil has a high relative humidity and there is significant runoff. Aspect tool in the surface tool subsection is used to generate aspects such as slope-wise aspects.
Rainfall: One of the key elements that influence the likelihood of flooding has been identified as rainfall. Because flooding may result after a brief period of heavy rain. We employed IDW interpolation and rainfall data from four Chennai meteorological stations to produce rainfall maps in the ArcGIS 10.8 environment. We interpolated using the IDW approach because we only had data for four locations, despite the fact that this method is strongly advised when there is a very little amount of data.
TRI: The TRI is the most important factors impelling flood events. It is based on the neighborhood’s topography in the research area. The chance of a flood increasing with decreasing TRI values. By using Focal statistical tool to create Minimum, Maximum and Mean raster file with input data of DEM. To generate the TRI for the study area followed by the equation in the raster calculator tool,
$$\text{TRI}=\text{Mean}-\text{Minimum }/ \text{Maximum}-\text{Minimum}$$
(1)
TWI: It expresses the alteration in wetness of a basin spatially, is a significant determinant in the likelihood of flood. This index displays the amount of water present in each individual pixel in the area. Using the following equation, TWI are
$$TWI = \frac{{In \left( {As} \right)}}{\tan \beta }$$
(2)
As and, respectively, represent the explicit catchment area (m2 m−1) and slope gradient (in degrees). In general, there is a direct correlation between floods and high TWI readings.
SPI: It has a substantial effect on the fluvial system. To determine the SPI, use the equation below,
whereas, As is the slope gradient stand in for the particular catchment region, and are denoted by (radians). Total SPI is the term used to describe both the bed’s erodibility and its ability to transport sediment.
STI: Another factor that might cause flooding is the STI, which can increase the frequency of flooding and cause damage to foundations. The following equation is used to derive the STI from the DEM are,
$$STI=({\frac{AS}{22.13})}^{2}* ({\frac{sin\beta }{0.0896})}^{2}$$
(4)
where the letter A stands in for the area upstream and the symbol designates each pixel of the slope. The hydro-climatic and geomorphologic parameters of the basin region are used to calculate the STI. As sediment is deposited, the channel’s bed shifts, reducing the channel’s capacity to hold water and leading to flooding.
Land use/Land cover: Flood frequency is directly impacted by LULC because it affects sediment transportation and surface runoff. Because the formation and infiltration of surface runoff are directly controlled by the LULC. As a result of these areas’ inability to produce surface water and allow for water infiltration, flooding occurs more frequently there. In the present study, a Land Use Land Cover (Lu/Lc) map was created using Landsat and OLI (Operational Land Imager) satellite images from 13.06.2023. The artificial neural network (ANN) technique was employed for this purpose, and the analysis was conducted using ENVI software (version 5.3) with a spatial resolution of 30 m. This approach allowed for accurate classification and mapping of different land cover types in the study area, providing valuable information for various environmental and land management applications. The LULC map was divided into five categories: agricultural land, waste land, urban area, grassland, and aquatic body.
Distance to the river: The majority of flooded places are typically found close to rivers. Because the distance from the river affects the likelihood of flooding and the ratio of river flow to river, it is a crucial determining element in determining the research area’s flood-prone areas. The likelihood of flood events decreases with increasing distance from the river. In the current study, we created a map showing the distance to the river using Google Earth Pro, converted the KML file into a shapefile in the Arc GIS environment, and created a buffer for the main river to calculate the distance from the river.
Soil: One of the major determining factors that impacts how rainfall-runoff functions is soil. While other factors, like the local climate and the erosion process, also have an impact on how rainfall-runoff forms, the soil possessions sprightly control water penetration. The higher the rate of soil infiltration, the less frequently flooding happens. The National Bureau of Soil Survey provided the soil map that was utilised to digitise the soil for this investigation.
Lithology: The study of rock properties known as lithology affects flood behaviour. Sand and other porous lithologies reduce flood risk by absorbing water. Clay-like impermeable rocks can cause surface runoff, which heightens floods. In order to better control floods, flood extent can be predicted and effective drainage systems can be planned by understanding lithology. Lithology map was prepared by using Geological Survey of India lithology map.
Method for flood influencing factors using Information gain ratio and multicollinearity test
It is essential to evaluate the importance of the flood affecting parameter or the probability for flooding before beginning the model’s training sections. Based on each parameter’s statistical traits and connection to the floods, its relative importance has been determined. The Information Gain Ratio (InGR) approach has been used to determine the influential factors for FSM prediction. An InGR value is assigned to each influencing element in order to quantify its significance. Higher InGR values are indicative of more pertinent influencing elements. The decision to use the InGR model in the current experiment was based on its simplicity and effectiveness. The InGR model is well-suited for the research objectives and provides valuable insights into the relevant influencing factors, making it an appropriate choice for the study. It is computed using the following equation are,
$$Gain\, Raito \left(x, Z\right)=\frac{Entropy \left(Z\right)- {\sum }_{1}^{n}{\sum }_{i=1}^{n}\frac{Zi}{z} Entropy (zi)}{-\sum_{i=1}^{n}\frac{Zi}{Z}\text{log}\frac{Zi}{Z}}$$
(5)
If the property x originates from training point Z with subsets Zi1 = 1, 2, 3, etc. Using a range of multi-collinearity tests, such as variance decomposition proportions, conditional index, VIF, and tolerances, influencing factors have been evaluated for all probability models. We utilised the Pearson’s relationship coefficient and the VIF to determine the respective weights of the twelve flood training factors in this investigation. The VIF > 9 and incredibly weak correlation serve as indicators of the problem of multicollinearity in the components. Therefore, if the conditioning factor’s VIF value is more than 9, it is highly recommended to leave it out of the model.
Food susceptibility modelling-ANN
The three-layer ANN model (input, hidden, and output layers) utilised in the current work using a Back-Propagation (BP) and error correction learning method has been effectively utilised in the flood susceptibility modelling. The input layers and ten hidden nodes in this investigation were configured with the same numbers as the critical parameters (Table 1). The output layer, on the other hand, uses a single node and is coded as 1 for flood occurrences and 0 for non-flood events. Although there are other techniques for training ANN models, BP is the most often used one. Thus, the BP based ANN model has been used to estimate the nonlinear connection between the essential parameters and the flood occurrences. First, BP selects the starting weights at random. There has been a comparison of calculated and observed values. Errors are defined as discrepancies between calculated values and observed values. Several error measurement approaches, including mean squared error (MSE) and root mean square error (RMSE), have been used to analyse it. The initial weights are adjusted based on the generalised delta rule to distribute the entire error across the network’s neurons.
$$Zj= \sum_{i=1}^{n}wij*Xi+bj$$
(6)
From Eq. (6), Input to neuron in hidden layer, xi are the input feature, wij are the weights connecting input neuron iii to hidden neuron j and bj is the bias term for hidden neuron j.
Form Eq. (7), shows the activation function of Sigmoid, Tanh and ReLU f(z) = 1/1 + e−z, tanh(z) and max (0, z).
$$Yk= \sum_{j=1}^{m}wjk*aj+bk$$
(8)
From Eq. (8), shows the output layer calculation which includes aj are the activated outputs from the hidden layer neurons, wjk are the weights connecting hidden neuron j to output neuron k and bk is the bias term for output neuron k.
$$MSE= \frac{1}{2}\sum_{i=1}^{N}({Yi-yi)}^{2}$$
(10)
$$Cross Entropy Loss= -\frac{1}{N}\sum_{i=1}^{N}(Yilog\left(yi\right)+\left(1-Yi\right)\text{log}(1-yi))$$
(11)
$$wij=wij-n \frac{\delta L}{\delta wij}$$
(12)
From Eqs. (9) to (12) are used to activation output, Training model, Loss function, Cross- Entropy Loss and Backpropagation and weight update. where yi are the actual values and Yi are the predicted values. N is the learning rate and L is the loss function.
Random forest
The ground breaking random forest (RF) approach combines classification and regression decision trees to make accurate predictions. It is a popular ensemble-learning method. How’s concept of “random selection features” and Breiman’s “random subspace” are essential components of the RF, which can be divided into two subgroups. The Random Subspace is an ensemble machine learning technique that generates multiple classifiers in order to boost the prediction accuracy of a subpar classifier right from the beginning. To predict the data classification, the RF performs numerous regression tree training stages and generates diverse sets of samples via sampling with replacement. The final classification chosen by the RF is based on the voting outcomes of several classifiers, ensuring a significant number of votes from each tree in the forest. During the regression tree’s training phase, the observation datasets are categorized using rules based on response parameters until the prediction achieves the lowest possible node deviation. In the RF algorithm, during the training of each regression tree, a random subset of input archives and predictor factors is chosen as input. Using a total sample to train the decision trees is not recommended, as it disregards the importance of local samples. In flood susceptibility analysis, the RF model serves as a benchmark for comparing outcomes with those of a new hybrid model, highlighting its usefulness in such applications.
$$xi =[xi1,xi2 ,…,xin]$$
(13)
Form Eq. (13), each data point i, create a feature vector xi consisting of n features such as rainfall intensity, slope, land use, soil type, and elevation.
RF builds m decision trees. For each tree j, a bootstrap sample of the data is taken, and a subset of features is selected for splitting nodes. A decision tree splits data based on feature fk and threshold θ to minimize impurity (e.g., Gini impurity or entropy for classification):
$$Split=argmimfk,\theta (Impurity left + Impurity right)$$
(14)
Each tree j makes a prediction Yij for data point i by traversing the tree structure based on the feature values.
$$Yij=Tj\left(xi\right)$$
(15)
The final prediction Yi for data point i is obtained by averaging the predictions (regression) or taking a majority vote (classification) from all trees. For regression (e.g., predicting flood susceptibility score):
$$Yi=\frac{1}{m}\sum_{j=1}^{m}Yij$$
(16)
$$Yi=mode \left({(Yij)}_{j=1}^{m}\right)$$
(17)
The importance of each feature fk is calculated to understand its contribution to the model’s predictions. This can be done by measuring the decrease in impurity from all nodes where fk is used:
$$Importance \left(fk\right)=\frac{1}{m} \sum_{j=1}^{m}\sum_{nodes t using fk}(Impurity left
(18)
The spatial datasets were standardized to a common resolution of 30 m × 30 m using GIS-based resampling techniques to maintain consistency across layers. A total of 5000 sample points were generated through stratified random sampling, ensuring a balanced representation of flooded and non-flooded areas. These samples were used for training and validating the machine-learning models. To ensure data quality, preprocessing steps included the removal of null values, noise filtering, layer alignment, and accuracy checks via ground truth points and high-resolution imagery. The predictive strength and independence of the conditioning factors were statistically validated using Variance Inflation Factor (VIF) to detect multicollinearity, and Information Gain Ratio to assess their relevance to flood susceptibility.
