Water body extraction from high spatial resolution remote sensing images based on enhanced U-Net and multi-scale information fusion

Machine Learning


The overall structure of the EU-Net model

The proposed EU-Net network model and its innovative features will be detailed in this section. The key to water body extraction is to obtain features with high discrimination for water bodies from images. To address the limitations associated with employing a convolutional neural network for training and prediction, exploring alternative approaches that utilize an image block encompassing the individual pixel is crucial when classifying it.

Firstly, there is a significant storage overhead, as the convolutional kernels continuously slide over windows for each pixel, with each sliding window being classified by the convolutional neural network. A small field of view can lead to insufficient capture of local information, while a large field of view may result in the loss of edge information. Secondly, when sliding the window, the adjacent pixel blocks are largely repetitive, leading to inefficient computations as each pixel block is convolved individually. Lastly, the pixel block’s size limits the receptive field’s size.

Based on these challenges, the EU-Net network structure has been designed, as illustrated in Fig. 1. The network comprises two main parts: the upper part for down-sampling operations and the lower part for up-sampling operations.

Figure 1
figure 1

EU-Net structure diagram.

In the upper part of EU-Net, each module (D1_Block to D5_Block) consistently involves two 3 × 3 convolutions followed by a ReLU activation function and a 2 × 2 max-pooling layer for downsampling. With each downsampling iteration, the number of feature channels gradually doubles until reaching the lowest resolution. Multi-scale dilated convolutions are inserted into the middle three layers of the downsampling process (D2_Block to D4_Block). Dilated convolutions, by introducing gaps (dilation factors) between convolution kernel elements, can expand the receptive field without increasing the number of parameters. This allows the model to capture a wider range of contextual information, facilitating the processing of global features. Additionally, a channel attention mechanism is implemented in the middle part of the D3_Block. The channel attention mechanism, by assigning different weights to different channels, can emphasize important feature channels and suppress less important ones. This allows the model to better focus on features crucial for the current task (e.g., classification, detection, segmentation), thereby improving model performance.

In the lower part of EU-Net, each module (U1_Block to U5_Block) retains a large number of feature channels during the upsampling process, allowing the network to transfer contextual information to higher resolution layers. Each layer in this region includes a 2 × 2 transposed convolution operation for upsampling, effectively halving the number of feature channels. It also involves the fusion of feature maps from the corresponding downsampling layer, followed by two 3 × 3 convolutions and ReLU activations. Multi-scale dilated convolution modules and multi-scale feature fusion modules are included from U2_Block to U4_Block. Different scale feature maps can capture targets of different sizes; small-scale feature maps are suitable for capturing large targets, while large-scale feature maps are suitable for capturing small targets. By incorporating multi-scale feature fusion modules, the model can simultaneously handle targets of various sizes, improving detection and segmentation accuracy. A spatial attention mechanism is implemented in the middle part of the U4_Block. The spatial attention mechanism, by assigning different weights to different positions in the image, can emphasize crucial regions in the image. This allows the model to better focus on key areas in the image, enhancing task performance. Finally, the network is connected to a segmentation head for classification.

EU-Net integrates an improved residual block throughout its architecture. This residual block adjusts feature map values by incorporating a Sigmoid activation function into the residual connection. The Sigmoid function maps inputs to the range (0, 1), effectively controlling information flow by enhancing significant features and suppressing irrelevant ones. The improved residual connection maintains the basic characteristics of a residual block, namely enhancing gradient flow through skip connections, while simultaneously adjusting the input for more stable and effective gradient propagation.

Residual connections for feature enhancement

In remote sensing, especially when extracting water bodies, it’s crucial to capture intricate details like small water bodies. The original U-net architecture, despite its successes, exhibited limitations when applied to complex aquatic environments. GID-Water and WHDLD dataset’s complexity, characterized by numerous small water bodies interspersed with various land cover types, posed a significant challenge. The inherent downsampling process in U-net, while effective for capturing broader contextual information, resulted in a substantial loss of these crucial fine details. As a result, the model’s ability to delineate small and intricate water bodies was compromised, leading to reduced accuracy and precision in the water body extraction task.

To address this issue, this study proposes the integration of residual connections into the U-net architecture, a modification aimed at preserving the high-resolution features essential for detecting small water bodies. Residual connections, or skip connections, are a fundamental component of the ResNet34 architecture, which has seen tremendous success in various image recognition tasks. These connections allow the direct flow of information across layers, effectively creating shortcuts in the network. Through this approach, they address the issue of data loss during the downsampling stage and maintain the essential finer details needed for precise delineation of water bodies.

Incorporating residual connections also tackles the problem of vanishing gradients. This issue often arises in deep neural networks where, with increasing network depth, the gradients essential for training become extremely small, leading to a stagnation in the network’s learning process. By providing alternate pathways for the gradient flow, residual connections ensure a more robust and stable training process, allowing the construction of deeper networks capable of capturing more complex features without the risk of training stagnation.

As shown in Fig. 1, in the EU-Net model with residual connections, each convolutional block is supplemented with a shortcut that bypasses one or more layers. In the forward pass, the output of a block is combined with its input, and this occurs prior to the activation step. This process allows the network to learn modifications to the identity mapping rather than the entire transformation, which has been shown to be easier and more effective for deep networks. Consequently, the network’s focus is on acquiring knowledge about the residual mappings that enhance the characteristics, thereby enhancing the model’s capacity to comprehend the intricate details of minor water bodies. Experiments show that introducing residual connections into the U-net architecture results in significant improvements in terms of accuracy metrics. The improvements are particularly pronounced in challenging scenarios where traditional U-net would struggle, such as areas with dense vegetation near water bodies, narrow streams, or water bodies with irregular shapes. The residual U-Net’s efficacy in preserving and enhancing intricate details throughout the network renders it a dependable solution for water body extraction, particularly in high-resolution remote sensing imagery where each pixel holds significant value.

We have made improvements in the details of the residual connections. Figure 2 displays the original residual connections and the improved residual connections. As illustrated in Fig. 2b, we introduce a gating mechanism controlled by the Sigmoid activation function in the skip connection of the residual block. This mechanism regulates the information flow through the skip connection by element-wise multiplying the direct path of the skip connection with the output of the Sigmoid function. Introducing the Sigmoid activation function to dynamically regulate the information flow in the skip connections of residual networks not only increases the model’s complexity and flexibility, allowing the network to adaptively balance the information ratio between the direct path and the nonlinear path but also enhances the model’s robustness to changes in input data, helping to mitigate overfitting issues. Furthermore, this mechanism, by providing additional dynamic elements, helps maintain the stability of the training process, especially reducing the risk of gradient vanishing or exploding in the training of deep networks. Experiments show that the improved residual blocks outperform traditional residual blocks in metrics.

Figure 2
figure 2

Attention mechanism for feature refinement

Channel attention

While the addition of residual connections can enhance the precision in extracting water bodies, relying solely on this modification is not adequate for attaining optimal results. The challenge lies in the fact that while residual connections alleviate the problem of information loss to some extent, they do not define the features of each channel. Therefore, we have introduced a channel feature attention mechanism. The channel attention mechanism plays a crucial role in deep learning, dynamically reinforcing or suppressing the responses of various channels in a convolutional neural network. It allows the model to dynamically adapt its attention and prioritize the crucial aspects of the current task. By aggregating global information, such as through global average pooling or max pooling, this mechanism captures global statistical information to identify and emphasize important channels, thus providing an effective way to refine and enhance the model’s feature representation21. Assuming there is a feature map \(F \in R^{H \times W \times C}\) where H, W and C (H and W signify the feature map’s height and width, respectively, and C stands for the particular channel index.) Initially, the spatial information is aggregated utilizing global average pooling in order to compress channels. For the c channel, its global average pooling is denoted as Zc, as defined in Eq. (1).

$$Z_{c} = \frac{1}{H \times W}\sum\limits_{i = 1}^{H} {\sum\limits_{j = 1}^{W} {F_{i,j,c} } }$$

(1)

where \(\frac{1}{H\times W}\) denotes the normalization factor, used to calculate the average over the spatial dimensions. \(\sum\limits_{i = 1}^{H} {\sum\limits_{j = 1}^{W} {} }\) is a double summation symbol, indicating the summation over all elements across the height and width. The pixel value of \({F}_{i,j,c}\) in the original feature map corresponds to channel c, row i, and column j. Subsequently, to produce the weights for each channel as defined in Eq. (2), a fully connected layer, a ReLu activation function, and a Sigmoid activation function are employed.

$$s=\sigma \left(g\left(z,W\right)\right)=\sigma \left({W}_{2}*ReLU\left({W}_{1}*z\right)\right)$$

(2)

where s contains the weights for each channel. The Sigmoid activation function, denoted by \(\sigma\), is employed. \(g(z,W)\) represents a function of a fully connected layer that takes the channel descriptor z and a learned weight matrix W as inputs, outputting a transformed vector. \({W}_{2}\) and \({W}_{1}\) correspond to the weight matrices of two fully connected layers, where \({W}_{1}\) acts as the dimension reduction layer while \({W}_{2}\) serves as the dimension increase layer. Ultimately, the acquired weights s are utilized on the initial feature map. The output feature map of the c-th channel after channel attention processing is defined as Eq. (3). Figure 3 shows the structure of the channel attention.

Figure 3
figure 3

$${\widetilde{F}}_{c}={s}_{c}\cdot {F}_{c}$$

(3)

Spatial attention

The underlying principle of the channel attention mechanism is to utilize channel weights to extract valuable information. By considering different weights, a feature map summation can be performed per this principle. This mechanism, known as spatial attention, involves incorporating weight information from various sections of the feature map. The spatial attention mechanism can identify and concentrate on the most critical parts of an image, significantly enhancing the model’s efficiency and accuracy in processing visual information. This mechanism uses attention maps generated by convolutional layers to assign weights to each spatial location, thereby reinforcing the response to important features and suppressing irrelevant areas21. This allows the network to focus more on task-relevant signals in complex backgrounds. Assuming there is a feature map \(F\in {R}^{H\times W\times C}\), the spatial attention mechanism can be defined as shown in Eq. (4).

$$A=\sigma \left({f}_{\text{att}}\left(F\right)\right)$$

(4)

here, A is the generated spatial attention map, whose dimensions are the same as the input feature map. The Sigmoid activation function, denoted as \(\sigma\), is employed to introduce non-linearity into the neural network model. On the other hand, the convolution operation, represented by \({f}_{att}\), plays a crucial role in learning and adapting the weights of the feature map. Finally, through the dot product, a new weighted feature map is obtained. The new feature map obtained after spatial attention processing is defined as Eq. (5). Figure 4 shows this structure.

Figure 4
figure 4

$$\widetilde{\text{F}}=\text{A}\odot \text{F}$$

(5)

Design of channel attention mechanism and spatial attention mechanism

The previous sections introduced channel attention and spatial attention mechanisms; this section will describe how to design this module. In the baseline U-net model, the flow of an image can be summarized as the extraction of simple to deep features of an image followed by the re-modeling of deep image features. In the final output, specifically the classification of individual pixels, accuracy essentially depends on the modeling of deep features. Under the same conditions of downsampling and upsampling, the finer the deep features, the more precise the re-modeling of deep image features, and hence, the more accurate the classification of individual pixels. Considering that high-resolution remote sensing imagery allows for clearer modeling of similar water bodies, this paper decides to integrate the channel attention mechanism after the third downsampling. This integration considers both the extraction of deep features and the selection of shallower features. Although the channel attention mechanism does not directly affect contextual information, by weighting features across different channels, the model can pay more attention to feature channels that are more important for the current task. This operation indirectly influences the extraction of contextual information.

The spatial attention mechanism, by focusing on key areas within an image, enables the network to more accurately identify water bodies and their boundaries. This mechanism enhances water body detection accuracy in complex environments by allocating higher weights to pixels related to water features while suppressing interference from the background or non-water regions. The penultimate layer of upsampling in U-net retains high-level abstract features extracted during the downsampling process and integrates shallow information through skip connections. Therefore, the penultimate layer of upsampling is rich in semantic information, prompting the decision to incorporate spatial attention mechanisms at this stage. Through this approach, the network can capture more detailed features of water bodies’ boundaries and internal textures, which is particularly important for accurately depicting small water bodies or edges of water bodies in high-resolution remote sensing images.

Multi-scale dilated convolution module for spatial contextual relationship modeling

Due to the structure of U-net, there is a significant loss of detail in the transfer of feature maps to the right, resulting in reduced accuracy in the recognition of small water bodies. To address this issue, this paper designs a Multi-Scale Dilated Convolution (MSDC) module containing multiple dilated convolutions with different kernel sizes35. Standard convolution operates on the input image via a sliding window (the convolutional kernel), where each window performs a weighted summation of the covered pixels to generate the pixel values for the output feature map. Dilated convolution introduces an additional parameter known as the dilation rate, which defines the spacing between elements within the convolutional kernel. In standard convolution, the receptive field increases linearly with network depth. Dilated convolution allows for a more rapid increase of the receptive field without a significant increase in parameters or computational load. This means that the neural network can capture a broader context at deeper levels. The MSDC module designed in this study is illustrated in Fig. 4. To enhance the receptive field, a higher dilation rate is necessary for dilated convolutions in the shallow layers of both the encoder and decoder components within the U-net architecture, particularly in regions with larger feature maps. Conversely, to ensure the preservation of intricate details and accommodate abstract feature maps, reducing the dilation rate in both the encoder and decoder modules of the U-net is advisable. This process can be simply considered as setting a smaller dilation rate, or even zero, for more abstract features. Figure 5 illustrates the use of multi-scale dilated convolution. The three branches in Fig. 5 (from top to bottom) represent the data flow of EU-Net. The left side can be understood as low-level features, with features becoming more advanced (downsampling layers) towards the right. In the shallow-level features, a larger dilation rate is required. As the data flow moves towards the deeper layers of the network, the required dilation rate gradually decreases. Then, batch normalization (BN) processing is performed, followed by concatenation to obtain the merged features. Let each branch’s feature map be represented by \(F_{i}\), where \(i\) denotes the branch index, with \(i = 1\) being the leftmost (low-level features), and as we move to the right (towards higher-level features), the value of \(i\) increases. For each branch \(i\), a dilation operation \(D_{i}\) is applied to the feature map \(F_{i}\) with a dilation rate \(r_{i}\). The dilation rate \(r_{i}\) is larger for lower-level features (\(i\) being smaller) and decreases for higher-level features (\(i\) being larger). This can be defined as Eq. (6):

$$F_{i}^{\prime} = D_{i} \left( {F_{i} ,\tau_{i} } \right)$$

(6)

where \(F_{i}^{\prime}\) is the feature map after applying dilation, and \(D_{i} ( \cdot ,r_{i} )\) represents the dilation operation on feature map \(F_{i}\) with a dilation rate \(r_{i}\)​.

Figure 5
figure 5

Diagram illustrating the expansion of the receptive field by dilated convolution.

After the dilation operation, each feature map \(F_{i}^{\prime}\) undergoes batch normalization (BN), denoted as \(BN(F_{i}^{\prime} )\).

Finally, the processed feature maps from all branches are concatenated to form the merged feature map is definded as Eq. (7):

$$F_{merged} = Concat(BN(F_{1}^{\prime} ),BN(F_{2}^{\prime} ), \ldots, BN(F_{n}^{\prime} ))$$

(7)

where \(Concat(\cdot)\) represents the concatenation operation over all processed feature maps from the branches, and \(n\) is the total number of branches in the EU-Net architecture.

Therefore, using convolutions with different dilation rates at various sampling layers of U-net can effectively solve the problem of insufficient receptive fields. Overall, MSDC extract spatial context information over a larger receptive field while also taking into account the integration of multi-scale spatial relationships.

Multi-scale feature fusion module

In the realm of remote sensing and particularly in water body extraction tasks, capturing and understanding the diverse range of spatial details is crucial. Traditional Convolutional Neural Networks (CNNs) often struggle to adapt to the diversity of objects in natural scenes, especially the more specific descriptions provided by high-resolution satellite imagery. Therefore, the fusion of multi-scale features becomes crucial, as this operation can distinguish more complex scenes.

Integration of features at multiple scales is not just a mere aggregation of features, it’s a sophisticated strategy that intelligently combines and refines information from different layers of the network. Each layer of a CNN captures features at a different level of abstraction. The earlier layers tend to capture fine details such as edges and textures, while deeper layers capture higher-level semantic information like shapes and object categories. In the context of water body extraction, the finer details might include the ripples and edges of water bodies, while the higher-level features might represent the overall shape and extent of lakes or rivers. Through the integration of these multi-level features, the model achieves a more comprehensive perception of the image. This allows it to more precisely identify the locations and boundaries of water bodies. This is particularly beneficial in complex scenarios where the water body is surrounded by similar textures or where the water body’s shape is irregular and intertwined with other land cover types.

The proposed EU-Net incorporates a multi-scale feature fusion process, where the feature maps from the last three layers of the decoder are concatenated together, as illustrated in Fig. 6. The flow of feature maps represents the complexity of feature map details. In the concatenation stage, lower-level feature maps are actually combined with higher-level feature maps. This feature map then flows into a spatial attention mechanism, enabling the network to more completely depict the details. Subsequently, the receptive field is expanded, and contextual relationships are perceived by utilizing dilated convolution on the feature map. These three types of features sequentially transition from lower-level feature maps (large water bodies) to higher-level feature maps (detailed contours). The last three layers of the decoder are rich in semantic information having gone through multiple stages of feature refinement. By fusing these layers, the model leverages both the detailed textural information and the broader contextual understanding, essential for accurately delineating water bodies of various sizes and shapes. The implications of multi-scale feature fusion are profound. For larger water bodies, which may span across several pixels and exhibit complex interactions with their surroundings, the model can utilize the broader semantic information to understand the overall context and make more informed decisions. For smaller water bodies, which might be just a few pixels wide and easily confused with other similar features, the model relies on the finer textural details to make precise localizations.

Figure 6
figure 6

Multi-scale Feature Fusion Module.

Moreover, multi-scale feature fusion also lends itself to robustness. In real-world scenarios, the conditions under which images are captured can vary widely-lighting conditions, seasons, cloud cover, etc., all introduce variability into the data. A model that relies on a single scale of features is more susceptible to being misled by these variations. In contrast, a model that fuses multiple scales of features is more resilient, drawing on a wider range of cues to make its predictions.

To summarize, the integration of multi-scale features marks a notable advancement in remote sensing and the extraction of water bodies. The proposed method acknowledges and addresses the inherent complexity and variability of natural scenes, providing a more sophisticated and comprehensive understanding of the data. By doing so, it not only enhances the performance of models on current tasks but also opens up new avenues for future exploration and innovation.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *