Revolutionizing Urban Mapping: Deep Learning and Data Fusion Strategies for Accurate Building Footprint Segmentation

The literature on building footprint segmentation includes three main areas: rule-based methods rely on predefined rules and thresholds, machine learning employs algorithms used for image classification based on feature extraction, and deep learning leverages convolutional neural networks. Furthermore, data fusion integrates different sources to increase accuracy during the building segmentation process.

Rules-Based Approach

In the domain of constructing parcel segmentation methods, rule-based approaches have traditionally relied on pre-established rules and thresholds and leveraged spectral and geometric features for segmentation.^6,22An important historical study by Huertas and Nevada in 1988 outlined the methodology used for building detection applied to aerial photography. The method was based on edge detection, shadow analysis using the illumination direction, and shape analysis using a rectangular model to represent the building, facilitating the segmentation of the building from its surrounding environment.^{twenty three}.

Rule-based building detection faces challenges in adaptability and accuracy in high-resolution optical remote sensing. Diverse urban structures lead to errors, as evidenced by the Vaihingen 2D labeling competition.^{twenty four} Rule-based methods performed poorly compared to deep learning strategies. Their limited adaptability and reliance on simple models made them less favorable, but they do have potential as a post-processing complement to more sophisticated methodologies.^{twenty five}.

Machine learning approach

Recently, machine learning has become an essential approach for building detection from remotely sensed orthophotos, employing a variety of supervised and unsupervised algorithms for pixel- or object-based classification. These computational methods can be based on features such as color, texture, or shape.²⁶Various classifiers, such as Support Vector Machines (SVMs), have been used, for example, for texture-based aerial image segmentation.⁷Instead, a Random Forest (RF) classifier for spectral-based structural segmentation was considered when working with satellite imagery.^8,9Building on this foundation, we investigated the integration of DSM and orthophotos by applying five different algorithms, with the Random Forest algorithm emerging as the best method.⁹.

Additionally, the integration of LiDAR data and high-resolution imagery has been explored in the past to enhance the feature representation of building extraction. Using a building extraction layer with high-resolution imagery (HRI) data, random forest classification is used to adequately differentiate building types in urban areas. However, challenges remain in harmonizing diverse data sources and managing the computational demands of processing multidimensional data.¹²Continuing obstacles in this field include sparseness of point clouds, high spectral variability, differences in urban objects, complexity of surroundings, and data inconsistencies.¹¹Furthermore, issues with feature selection and extraction can hinder machine learning approaches.²⁷The complexity of building footprints in traditional orthophotography can make model training difficult and lead to inaccuracies in segmentation that require a significant amount of variables.⁹Especially for high-rise buildings, factors such as relief displacement that causes a misalignment between the roof contour and the actual building footprint introduce complexities that affect the learning ability of segmentation models.³Addressing these challenges will be crucial for the advancement of building detection and segmentation applications in complex urban environments.

Deep Learning Approach

In the field of building footprint segmentation, deep learning approaches employing Convolutional Neural Networks (CNNs) have become crucial, demonstrating remarkable capabilities in pixel- or object-based semantic segmentation of orthophotos.^10,28,29A wide range of deep learning algorithms including AlexNet, fully convolutional networks, U-Net, VGG, GoogLeNet, ResNet, DenseNet, LinkNet, pyramidal scene parsing networks, bottom-up and top-down feature pyramid networks, DeepLabv3 and DeepLabv3+ demonstrated their efficiency in achieving both accuracy and robustness during the building footprint segmentation process.³⁰.

Combining Mask R-CNN with building boundary regularization improves the accuracy of building polygons, but its generalization ability to other contexts is still limited.³¹Incorporating multi-source data, such as very high resolution aerial imagery and multi-source GIS data, presents challenges and opportunities that require careful consideration for optimal results.³².

Most approaches that use RGB orthophotos as the primary input in deep learning processes overlook the richness that elevation information brings, especially when acquired from multiple sources. Conversely, the richness of detail in multiple source data poses challenges to developing accurate deep learning models for building footprint extraction.³³DeepLabv3, known for its edge accuracy and multi-scale context, has advantages when applied to a combination of RGB and DSM data, as highlighted in the MAP-Net comparison.^34,35.

Although Transformers have shown promise for building detection and segmentation tasks, they have limitations that must be considered: for example, the complexity of Transformer models can increase computational requirements and training times.³⁶Additionally, the Transformer may struggle to capture fine details in building structures, especially in scenarios with limited data or diverse building types.³⁷.

The evaluation of deep learning-based methods used as building-background discriminators traditionally prioritizes metrics that reliably extract the majority of building footprints, however these metrics have yet to fully address the computational time and resource requirements, placing a premium on a comprehensive evaluation framework.³⁸The application of deep learning models to remote sensing for building extraction tasks has inspired many researchers to explore advanced techniques that can handle the computational complexity inherent in such tasks.¹.

Data fusion approach

Data fusion combines data from different sources to create a new dataset that can provide better information than either source alone. Data fusion can be performed at different levels: pixel, feature, or decision level.³⁹In this paper, we focus on pixel-level data fusion, combining pixel values from different images to create a new dataset with more bands or higher resolution.⁴⁰Therefore, data fusion with elevation information enhances the contrast between buildings and the background, facilitating the segmentation process of building boundaries and improving the segmentation of building footprints.

Historically, many methods using DSM data for building extraction did not incorporate RGB data, limiting their effectiveness.⁴¹ used a two-stage global optimization process in DSM but faced challenges with low-rise and non-rectilinear buildings. Tian et al.⁴² used a DSM that relies on height information and the Kullback–Leibler divergence index to detect urban change, but it lacked the richness of context that RGB data can provide. Bittner et al.⁴³ We applied a fully convolutional network (FCN) to the DSM to build mask extraction, and augmenting it with RGB data allows for better material classification and feature extraction.

In contrast, recent studies⁴⁴integrated RGB data with DSM and Visible Difference Vegetation Index (VDVI) to significantly improve the accuracy of building extraction, especially in complex areas where buildings are hidden by vegetation. This fusion allows for better discrimination between buildings and ground objects. Despite these advances, the evolving field of deep learning requires more robust algorithms that can capture multi-contextual details to further improve segmentation accuracy in complex urban environments.

A study by Marmanis et al.⁴⁵ We acknowledge the challenges faced when using boundary detection to improve the accuracy of semantic image segmentation of man-made structures while simultaneously handling vegetation classes. We then propose a nuanced approach as this may impact the generalizability of results in urban environments during data fusion scenarios.

Further research¹⁵ A key achievement was highlighted by the successful fusion of both aerial imagery and LiDAR data acquired with the active contour segmentation algorithm application. However, even here, multi-source data brings with it enormous tasks, such as ensuring compatibility between different data formats and the need to calibrate for variations in resolution and accuracy associated with data fusion.

The literature also highlights challenges associated with data fusion techniques. For example, one study³ The results highlighted the difficulty of accurately extracting building footprints, especially for high-rise buildings, due to the misalignment of roof contours and building footprints in conventional orthophotos.⁴⁶ We have demonstrated how to overcome missing and incomplete modalities using generative adversarial networks applied to the building footprint segmentation process.

By showing the complexities involved in fusing diverse data modalities, the study³³ It was demonstrated that incorporating additional height information improves the overall segmentation quality of building footprint extraction and significantly increases the prediction accuracy.

Data fusion has emerged as a pivotal technique in building footprint segmentation, leveraging information from various sources to create enriched datasets at various levels, including pixel, feature and decision levels.³⁹The paper focuses on pixel-level data fusion, specifically blending pixel values from different images to generate a new image with expanded bands or increased resolution.⁴⁰Highlighting the importance of data fusion, especially elevation information in Digital Surface Models (DSMs), we demonstrate how it can enhance and refine building footprint segmentation. Studies show that fusion of RGB and DSM orthophotos outperforms RGB orthophotos alone, improving accuracy and boundary delineation.³².

Despite advances in data fusion techniques, challenges remain: As demonstrated by studies merging RGB and LiDAR data or employing advanced algorithms such as gated residual refinement networks, integrating multi-source data presents obstacles such as compatibility issues, resolution variability, and the need for extensive labeled datasets for training.^13,47In particular, traditional orthophotos suffer from misalignment of roof contours and building footprints, which hinders accurate extraction, especially for tall buildings.³Missing or incomplete modalities add further complexity, leading to the introduction of innovative solutions such as Generative Adversarial Networks for building footprint segmentation.⁴⁶.

To overcome these challenges, recent studies have highlighted the transformative potential of incorporating additional height information into the fusion process. Integration of height data improves the overall segmentation quality and significantly increases prediction accuracy.³³As building footprint segmentation continues to evolve, careful consideration of data fusion methodologies, their challenges, and innovative solutions are coming to the forefront to drive advancements in the urban planning and change detection fields.

Source link