Dataset descriptions
The datasets utilized in this study19,20,21 are fundamentally structured to support the development and evaluation of deep learning models for radio spectrum sensing. They comprise three primary components: synthetic training data, real-world captured data22,23 (refers to real-world data collected from actual 5G/LTE networks or environments, representing real signal conditions, noise, and interference. It is typically gathered from sensors, devices, or real-time communication systems), and a pre-trained model. The synthetic dataset23 (is artificially generated data, often created through simulation tools or models. It is used to simulate various scenarios and network conditions in a controlled environment, but may not capture the full complexity or variability of real-world data.) consists of 128 × 128 spectrogram images generated using MATLAB’s 5G Toolbox™ and LTE Toolbox™, which were used to simulate New Radio (NR) and LTE signals, respectively. These signals were passed through standardized ITU and 3GPP channel models to reflect realistic propagation effects. Furthermore, random shifts were applied in both time and frequency domains to emulate practical deployment conditions, such as user mobility, frequency offsets, and timing variations. This variability enhances the model’s exposure to diverse signal patterns and improves its ability to generalize beyond controlled environments.
Each training frame represents a 40-millisecond segment containing either NR, LTE, or a combination of both, categorized into three semantic classes: LTE, NR, and Noise. Pixel-wise semantic segmentation was employed for annotation, whereby each pixel in the spectrogram is labeled based on the presence and location of signal energy in the time-frequency domain. Active signal regions were labeled according to their corresponding protocol type (LTE or NR), while inactive or non-signal areas were marked as Noise. In addition to synthetic data, a set of captured spectrograms—acquired using RF receivers—was included to evaluate model robustness in real-world conditions and mitigate potential domain shifts between simulated and practical environments.
The deep learning approaches were investigated to learn the spatial and spectral structure of the signals. it utilized transfer learning with the DeepLabv3 + semantic segmentation architecture, employing a ResNet-50 backbone pretrained on large-scale image datasets and fine-tuned using 256 × 256 RGB inputs. This approach leverages the advantages of pre-learned features to accelerate training and enhance performance, particularly when labeled data is limited.
Given the computational demands associated with training deep models on large and complex datasets, cloud computing infrastructure was employed to enable efficient training workflows and systematic hyperparameter tuning. This hybrid framework—integrating synthetic and captured data, advanced segmentation techniques, and scalable computation—provides a robust foundation for deploying real-time, intelligent spectrum sensing models in dynamic 5G and LTE wireless communication environments.
These data set19,20,21 are related to the field of spectral sensing and contain training data (128 × 128), forecast data using custom models, and captured data for analysis. They can be used to train and analyze the performance of machine learning models in radio spectrum detection.
In the deep learning domain, one significant advantage of using wireless signals is the ability to synthesize high-quality training data, which is crucial for effective model training. In this specific application, 5G New Radio (NR) signals were generated using the 5G Toolbox™, while LTE signals were produced using LTE Toolbox™ functions. As described in Sect. 3.1, the dataset includes both synthetic signals generated via MATLAB toolboxes and real-world captured spectrogram.
These signals were then passed through channel models specified by the relevant standards to create a robust and realistic training dataset. Each training frame, with a duration of 40 milliseconds, was designed to contain either 5G NR signals, LTE signals, or a combination of both, with the signals being randomly shifted in frequency within the specified band of interest. This random shifting helps in simulating real-world scenarios and enhances the network’s ability to generalize. Regarding the network architecture, a semantic segmentation network was employed to identify spectral content within the spectrograms generated from the signals. Two different approaches were explored in this context. The first approach involved training a custom network from scratch using 128 × 128 RGB images as input. This method allows for the design of a network tailored specifically to the characteristics of the dataset. The second approach utilized transfer learning, leveraging the Deeplabv3 + architecture with a ResNet-50 base network. This network was trained on 256 × 256 RGB images. Transfer learning is particularly beneficial as it allows the model to take advantage of pre-trained weights, typically learned from large datasets, thereby speeding up the training process and potentially improving performance, especially when labeled data is limited.

The sample frame of the data.

Spectral Analysis and Classification of LTE.
The previous Fig. 2 shows the sample frame of the data and Fig. 3 presents a detailed analysis of a received wireless signal, represented through two subplots. The top subplot, labeled “Received Spectrogram,” illustrates the signal’s intensity over time and across a frequency range from 2320 MHz to 2380 MHz The color scale varies from blue to yellow, where yellow indicates regions of higher signal intensity, particularly around 2350 MHz The bottom subplot, titled “Estimated Signal Labels,” categorizes the signal into three distinct types: noise, LTE (Long-Term Evolution), and NR (New Radio, likely 5G). Most of the frequency spectrum is identified as noise (in pink), while a specific segment around 2350 MHz is classified as LTE (blue) and NR (cyan), suggesting active communication signals in that frequency range. Together, these plots provide a comprehensive view of the signal’s behavior and its classification within the observed frequency band.

Spectral analysis and classification of NR.
For the training dataset, both synthetic 5G NR signals and LTE signals were generated using the 5G Toolbox™ and LTE Toolbox™, respectively. These signals were passed through channel models specified by relevant standards, simulating real-world conditions. Each frame in Fig. 4 had a duration of 40 milliseconds and contained either 5G NR, LTE, or a combination of both signals, randomly shifted in frequency within the specified band of interest.

Data generation for spectrum sensing
The training process was conducted using the stochastic gradient descent with momentum (SGDM) optimization algorithm. This algorithm is known for its efficiency in converging to an optimal solution by accelerating the training process using momentum.

The training and validation accuracy over Epochs.
Table 2; Fig. 5 shows the training dataset was meticulously divided, with 80% allocated for training, 10% for validation, and the remaining 10% reserved for testing. This split ensures that the model is properly evaluated and fine-tuned throughout the training process. Additionally, class weighting was implemented to address any imbalances in the dataset, ensuring that the model does not become biased towards any class. By applying these techniques, the network was trained to effectively identify and segment different types of spectral content, improving its ability to operate in complex wireless environments.
Data provided from the following links were used to train the model19, with 80% of the data allocated to this set21, was used to validate the model’s performance during training. For the test set20, was used to test the model after the training process was complete. The data was split using random partitioning to ensure balanced representation of all patterns in the data across the different sets. This splitting is vital for obtaining accurate and systematic results in evaluating model performance.
Performance metrics of the data
Given the difficulty of obtaining sufficient and diverse natural data due to data collection constraints and information privacy, we generated synthetic data that accurately mimicked the characteristics of natural data using advanced scientific methods. Specialized tools and simulators, such as the 5G Toolbox and LTE Toolbox from the MATLAB environment, were used, which generate digital signals that adhere to official standards and protocols for telecommunications networks. After generating these signals, they were passed through simulated wireless channel models to reflect the effects of the real environment. This approach ensures that the synthetic data exhibits the same intrinsic characteristics as real data, enabling efficient model training and enhancing their generalizability and practicality.
The performance of the trained network was evaluated using several metrics:
The Table 3 illustrated the performance of the trained network on synthetic data was assessed using several key metrics. Global accuracy, which reflects the overall proportion of correctly classified pixels across all classes, achieved a value of 91.802%, indicating that the network performed well on the synthetic dataset. Mean IoU (Intersection over Union), a crucial metric for semantic segmentation tasks, was 63.619%. This value suggests that while the network could correctly classify most of the pixels, there were still areas for improvement, particularly in distinguishing between different classes. The Weighted IoU was considerably higher at 84.845%, which accounts for class imbalance by giving more weight to frequently occurring classes. Lastly, the Mean BF (Boundary F1) Score was 73.257%, indicating that the network performed moderately well in accurately segmenting the boundaries between different spectral content within the synthetic data.
When evaluating the network on captured data in Table 4, the performance metrics showed significant improvements. Global Accuracy reached 97.702%, demonstrating that the network was highly effective in correctly classifying the spectral content in real-world conditions. The Mean Accuracy, which provides an average of the accuracies for each class, was 96.717%, indicating consistent performance across different classes. The Mean IoU also improved substantially to 90.162%, showing that the network was more adept at correctly identifying and segmenting the various classes within the captured data. The Weighted IoU increased to 95.796%, reflecting the network’s enhanced ability to manage class imbalance. The Mean BF Score rose to 88.535%, suggesting that the network was better at precisely delineating the boundaries between different classes in real-world data compared to synthetic data.
For the captured dataset, approximately 60% of frames—primarily those characterized by low SNR and high temporal redundancy—were excluded, as they did not contribute meaningful diversity to the training data. Following this filtering, the network’s performance reached near-perfect levels. As shown in Table 5, the Global Accuracy increased to 99.924%, indicating highly accurate spectral classification. The Mean Accuracy reached 99.83%, reflecting consistent performance across all classes in the absence of excessive noise. The Mean IoU climbed to 99.56%, demonstrating excellent segmentation capability, while the Weighted IoU reached 99.849%, indicating strong handling of class balance. Finally, the Mean BF Score of 99.792% confirmed the model’s precision in delineating class boundaries under cleaner conditions. These results highlight the network’s potential for robust performance generalizability.
Network architectures and models
This comprehensive evaluation allowed us to identify the strengths and weaknesses of each architecture, guiding the selection of the most effective model for practical deployment. Algorithm 2 represents all the network model development and evaluation steps during model training and testing. This approach involved dividing the dataset into three subsets, where each subset was used once as a validation set while the remaining two subsets were used for training. The model training was implemented using the Keras library, while data splitting and cross-validation were managed using scikit-learn. The choice of 3 folds was motivated by the large size of the dataset and the considerable computational resources required for training deep learning models. This method helps mitigate the risk of overfitting and reduces the impact of randomness by providing a more comprehensive assessment of the model’s performance across different data splits.

Network model development and evaluation.
ResNet-18
Among the various architectures evaluated, ResNet-18 serves as a foundational baseline model due to its simplicity and proven effectiveness in image classification tasks. It is a relatively shallow but effective convolutional neural network architecture that belongs to the ResNet (Residual Network) family24. ResNet-18 utilizes residual connections to address the vanishing/exploding gradient problem common in deep neural networks, enabling high performance on various computer vision tasks despite having only 18 weighted layers.
With approximately 11.7 million parameters, ResNet-18 is notably lightweight compared to deeper networks, making it a popular choice as a baseline or feature extractor, especially in applications requiring efficient inference or transfer learning. The performance of the trained network was evaluated using several metrics: global accuracy, mean accuracy, mean Intersection over Union (IoU), weighted IoU, and mean Boundary F1 (BF) score25.

Illustrates the Architecture of model ResNet-18.
The ResNet-18 model is a deep convolutional neural network designed for image classification as illustrates in Fig. 6. It begins by accepting an input image, which is passed through a 7 × 7 convolutional layer with 64 filters and a stride of 2. This reduces the spatial dimensions of the image. Following this, a 3 × 3 max pooling layer with a stride of 2 further decreases the image size while retaining important features. The network continues with several blocks of 3 × 3 convolutional layers, each progressively increasing the number of filters from 64 to 512. These convolutional layers work to extract increasingly abstract and complex features from the input image. After these convolutional layers, average pooling is applied to summarize the learned features and reduce the spatial dimensions further26. The output of this pooling layer is then fed into a fully connected layer, which helps in forming the final prediction. Finally, a softmax layer is used to classify the input image into one of two possible categories, resulting in binary classification outputs27.
ResNet-50
ResNet-50 is a 50-layer variant of the ResNet deep learning architecture, introduced in 2015 by researchers at Microsoft, that utilizes residual connections to enable the training of very deep neural networks; it takes 224 × 224 pixel RGB images as input, has around 25 million trainable parameters, offers strong performance on various computer vision tasks like image classification (achieving around 76% top-1 accuracy on ImageNet), and is widely used as a backbone or feature extraction network in state-of-the-art deep learning models due to its balance of complexity, efficiency, and performance.

Show the Architecture of model Resnet-50.
The diagram represents the architecture of a deep convolutional neural network, resembling a ResNet-like model in Fig. 7. The process begins with an input image that passes through a zero-padding (ZERP PAD) layer to ensure consistent dimensions for convolution. In Stage 1, the image goes through a convolutional layer (Conv) to extract low-level features, followed by batch normalization (Batch Norm) to stabilize and speed up training28. A ReLU activation function introduces non-linearity, and max pooling (Max Pool) reduces the spatial dimensions while preserving key features. In Stage 2, two convolutional blocks (Conv Block ID x2) are applied for deeper feature extraction. Stage 3 increases the complexity of features with three convolutional blocks (Conv Block ID x3), while Stage 4 further extracts high-level abstractions using five convolutional blocks (Conv Block ID x5). Stage 5 utilizes two more convolutional blocks (Conv Block ID x2) to refine the extracted features. The image is then passed through an average pooling (AVG Pool) layer, which reduces the spatial dimensions by averaging pixel values in each region, summarizing the feature maps. The output is flattened into a one-dimensional vector and passed through a fully connected layer (FC) for classification. Finally, the output layer produces the network’s prediction29.
MobileNetv2
The following diagram represents the architecture of a deep convolutional neural network, likely based on the MobileNetV2 framework. The model begins by taking an input image with dimensions 128 × 128 × 3, where the image has a resolution of 128 × 128 pixels and three color channels (RGB). The image first undergoes preprocessing, which may include operations like resizing, normalization, or augmentation to prepare the data for further processing.
After preprocessing, the image is passed through a 3 × 3 convolutional layer that uses the ReLU activation function. This layer extracts basic features from the image, such as edges and textures. Following this, a 2 × 2 max-pooling layer is applied, which reduces the spatial dimensions of the feature map from 128 × 128 to 64 × 64, all while retaining the most significant information from the image30.

Show the architecture of model MobilenetV2 blocks.
The core of the model is built on MobileNetV2 blocks as illustrates in Fig. 8, starting with a block that uses 32 filters to process the feature maps. This first block produces a feature map of size 64 × 64. The network continues to process these features in a second block, which applies 96 filters, further reducing the spatial size of the feature map to 32 × 32. Finally, the third block in the network consists of 1280 filters, which processes the data and outputs a smaller feature map of size 4 × 431.
Once these layers have processed the image, the output is flattened into a one-dimensional vector containing 1280 elements. This vector is passed into a fully connected layer, where the features are mapped to output classes. The final step involves a softmax classifier, which converts the fully connected layer’s output into probabilities that correspond to the different possible classes (C1, C2, … Cn). This probability distribution helps classify the input image into one of the predefined categories. This entire structure allows the network to efficiently process and classify images, balancing performance and computational efficiency32.
EfficientNet
EfficientNet is a family of highly efficient convolutional neural network (CNN) architectures developed by researchers at Google Brain, characterized by their use of a compound scaling method to uniformly scale the network’s depth, width, and resolution based on computational constraints, coupled with a set of innovative building blocks such as mobile inverted bottleneck convolutions, squeeze-and-excitation optimization, and swish activation functions, resulting in a range of EfficientNet-B0 to EfficientNet-B7 model variants that offer state-of-the-art performance on various image classification tasks with exceptional accuracy-efficiency tradeoffs, making them well-suited for deployment on resource-constrained devices while also serving as powerful feature extractors for transfer learning in a wide array of computer vision applications33.

Illustrates the architecture EfficientNet blocks.
The previous Fig. 9 illustrates the architecture of a deep convolutional neural network optimized for image processing, utilizing EfficientNet blocks for efficient feature extraction. The network begins by accepting an input image with dimensions 224 × 224 × 3, representing an RGB image. The first layer is a 3 × 3 convolutional layer, which extracts low-level features like edges and textures. Following this, a EfficientNet Convolution (MBConv1) with a 1x expansion factor and a 3 × 3 filter processes the features, ensuring low computational cost. The network then deepens with a series of MBConv6 layers, which apply a 6x expansion factor and 3 × 3 filters to extract more complex features while maintaining efficiency. This configuration is repeated multiple times, progressively capturing more detailed and abstract information. Next, the architecture shifts to MBConv6 layers with 5 × 5 filters, which, with their larger receptive field, are adept at detecting larger spatial features and refining the understanding of complex patterns. Several of these 5 × 5 MBConv6 layers are applied to further enhance feature extraction. Finally, the network produces a condensed feature map (7 × 7 × 320), which encapsulates essential image information, ready for classification or further processing tasks.
From the previous summarizing the main features of the models ResNet-18, MobileNetV2, ResNet-50, and EfficientNet. Table 6 also explains how their features can be used in DenseNet121 and InceptionV3 for feature extraction.
