An efficient deep learning approach with frequency and channel optimization for underwater acoustic target recognition

Machine Learning


In this section, we conduct two experiments to validate the efficiency and superiority of our proposed method. The first is an ablation experiment to verify the impact of key model components. The second is a comparative experiment to evaluate our method against state-of-the-art approaches in terms of accuracy and efficiency.

Ablation experiment

In this section, we conduct a comprehensive ablation study to investigate the key design factors contributing to the performance of FCResNet5. Specifically, we compare time-frequency and non-time-frequency input features, assess the influence of frequency bandwidth selection, examine the effect of window overlap during feature extraction, evaluate the role of frequency channelization, and explore suitable network architectures. These analyses collectively demonstrate how FCResNet5 achieves a balance between accuracy and computational efficiency, making it well-suited for real-world ship-radiated noise classification.

Comparison Between Time-Frequency and Non-Time-Frequency Features

To evaluate the effectiveness of different input representations for ship-radiated noise classification, we conduct a comparative experiment using ResNet18 across seven feature types. These include four time-frequency features (STFT, Mel, CQT, and Gamma-tone) and three non-time-frequency features (MFCC, Wavelet, and Cepstrum). Each feature is evaluated using five randomly generated data splits, and the results are averaged over 10 repeated runs. The average classification accuracies are summarized in Table 4.

Table 4 Average classification accuracy (%) of ResNet18 using seven feature representations.

As shown in Table 4, the time-frequency representations consistently outperform the non-time-frequency counterparts. Among them, STFT achieves the highest average accuracy (72.38%), followed closely by Mel (71.20%), while Wavelet and Cepstrum exhibit the lowest performance. These findings suggest that time-frequency features preserve more discriminative information crucial for ship-radiated noise classification.

Fig. 7
figure 7

t-SNE visualizations of learned features using seven different feature types: Mel, CQT, Gamma-tone, STFT, MFCC, Wavelet, and Cepstrum.

To further evaluate feature separability, we visualize the t-SNE embeddings of the extracted features in Fig. 7. Subfigures (a) to (d), which correspond to the time-frequency representations, exhibit more distinct and compact clusters, reflecting stronger inter-class separability. In contrast, the non-time-frequency features in subfigures (e) to (g), particularly Wavelet and Cepstrum, show more diffuse and overlapping distributions, indicating limited discriminative capability. These qualitative observations are consistent with the classification accuracy results summarized in Table 4.

Based on this analysis, we adopt the four time-frequency features (STFT, Mel, CQT, and Gamma-tone) as the primary input representations for subsequent experiments.

Effective bandwidth

In this section, we validate the rationale behind selecting the 2kHz bandwidth. In our experiments, we use ResNet18 as the classification method to test the impact of different upper and lower frequency limits on the model’s performance.

The experimental results are illustrated in Fig. 8 and Table 5, which jointly present the classification accuracy (mean ± std over 10 trials) of ResNet18 under various frequency input configurations across four spectral features: CQT, Mel-spectrogram, STFT, and Gamma-tone. Figure 8 visualizes the accuracy trends for different maximum frequency cutoffs (green lines) and segmented frequency bands (orange lines), while Table 5 reports the detailed numerical values.

As shown by the green line, the model’s classification performance generally decreases as the bandwidth range increases, indicating that increasing the upper bandwidth limit does not improve the model’s performance but instead introduces interference. Furthermore, the yellow line also shows that higher bandwidth ranges correspond to lower classification accuracy, suggesting that these ranges likely do not contain useful signals for classification, but rather unwanted noise. Thus, this experiment confirms the rationale of our setting by limiting the data bandwidth to within 2 kHz.

Fig. 8
figure 8

Accuracy (mean ± std) of different frequency ranges on four different types of features. The green lines represent the results under different maximum frequency cutoffs, while the orange lines correspond to different frequency band segments.

Table 5 Accuracy (mean ± std) of FCResNet5 under different frequency range and band settings across four types of spectral features.

Analyzing Window Overlap in Spectral Feature Extraction

To investigate the influence of windowing strategies on model performance and training efficiency, we conduct an ablation study on the use of overlapping windows during time-frequency feature extraction. In this experiment, we fix the window length at 200 ms and vary the overlap ratio between adjacent frames. Specifically, we consider four overlap settings: 0% (no overlap), 25%, 50%, and 75%, which correspond to hop sizes of 200 ms, 150 ms, 100 ms, and 50 ms, respectively. For each setting, the model is trained and evaluated 10 times using different random seeds. We report the average classification accuracy and standard deviation across ten repeated runs, as shown in Table 6.

Table 6 Performance comparison under different overlap settings (accuracy: mean ± std; training time in seconds).

The results indicate that applying overlap generally improves classification performance for most feature types. For example, STFT accuracy improves from 78.03% (±1.21) without overlap to 78.78% (±1.16) with 75% overlap. A similar pattern is observed for the Gamma-tone feature, which increases from 69.61% (±1.02) to 71.46% (±1.24). MEL features also show moderate gains with overlap, although the improvement is less consistent. In contrast, the CQT feature shows a decrease in accuracy when the overlap exceeds 25%.

However, the benefit of overlapping comes at the cost of increased training time. Due to the denser frame segmentation, total training time grows significantly as overlap increases. For instance, STFT training time increases from 364 s at 0% overlap to 1229 s at 75% overlap. This highlights a trade-off between performance and computational cost. While overlapping can marginally enhance performance, particularly for STFT and Gamma-tone, it also leads to substantially higher resource consumption. As such, the no-overlap setting is adopted in our default configuration to ensure a good balance between accuracy and efficiency.

Effectiveness of frequency channelization

In this section, we evaluate the effectiveness of Frequency Channelization (FC) by applying it to three representative models: ResNet18, RCMoE-balance, and CFTAnet. The “FC” suffix indicates that the model employs frequency channelization. A comprehensive comparison of model complexity (parameters and MFLOPs), average training time, and classification accuracy (mean ± std over 10 trials) before and after applying FC is presented in Table 7 and visualized in Fig. 9.

Table 7 Comparison of model complexity (parameters and FLOPs), training time, and accuracy between FC and Non-FC versions of different methods across spectral features.
Fig. 9
figure 9

Figures of parameters (ab), flops (cd), training time (eh), Accuracy (il) comparison on three methods with/without frequency channelization.

Table 7 and Fig. 9 demonstrate that introducing frequency channelization leads to a marginal increase in parameter count, ranging from approximately 0.4 to 1.25 million across all models and feature types. Despite this minor increase in model size, the reduction in computational cost is substantial. For example, the number of FLOPs drops by over 90% for ResNet18 and RCMoE-balance under all feature types, and by more than 50% for CFTAnet. These reductions translate into significantly lower training times. ResNet18 sees the largest improvement, with training time reduced by more than 660 seconds, while CFTAnet also benefits with reductions of over 230 seconds across most settings. These results confirm that FC enhances computational efficiency considerably, while introducing only a modest increase in model complexity.

In terms of classification accuracy, FC yields consistent performance improvements for CFTAnet across all spectral features. Notably, the gain is particularly pronounced for the Gamma-tone input, where the accuracy improves by 5.63%. In contrast, ResNet18 and RCMoE-balance exhibit slight drops in accuracy (generally less than 2–4%), suggesting that the standard ResNet18 backbone may not be optimally aligned with frequency-channelized inputs. Since RCMoE-balance also builds on ResNet18, this pattern reinforces the hypothesis that architectural compatibility plays a key role in leveraging the benefits of FC.

Overall, these results confirm that frequency channelization significantly enhances training efficiency while preserving, and in some cases improving, classification accuracy, particularly when paired with lightweight models such as CFTAnet. This highlights the importance of designing network architectures that are structurally adapted to frequency-segmented inputs.

Exploring optimal network architectures

In this section, we aim to explore network architectures that are better suited for frequency channelization. The network architecture of ResNet18 is given in Fig. 10, where the numbers 64, 128, 256, and 512 represent the feature channels generated at each layer. The diagram shows a progressive increase in the number of channels across the layers. However, after applying frequency channelization, the number of channels is 132 for CQT and 400 for other features, both significantly higher than the initial 64 channels after the first convolutional layer of ResNet18. This suggests that our input features undergo a rapid decrease in channel numbers before gradually increasing again, which we suspect may negatively impact the performance of frequency channelization. To further investigate, we conduct comparative experiments between descending and ascending channel configurations and evaluate different layer depths to assess their impact on performance.

In our experiments, the initial convolution layer is treated as the first layer, with the subsequent four groups of residual blocks treated as layers 2, 3, 4, and 5. The number of channels is adjusted uniformly for each group, and network depth is controlled by reducing the number of residual blocks. We use STFT as input features and set the maximum channel number to 256, since the final channel number in ResNet18 (512) exceeds our input’s channel number (400). Tables 8 and 9 compare accuracy (mean ± std over 10 trials), parameters, MFLOPs and average training time under descending and ascending channel configurations, respectively. In the table, the rows where the models achieved relatively higher accuracy are highlighted in bold to emphasize their performance.

Fig. 10
figure 10

The network architecture of ResNet18.

The results show that the descending channel configuration consistently achieves higher accuracy than the ascending configuration under the same conditions, with the difference becoming more pronounced as the number of layers increases. However, the descending configuration requires substantially more parameters and higher computational complexity, suggesting a trade-off between accuracy and efficiency.

Table 8 Table of accuracy, params, FLOPs, and average training time for different maximum channels and layers with ascending. Accuracy greater than 76% is highlighted.
Table 9 Table of accuracy, params, FLOPs, and average training time for different maximum channels and layers with descending. Accuracy greater than 76% is highlighted.

From Table 9, we conclude that maximum channel numbers of 256 and 128 yield the best accuracy. Furthermore, when the number of layers exceeds 2, accuracy differences are minimal. Therefore, to balance accuracy and efficiency, we select the 128\(\rightarrow\)64 configuration for further experiments, considering that the CQT feature input has 132 channels.

In analyzing the initial convolution layer, we find that its kernel size has a significant impact on network performance. The original ResNet18 uses a kernel size of 7 with a stride of 2. We adjust the stride to 1 and compare the impact of different kernel sizes on network performance.

Figure 11 shows the comparison results of different kernel sizes for accuracy (mean over 10 trials), parameters, MFLOPs, and average training time. We can see that kernel size of 3 achieve the best performance with the lowest parameters and complexity.

Based on these results, we name our optimal network FCResNet5. This network includes 5 layers: an initial convolution layer and one residual block, with each residual block comprising two residual networks, and each residual network containing two convolutional layers. The name FCResNet5 reflects its simplified ResNet structure, lightweight design, and frequency channelization.

Fig. 11
figure 11

Comparison of different kernel sizes. (a) accuracy, (b) number of parameters, (c) computational complexity measured in MFLOPs, and (d) average training time.

Comparative experiment

To comprehensively assess the effectiveness of the proposed FCResNet5, we design two complementary comparative experiments. The first focuses on evaluating classification performance using four types of time-frequency spectral features: STFT, Mel, CQT, and Gamma-tone. The second assesses the robustness of different models under varying signal-to-noise ratio (SNR) levels, simulating real-world scenarios with degraded acoustic quality. These two perspectives jointly offer a holistic evaluation of both discriminative capacity and noise resilience.

Performance across spectral features

In the first part of the comparative study, we evaluate the classification performance of FCResNet5 against several representative methods, including RCMoE-balance37, CFTAnet24, and the widely used backbone network ResNet188,26,37,42. The dataset is divided into five distinct training and testing folds, as detailed in Table 10, ensuring that the evaluation covers diverse data distributions. We report the average classification accuracy along with standard deviation over ten independent runs, as well as model complexity metrics including parameter count, MFLOPs, and average training time. Furthermore, to support the reliability of the comparisons, we perform statistical significance tests using independent two-sample t-tests (\(n=50\)), with a Bonferroni-corrected threshold of \(p=0.0167\).

Table 10 Dataset split summary.
Table 11 Comparison of classification accuracy across different spectral features and methods without window overlap during feature extraction.
Table 12 Comparison of classification accuracy across different spectral features and methods with 50% window overlap during feature extraction.

As shown in Tables 11 and 12, we present the classification performance of FCResNet5 compared to three representative baselines under 0% and 50% window overlap settings. Columns 3 to 7 report the accuracy (mean ± std over 10 trials) and training time for each of the five dataset folds using four types of spectral features. Column 8 summarizes the average accuracy across all folds, while Column 9 provides the p-values from two-sample t-tests between FCResNet5 and other competing methods under each feature. The last two columns display the model complexity in terms of MFLOPs and parameter count, offering a complete view of model efficiency and cost.

The results in Tables 11 and 12 reveal that ResNet18 performs best under STFT and Mel inputs, while FCResNet5 consistently ranks second. However, when using CQT and Gamma-tone features, FCResNet5 achieves the highest accuracy across both overlap settings, outperforming all baseline models. For instance, with no overlap, FCResNet5 achieves 66.38 on CQT and 66.98% on Gamma-tone, surpassing ResNet18 by 1.39% and 2.63%, respectively. Under 50% overlap, this advantage is maintained, with FCResNet5 reaching 66.98% and 66.90% on CQT and Gamma-tone, respectively.

While performance differences on STFT and Mel are relatively small, the consistent advantage of FCResNet5 on CQT and Gamma-tone demonstrates its adaptability across diverse input features. Given the limited performance gap and the consistent feature extraction settings, these variations are more likely attributed to statistical fluctuations rather than systematic feature dependence.

In addition to accuracy, FCResNet5 maintains significant efficiency benefits. It offers a substantial reduction in the number of parameters compared to ResNet18 and RCMoE-balance, with MFLOPs reduced by over 90% and shorter training times across all features. These results confirm that FCResNet5 achieves competitive performance while substantially reducing computational cost, making it a strong candidate for deployment in resource-constrained environments.

To assess performance differences, we conduct independent two-sample t-tests (Bonferroni-corrected, \(p < 0.0167\)) to compare FCResNet5 with baseline models under both non-overlap and 50% overlap conditions. Under the non-overlap setting, FCResNet5 shows significant advantages over ResNet18 and RCMoE-balance under the Gamma-tone feature (\(p < 0.001\)), while differences under STFT, Mel, and CQT are not statistically significant (\(p > 0.0167\)) despite variations in average accuracy. Under the 50% overlap setting, despite FCResNet5 showing a marginally lower average accuracy than ResNet18 under the Mel feature, the difference is statistically significant. For STFT, although ResNet18 again yields higher mean accuracy, the difference remains insignificant (\(p > 0.0167\)). These results indicate that while FCResNet5 offers competitive or superior performance across various scenarios, its improvements are feature-dependent and not always statistically significant. This underscores its practical value in resource-constrained environments, where balancing accuracy and efficiency is critical.

To further interpret the classification behavior, Fig. 12 presents the normalized confusion matrices of FCResNet5 using the four different spectral features. These matrices reflect the average performance over 10 repeated runs on the test set. Among the four target classes, Tug is generally recognized with the highest recall across all feature types. STFT provides the most balanced classification, yielding strong performance for Passengership (0.77), while Mel leads in accuracy for the Cargo class (0.74). In contrast, CQT and Gamma-tone representations result in relatively higher inter-class confusion, especially for Cargo and Tanker. For instance, under Gamma-tone, the recall for Cargo and Tanker drops to 0.58 and 0.66, respectively, reflecting a noticeable decline in discriminability.

Fig. 12
figure 12

Confusion matrices of FCResNet5 using: (a) STFT, (b) Mel, (c) CQT, and (d) Gamma-tone on the test set.

Robustness Evaluation under Different SNR Conditions

To evaluate the robustness of different models in noisy environments, we simulate additive Gaussian noise at various SNR levels: clean (no noise), 10 dB, 5 dB, 0 dB, \(-5\) dB, and \(-10\) dB. Noise is added directly to the raw audio signals before feature extraction. We compare our proposed model FCResNet5 with a standard ResNet18 baseline, as well as two competitive methods from the literature: RCMoE-balance and CFTAnet. Each setting is evaluated over 10 independent runs, and we report the average classification accuracy across these runs. Figure 13 illustrates the classification performance of different models under varying SNR conditions. As the noise level increases, all models exhibit a consistent decline in accuracy, indicating sensitivity to noise interference. FCResNet5 achieves the highest accuracy when the SNR is above 0 dB, suggesting its suitability for clean to moderately noisy environments. However, as the SNR drops below 0 dB, its performance deteriorates more noticeably, and CFTAnet slightly surpasses it in extreme noise conditions (e.g., −5 dB and −10 dB). These results indicate that while all models are vulnerable to severe noise, the overall robustness across models remains comparable. Enhancing low-SNR resilience without sacrificing model efficiency remains an important direction for future investigation.

Fig. 13
figure 13

Classification accuracy of four models under different SNR levels.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *