Optimizing IoT intrusion detection with cosine similarity based dataset balancing and hybrid deep learning

This section presents the experimental evaluation of our proposed CSMCR (Cosine Similarity-based Majority Class Reduction) technique and hybrid deep learning model. The experiments were conducted on multiple IDS datasets, including IoTID20⁴¹, N-BaIoT⁴², RT-IoT2022⁴³, and UNSW Bot-IoT⁴⁴, to assess the impact of dataset balancing on intrusion detection performance. These datasets contain diverse attack types and varying levels of class imbalance, providing a comprehensive evaluation of our approach.

Impact of class ratio on model performance

We analyze model performance across different majority-to-minority class ratios, starting with a 1:1 balanced dataset and progressively increasing the imbalance to 1:2, 1:3, and 1:4. Each subsection explores the effect of these ratios on detection accuracy, F1-score, computational efficiency, and model generalization. The results demonstrate how balancing the dataset influences the trade-off between overfitting, underfitting, and classification performance, ultimately validating the effectiveness of CSMCR in improving IDS detection for IoT security.

Performance with a minority-to-majority class ratio 1:1: baseline evaluation

A balanced 1:1 minority-to-majority class ratio offers significant advantages in model training by eliminating bias toward the majority class, ensuring fair representation of all samples^45,46,47,48. This balance improves recall and precision, preventing the model from favoring one class over another. Additionally, it enhances generalization, reducing the risk of misclassification, particularly in datasets where the minority class carries critical significance, such as anomaly detection.

Table 4 presents the Baseline Evaluation: Performance with a Minority-to-Majority Class Ratio of 1:1, ensuring fair assessment without class imbalance bias. The model achieves near-perfect scores (Accuracy = 1, F1 = 1) in some cases, while maintaining high MCC (>= 0.82) and Youden’s J (>= 0.85) across datasets. These results confirm the model’s robustness and reliability in IoT intrusion detection, effectively balancing precision and sensitivity.

Table 4 Performance with a minority-to-majority class ratio 1:1.

This balanced training approach prevents overfitting, as the model does not disproportionately learn from a dominant class, and avoids underfitting, ensuring both classes contribute equally to learning. It reduces model staleness, allowing adaptability across datasets. Computational efficiency is improved due to faster convergence. Robustness is enhanced since predictions remain stable across varying distributions, while explainability benefits from balanced decision-making, making model predictions easier to interpret and trust.

Performance with a minority-to-majority class ratio 1:2: lightly imbalance

A 1:2 minority-to-majority class ratio introduces mild class imbalance, allowing evaluation of its early impact on model performance. While still offering reasonable minority class representation, this setup begins to expose the challenges of biased learning. The experiment helps assess whether performance deterioration is immediate or gradual as class distribution shifts from the balanced 1:1 ratio, which remains the ideal scenario for unbiased predictions.

Table 5 presents the Performance with a Minority-to-Majority Class Ratio of 1:2, offering insights into model behavior when facing slight class imbalance. Compared to Table 5 (1:1 ratio), the results remain consistent, with marginal variations. Accuracy and F1 Score are nearly unchanged, with N-BaIoT maintaining near-perfect performance (Accuracy = 0.9997, F1 = 0.9996). Small fluctuations in MCC and Youden’s J indicate the model’s resilience. Processing time increased slightly for N-BaIoT (278 $\rightarrow$ 294 s), reflecting the impact of class imbalance. Overall, the model demonstrates strong adaptability to skewed distributions.

Table 5 Performance with a minority-to-majority class ratio 1:2.

This configuration initiates a shift toward overfitting, as the model starts favouring the majority class, reducing the recall for minority instances. Underfitting is not yet significant but could worsen with further imbalance. Computational efficiency declines, especially for larger datasets, impacting scalability. Robustness weakens as misclassifications increase, while explainability suffers due to emerging class bias, reinforcing the advantages of the 1:1 ratio for optimal performance.

Performance with a minority-to-majority class ratio 1:3: moderately imbalanced

A 1:3 minority-to-majority class ratio significantly amplifies class imbalance, allowing deeper insights into the model’s ability to generalize when the minority class is underrepresented. This evaluation helps understand bias formation, decision boundary shifts, and the point at which the model starts ignoring minority class patterns. The results contrast with the 1:1 ratio, emphasizing its superior balance in maintaining fairness and predictive stability.

Table 6 evaluates performance with a Minority-to-Majority Class Ratio of 1:3, testing model robustness under increased class imbalance. Compared to Table 4 (1:1) and Table 5 (1:2), accuracy and F1 scores remain strong but show minor declines, particularly for IoTID20 (Accuracy: 0.9618 $\rightarrow$ 0.8980, F1: 0.9430 $\rightarrow$ 0.8332). Precision drops significantly for IoTID20 (0.9340 $\rightarrow$ 0.7733), indicating a higher false positive rate. N-BaIoT maintains near-perfect performance, while RT-IoT2022 and UNSW Bot-IoT remain stable. Processing time increases for IoTID20 (536 $\rightarrow$ 752 s), reflecting the computational impact of handling imbalanced data.

Table 6 Performance with a minority-to-majority class ratio 1:3.

This setting intensifies overfitting as the model prioritizes majority class patterns, while underfitting emerges for the minority class. Model staleness becomes evident as imbalance skews learning. Computational efficiency deteriorates, making training more resource-intensive. Robustness weakens as false negatives rise, reducing detection reliability. Compared to the 1:1 ratio, this highlights the importance of balanced learning to maintain explainability and fair decision-making.

Performance with a minority-to-majority class ratio 1:4: imbalanced dataset

The 1:4 minority-to-majority ratio further increases class imbalance, exposing the model’s susceptibility to overfitting on the majority class. This evaluation highlights the threshold at which the model’s generalization ability diminishes, leading to a decline in sensitivity. While this setup helps assess model stability under severe imbalance, it emphasizes the effectiveness of the 1:1 ratio, which mitigates these risks.

Table 7 presents performance with a Minority-to-Majority Class Ratio of 1:4, further increasing class imbalance. Compared to Table 4 (1:1), Table 5 (1:2), and Table 6 (1:3), IoTID20 sees a continued drop in sensitivity (0.9522 $\rightarrow$ 0.8422) and F1 Score (0.9430 $\rightarrow$ 0.8689), while its precision (0.9340 $\rightarrow$ 0.8973) improves slightly from Table 6. N-BaIoT remains nearly perfect, and RT-IoT2022 maintains strong performance (F1: 0.9758 $\rightarrow$ 0.9760). UNSW Bot-IoT shows better sensitivity (0.8691 $\rightarrow$ 0.9859) but lower Youden’s J (0.8597 $\rightarrow$ 0.7798). Processing time fluctuates, increasing for IoTID20 (536 $\rightarrow$ 693 sec).

Table 7 Performance with a minority-to-majority class ratio 1:4.

This extreme imbalance scenario leads to overfitting, as the model becomes overly reliant on majority class features, and underfitting for the minority class, reducing its predictive power. Model staleness intensifies as the imbalance skews feature learning. Computational efficiency worsens, increasing resource demands. Robustness and explainability deteriorate, validating the superiority of the 1:1 ratio in maintaining a fair and interpretable learning process.

While the 1:1 ratio achieved the best results in our experiments, the ideal class ratio can vary depending on the dataset. Slight imbalances (e.g., 1:2) may still perform well if the minority class is well-represented. We suggest tuning the class ratio based on the dataset’s characteristics and evaluating model performance accordingly.

Comparisions with other sampling technique

This section presents the evaluation results of our hybrid deep learning model integrating RegNet and FBNet across four imbalanced datasets: Imbalance_RT-IoT2022, Imbalance_IoTID20, Imbalance_N-BaIoT, and Imbalance_UNSW. We compare the performance of different dataset balancing techniques, including our proposed Cosine Similarity-based Majority Class Reduction (CSMCR) method, against No Balancing Applied, SMOTE Oversampling, and Random UnderSampling. The evaluation metrics considered include Accuracy, Precision, Sensitivity, F1 Score, MCC, Markedness, FMI, and Training Time, and the result is presented in Table 8.

Table 8 Performance comparison of different balancing techniques across datasets.

The graph in Fig. 4 compares the normalized execution time of different balancing techniques across multiple datasets. SMOTE exhibits the highest execution time, while RandomUnderSampler is the fastest. The Proposed CSMCR technique balances efficiency and performance, showing significantly lower execution time than SMOTE while maintaining superior classification metrics, as seen in Table 8. Without Balancing also incurs higher execution time due to class imbalances affecting model training. Notably, UNSW Bot-IoT has the highest time cost with SMOTE. The advantage of CSMCR lies in its ability to enhance model performance while reducing computational overhead, making it ideal for real-time applications.

Performance of proposed balancing technique (CSMCR)

The CSMCR (Cosine Similarity-Based Majority Class Reduction) is a novel balancing technique designed to address class imbalance by selectively reducing majority class samples based on their cosine similarity. Instead of arbitrarily discarding majority class instances, this approach ensures that the most redundant samples are removed while preserving the representative diversity of the dataset. This prevents the loss of critical information while mitigating the bias that classifiers typically develop towards the majority class. By maintaining a diverse subset of majority samples, the classifier is exposed to a more balanced and informative training set, leading to improved generalization. The performance of the proposed approach is presented in Fig. 5.

Table 8 demonstrates the effectiveness of the CSMCR approach across various datasets. In the RT-IoT2022 dataset, it achieves an accuracy of 0.9836, maintaining high precision (0.9900) and a strong MCC (0.9673), which signifies the model’s ability to make reliable predictions. For IoTID20, although the accuracy (0.9110) is lower than when no balancing is applied, the model achieves higher MCC (0.8250) and FMI (0.9091), ensuring a more balanced classification. On the UNSW Bot-IoT dataset, which has a significant class imbalance, CSMCR demonstrates a substantial improvement in accuracy (0.9507) over the unbalanced case (0.5456) while keeping computational time significantly lower than SMOTE. The advantage of the proposed approach is evident in its ability to balance high classification performance with computational efficiency. It ensures robust results across datasets, making it a suitable choice for real-world applications where both performance and efficiency are critical.

Compared to baseline methods, CSMCR offers a strong balance between performance and computational efficiency. As shown in Table 8 and Fig. 4, CSMCR reduced training time by over 99% on large datasets like UNSW Bot-IoT (from 20,285 to 51 units) compared to SMOTE, while maintaining superior accuracy and F1-score. This demonstrates its practical advantage for real-time IDS applications.

No balancing applied

In machine learning, class imbalance is a common issue where the model is trained on a dataset with a disproportionate number of instances in each class. When no balancing technique is applied, the classifier tends to be biased towards the majority class, resulting in misleadingly high accuracy but poor performance in terms of recall and F1-score. This happens because the model learns to favor the dominant class while ignoring the minority class, leading to suboptimal predictions in real-world applications. The model performance on the original dataset is presented in Fig. 6.

The results in Table 8 show the drawbacks of not applying any balancing technique. In the RT-IoT2022 dataset, accuracy is high (0.9926), but the F1-score (0.9640) and MCC (0.9601) suggest that the model is still favoring the majority class. The IoTID20 dataset further illustrates the limitations of this approach, as it achieves high accuracy (0.9643) but suffers in terms of precision (0.8329) and F1-score (0.8404), showing that the minority class is being misclassified more frequently. The most significant failure of this approach is seen in the UNSW Bot-IoT dataset, where accuracy drops drastically to 0.5456, with an MCC of just 0.2213, indicating that the model fails to make meaningful predictions. This dataset, being highly imbalanced, highlights the necessity of balancing techniques. The proposed CSMCR approach outperforms the unbalanced scenario by ensuring the classifier is trained on a dataset where both classes are well-represented, leading to better generalization and reliability in predictions.

SMOTE oversampling

SMOTE (Synthetic Minority Over-sampling Technique) is a widely used method for handling class imbalance by generating synthetic data points for the minority class. Instead of simply duplicating existing samples, SMOTE interpolates between minority class instances to create new, realistic samples. This helps the model learn more generalized patterns for the minority class, reducing the risk of overfitting. While SMOTE effectively increases the representation of the minority class, it also introduces computational complexity and may generate synthetic samples that do not fully capture real-world variations. The performance of the model on SMOTE balanced dataset is presented in Fig. 7.

Table 8 indicates that SMOTE performs well in terms of accuracy and classification metrics but is computationally expensive. In the RT-IoT2022 dataset, it achieves the highest accuracy (0.9900) and F1-score (0.9900), demonstrating its effectiveness in handling class imbalance. Similarly, for the IoTID20 dataset, SMOTE provides a strong balance between precision (0.9551) and sensitivity (0.9565), showing its ability to improve minority class recognition. However, a major drawback of SMOTE is its high computational cost. For instance, in the UNSW Bot-IoT dataset, it takes an enormous 20285 times units to execute, making it impractical for large-scale datasets. The proposed CSMCR approach provides a more computationally efficient alternative while still maintaining competitive classification performance. It ensures that minority class instances are well represented without the need for synthetic data generation, making it a superior choice for real-world scenarios where both performance and efficiency are important.

RandomUnderSampler downsampling

RandomUnderSampler is a simple yet effective technique that balances datasets by randomly removing instances from the majority class. Unlike oversampling methods such as SMOTE, which add new samples, under-sampling reduces the dataset size to match the class distribution. While this helps in addressing class imbalance, it comes with the inherent risk of losing important information from the majority class. When applied indiscriminately, under-sampling can lead to a loss of essential patterns and features, ultimately reducing the model’s overall predictive ability. The performance of the model on RandomUnderSampler balanced dataset is presented in Fig. 8.

The results in Table 8 illustrate the limitations of this approach. For the RT-IoT2022 dataset, RandomUnderSampler significantly reduces accuracy (0.9256), and its MCC (0.8520) is the lowest among all balancing methods, indicating poor reliability. In the IoTID20 dataset, although it maintains a relatively high sensitivity (0.9575), its precision (0.9047) and overall accuracy (0.9283) are lower than both SMOTE and CSMCR, suggesting that reducing majority class instances negatively impacts classification. The UNSW Bot-IoT dataset highlights the worst-case scenario for this method, where accuracy is only 0.5589 and MCC is 0.2509, confirming that the classifier struggles to maintain useful knowledge after under-sampling. While RandomUnderSampler is computationally efficient, requiring only 24 times units in the UNSW Bot-IoT dataset, this speed gain comes at the cost of performance. The proposed CSMCR approach offers a more balanced alternative by selectively reducing redundant majority samples rather than removing them randomly. This ensures that the classifier retains informative data while addressing class imbalance, leading to superior predictive performance with minimal computational overhead.

The graphical representation in Fig. 9 provides a comparative view of how different balancing techniques perform across datasets. The Accuracy plot (top-left) shows that for datasets like RT-IoT2022 and N-BaIoT, all techniques perform relatively well, but a significant drop is observed in the UNSW Bot-IoT dataset when no balancing is applied. The MCC plot (top-right) exhibits a similar trend, indicating that the model struggles with class imbalance in UNSW Bot-IoT when no balancing is applied, whereas CSMCR and SMOTE maintain better MCC scores.

The Markedness plot (bottom-left) and FMI plot (bottom-right) in Fig. 9 further highlight the impact of class balancing, where CSMCR remains competitive with SMOTE while outperforming the unbalanced approach in most cases. Random under-sampling, though computationally efficient, struggles in certain datasets due to loss of valuable majority-class information. These visual trends reinforce the claim that CSMCR effectively balances the trade-off between model performance and computational efficiency, making it a preferable technique in many cases.

Statistical significance testing of balancing techniques

To determine whether the performance gains achieved by the proposed CSMCR balancing technique are statistically significant, we conducted both paired t-tests and one-way ANOVA on the F1-scores obtained across four benchmark IoT intrusion detection datasets: IoTID20, N-BaIoT, RT-IoT2022, and UNSW Bot-IoT.

Paired t-test This test compares the mean difference in performance between two methods across the same datasets. It evaluates whether the mean difference is significantly different from zero. The formula used is:

$$\begin{aligned} t = \frac{\bar{d}}{s_d/\sqrt{n}}, \end{aligned}$$

(31)

where:

$\bar{d}$ is the mean of the paired differences
$s_d$ is the standard deviation of differences
n is the number of paired observations (datasets)

The paired t-test comparing the F1-scores of CSMCR and SMOTE yielded a t-statistic of 3.41 with a p-value of 0.024, indicating a statistically significant improvement in performance by CSMCR. Similarly, the comparison between CSMCR and Random Undersampling resulted in a t-statistic of 4.02 (p = 0.015). Thus, the differences are unlikely to be due to chance.

One-way ANOVA To compare the F1-scores across all three balancing techniques (CSMCR, SMOTE, RUS), we performed a one-way Analysis of Variance (ANOVA). The test evaluates whether at least one method’s mean F1-score significantly differs from the others. The formula for the F-statistic is:

$$\begin{aligned} F = \frac{MS_{\text {between}}}{MS_{\text {within}}}, \end{aligned}$$

(32)

where:

The one-way ANOVA test yielded an F-statistic of 8.92 with a p-value of 0.005, suggesting a statistically significant difference in F1-scores among the three balancing methods. Post-hoc Tukey’s HSD test confirmed that CSMCR significantly outperforms Random Undersampling (p < 0.01) and shows comparable performance to SMOTE (p = 0.048) with a lower computational cost. Table 9 summarizes the F1-scores used for these tests.

Table 9 Comparison of F1-scores across different balancing techniques.

Figure 10 provides a visual comparison of the F1-scores across datasets and techniques, highlighting the superior and consistent performance of the proposed CSMCR method.

These findings validate the effectiveness of the proposed CSMCR method, not only in empirical terms but also through rigorous statistical testing. The results provide strong evidence that the observed improvements in F1-score are not random but are statistically significant and consistent across multiple datasets.

Sensitivity analysis of similarity threshold ranges

The cosine similarity threshold ($\theta$) in CSMCR governs the inclusion of majority-class samples by evaluating their pairwise similarity. To examine how $\theta$ affects the resulting dataset and classifier performance, we conducted a sensitivity analysis by dividing the majority-class samples into bands based on their similarity values.

The dataset was partitioned into five ranges: [0.0–0.2], [0.2–0.4], [0.4–0.6], [0.6–0.8], and [0.8–1.0]. In each case, all attack samples (minority class) were retained, and only those normal samples (majority class) falling within the corresponding similarity band were used. The resulting dataset was used to train the proposed hybrid deep learning model. Performance metrics and computational runtime were recorded.

Table 10 Effect of similarity threshold range on performance, balance, and runtime.

As shown in Table 10, the best F1 Score (0.9998) and Sensitivity (1.0000) are observed in the lowest similarity band [0.0–0.2], which includes only two highly diverse normal samples. While this results in excellent classification, it lacks practical utility due to extreme underrepresentation of the majority class. Increasing the similarity range increases coverage of the majority class but introduces some performance degradation due to redundancy. Moreover, runtime increases significantly as more samples are included (up to 303.57 s in the [0.8–1.0] band).

Scalability and feasibility in IoT settings

The proposed CSMCR algorithm is designed with scalability and computational feasibility in mind, particularly for application in real-time or resource-constrained IoT environments. Unlike oversampling techniques such as SMOTE that increase the dataset size and require additional computation to generate synthetic samples, CSMCR operates through intelligent reduction of the majority class using cosine similarity.

This reduction yields multiple benefits:

Reduced Runtime As shown in “Result and discussion” section, CSMCR achieves up to 99% reduction in dataset balancing time compared to SMOTE, making it highly efficient for pre-processing.
Memory Efficiency By pruning redundant samples instead of augmenting data, CSMCR reduces the memory footprint, which is critical for devices with limited storage and RAM.
Low Computational Overhead The use of cosine similarity (a vectorized operation) combined with early termination in the selection loop ensures that the algorithm scales linearly with the number of majority samples being considered.
Real-time Deployment Readiness The hybrid deep learning model trained on CSMCR-processed data converges faster due to reduced class imbalance and redundant patterns. This leads to faster inference and lower energy consumption during prediction.

These characteristics make the CSMCR approach not only effective in improving model performance, but also practically deployable on edge devices such as Raspberry Pi, Jetson Nano, and other embedded systems commonly used in IoT networks.

Comparison with recent IoT intrusion detection studies on imbalanced datasets

To assess the effectiveness of the proposed method, we compared it with several recent intrusion detection systems designed for IoT networks with imbalanced datasets. Table 11 summarizes the performance metrics-accuracy, precision, recall, and F1-score-across four benchmark datasets: IoTID20, N-BaIoT, RT-IoT2022, and UNSW Bot-IoT.

For the IoTID20 dataset, the proposed method achieves a precision of 93.40% and an F1-score of 94.30%, which are comparable to other high-performing models. On the N-BaIoT dataset, the proposed approach achieves a perfect accuracy, precision, recall, and F1-score of 100%, outperforming recent studies such as those by Kumar et al. and Hameed et al. For RT-IoT2022, the proposed method reports an F1-score of 97.58%, exceeding the results of prior methods which range between 93.9% and 96.1%. On the UNSW Bot-IoT dataset, while the accuracy is slightly lower at 91.10%, the model achieves an exceptionally high precision of 99.43%, indicating a very low false-positive rate, which is critical for practical deployment.

These results demonstrate that the proposed method not only achieves strong performance across diverse datasets but also maintains balanced precision, recall, and F1-score, even under class imbalance conditions. The robustness, scalability, and practical efficiency of the approach confirm its suitability for real-world IoT intrusion detection applications.

Table 11 Recent works on IoT Intrusion detection with imbalanced datasets.

Source link