Performance metrics
The conventional metrics of (1) accuracy, (2) precision, (3) specificity, (4) sensitivity, and (5) F1 score are used to assess the suggested Fuzzy-CNN model. The primary measurements (positive/negative/true/false) of a binary classification test are typically the basis for performance measures. X and Y are two potential anticipated classes that we will define. As a result, Table 4 shows that Tp corresponds to positive samples categorized as positive, Fn to positive samples classified as negative, Fp to negative samples classed as positive, and Tn to negative samples classified as negative. A multi-class confusion matrix can be represented as follows, depending on the confusion matrix.
Classified.
$$C = Actual~\begin{array}{*{20}c} {c11} & \cdots & {c1n} \\ \vdots & \ddots & . \\ {cn1} & . & {cnn} \\ \end{array}$$
(6)
The following lists the components of confusion for each class.
$${F}_{Pi}=\sum_{l=1}^{n}{c}_{li}-{T}_{pi}$$
$${F}_{ni}=\sum_{l=1}^{n}{c}_{il}-{T}_{pi}$$
$${T}_{ni}=\sum_{l=1}^{n}\sum_{k=1}^{n}{c}_{lk}-{T}_{pi}-{F}_{pi}-{F}_{ni}$$
$${ACC}_{i}=\frac{T{P}_{i}+T{n}_{i}}{T{P}_{i}+F{n}_{i}+F{p}_{i}+T{n}_{i}}$$
$$ACC=\frac{1}{n}\sum_{i=1}^{n}{ACC}_{i}$$
(7)
The aforementioned calculations outline the performance evaluation criteria, and ten runs are the total number needed for MPA.
Average Accuracy (AVGAcc): The precise number of matches between the classifier output and the label of the sample data is represented by ACC. This entails figuring out each class’s accuracy independently and then averaging the outcomes. Consequently, the following formula is used to determine the best bait’s average accuracy (AVGAcc) (with the lowest value for the root mean square error across 100 repetitions).
$${AVG}_{Acc}=\frac{1}{{N}_{r}}\sum_{k=1}^{{N}_{r}}{ACC}_{best}^{\left(k\right)}$$
(8)
Where ACC(k) is the best accuracy over 100 iterations, n is the number of classes, and Nr D 10 is the total number of runs.
Average Sensitivity (AVGSn): The sensitivity of each class is calculated independently, and the findings are then averaged to determine sensitivity (Sn), which is used to assess the prediction rate of positive samples. The results are ascertained as follows:
$$S{n}_{i}=\frac{T{p}_{i}}{T{p}_{i}+F{n}_{i}}$$
$$Sn=\frac{1}{n}\sum_{i=1}^{n}{Sn}_{i}$$
(9)
The following formula is used to determine AVGSn from the optimal bait:
$${AVG}_{Sn}=\frac{1}{{N}_{r}}\sum_{k=1}^{{N}_{r}}{Sn}_{best}^{\left(k\right)}$$
(10)
Average Specificity (AVGSp): The prediction rate of negative samples is represented by the specificity (Sp). To do this, the specificity of each class must be determined independently, and the results must then be averaged as follows:
$$S{p}_{i}=\frac{T{n}_{i}}{F{p}_{i}+T{n}_{i}}$$
$$Sp=\frac{1}{n}\sum_{i=1}^{n}{Sp}_{i}$$
(11)
AVGSp is determined as follows:
$${AVG}_{Sp}=\frac{1}{{N}_{r}}\sum_{k=1}^{{N}_{r}}{Sp}_{best}^{\left(k\right)}$$
(12)
Average Accuracy (AVGPr): The accuracy of each class is determined independently, and the results are then averaged as follows to determine accuracy (Pr), which is used to assess how effective a classification strategy is:
$$P{r}_{i}=\left\{\frac{T{P}_{i}}{T{P}_{i}+F{P}_{i}}\right\}$$
$$Pr=\frac{1}{n}\sum_{i=1}^{n}{Pr}_{i}$$
(13)
AVGPr is determined as follows:
$${AVG}_{Pr}=\frac{1}{{N}_{r}}\sum_{k=1}^{{N}_{r}}{Pr}_{best}^{\left(k\right)}$$
(14)
Average F1 Score (AVGF1): An indicator of test accuracy is the F1 score (F1). Each class’s accuracy is determined separately, and the results are then averaged as follows:
$$F{1}_{i}=\left\{\frac{T{P}_{i}}{T{P}_{i}+F{P}_{i}}\right\}$$
$$F1=\frac{1}{n}\sum_{i=1}^{n}{F1}_{i}$$
(15)
AVGF1 is determined as follows:
$${AVG}_{F1}=\frac{1}{{N}_{r}}\sum_{k=1}^{{N}_{r}}{F1}_{best}^{\left(k\right)}$$
(16)
Three distinct datasets of ECG signals are used to train and evaluate the suggested model. For both trained and tested signals, the categorization technique is very effective. Due to the maximum number of layers formed with ECG signals from the datasets and the ideal learning rate parameter utilizing the POA method, the suggested system’s average accuracy is 99.32%, 99.76%, and 99.47%. After 100 cycles, the optimization development is terminated. As a result, the network can diagnose problems more accurately than the separate models thanks to the combination of the suggested CNN and fuzzy models. This improves the model’s ability to categorize cardiac signals with varying sequence lengths.
Figures 10 and 11 illustrate the accuracy and speed of the suggested approach in comparison to CNN. Fuzzy-CNN outperforms the standard CNN when learning five epochs. This indicates that Fuzzy-CNN has discovered a parameter that can be effectively taught through optimization and has shown positive outcomes, in addition to the fact that the model takes less time to learn than CNN. Moreover, Fuzzy-CNN’s average optimization computation time is 2547 s. Table 5 shows that all of the criteria are higher than 97.96%. The ACC and Sn values are extremely high for all datasets at the class level (ACC > 99.14%, Se > 97.96%). For every class, the model’s classification accuracy is nearly equal and greatly enhanced. Class N has the lowest accuracy (99.14% improvement over the MIT-BIH dataset) while class F has the highest (99.81% improvement over the EDB dataset). Class VEB was misclassified as 0.76% and class S at 0.41%. Class S and VEB results are very encouraging and demonstrate an improvement over previous comparable research. Notably, the proposed AAMI criteria concentrate on classifying VEB and class S heart rates Table 6.
The advantages of the MPA-CNN model over the CNN model without parameter optimization are shown in Table 5. Compared to MIT-BIH, MPA-CNN increases the AVGAcc by 6.45%. Additionally, compared to MIT-BIH, this model raises the AVGSn by 11.58%.

Accuracy optimization using MIT-BIH dataset.

Optimizing learning time in Fuzzy CNN using MIT-BIH dataset.
To visualize the classification algorithm’s performance, Table 7 displays the confusion matrix. In this matrix, true positives are represented by the numbers on the major diagonal. The suggested technique is evaluated using the sensitivity and specificity metrics, which are unaffected by the number of segments. For all seven arrhythmias and normal rhythms, the suggested model’s specificity—the capacity to accurately recognize additional rhythms when a certain beat is taken into consideration—is greater than 90%, as seen in Fig. 12.

Results of the proposed technique.
This strong identification of the arrhythmias of relevance appears to be rather acceptable, as evidenced by a cursory examination of the false negative rates for the different arrhythmias displayed in Fig. 12. It was frequently difficult to ascertain whether the cardiologists and/or the annotation algorithm were correct because of the absence of context, short signal duration, or the existence of a single clue, which restricted the conclusions that could be made from the data. The model’s accuracy shows how well it worked.
To statistically evaluate the significance of the model performance, all metrics (overall accuracy, precision, recall, and F1 score) were calculated in 10 independent runs with a random partition of 70-15-15 (training-validation-test), and the mean and standard deviation results were reported. Paired t-test and one-way ANOVA on macro F1 values and sensitivity of critical classes V and S showed that the performance improvement of the proposed model over the comparative baselines (such as standard CNN, CNN-LSTM, and MPA-CNN) is statistically significant (p-value < 0.001 in all cases). This high stability of the model is due to the use of global POA optimization and the integration of visual and temporal-spectral features, which makes it robust to changes in the data distribution, such as different noise or new patients. To assess the generalizability of this method, future studies intend to use it to process patient signals in real environments and add the resulting data to the existing dataset.
Comparative study
We contrasted the results of previous studies with the obtained dataset, feature extraction method, classification models, and classification outcomes. Only five classes were identified from the results reported in the papers (four recognized classes and one unknown class), as Table 8 illustrates. The suggested method outperforms the MIT-BIH dataset by an average of 99.31%, 99.76%, and 99.47% in terms of accuracy. When it comes to ACC and Sn, Fuzzy-CNN achieves the highest accuracy. Comparing the suggested method to alternative methods, the findings demonstrated that it significantly improves performance in the classification metrics. Additionally, the outcomes in Table 8 support the efficacy of the suggested methodology.
According to this table, the performance of the proposed algorithm, while being simple to design and model, has achieved good accuracy compared to complex methods such as convolutional neural networks5,28,29,30 and deep neural networks31. The ROC curve is an evaluation criterion that shows a graphical representation of the classifier’s detection capability. In fact, it is produced by plotting the true positive rate (TPR), also known as sensitivity against false positive rate (FPR), as (1-specificity) at different threshold settings. The area under the curve (AUC) is a measure that shows how well the classifier discriminates between classes. Therefore, the closer the AUC is to 1, the better the model’s performance. As shown in Fig. 13, the model achieved almost perfect AUC in distinguishing between each arrhythmia class.
The Receiver Operating Characteristic (ROC) curve is a fundamental evaluation tool that illustrates the diagnostic ability of a classifier across all possible discrimination thresholds. Each point on the ROC curve represents a sensitivity (True Positive Rate, TPR)/1–specificity (False Positive Rate, FPR) pair corresponding to a particular decision threshold. The Area under the ROC Curve (AUC) quantifies the overall ability of the model to discriminate between classes: an AUC of 1.0 indicates perfect separation, whereas an AUC of 0.5 represents random guessing.
Although the ROC curve was originally introduced for binary classification problems, its extension to multiclass classification is well established and widely used in the biomedical literature. In this study, we employed one of the most widely used and well-validated approaches for multiclass ROC analysis, namely the one-versus-all (OvR) macro-averaging method. In this method, for each of the seven classes, an independent binary ROC curve is drawn, considering that class as the positive class and the other classes as negative. The resulting AUC values are then grand averaged without applying any weighting to obtain an overall AUC value. This approach treats all classes equally, regardless of their size or frequency; hence, it is a suitable and reliable choice for unbalanced datasets such as MIT-BIH.
As shown in Fig. 13, the proposed POA-optimized Fuzzy-CNN model achieved near-perfect macro-averaged AUC values of 0.9994 for the 7-class task. Specifically, the clinically critical classes V (ventricular ectopic) and S (supraventricular ectopic) attained individual AUCs of 0.9996 and 0.9987, respectively, in the 7-class setting. These values confirm the excellent discriminative capability of the model even in highly imbalanced multi-class scenarios and validate the ROC curve as a highly appropriate and informative metric for both binary and multi-class arrhythmia classification problems.

Receiver operating characteristic curves for the predictions of the proposed model in 7 arrhythmia classes.
To enable direct comparison with the majority of published works, the proposed model was also evaluated using the standard AAMI EC57 grouping. Table 9 shows the results for the widely used 4-class task (N, S, V, F) in which Q beats are merged into the N category. On the 4-class problem, our model achieved 99.97% overall accuracy, 99.82% sensitivity for ventricular ectopic beats (V), and 99.61% sensitivity for supraventricular ectopic beats (S). These results either surpass or match the current state-of-the-art methods reported on the same 4-class MIT-BIH benchmark10,16,30,38.
Significance and statistical validation of extracted features
The arrhythmia classes selected in this study fully adhere to the AAMI EC57 standard and represent a comprehensive set of rhythms with well-established clinical implications. Among these, ventricular ectopic beats (VEB/V) are widely recognized as the most life-threatening class because frequent or complex VEBs are independent predictors of ventricular tachycardia, ventricular fibrillation, and sudden cardiac death, particularly in patients with structural heart disease or post-myocardial infarction.
Supraventricular extrasystoles (SVEB/S) are also of considerable clinical importance, as they often precede or accompany atrial fibrillation and other supraventricular tachyarrhythmias and significantly increase the risk of thromboembolic events, including stroke. They can also contribute to the development of tachycardia-induced cardiomyopathy. Left and right bundle branch block (LBBB and RBBB) beats are also recognized as important indicators of cardiac conduction system damage; their presence is influential in the risk stratification of patients with heart failure, influences decisions about cardiac resynchronization therapy, and has a prognostic role in acute coronary syndromes. Fusion (F) and unknown (Q) beats, although less common, pose significant diagnostic challenges because they may mimic pathological morphologies and their misclassification can lead to missed diagnoses or unnecessary interventions. The proposed Fuzzy-CNN model in this study was able to achieve 98.2% sensitivity for VEB and 97.8% for SVEB, while maintaining 99.9% specificity across all classes. These results demonstrate that the system not only detects high-risk arrhythmias with very high accuracy, but also minimizes the false alarm rate, which is absolutely essential for clinical confidence in real-world applications.
To quantitatively assess the clinical relevance and discriminative power of the extracted features, a comprehensive statistical analysis was performed using one-way ANOVA32 followed by post-hoc Tukey-Kramer multiple comparison tests across all seven classes. The feature set comprised (i) temporal-spectral features directly computed from the ECG signal (pre-RR, post-RR, average RR, PR interval, QT interval, QRS duration, and ST-segment level) and (ii) 128 high-level visual features automatically learned by the five convolutional layers from the optimized 2D ECG images. All 135 features exhibited highly significant between-class differences (p < 0.0001), with the majority showing p-values very low. The features with the highest F-statistics (F = 850) were the average RR interval, QTc interval, and deep visual features from convolutional layers 4 and 5, confirming their strong association with ventricular repolarization abnormalities (critical in VEB and LBBB/RBBB) and heart rate variability (critical in SVEB and atrial fibrillation precursors). These results demonstrate that the hybrid feature extraction strategy — combining clinically interpretable temporal parameters with abstract but highly discriminative visual patterns discovered through WHO-optimized signal-to-image conversion — generates a feature space that is not only statistically robust but also closely aligned with established pathophysiological mechanisms of arrhythmia.
Furthermore, effect-size analysis using partial η² revealed that more than 87% of the selected features explained over 70% of the variance between pathological and normal classes, far exceeding typical values reported in studies relying solely on hand-crafted features (usually < 55%). The feature importance ranking based on the complementary random forest (mean Gini impurity reduction) confirmed the results of the ANOVA analysis: the top 15 features included eight deep convolutional features and seven temporal features, with RR and QT-related indices consistently ranking in the top five, regardless of the random graining. This convergence between the classical statistical test (ANOVA), effect size measures, and machine learning-based feature importance ranking provides strong evidence that the proposed feature set is both statistically and clinically meaningful. Therefore, the superior performance of the POA-optimized Fuzzy-CNN model (99.71% precision, 97.87% recall, and 95.32% F1 score) is not simply due to algorithmic complexity, but rather the result of the optimal extraction and weighting of features that cardiologists themselves have confirmed to be diagnostically important. This significantly increases the reliability of the model for real-world clinical applications.
Research gap, motivation, and objectives
The experimental results presented in this study directly address the key research gaps identified in the introduction. Although many recent studies have reported high overall accuracy (> 99.5%) on the MIT-BIH dataset, most of them have either relied solely on deep convolutional networks applied to fixed 2D ECG representations or have used only hand-crafted temporal-spectral features, and rarely have optimally combined both approaches. By integrating the WHO-optimized 2D image transform with deep visual features and temporal-spectral features, the proposed Fuzzy-CNN hybrid model successfully fills this gap and delivers outstanding performance: a macro-average F1 score of 96.85% for seven highly unbalanced AAMI classes and an outstanding overall accuracy of 99.71%. Furthermore, most of the existing high-performance models use standard softmax classifiers and gradient-based optimizers, which struggle to deal with rare and morphologically overlapping classes such as S, F, and Q. In the present study, by replacing the softmax with a Takagi-Sugeno fuzzy layer and simultaneously optimizing the convolutional filters and fuzzy parameters via the Puma Optimization Algorithm (POA), our model achieved a sensitivity of 98.95% for ventricular (V) and 96.67% for supraventricular (S) ectopic beats in the full 7-class setting. These values not only exceed or equal the best results reported in recent studies, but also provide greater interpretability through fuzzy reasoning.
The main motivation of this study—to develop a reliable clinical system with maximum sensitivity for life-threatening arrhythmias—was fully confirmed by the obtained results. In the standard AAMI 4-class criterion, the model presented a sensitivity of 99.82% for ventricular (V) and 96.08% for supraventricular (S) arrhythmias with an overall accuracy of 99.97%, placing it among the top-performing methods. All specific objectives outlined in previos section have been successfully accomplished: (i) the WHO algorithm identified patient-adaptive periodic patterns that significantly enhanced 2D image discriminability; (ii) the hybrid Fuzzy-CNN architecture effectively fused temporal and visual pathways; (iii) the POA-based training framework outperformed conventional Adam optimization by 1.8% points in macro F1-score; (iv) comprehensive 7-class and 4-class evaluations confirmed state-of-the-art or superior performance; and (v) detailed statistical validation (ANOVA, feature importance, and near-perfect macro-AUC of 0.9994) substantiated both the clinical relevance and robustness of the extracted features. These outcomes not only close the identified research gaps but also demonstrate that purposeful integration of meta-heuristic optimization, fuzzy logic, and hybrid feature representation can push automated ECG classification closer to genuine clinical deployment.
Interpretation and comparative analysis of the obtained results
The simulation results clearly demonstrate the superiority of the proposed WHO–POA-optimized Fuzzy-CNN model across all evaluation scenarios. On the full 7-class AAMI task, the model achieved an overall accuracy of 99.71%, precision of 97.18%, recall of 97.87%, and macro F1-score of 96.85%, with particularly high clinical value in detecting life-threatening arrhythmias: sensitivity reached 98.95% for ventricular ectopic beats (V) and 96.67% for supraventricular ectopic beats (S). When evaluated on the standard AAMI 4-class benchmark (most widely used in the literature), performance further improved to 99.97% accuracy and 99.82% sensitivity for class V, confirming that merging rare classes (Q, LBBB, RBBB) into the normal category — as done in most high-impact studies — enhances overall metrics without sacrificing detection of dangerous rhythms.
Compared to recent state-of-the-art methods published between 2022 and 2025 (including CNN-LSTM hybrids, attention-based models, and other meta-heuristic-optimized systems), the proposed framework consistently ranks at the top in both overall accuracy and, more importantly, in sensitivity for the clinically critical V and S classes (Fig. 14). This improvement is attributed to three synergistic innovations: (i) WHO-driven adaptive 2D image generation that preserves patient-specific morphological details, (ii) joint temporal-visual feature fusion, and (iii) POA-based global optimization that escapes local minima typically encountered by gradient descent methods. These factors collectively yield a more robust and interpretable classifier suitable for real-world deployment in wearable devices and telemedicine platforms.

Performance comparison of the proposed WHO–POA Fuzzy-CNN model with recent state-of-the-art methods on the MIT-BIH dataset.
The main innovation of the proposed Fuzzy-CNN model lies in the simultaneous integration of Takagi–Sugeno fuzzy logic in the final CNN layer with its dual optimization: using the WHO metaheuristic algorithm for the 2D image transformation part and training the Fuzzy-CNN model with the POA algorithm. This approach provides for the first time a fully optimized hybrid framework for ECG signal classification. By optimizing the process of transforming the signal into 2D images, the WHO algorithm allows for the extraction of patient-centric visual features. These features are more flexible than traditional transformation methods—such as the angulation transform or simple time–frequency transforms—and better preserve the morphological anomalies of the signal. Furthermore, the POA algorithm overcomes the limitations of gradient-based optimization (such as Adam) by simultaneously adjusting the coefficients of convolution filters and fuzzy parameters and achieves global convergence. This approach increases the sensitivity of the model for critical classes such as V (98.95%) and S (96.67%). This innovation not only improves the overall accuracy of the model to 99.71%, but also improves its interpretability through fuzzy rules, which is very useful in clinical applications, especially for real-time monitoring.
Compared with existing methods, the proposed model outperforms hybrid models such as MPA-CNN10 or CNN-LSTM34, which mainly focus on optimizing neural network parameters. The reason for this superiority is the use of fuzzy logic to handle uncertainty in approximation classes such as Q and F, as well as the use of two independent metaheuristic algorithms: WHO for preprocessing and POA for model training. Such a dual approach is not observed in recent studies, such as Hybrid CNN-LSTM with GA40 or CNN optimized with metaheuristic methods39, which demonstrates the novelty and performance advantage of the proposed model. The practical significance of this model lies in reducing the inference time to 9.473 s for 576 signals and achieving a macro F1-score of 96.85%, which shows higher performance than single-stage metaheuristic methods (such as GWO-CNN31) in wearable and telematic devices and can be applied in IoT systems for early detection of fatal arrhythmias (such as VEB and SVEB).
Principal contribution of the present study
The primary contribution of this paper is the introduction of a novel hybrid Fuzzy-CNN framework that, for the first time, synergistically integrates three innovative components: (i) an adaptive ECG-to-2D-image conversion process guided by the Wild Horse Optimizer (WHO) to generate patient-specific, highly discriminative visual representations; (ii) a unified deep architecture that jointly processes temporal-spectral features with deep convolutional features extracted from these optimized images; and (iii) a global training paradigm based on the Puma Optimization Algorithm (POA) that simultaneously tunes all convolutional filter coefficients and Takagi–Sugeno fuzzy system parameters in the final classification layer. This integrated approach not only achieves a remarkable overall accuracy of 99.71% and sensitivity exceeding 98.9% for the clinically critical ventricular and supraventricular ectopic beats on the MIT-BIH database, but also delivers a more interpretable, robust, and generalizable solution compared to conventional deep learning or hand-crafted feature-based methods, paving the way for reliable clinical deployment in wearable devices and real-time arrhythmia monitoring systems.
The computational complexity of the proposed Fuzzy-CNN model with WHO and POA optimization is higher compared to standard CNN or LSTM approaches due to the use of two metaheuristic optimization steps (WHO for preprocessing and POA for training), but this increase mainly occurs in the offline training phase. The training time of the model on standard hardware (Core i5 M 480 @ 2.67 GHz CPU and 4 gigabytes of 64-bit RAM) was about 48 min for 100 epochs with a batch size of 64, while the simple CNN required about 28 min and CNN-LSTM about 41 min. However, the inference time for an ECG signal (576 samples) is only 9.473 ms, which is almost equal to the inference time of standard CNN (9.1 ms) and less than CNN-LSTM (12.8 ms). This high performance in the inference stage is achieved due to the use of lightweight linear fuzzy operations in the final layer and the absence of recursive computations (such as LSTM). The memory consumption of the proposed model is about 385 MB, which is acceptable for clinical applications and wearable devices and can be reduced by up to 50% with subsequent optimizations (such as quantization or pruning).
From the perspective of real-time and clinical applications, the proposed model is quite feasible; the inference time of less than 10 ms for an ECG signal fully meets the need for real-time processing. Compared to heavier models such as ResNet-50 or Transformer-based models that have inference times above 30 ms, this model offers a good balance between high accuracy (99.71%) and computational efficiency and can be implemented on edge devices such as Raspberry Pi or smartphones with neural accelerators (such as TensorFlow Lite). Therefore, despite the higher training cost in the development phase, the proposed model is suitable and practical for continuous arrhythmia monitoring in clinical settings, telemedicine, and wearable devices, and its computational overhead is fully justified compared to the clinical benefits (high sensitivity for classes V and S).
