In this section, we discuss the results produced by different machine learning models, aiming to compare and determine the most effective model for predicting cerebral aneurysm rupture based on 35 morphological and 3 clinical inputs. Evaluation criteria include accuracy on the training and test datasets, recall, precision, and accuracy on the test dataset, and receiver characteristic arithmetic (ROC) curves. Following these evaluations for each model, we discuss the most important features identified by the models. We aim to clarify the correlation between each parameter and the rupture status of cerebral aneurysms. This analysis provides a comprehensive understanding of the influencing factors that contribute to the accurate prediction of aneurysm rupture.
Accuracy
The main metric to evaluate the performance of the models and allow comparison between different models is accuracy. Accuracy is measured on both the training and testing datasets. Accuracy is defined as the ratio of correctly predicted cases to all predicted cases. Although high accuracy is desirable, it is important to note that achieving 100% accuracy is not optimal, as it may indicate overfitting or lack of generalization to unknown data. Ideally, the accuracy of the training and testing datasets should be comparable, with a maximum recommended difference of 10%. Figure 5 shows the accuracy results of all models. It is clear that all models are able to achieve an accuracy above 0.70. XGB shows the highest accuracy with 0.91, while KNN shows the lowest accuracy with 0.74. When evaluating the generalizability of the models to new data, both MLP and SVM performed well, achieving an accuracy of 0.82 on the testing dataset. This shows that MLP and SVM outperform other models in terms of prediction accuracy for unknown data.

Accuracy on training and testing datasets.
Precision and Recall
In addition to accuracy, we also included precision and recall as important metrics to comprehensively evaluate the model's performance. We made this decision considering the sensitive nature of the medical data under consideration and emphasizing the importance of timely disease recognition. Simply put, recall measures the model's ability to correctly identify the presence of a disease. Recall is defined as the ratio of true positive predictions to the total number of actual positive cases. Similarly, precision reflects the model's ability to accurately predict positive occurrences. Precision is defined as the ratio of true positive predictions to the total number of predicted positive cases.
While recall is particularly important in the medical context, accuracy and precision should not be overlooked either, as they collectively contribute to the overall effectiveness of the model. Figure 6 shows the results of evaluating all three metrics (accuracy, precision, and recall) on the test dataset, focusing in particular on the rupture class, which represents the occurrence scenario in this study. SVM and MLP are again the best performing models. The results show that SVM and MLP have high recall of 0.92 and 0.90, respectively, in predicting the occurrence of cerebral aneurysm rupture. SVM also has an accuracy and precision of 0.82, while MLP has an accuracy of 0.83 and precision of 0.82. In contrast, RF performed relatively poorly on all three criteria. However, it is noteworthy that even in the case of RF, all performance metrics on the test dataset are above 0.75, indicating a high level of predictive ability.

Accuracy, precision, and recall on the test dataset.
ROC Curve
Another metric used in the evaluation is the ROC curve, which shows the true positive rate and the false positive rate. A linear behavior with equal true positive rate and false positive rate represents a random classifier. As the model improves, the curve shifts towards the top left point. An ideal model would have a true positive rate of 1 and a false positive rate of 0. The area under the curve (AUC) is a representative measure of the model's performance, where an AUC of 0.5 indicates a random classifier and an AUC of 1 indicates an ideal classifier. Figure 7 shows the behavior of the ROC curves for each model and their corresponding AUC. Based on these criteria, SVM and MLP are the best performing models with a fierce competition. These ROC curves show a favorable trajectory, and the AUC values confirm their superior performance. Conversely, RF shows a comparatively poorer performance than the other models. In summary, all models show very acceptable performance and scores. Optimizing these models to improve their reliability and validity in predicting cerebral aneurysm rupture is a worthwhile endeavor.

Receiver operating characteristic (ROC) curves for all models.
Major features
Since each machine learning model employs its own algorithms and mathematical relationships, it is expected that there will be differences in the weights assigned to each parameter in the final classification decision. Figure 8 shows the weights of each parameter for the two best performing models in this study. The SVM model identifies the first five key features as EI (Ovality Index), SR (Size Ratio), I (Irregularity), UI (Waviness Index), and IR (Ideal Roundness), a new parameter introduced in this study. On the other hand, the MLP model prioritizes EI, I, Location, NA (Neck Area), and IR, with IR again showing a significant influence.

Key features of the two best performing models.
Other new parameters introduced in this study include NC, IS, ON, IRR, COD, ISR, and IOR, which occupy positions 6, 9, 13, 19, 27, 30, and 36, respectively, in SVM. For the MLP model, the order of these new parameters is IR (5), NC (7), ON (18), IS (24), ISR (27), COD (32), IOR (34), and IRR (38). Notably, some parameters in the MLP model exhibit negative values, indicating that they have an inverse effect on the model's predictions and are inversely correlated with the output. It is important to recognize that this pattern may vary depending on the architecture used for the MLP model.
One question that may arise in this study is whether, based on physicians' experience, bifurcation aneurysms are more likely to rupture than lateral aneurysms. However, our study did not show a significant contribution of this factor. This discrepancy does not mean that bifurcation and lateral status are unimportant. Rather, it highlights that when other features are considered in conjunction with this parameter, the correlation between other parameters is stronger than with this particular one. Essentially, expanding the input variables and making decisions based on more comprehensive information reveals the importance of parameters that were not considered before. Thanks to modern machine learning models, it is now possible to compare multiple parameters simultaneously and identify the contribution of each parameter to the others. This approach allows for more reliable decision-making by considering a wider range of factors and by better understanding the complex interplay of variables that contribute to the prediction of cerebral aneurysm rupture.
Here, we briefly compare previous and current studies, with a particular focus on the test datasets used in all studies. To facilitate this analysis, we present the results of six similar studies as well as our own findings. As mentioned above, we sought to incorporate comprehensive morphological parameters to ensure the robustness of our findings.
As the range of parameters considered expands, it is expected that the relative importance assigned to each parameter will change. Furthermore, increasing the size of the dataset can increase the reliability of the results. Among the critical parameters, size ratio emerged as a recurring focus, highlighting its essential importance in the assessment of rupture risk. Given the sensitivity inherent in medical data, we again highlight the importance of the recall score. It is noteworthy that our study achieved an excellent recall score. Unfortunately, this metric was not present in previous studies, limiting direct comparisons.
Table 3 shows the results of six similar studies as well as our own findings. As mentioned above, we sought to incorporate comprehensive morphological parameters to ensure the robustness of our findings.
As the range of parameters considered expands, it is expected that the relative importance assigned to each parameter will change. Furthermore, increasing the size of the dataset can increase the reliability of the results. Among the critical parameters, size ratio emerged as a recurring focus, highlighting its essential importance in the assessment of rupture risk. Given the sensitivity inherent in medical data, we again highlight the importance of the recall score. It is noteworthy that our study achieved an excellent recall score. Unfortunately, this metric was not present in previous studies, limiting direct comparisons.