Estimation of sexual dimorphism of adult human mandibles of South Indian origin using non-metric parameters and machine learning classification algorithms

Descriptive statistics

The frequency distribution of each variable for the parameters observed is provided in Table 1. Cohen’s kappa value of 0.81 to 1.00 was considered a very good strength of agreement for intra-observer variability. The Fleiss kappa test revealed k values ranging from 0.4 to 1.00 indicating moderate to very good strength of agreement in inter-observer variability. A few variables showed significant differences in their occurrence between male and female mandibles. The lower border of the mandible was predominantly rocker in males and straight in female mandibles (p < 0.001). The gonial angle was found to be everted in males (p < 0.001), followed by a straight profile. Partial and complete mylohyoid bridges were observed significantly more in male mandibles (p = 0.021). A flexure of the posterior border of the mandibular ramus was found to be a trait observed in male mandibles (p < 0.001). A detailed description of the occurrence of these variables between male and female mandibles is given in Table 2.

Table 1 Frequency distribution of all the variables observed.

Table 2 Comparison of the occurrence of variables between male and female mandibles.

Correlation and confusion matrix

The correlation between the variables used in this study is shown in Fig. 4 which is also called the correlation matrix. The correlation values typically range from + 1 to -1, where + 1 indicates a perfect positive correlation, 0 indicates no correlation, and − 1 indicates a perfect negative correlation between the variables.

The confusion matrix for the four ML algorithms used in this work is shown in Figs. 5 and 6, respectively, for SMOTE and ROS.

The comparison of classification models using weighted average precision and recall provides valuable insights into both the accuracy and reliability of each model. RF stands out with the highest weighted average precision of 0.94 and recall of 0.93 for SMOTE and ROS, indicating that it not only makes highly accurate positive predictions but also captures the majority of true positives across all classes. This strong performance underscores RF’s capability to balance precision and completeness in predictions, making it highly suitable for applications that require both high accuracy and thorough identification. SVM and DT both achieved a weighted average precision and recall of 0.87 and 0.85 for SMOTE and average precision and recall of 0.91 and 0.90 for ROS, suggesting that they are similarly effective in balancing false positives and false negatives. These models offer a strong balance among interpretability, computational efficiency, and predictive performance. Although they do not reach the performance level of RF, they still represent a good trade-off, making them suitable options when model simplicity or speed is prioritized. On the other hand, KNN recorded the lowest weighted precision and recall of 0.81 and 0.80 for SMOTE and 0.91 and 0.90 for ROS. Although the differences may seem marginal, they reflect that KNN is less capable of accurately identifying all classes, especially in scenarios with overlapping class boundaries or imbalanced distributions. This lower performance suggests KNN may be more affected by local sample noise and might benefit from feature scaling or dimensionality reduction.

Overall, the RF model stands out as the most balanced and effective in terms of both precision and recall, followed by SVM and DT. The KNN model falls slightly behind in performance. These results highlight the effectiveness of ensemble methods in achieving robust and reliable classification, especially when dealing with real-world datasets that feature varied and complex patterns.

Jaccard score, F1 score, accuracy, and specificity

The Jaccard index constitutes a significant metric utilized for assessing the similarity between predicted and actual classifications. An elevated Jaccard index indisputably indicates enhanced model efficacy and highly precise predictions. The F1 score, which embodies the harmonic mean of precision and recall, is crucial for appraising model performance, particularly in scenarios characterized by imbalanced class distributions. An F1 score approaching 1 illustrates a commendable equilibrium between precision and recall, distinctly reflecting the model’s strong performance and dependability. The Table 3 presents the consolidated Jaccard Index and F1 scores for the four ML algorithms used in this study.

Table 3 Summary of Jaccard index and F1 scores.

It can be seen from Table 3 that the KNN algorithm exhibits the least effective performance among the evaluated models, with Jaccard and F1 scores recorded at 0.67 and 0.80, respectively, for SMOTE and Jaccard and F1 scores of 0.78 and 0.87, respectively, for ROS. This indicates that KNN struggles with noisy features, as it is particularly sensitive to local distributions and has difficulty generalizing effectively in scenarios characterized by imbalanced or high-dimensional datasets. The DT demonstrates superior performance in comparison to KNN, achieving Jaccard and F1 scores of 0.74 and 0.85, respectively, for SMOTE and Jaccard and F1 scores of 0.82 and 0.90, respectively, for ROS. This finding shows that DT establishes a more systematic and rule-based decision boundary by executing hierarchical splits grounded in feature thresholds. Such a structured methodology facilitates enhanced generalization capabilities for the DT, especially in instances where the dataset exhibits non-linear separability. The SVM achieves a Jaccard score that is similar to that of the decision tree (DT); however, its F1 score is slightly lower. This suggests that while the SVM may demonstrate higher precision, it also has lower recall, meaning it is able to accurately identify fewer true positive cases. The RF model distinctly surpasses all other models, attaining the highest Jaccard index, which signifies a superior degree of prediction overlap with the true labels. Additionally, it showcases the most favorable F1 score, indicative of an optimal equilibrium between precision and recall. This remarkable performance can be ascribed to the ensemble nature of RF, which mitigates the risk of overfitting, adeptly captures intricate feature interactions, and generalizes effectively across diverse subsets of the dataset.

The Tables 4 and 5 represent the accuracy and class-wise specificity for all four models used in this study, based on results obtained after applying SMOTE and ROS techniques to the original dataset.

From Table 4 it is observed that for KNN, the accuracy was increased from 0.80 using SMOTE to 0.87 using ROC. The confidence intervals for accuracy of KNN were about [0.75,0.97], which suggests that there is 95% confidence that the model’s actual accuracy falls between 75% and 97%. The accuracy of DT and SVM increased from about 0.82 or 82% for SMOTE to 90% for ROS with a confidence interval of [0.70,0.92] for SMOTE and [0.80,0.97] for ROS. However, the RF algorithm exhibited the highest accuracy for SMOTE and ROS when compared to other algorithms. The accuracy for SMOTE is notably 8–15% greater for RF in comparison to KNN, DT, and SVM. It is also worth noting that the accuracy of RF for both SMOTE and ROS was the same.

The superior performance of RF can be attributed to its ensemble structure, which effectively reduces overfitting, captures complex feature interactions, and ensures better generalization across different data folds.

Table 5 Specificity per class.

A specificity value of 0.76 for Males and 0.85 for Females was achieved for KNN, which shows a better performance in correctly rejecting Male instances when evaluating Female samples.

Male specificity remains at 0.76 for ROS, while Female specificity improves to a perfect 1.00, showing enhanced ability to identify Female negatives. Decision Tree (DT) under SMOTE records equal specificity for Males (0.76) and high specificity for Females (0.95). ROS notably increases Male specificity to 0.86, while Female specificity stays stable at 0.95, suggesting improved precision in rejecting Male negatives. SVM resulted in a moderate specificity with SMOTE, which is 0.81 for Males and 0.85 for Females. However, the specificity improves to 0.86 and 0.95, respectively, with ROS, indicating better class distinction. Random Forest (RF) achieves the highest baseline specificity with SMOTE, which is 0.86 for Males and perfect 1.00 for Females. Notably, these values remain consistent with ROS, indicating robustness across oversampling methods in accurately identifying negatives for both classes.

Both SMOTE and ROS effectively mitigate class imbalance and enhance model specificity; however, ROS frequently produces superior or equivalent specificity metrics, particularly for the Male class across various algorithms. The Random Forest algorithm exhibits the most robust overall performance, demonstrating resilience to the oversampling method, achieving near-optimal specificity for both classes. This finding suggests that Random Forest is a dependable option for classification tasks involving this imbalanced dataset, with ROS potentially augmenting performance in other modelling scenarios.

McNemar’s test statistic evaluation is a non-parametric statistical examination employed to assess variations in paired nominal data, especially in the context of comparing the proportions of two interrelated groups. We have used McNemar’s test to evaluate if there is a significant difference in accuracy between two paired classification algorithms, which are shown in Table 6 for SMOTE and Table 7 for ROC. The p-value indicates the likelihood that observed differences between two classifiers are due to chance, with values below 0.05 suggesting significance. McNemar’s test statistic measures the frequency of disagreements in predictions and follows a chi-square distribution; a higher value indicates a greater difference. Together, they help determine if the classifiers significantly differ in accuracy.

Table 6 McNemar’s test for accuracy with SMOTE.

Table 7 McNemar’s test for accuracy with ROC.

It can be inferred from Tables 6 and 7 that the p-values exceed typical significance levels (e.g., 0.05), which indicates that there is no statistically significant difference in accuracy between the compared algorithms. All pairs of algorithms tested perform similarly in terms of accuracy when using SMOTE, and none of the differences are statistically significant according to McNemar’s test.

ROC curve and AUC

ROC (Receiver Operating Characteristic) and AUC (Area Under the Curve) are essential metrics for evaluating the performance of binary classification problems. A model provides better class estimates when the ROC curve is closer to the top left corner. A single number indicates the overall performance of the ROC curve, which is inferred by the AUC, the value of which lies between 1 and 0. The Figs. 7 and 8 summarizes the ROC and AUC for this work for SMOTE and ROS, respectively.

The Area Under the Curve (AUC) scores provide a measure of each model’s ability to distinguish between classes across all threshold settings. It can be noted from Fig. 7 RF achieved the highest AUC of 0.83, indicating strong discriminatory power and reliable performance across different classification thresholds. DT and SVM, followed by AUC scores of 0.86 and 0.83, respectively, showing moderate effectiveness. KNN recorded the lowest AUC of 0.81, suggesting it is less capable of consistently separating the classes. Figure 8 shows all four models show strong classification performance, with AUC values over 0.85. Random Forest (RF) leads at 0.93, indicating superior accuracy and fewer false positives. DT and SVM each score 0.90, while KNN has a slightly lower AUC of 0.88. The ROC curves visually support these results, with RF, DT, and SVM clearly outperforming KNN. These results highlight Random Forest as the most robust model in terms of overall class separation, while KNN may struggle to handle overlapping class distributions or more complex decision boundaries.

Bootstrap AUROC Difference is a statistical method used for the comparison of the Area Under the Receiver Operating Characteristic Curve (AUROC or AUC) between two classification models by estimating the variability of their difference. In this process, the sampling of the datasets with replacements is done, resulting in bootstrapped samples. For each of these samples, the difference in AUROC is calculated and recorded. The estimation of variability and confidence intervals is then evaluated based on the distribution of AUROC differences. By analysing this distribution, it is possible to determine if the performance difference is statistically significant or due to chance. The Tables 8 and 9 show the AUROC difference with SMOTE and ROS, respectively.

Table 8 AUROC difference with SMOTE.

Table 9 AUROC difference with ROC.

Both tables compare pairs of algorithms using the bootstrap method to estimate differences in Area Under the Receiver Operating Characteristic (AUROC), along with 95% confidence intervals (CIs). In Table 8, which includes Synthetic Minority Over-sampling Technique (SMOTE), the AUROC differences between models range from − 0.15 to 0.12, with all confidence intervals including zero. Similarly, Table 9, which presents ROC data, shows differences ranging from − 0.04 to 0.06, with all confidence intervals also containing zero. Since zero falls within every confidence interval, we can conclude that there is no significant difference in AUROC between any pairs of compared algorithms under both conditions. This indicates that all models perform comparably regarding their classification accuracy.

Balanced accuracy and Matthews correlation coefficient (MCC)

When the dataset is imbalanced, balanced accuracy is the metric that is most widely used to evaluate the classification models. Balanced accuracy averages the recall scores of all classes, giving equal importance to each. For binary classification, it is the mean of the true positive rate (sensitivity) and true negative rate (specificity), ensuring fair assessment for both majority and minority classes. Balanced accuracy for the ML algorithms used in this work are presented in Table 10.

Table 10 Balanced accuracy.

As can be seen from Table 10 that there is an increase in the values of balanced accuracy across all the ML algorithms in the ROS method of class balancing. KNN showed an increase of 2.5%, whereas DT and SVM showed an increase of 9.7% and 5.8% respectively, for ROS when compared to SMOTE. Contrastingly, RF showed no increase in balanced accuracy across both the oversampling methods. This finding demonstrates its robustness to the choice of oversampling method. Overall, ROS seems to offer a slight advantage for most models, except for Random Forest, which performs equally well with both techniques.

The Matthews Correlation Coefficient (MCC) is a metric used to assess the performance of binary (two-class) classifications. When the dataset is imbalanced, MCC offers a balanced metric for accuracy evaluation. The Matthews Correlation Coefficient for the ML algorithms used in this work is presented in Table 11.

Table 11 shows that ROS generally increases the MCC values when compared to SMOTE across all the ML algorithms used in this work. KNN improves from 0.61 with SMOTE to 0.78 with Random Over Sampling (ROS), while SVM rises from 0.65 to 0.80. Decision Trees (DT) increase from 0.72 to 0.80 with ROS. Random Forest (RF) achieves the highest Matthews Correlation Coefficient (MCC) of 0.86, showing consistent performance with both SMOTE and ROS. Overall, ROS enhances the correlation between predicted and actual outcomes for most algorithms, with Random Forest consistently performing well.

Table 11 Matthews correlation coefficient (MCC).

Permutation feature importance and Gini index

The permutation feature importance for the Random Forest Model is as shown in Fig. 9 for SMOTE and ROS. The most critical features for SMOTE are the N3 Lower border_0, N6 AR profile_2, and N6 Gonial angle_2. They have the importance scores ranging between 0.09 and 0.12. Other features like N12 Flexure ramal post border_0 and N1 Shape of chin_0, also have a higher value of importance scores, suggesting that both angular and shape-based measurements are important for classification. The distribution of importance scores reveals that a few features have a strong influence, while many others contribute moderately, highlighting SMOTE’s ability to maintain variability across predictors.

For the ROS technique, N6 Gonial angle_0, N12 Lower border_2, and N3 Lower border_0 are dominant features, with the top-ranked feature showing slightly higher importance than in SMOTE. Many of the top features overlap between SMOTE and ROS, suggesting a stable set of core predictors regardless of the balancing method. The ROS curve indicates a steep decline in importance after the top few features, suggesting that ROS-trained models rely more on a limited set of predictors. This may result from ROS duplicating minority samples without adding synthetic variability, leading to a focus on the strongest discriminative features.

Both methods highlight that certain anatomical measurements, particularly gonial angles and lower border profiles, are crucial for classification. The model’s robustness is enhanced by SMOTE, which distributes the importance across a wide range of features. On the other hand, ROS focuses on fewer, highly discriminative features, which may increase sensitivity to changes in those measurements.

The plot for the Gini index of the decision tree is shown in Fig. 10 for SMOTE and ROS. Gini importance measures each feature’s impact on reducing node impurity, with higher values indicating a stronger role in classification.

For the SMOTE, N6 Gonial angle_0 and N12 Flexure ramal post border_0 are the dominant features, having importance scores above 0.10. Other contributors include N3 Lower border_0, N7 Post Edge of ramus_0, and N5 AR Profile_1. The importance values are relatively varied across features, suggesting that SMOTE enables the model to utilize a broader set of predictors.

For the ROS, N12 Flexure ramal post border_0 and N6 Gonial angle_0 remain the top features with higher importance, but with even higher importance, particularly for the top feature, which exceeds 0.19. The ranking also shows a slightly steeper drop after the top few features, indicating stronger reliance on key variables such as N4 AR Shape_0 and N7 Post Edge of ramus_0. The overlap in leading features between SMOTE and ROS reflects their consistent predictive value, while the sharper concentration in ROS suggests a heavier dependence on a smaller feature set.

Model performance variability

The plot for the standard deviation for the Jaccard Index, F1 score, and accuracy score for all the ML algorithms in this study is presented in Fig. 11 for both SMOTE and ROS methods.

Standard deviation is used to measure performance variability, with lower values indicating more consistent results across multiple runs. For SMOTE, SVM records the highest variability for the Jaccard Index (0.111%) and Accuracy Score (0.086%), while DT shows the highest variability for the F1 Score (0.149%). The lowest variability for Jaccard Index, F1 score, and accuracy for KNN is 0.073%,0.052% and 0.052% respectively, for KNN, which depicts its strong stability. RF and DT show moderate variability, with RF performing better than DT for F1 Score stability but slightly worse for Jaccard Index consistency.

For ROS, RF exhibits the highest variability for Jaccard Index (0.103%) and Accuracy Score (0.073%), while DT again shows the largest fluctuation for the F1 Score (0.173%). The most stable model recorded was KNN, having the lowest variability score for F1 (0.035%) and Accuracy score (0.05%), whereas SVM records the lowest Jaccard Index variability (0.065%).

Overall, KNN exhibits the least sensitivity to variations in training data from resampling methods, performing consistently with both SMOTE and ROS. Conversely, DT and RF are are more affected by small perturbations during oversampling, particularly in the F1 Score, due to their inherent instability. SVMs, on the other hand, exhibit moderate variability but can spike with certain metrics using SMOTE, likely due to changes in support vectors from synthetic samples. Thus, simpler algorithms like KNN are more robust to resampling variability, while more complex models may require additional tuning for stability.

SHAP waterfall and SHAP summary visualization

SHapley Additive exPlanations (SHAP) is a method by which the interpretations of the predictions of machine learning models are made. A SHAP waterfall plot visualizes an individual prediction by showing how each feature’s SHAP value contributes from the baseline (average prediction) to the final output. It reveals which features push the prediction higher or lower, offering detailed interpretability.

In contrast, a SHAP summary plot provides a global view of feature importance across the dataset, illustrating the magnitude and direction of each feature’s impact. This plot combines feature importance with SHAP value distribution, helping us identify the most significant features and their influence on predictions. The Figs. 12 and 13 show the SHAP waterfall and summary plots for KNN, DT, and SVM for SMOTE and ROS, respectively.

The SHAP waterfall plots in Fig. 12 indicate how individual features contribute to specific model predictions for different algorithms and resampling techniques. In each plot, the baseline prediction E[f(X)] is adjusted in a stepwise manner by the features that influence the most, with red bars indicating features that push the prediction higher and blue bars showing those that push it lower. In the KNN model with SMOTE (a) and Random Over-Sampling (ROS) (d), features like N6 Gonial angle_3 and N12 Flexure ramal post border_0 show significant effects. In the Support Vector Machine (SVM) models (c, f), N12 Flexure ramal post border_0 has a strong positive influence. Decision Tree models (b, e) are simpler, often dominated by a single feature like N1 Shape of chin, indicating lower complexity and more discrete decision rules.

Figure 13 presents SHAP summary plots, which display the overall importance and direction of influence of each feature across all predictions for different models and resampling strategies. In KNN and SVM models with both SMOTE and ROS (a, c, d, f), features such as N3 Lower border_0, N6 Gonial angle, and N12 Flexure ramal post border emerge as consistently high-impact predictors, with their SHAP values indicating whether higher feature values push predictions toward positive or negative outcomes. In contrast, Decision Tree models (b, e) show a more simplified influence landscape, often dominated by a single categorical feature like N1 Shape of chin, resulting in minimal variation in SHAP values across instances. The color gradient reflects the feature value (red for high, blue for low), while the horizontal spread indicates the strength and variability of each feature’s contribution to the model output.