Source of the data
The data for this study were obtained from the Somaliland National Examination Database, which is managed and certified by the Somaliland National Examination and Certification Board (SLNECB) under the supervision of the Ministry of Education and Science, Somaliland. The dataset includes information on various variables, such as place of residence (districts and regions), school location (urban or rural), students’ gender, school type (public or private), and school accommodation (boarding or day). The dataset included 20,950 students, including 9,111 girls and 11,839 boys, who participated in the 2023 national primary school exams.
Data preprocessing and handling
To ensure full reproducibility, the dataset underwent a rigorous preprocessing workflow before model development. Examination records were complete, as the Somaliland National Examination Database includes only students who sat for the examination; therefore, no students absent from the exam appeared in the dataset. Categorical variables such as region, district, residence, and school type were encoded using one-hot encoding, while binary variables, including gender and school ownership, were retained as dichotomous. For continuous variables related to subject performance (Somali, Arabic, English, and Science), each was converted into a binary “pass/fail” indicator based on national grading criteria: scores ≥ 50 were coded as “pass,” and scores < 50 as “fail.” There were no other continuous predictors in the dataset. The dependent variable (mathematics pass/fail) presented a natural imbalance of 64/36; therefore, no resampling techniques were used. Instead, we applied stratified train–test splitting to preserve the class distribution and evaluated models using imbalance-sensitive metrics such as F1-score, balanced accuracy, and AUC.
Hyperparameter optimization was conducted via a grid search embedded within 10-fold stratified cross-validation to enhance robustness and reduce overfitting. For Decision Trees, parameters such as maximum depth, minimum samples per leaf, and splitting criteria were tuned. At the same time, Random Forest models were optimized using the number of estimators, the maximum number of features, and the maximum depth. SVM models were evaluated using both linear and RBF kernels, with penalty parameter (C) and kernel coefficient (γ) tuned. KNN optimization included testing a range of k values (3–15) and alternative distance-weighting schemes. Naïve Bayes and Logistic Regression required minimal parameter adjustments and were implemented using standard configurations. All preprocessing, variable transformations, and hyperparameter-tuning procedures were executed with fixed random seeds to guarantee methodological transparency and full reproducibility.
Study area
This study focuses on assessing primary school students’ mathematics achievement in Somaliland using datasets corresponding to the 2022/2023 academic year. It is concerned with students’ mathematics performance, particularly in the context of the Somaliland National Primary School Examination (SNPE). To ensure methodological rigor and the acquisition of a representative dataset for subsequent analysis and evaluation, the scope of the data was extended to include all primary education institutions in the 14 regions and 23 constituencies of Somaliland, as shown in Fig. 1. This comprehensive approach ensured a sophisticated examination of the educational panorama across diverse administrative and geographical strata in Somaliland. The methodology builds upon previous studies, such as34, which utilized similar approaches to predict student dropout rates in Somaliland, and35, which examined the determinants of student academic performance using multi-level logistic regression to estimate unobserved effects at both the student and school levels. Furthermore36, machine learning-based analysis of academic performance determinants, using insights from the 2021/2022 National Secondary School Exams, underscores the value of integrating advanced analytical techniques to uncover critical educational patterns, which this study also endeavors to achieve.

Somaliland map (regions and districts).
Variables of study
Dependent variable
The dependent variable in this study was students’ mathematics performance in primary school examinations, categorized as pass (coded as 1) or fail (coded as 0). The dichotomous classification signified a binary outcome, with one category designated as the event (coded as 1) and the other as the reference level (coded as 0). In the context of this research, success in mathematics is defined as students scoring 50 and above, which is termed a pass (coded 1), whereas those scoring below 50 are classified as failures (coded zero). Therefore, the variable ‘Student Math Score’ takes the value of 1 for scores equal to or above 50, denoting a pass, and 0 for scores below 50, denoting a failure in the primary examinations.
Independent variables
Various factors are expected to be associated with students’ mathematics performance in primary school examinations, as shown in Table 2. In this study, the following independent variables were examined as potential influences on students’ mathematics performance, particularly in the context of primary mathematics examinations:
Specification of supervised machine learning models
Logistic regression
Logistic regression (LR) is a widely used classification algorithm that predicts the probability of binary outcomes by modeling the relationship between independent variables and the dependent variable using logistic functions. It is a popular choice among data analysts and statisticians because of its simplicity and effectiveness36. The formula for the LR is
$$\:P\left(Y=1|X\right)=\:\frac{1}{\left(1\:+\text{exp}\left(-z\right)\right)},\:$$
where \(\:P(Y=1|X)\) represents the probability of the positive class for a given instance, \(\:X\) represents the input variables, and \(\:z\) is the linear combination of the input variables and their respective coefficients.
Decision tree
A decision tree (DT) is a type of non-parametric supervised machine learning algorithm that constructs a tree-like model of decisions and their possible outcomes. It uses a data-splitting process based on different attribute values to form branches and leaf nodes, and the decision-making process is guided by a sequence of if-else conditions. The Decision Tree algorithm relies on various metrics, such as information gain, Gini index, and entropy, to identify the optimal attribute for partitioning the data35.
Random forest
Random forests (RFs) are ensemble learning methods that combine multiple decision trees to improve prediction accuracy and reduce overfitting. A collection of Decision Trees was generated using random subsets of the training data and features. The final prediction was made by pooling the predictions of each tree. There is no single formula for random forests because they involve the integration of Decision Trees37.
Naïve Bayes
Naive Bayes (NB) is a probabilistic classifier that uses Bayes’ theorem and assumes independence between features. It determines the probability of each class based on a set of input features and selects the class with the highest probability35. More precisely, the following formula for NB is derived from Bayes’ theorem:
$$\:P\left(C|X\right)=\:\frac{\left(P\left(X|C\right)\:P\left(C\right)\right)}{P\left(X\right)},$$
where \(\:P\left(C\right|X)\:\)represents the probability of the class \(\:C\) Given the input features \(\:X\), \(\:P\left(X\right|C)\) is the probability of features \(\:X\) given class \(\:C\), \(\:P\left(C\right)\:\)is the prior probability of class \(\:C\), and \(\:P\left(X\right)\) is the prior probability of features \(\:X\).
Support vector machine
Support vector machines (SVMs) are widely employed for both classification and regression tasks because they identify an optimal hyperplane that effectively separates data points from different classes. The SVM formula involves determining the decision boundary by solving a quadratic optimization problem38.
K nearest neighbors
The K-nearest neighbors (KNN) algorithm is a simple yet highly efficient method for categorizing an instance based on its nearest neighbors in the feature space. The KNN algorithm can make accurate predictions by determining the majority class of k-nearest neighbors. The KNN algorithm involves calculating the distances between the target instance and all other instances in the training set to identify the K nearest neighbors39.
These six supervised machine learning models offer a diverse set of techniques for predicting mathematics performance using the 2022/2023 Somaliland National Primary Examination Results. Each model has its own strengths, assumptions, and formulas, enabling a comprehensive assessment of its predictive ability.
Model adequacy measures
The use of model adequacy measures provides essential information about a model’s performance, precision, and reliability. Our analysis included several measures, such as accuracy, sensitivity, specificity, F1 score, precision, and recall. Each of these measures plays a vital role in assessing the predictive ability of the models and their effectiveness in correctly categorizing the instances.
Accuracy
Accuracy is a measure of the proportion of correctly predicted instances relative to the total number of cases. This value was calculated using the following equation:
$$\:Accuracy\:=\:\frac{(TP\:+\:TN)\:}{(TP\:+\:TN\:+\:FP\:+\:FN)},$$
Where TP (True Positives) represents the number of correctly predicted positive instances, TN (True Negatives) represents the number of correctly predicted negative instances, FP (False Positives) represents the number of incorrectly predicted positive instances, and FN (False Negatives) represents the number of incorrectly predicted negative instances.
A higher accuracy score indicates a more dependable model.
Sensitivity (recall)
Sensitivity, often referred to as recall or true positive rate, measures the proportion of correctly predicted positive instances (i.e., identifying students who performed well in mathematics) among all actual positive instances. The formula used to calculate this value is as follows
$$\:Sensitivity\:=\frac{\:TP\:}{(TP\:+\:FN)},$$
Where TP and FN are as described previously.
Specificity
Specificity is a metric that assesses the proportion of correctly predicted negative instances (i.e., identifying students who did not perform well in mathematics) to all actual negative instances. The following formula was used for the calculation:
$$\:Specificity\:=\:\frac{TN}{\:(TN\:+\:FP)},\:$$
Where TN and FP are the same as those described above.
F1 score
The F1 score is a combined measure of precision and recall. It provides a balanced assessment of the model’s performance by considering both false positives and false negatives. This was calculated using the following formula.
$$\:F1\:Score\:=\:\frac{2\:*\:\left(Precision\:*\:Recall\right)\:}{\:(Precision\:+\:Recall)},$$
Precision is the proportion of correctly predicted positive instances relative to the total number of predicted positive cases, and recall, also known as sensitivity, measures the accuracy of predicting positive instances by calculating the proportion of correctly predicted positive instances among all actual positive cases.
A higher F1 score indicates better precision and recall, which reflects a more accurate and reliable model.
Precision
Precision is a metric that assesses a model’s ability to identify positive instances among all accurately predicted positive instances. This was determined using the following equation:
$$\:Precision\:=\frac{\:TP\:}{\left(TP\:+\:FP\right)},$$
Where TP and FP are as described above.
A higher precision score indicates fewer false positives and a more dependable model.
Area under the curve
The area under the curve (AUC) is a commonly used measure of model performance that assesses a supervised machine learning model’s overall effectiveness across a range of classification thresholds. This is particularly useful for evaluating the predictive ability of models and comparing their performance.
In the context of our study, which assessed the ability to predict mathematical performance using data from the 2022/2023 Somaliland National Primary Examination Results, the AUC served as a valuable tool for evaluating models’ ability to discriminate between positive and negative outcomes.
A higher AUC score indicates that the model has a superior ability to correctly classify instances, as it assigns higher probabilities to positive than to negative cases.
Using these model adequacy measures, we gained valuable insights into the performance and effectiveness of supervised machine learning models for predicting mathematical performance using the 2022/2023 Somaliland National Primary Examination Results.
Figure 2 Illustrates the methodology employed in this study. This study used the Somaliland National Examination Result 2022/2023 dataset, which was split into training and test sets at 80% and 20%, respectively. A suite of machine learning models, including Logistic Regression, Decision Tree, Random Forest, Naive Bayes, SVM, and KNN, were developed and trained on the training data. Subsequently, the performance of each model was evaluated based on the test data using a comprehensive set of accuracy metrics, including accuracy, Sensitivity, F1-score, Precision, specificity, and AUC. This rigorous methodology enabled the identification of the most suitable model for predicting student performance in the Somaliland National Examination.

Model validation strategy
To evaluate robustness, we implemented 10-fold stratified cross-validation in addition to the conventional 80/20 train–test split method. The dataset was divided into 10 folds, preserving the pass/fail ratio within each fold. Each model was trained on nine folds and tested on the remaining fold, with each fold serving as the test set in turn. The performance metrics were averaged across folds, and their standard deviations were reported. This approach mitigated the risk of overfitting and enabled us to assess the model’s stability.
Hyperparameter tuning and implementation
The hyperparameters were tuned using a grid search within the cross-validation framework. For the Decision Tree (DT) and Random Forest (RF), the maximum depth, number of estimators, and minimum samples per split were optimized. For Support Vector Machines (SVM), both linear and radial basis function kernels were tested by tuning the penalty parameter (C) and kernel coefficient (γ). For K-Nearest Neighbors (KNN), the number of neighbors (k) was tuned to 3–15, with alternative distance weighting schemes tested. Naïve Bayes and Logistic Regression required minimal tuning and were implemented with standard parameterization.
