A framework for comparing model accuracy based on cross-validation
We first describe our framework for assessing the statistical significance of accuracy differences between two classification models evaluated by “repeated” CVs22,23, a practice that has been shown to be problematic but is still frequently adopted by researchers. In repeated CV, the two models are both trained and evaluated using a K-fold (stratified) CV that is repeated for M times. The resulting \(K \times M\) accuracy scores associated with either model are then compared by a statistical test. We now investigate whether this testing procedure can consistently quantify the statistical significance of the difference in classification accuracy with different choices of K and M.
In designing this framework, we first note that the accuracy of ML models generally depends on the dataset and sample size (e.g., training non-linear models generally requires more data than training linear models, so non-linear models are only more predictive when training data is sufficient). It poses a challenge to disentangle the impact of CV setups on the accuracy difference between models. Therefore, we refrain from comparing models with different underlying algorithms but instead propose a framework to construct two classifiers with the same “intrinsic” predictive power (Fig. 1); that is, for any dataset, there is no theoretical algorithmic advantage of one model over another, and the observed accuracy difference between two models is only created by chance. Specifically, we create two classifiers by executing the following steps:
-
Step 1:
Randomly choose N samples from each class;
-
Step 2:
Create a random zero-centered Gaussian vector with standard deviation of \(\frac{1}{E}\), where E is a predefined parameter called the perturbation level. The dimension of the vector equals to the number of features;
-
Step 3:
In each of the \(K \times M\) validation runs, train a linear Logistic Regression (LR) on the training data;
-
Step 4:
Create a perturbed model by adding the random vector to the linear coefficients of its decision boundary;
-
Step 5:
Create a second perturbed model by subtracting the random vector from the decision boundary;
-
Step 6:
Evaluate the accuracy of two perturbed models on the testing data;
-
Step 7:
Use a certain hypothesis testing procedure (e.g., paired t-test) to produce a p-value quantifying the significant difference in prediction accuracy across the \(K \times M\) testing folds.
In this framework, the perturbations along two strictly opposite directions ensure that the magnitude of the discrepancy between the two models is strictly linked to the perturbation level E. In doing so, the observed accuracy differences between the two perturbed models is due simply to chance rather than to their intrinsic differences (e.g., one model has a superior algorithm design or is better suited to a specific sample size than the other model). Ideally, one would want to consistently quantify statistical significance of that difference regardless of the choices of K and M. In the following sections, we will demonstrate that in practice, one model can appear statistically significantly better than another based solely on variations in the choices of K and M.
Model comparison using paired t-test
We applied the above framework to compare model accuracy in three neuroimaging-based classification tasks: 1) classifying 222 healthy control subjects vs. 222 patients with Alzheimer’s disease based on T1-weighted MRI released by the Alzheimer’s Disease Neuroimaging Initiative (ADNI)24 study; 2) distinguishing 391 individuals with autism spectrum disorders (ASD) from 458 typically developing controls based on resting-state functional MRI released by the Autism Brain Imaging Data Exchange (ABIDE I) Dataset25; and identifying sex of 6125 boys and 5600 girls based on (head size corrected) T1-weighted MRI released by the Adolescent Brain Cognitive Development (ABCD) study26. Neuroimaging data of all three datasets were preprocessed into tabular measurements as the input features to the classification (See Section Methods).
A commonly misused procedure for comparing model accuracy is to use a paired t-test to compare the two sets of \(K \times M\) accuracy scores from two models. To further illustrate this flaw, we applied the proposed framework (Fig. 1) to each of the three neuroimaging datasets to investigate the outcomes of the t-test based on various CV setups with different K, M combinations. In each K, M setup, we repeated the framework 100 times and recorded the average p-value of the corresponding statistical test.
In this experiment, we focused on balanced classification by setting the number of random samples \(N=500\) for ABCD, \(N=300\) for ABIDE, and \(N=222\) for ADNI. We chose E (in Step 2) for each dataset such that the resulting p-values were roughly on the same level. Supplement Fig. S1.1 a-f confirms that in all three classification tasks, the (unperturbed) Logistic Regression classifier achieved a classification accuracy significantly higher than chance in all K, M setups. Notably, changing K from 2-fold CV to 50-fold CV resulted in higher average classification and larger variance in accuracy over folds. Next, we compared the accuracy of the two perturbed classifiers in all three datasets. Based on the proposed comparison framework, Fig. 2a-c shows the range of p-values (quantifying significant accuracy differences) based on 2-fold or 50-fold CV, without repetition (M=1) or repeated for up to 10 times (M=10). We observe an undesired artifact that test sensitivity increased (lower p-values) with the number of CV repetitions M and the number of folds K. Furthermore, Fig. 3a-c shows the average p-value for more K, M combinations. If we used \(p<0.05\) as the significance threshold, Fig. 3d-f shows the “Positive Rate”, i.e, how likely the two models have significantly different accuracy based on K-fold CV repeated for M times. We can observe that, despite applying two classifiers of the same intrinsic predictive power on the same dataset, the outcome of the model comparison largely depended on CV setups, with a higher likelihood of detecting a significant accuracy difference in a high K, M combination setting. For example, in the ABCD dataset, the positive rate increased on average by 0.49 from \(M=1\) to \(M=10\) across different K settings, and it increased on average by 0.07 from \(K=2\) to \(K=50\) across different M settings, which highlighted the dependence of the p-value on the choice of K and M settings.
As already pointed out in many studies, one major issue of repeated CV is that the \(K\times M\) accuracy scores are highly dependent due to the overlap between the test (or training) folds of different validation runs22,27. This violates the assumption of sample independence in the t-test. To resolve this issue, a “corrected” version of the paired t-test22,27 has been proposed to control for the dependency across accuracy scores. We then examined whether the corrected t-test could avoid the dependency of test sensitivity on K and M. Results in Fig. 2d-f show that the correction indeed resulted in more conservative p-values than the regular t-tests but still largely influenced test sensitivity. For example, 50-fold CV still resulted in lower range of p-values than 2-fold CV in all three datasets, and a large number of CV repetitions resulted in the lowest p-values in ADNI. Fig. 3j-l suggests that the highest positive rate occurred under a combination of large K and M.

Statistical significance of comparing the accuracy of two Logistic Regression classifiers with the same intrinsic predictive power via cross-validation: (a–f) In each K, M setup, the framework of Fig. 1 was executed for 100 times. In each run, a paired t-test compared the \(K \times M\) accuracy scores of the two perturbed Logistic Regression models. We record box-plots of the resulting p-values for (a–c) uncorrected t-test and (d–f) corrected t-test.

The average p-value and positive rate (how frequent the two perturbed Logistic Regression models had significant accuracy difference based on the threshold of \(p<0.05\) in the 100 runs) were recorded for each K, M combination for uncorrected t-test (a–f) and corrected t-test (g–l).
Reproducibility of results
First, we examined whether the observed pattern in Fig. 2 was due to the relatively low classification accuracy in the neuroimaging applications (\(N<1000\)). We applied the proposed comparison framework of Fig. 1 to two synthetic classification datasets with \(N=10,000\) and \(N=100,000\), with a known Bayes error of 5% (See Methods for dataset creation). Supplement Figure S1.2 shows that the classifiers achieved high accuracy (92%) in both synthetic datasets. When \(N=10,000\) (Figure S1.2 a-d), the observed patterns aligned with the neuroimaging-based results, where higher K, M combinations resulted in lower p-values (greater chance of detecting significant accuracy differences between two perturbed classifiers). When increasing the sample size to \(N=100,000\), Supplement Figure S1.2 e-h suggests that compared to \(N=10,000\), the dependency of the p-value on M was less pronounced, but higher K still resulted in lower p-values in 80% of the time.
Next, to investigate whether the dependency of p-values on CV setups was specific to linear models, we repeated the neuroimaging-based experiments to compare accuracy scores between two perturbed Multi-Layer Perceptrons (MLP) (Supplement Fig. S1.1 g-l and Fig. S1.3, where the perturbation (of Step 4) was applied by adding the random vector to the weights of the last fully connected layer. Supplement Fig. S1.3 suggests similar patterns as in Fig. 2, where the range of p-values depended on both K and M. Lastly, instead of applying positive or negative perturbations in the framework of Fig. 1, the model was perturbed by two totally different random Gaussian vectors with standard deviation \(\frac{1}{E}\). We also replaced t-tests with permutation tests to handle the potential non-Gaussian distribution of accuracy scores. Supplement Figs. S1.4 and S1.5 largely replicate the findings that test sensitivity increased with K and M.

The distribution of p-values and positive rates when applying 4 testing procedures to compare two perturbed models in 3 neuroimaging-based classification tasks.
Rank of sensitivity across t-tests, McNemar’s test, and DeLong’s test
In addition to the t-tests, two other commonly used testing procedures for comparing model accuracy are McNemar’s test28 and DeLong’s test29. Unlike the t-tests, McNemar’s and DeLong’s tests typically only require one round of CV (\(M=1\)). Specifically, the classification results are pooled together from the K runs, and the number of correctly classified samples and the area-under-the-ROC-curve (AUC) are compared between two models by Chi-squared statistics. Based on the framework of Fig. 1, we then examined whether the sensitivity of McNemar’s test and DeLong’s test depended on the number of folds K. To do so, we repeated the model comparison framework 100 times for each K setting (similar to the previous experiment) and recorded the distribution of p-values and positive rates of McNemar’s test and DeLong’s test. Supplementary Fig. S1.6 shows that these sensitivity levels were more invariant to K compared to the two variants of the t-test.
Next, we examined the relative sensitivity among t-test, McNemar’s test, and DeLong’s test (i.e., which procedure resulted in the most conservative p-values). We recorded the distribution of p-values and positive rates over all K, M settings in Fig. 4. This figure shows that the average positive rate for the three datasets were not in the same range and also varied among different hypothesis testing procedures. For example, in the ABCD classification task, the difference in the average positive rate between the most sensitive test and the least sensitive test was 0.46 for the Logistic Regression model (Fig. 4c) and 0.51 for the MLP model (Fig. 4d). Another observation from Fig. 4 is that the overall rank of sensitivity among the 4 procedures was the same across the 3 datasets: uncorrected t-test was always the most sensitive procedure (highest positive rate), followed McNemar’s test and corrected t-test, and the least sensitive procedure was DeLong’s test.
We investigated whether the rank of sensitivity among the 4 testing procedures remained constant or, alternatively, varied with CV setups. To investigate this, we repeated the comparison of the two perturbed Logistic Regression models in ADNI under two perturbation levels. Fig. 5 plots the average p-value over 100 runs for each E,K,M combination, and the shape of each radar plot encodes the rank sensitivity of the 4 procedures. For example, Fig. 5a suggests that when choosing one-time 2-fold CV to compare two models perturbed at level \(E=6\), the rank of sensitivity was the same as in Fig. 4, with uncorrected t-test being the most sensitive procedure (smallest p-values closest to the center) and DeLong’s test being the least sensitive (largest p-values farthest away from the center). According to the shape changes of radar plots in Fig. 5, the rank of sensitivity among the 4 test procedures was not constant. The two variants of the t-test were the most sensitive tests (lowest p-value) when \(M=1, K=50\) (Fig. 5c) but were less sensitive than DeLong’s test and McNemar’s test under perturbation level \(E=3\) in Fig. 5a,b.

4 testing procedures were applied to compare two perturbed MLP models for classifying 222 controls and 222 patients from the ADNI dataset. The test was repeated 100 times at two different perturbation levels (E), number of folds (K), and number of CV repetitions (M). Each radar plot records the average p-value over the 100 runs.
Variability of test outcomes in comparing different ML models

We compared classification accuracy across 5 classifiers on the 3 datasets. For a model pair, we conducted the comparison using 4 testing procedures in 25 different CV setups and recorded the positive rate at the \(p<0.05\) level for each procedure.
Variations in the sensitivity of test procedures based on CV setups may contribute to p-hacking, where one could search through CV setups and testing procedures to pursue statistical significance of accuracy differences between two models with different methodological design. To show this, we estimated the accuracy of 5 classifiers, i.e. Multilayer Perceptron (MLP), Logistic regression (LR), Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbor (KNN), on the three datasets. For each classifier, we evaluated the accuracy using 25 different CV setups (based on different choices of K and M, see Methods for the details). The average classification accuracy for each classifier are shown in Table 1. Apart from KNN consistently achieving the lowest accuracy across all three datasets, the accuracy difference among the remaining 4 classifiers was small, with the gap between the most and least accurate classifier being 6%, 3%, and 2%, for ABCD, ABIDE and ADNI datasets, respectively.
Next, we aimed to detect statistical differences in the accuracy of the 5 classifiers. There were 10 pairs of models to be compared. For each pair of models, we conducted 100 comparisons by combining the 4 testing procedures with the 25 CV setups. For each testing procedure, we recorded the positive rate, i.e., percentage of reaching a significance level of \(p<0.05\) out of the 25 CV setups (Fig. 6). Across all methods and datasets, McNemar’s test, with an average positive rate of 0.44, was the most sensitive method, followed by the uncorrected t-test at 0.37, the corrected t-test at 0.33, and finally, DeLong’s test, was the least sensitive, with an average of 0.30. Critically, the test outcomes significantly varied with CV setups and testing procedures, making it difficult to draw consistent conclusions about whether one classifier was significantly more accurate than another. For example, in the ABCD classification, only the comparison between RF and LR using DeLong’s test was consistently insignificant (positive rate = 0). For all other model pairs, there was at least one CV setup and testing procedure combination that resulted in a statistically significant accuracy difference between the two models. However, none of the comparisons were significant for all 100 tests, although the comparison of 3 model pairs (KNN vs. RF, KNN vs. SVM, KNN vs. LR) had a positive rate>0.8 across all 4 testing procedures. These results on ABCD were largely replicated on ABIDE, where the majority of model comparisons showed inconsistent outcomes; i.e., only a subset of CV setups resulted in significant outcomes (0
