Fashion MNIST data
The Fashion-MNIST dataset is a collection of 70,000 grayscale images, each measuring 28 \(\times\) 28 pixels. The images are divided into ten unique categories, such as ‘shirt’ and ‘bag’, with each category containing 7000 images. This balance is crucial for fair machine learning training. The dataset is split into 60,000 training data and 10,000 testing data. The Fashion MNIST dataset is available at https://github.com/zalandoresearch/fashion-mnist. To mimic model training outlined in Kossen et al.7, we selected 250 data points from the training data of 60,000 using stratified sampling to obtain equal sizes across 10 categories. We built a ResNet-18 model to classify categories of clothing using the training data of size 250. Then we evaluated its performance using the test data.
CIFAR-10 data
The CIFAR-10 dataset is a collection of 60,000 color images, each measuring 32\(\times\)32 pixels. These images are evenly distributed across ten categories such as ‘airplane’ and ‘dog’, with 6000 images in each. The dataset is partitioned into two sets: 50,000 images for training and 10,000 images designated for testing. The CIFAR-10 dataset can be accessed at https://www.cs.toronto.edu/%7Ekriz/cifar.html. A ResNet-18 model for image classification was trained using the training data and then its performance was assessed on the test data.
Drug consumption data
The drug consumption dataset holds information on 1,885 respondents, each characterized by 12 attributes. These include personality measures for neuroticism, extraversion, agreeableness, impulsivity, and sensation seeking. Other demographic details provided are education level, age, gender, country of residence, and ethnicity. Participants also provided information on their consumption of 18 drugs, both legal and illegal, such as alcohol, cannabis, and cocaine. Notably, a fictitious drug named ‘Semeron’ was added to spot those exaggerating their drug use. For each drug, there are seven possible usage labels, ranging from ‘Never Used’ to ‘Used in Last Day’. We used 12 attributes as predictors and usage of the Semeron drug as the outcome. We randomly selected 150 data points to build a random forest classifier for the usage of the Semeron drug, and used the remaining data as test data for model evaluation. The dataset can be accessed freely at https://archive.ics.uci.edu/dataset/373/drug+consumption+quantified.
Non-alcoholic fatty liver disease (NAFLD) data
The non-alcoholic fatty liver disease (NAFLD) dataset is composed of EHR from patients seen at the University of Pennsylvania. 300 patients and 635 patients were identified as biopsy-confirmed and imaging-confirmed NAFLD cases, respectively. 2805 patients who were not diagnosed by NAFLD were derived from the Penn Medicine Biobank database as control subjects. 41 variables, including demographics, laboratory measurements, and medication were considered as predictors in the model. For missing value imputation, we used the mean for continuous variables, and the mode for categorical variables. To consider the real world setting where the distribution of test data differs from one of the training data, we created two datasets: a training dataset with EHR records from 300 biopsy-confirmed NAFLD cases and 1000 controls; and a testing dataset with EHR records from 635 imaging-identified cases and the remaining 1905 controls. We built a random forest classifier for NAFLD using the training dataset and validate it’s performance using the testing dataset.
Trained models to be assessed
For the Fashion-MNIST and CIFAR-10 datasets, we normalized the pixel values on each training image to a range between 0 and 1, and built the ResNet-18 models using the codes shared in Kossen et al.7. We employed a stochastic gradient descent optimizer with a learning rate of 0.1, weight decay of 5 \(\times\) \(10^{-4}\), and momentum of 0.9. The batch size was set to 128, and we used a cosine annealing schedule for the learning rate. For Drug consumption data and NAFLD, we trained random forest classifiers using the ‘randomForest’ package12 in the R programming language. We treated the training datasets as inaccessible data during model evaluation and considered the test datasets as unlabeled data.
Problem setup and LUR estimator
Let Y be a outcome variable and \(\textbf{X}\) be the p dimensional vector of covariates. Let \(g(\textbf{X})\) denote a model trained by an external data. In this work, we aim to evaluate the performance of the model \(g(\textbf{X})\) using a loss function of model performance metrics,
$$\begin{aligned} \mathscr {M} \equiv \textbf{E}[\mathscr {L}\{g(\textbf{X}),Y\}]. \end{aligned}$$
(1)
Depending on the definition of \(\mathscr {L}(\cdot )\), the quantity \(\mathscr {M}\) can represent performance metrics such as the mean squared error, cross-entropy, and Akaike information criterion. We assume that a test data for the model evaluation includes only observations of \(\textbf{X}\), \(\{\textbf{X}_i, i = 1,…,N\}\) where \(\textbf{X}_i\)’s are independent and identically distributed. Data for Y, however, is not available. Since it is infeasible to annotate all outcomes in the test data for large N, AT selects a subset of the test data to label the outcomes.
In the active testing algorithm, the already labeled subsample in the previous steps can be treated as sampled with probability 1 for the next sampling steps since it, along with the newly labeled subsample, is utilized for model validation. Let \(\delta ^s_i\) indicate whether the ith data point is selected for labeling at the sth sampling or not (\(\delta ^s_i = 1\) : labeled; \(\delta ^s_i = 0\) : unlabeled). Let \(\pi {\{g^s(\textbf{X}_i), \delta ^{s-1}_i\}}\equiv \delta ^{s-1}_i + (1-\delta ^{s-1}_i)P\{\delta ^s_i =1 | g^s(\textbf{X}_i) \}\) denote the sampling probability for the \(i^{th}\) patient where \(0 < \pi {\{g^s(\textbf{X}_i), \delta ^{s-1}_i\}} \le 1\). If the ith data point is selected in the previous steps, \(\pi \{g^s(\textbf{X}_i), \delta ^{s-1}_i = 1\} = 1\) as it should be automatically included in the labeled test data. We therefore can include labeled data points from the 1st to the \((s-1)\)th sampling into the labeled test data at the sth sampling step. If the ith data point is in the remaining unlabeled test data (i.e., \(\delta ^{s-1}_i = 0\)), we perform sampling with the sampling probability \(P\{\delta ^s_i =1 | g^s(\textbf{X}_i) \}\) under Poisson sampling scheme. Let \(n^*_s\) be the newly labeled data size at the sth sampling from the remaining unlabeled data. Then, the cumulative labeled data size after performing the sth sampling is \(\sum ^N_{i=1}\delta ^s_i = \sum ^N_{i=1}\delta ^{s-1}_i + n^*_s\). The labeled test data would consist of the already labeled data from the previous sampling steps and newly labeled data at the sth sampling step. Under Poisson sampling scheme, labeled data size \(n^*_s\) is random with \(\textbf{E}(n^*_s) = n_s\). The data available after the sth sampling can be represented as \(\{\delta ^s_i, \delta ^s_iY_i, \textbf{X}_i, i = 1,…,N\}\).
In the AT setting, the distribution of labeled data is not the same as that of the unlabeled data due to selective labeling. To remove this sampling bias, the LUR estimator9 at the sth step is an existing method to achieve the unbiased estimation of \(\mathscr {M}\) using the labeled data,
$$\begin{aligned} \widehat{\mathscr {M}}^{\text {LUR}}_s = \frac{1}{sN}\sum ^{s}_{j=1}w_j\sum ^{N}_{i=1}\frac{\delta ^{j}_i\mathscr {L}\{g(\textbf{X}_i),Y_i\}}{\pi {\{g^j(\textbf{X}_i), \delta ^{j-1}_i\}}}, \end{aligned}$$
(2)
where the weight \(w_j = N(N-s)/\{(N-j)(N-j+1)\}\) is adjusted in each sampling to remove the statistical bias for \(\mathscr {M}\). However, such estimators based on true inverse probability weighting (IPW) may be inefficient since the estimation variability for \(\mathscr {M}\) can be inflated if data points are selected with small sampling probabilities13.
AILUR estimator
Recall that the sampling probability at the sth sampling step is \(\pi {\{g^s(\textbf{X}_i), \delta ^{s-1}_i\}} = \delta ^{s-1}_i + (1-\delta ^{s-1}_i)P\{\delta ^s_i =1 | g^s(\textbf{X}_i) \}\). We know that \(\pi {\{g^s(\textbf{X}_i), \delta ^{s-1}_i= 0\}} = P\{\delta ^s_i =1 | g^s(\textbf{X}_i) \}\) is the sampling probability at the sth sampling used for acquiring the new labeled data from the remaining unlabeled test data. For building the AILUR estimator, we focus on estimating \(P\{\delta ^s_i =1 | g^s(\textbf{X}_i)\}\). Using the introduced notation, we can express the new labeled data as \(\{\delta ^{s}_i(1-\delta ^{s-1}_i)(\textbf{X}_i, Y_i), \delta ^{s}_i, i = 1,…,N\}\), and the remaining unlabeled test data as \(\{(1-\delta ^{s-1}_i)\textbf{X}_i, i = 1,…,N\}\). We consider an kernel smoothing estimator for \(P\{\delta ^s_i =1 | g^s(\textbf{X}_i)\}\),
$$\begin{aligned} \hat{P}\{\delta ^s_i =1 | g^s(\textbf{X}_i)=z\} =\frac{\sum ^{N}_{i=1}\delta ^s_i(1-\delta ^{s-1}_i) K_b\{g^s(\textbf{X}_i)-z\}}{\sum ^{N}_{i=1}(1-\delta ^{s-1}_i)K_b\{g^s(\textbf{X}_i)-z\}} \end{aligned}$$
where \(K_b\) is a kernel function with a bandwidth b. Let \(\hat{\pi }{\{g^s(\textbf{X}_i), \delta ^{j-1}_i\}} = \delta ^{s-1}_i + (1-\delta ^{s-1}_i)\hat{P}\{\delta ^s_i =1 | g^s(\textbf{X}_i)\}\) be the estimator for \(\pi {\{g^s(\textbf{X}_i), \delta ^{s-1}_i\}}\). By replacing \(\pi (\cdot )\) in the Eq. (2) with \(\hat{\pi }(\cdot )\), we propose the AILUR estimator of \({\mathscr {M}}\) at the sth step,
$$\begin{aligned} \widehat{\mathscr {M}}^{\text {AILUR}}_s = \frac{1}{sN}\sum ^{s}_{j=1}w_j\sum ^{N}_{i=1}\frac{\delta ^{j}_i\mathscr {L}\{g(\textbf{X}_i),Y_i\}}{\hat{\pi }{\{g^j(\textbf{X}_i), \delta ^{j-1}_i\}}}, \end{aligned}$$
(3)
By the property of nonparametric kernel smoothing estimation for \(P\{\delta ^s_i =1 | g^s(\textbf{X}_i)\}\), we have that the difference between \(\pi (\cdot )\) and \(\hat{\pi }(\cdot )\) can be very small under some regularity assumptions. We hence show that \(\widehat{\mathscr {M}}^{\text {AILUR}}\) is an asymptotically unbiased estimator of \(\mathscr {M}\).
AIIPW estimator
One challenge to construct the estimators in the Eqs. (2) or (3) requires the information of all true sampling weights (or estimated weights) used throughout the sampling, which increases memory costs. We propose the AIIPW estimator to alleviate the problem. To build the estimator, we first investigate the conditional expectation of the indicator on \(\textbf{X}\), \(\textbf{E}\{\delta ^s_i|g(\textbf{X})\}\). By iterative double expectation, we have
$$\begin{aligned} \textbf{E}\{\delta ^s_i|g(\textbf{X})\}&= {\left\{ \begin{array}{ll} P\{\delta ^1_i =1 | g(\textbf{X})\}, &{} \text {if s= 1},\\ \textbf{E}\{\delta ^{s-1}_i|g(\textbf{X})\} + P\{\delta ^s_i =1 | g(\textbf{X})\}\prod ^{s-1}_{j=1}[1-P\{\delta ^j_i =1 | g(\textbf{X})\}], &{} \text {if s > 1}. \end{array}\right. } \end{aligned}$$
Using the true weight, we can construct an IPW estimator of \({\mathscr {M}}\),
$$\begin{aligned} \widetilde{\mathscr {M}}^{\text {IPW}} = \frac{1}{N}\sum ^{N}_{i=1}\frac{\delta ^{s}_i\mathscr {L}\{g(\textbf{X}_i), Y_i\}}{\textbf{E}\{\delta ^s_i|g(\textbf{X})\}}. \end{aligned}$$
(4)
However, the estimator still depends on historical sampling probabilities. To avoid this, we construct an kernel smoothing estimator to replace \(\textbf{E}\{\delta ^s_i|g(\textbf{X})\}\),
$$\begin{aligned} \hat{\textbf{E}}\{\delta ^s_i|g(\textbf{X}_i)=z\} =\frac{\sum ^{N}_{i=1}\delta ^s_i K_b\{g(\textbf{X}_i)-z\}}{\sum ^{N}_{i=1}K_b\{g(\textbf{X}_i)-z\}} \end{aligned}$$
(5)
By replacing \(\hat{\textbf{E}}\{\delta ^s_i|g(\textbf{X})\}\) in the Eq. (4) with \(\hat{\textbf{E}}\{\delta ^s_i|g(\textbf{X})\}\), we propose the AIIPW estimator of \({\mathscr {M}}\)at the sth step,
$$\begin{aligned} \widehat{\mathscr {M}}^{\text {AIIPW}}_s = \frac{1}{N}\sum ^{N}_{i=1}\frac{\delta ^{s}_i\mathscr {L}\{g(\textbf{X}_i),Y_i\}}{\hat{\textbf{E}}\{\delta ^s_i|g(\textbf{X}_i)\}}. \end{aligned}$$
(6)
AIIPW for \({\mathscr {M}}\) is memory-efficient since the estimator at the \(s^{th}\) step does not rely on all true sampling weights (or estimated weights) used throughout the first step to the sth step, \(\pi {\{g^j(\textbf{X}_i), \delta ^{j-1}_i\}}\) (or \(\hat{\pi }{\{g^j(\textbf{X}_i), \delta ^{j-1}_i\}}\)) for \(j=1,…,s\).

Actively improved model evaluation algorithm.
Active model evaluation algorithm
Specifying the sampling probability in the active testing is important to acquire more informative labeled data for improvement of estimation efficiency. Using expected true loss conditional on \(\textbf{X}\) is one approach for driving sampling probabilities when the outcome is unlabeled, \(P\{\delta ^s =1 | \textbf{X}\} \propto \textbf{E}[\mathscr {L}\{g(\textbf{X}),Y\}|\textbf{X}]\)7. The expected loss is larger, the corresponding data points are more likely to be selected for labeling. For example, for binary classification models, we can consider the expected cross-entropy function conditional on \(\textbf{X}\),
$$\begin{aligned} P(\delta ^s =1 | \textbf{X}) \propto P(Y=1|\textbf{X})\log {g(\textbf{X})} + \{1-P(Y=1|\textbf{X})\}\log {\{1-g(\textbf{X})\}}. \end{aligned}$$
By replacing the true \(P(Y=1|\textbf{X})\) with \(g(\cdot )\), the sampling probability can be approximated. Since the performance of sampling depends on prediction accuracy of the trained model, we consider model re-calibration with the labeled data \(\{\delta ^s_i, \delta ^s_iY_i, \textbf{X}_i, i = 1,…,N\}\) using the estimating equation
$$\begin{aligned} \sum ^{N}_{i=1}\frac{\delta ^s_ig(\textbf{X}_i)[Y_i- h\{\theta g(\textbf{X}_i)\}]}{\hat{\textbf{E}}\{\delta ^s_i|g(\textbf{X})\}} = 0, \end{aligned}$$
(7)
where \(h(\cdot )\) is an arbitrary function (e.g., \(h = \{1+\exp (-\theta g(\textbf{X}_i)\})\}^{-1}\) for a logistic regression and \(h = \theta g(\textbf{X}_i)\}\) for a linear regression). We can also conduct re-calibration of multinomial logistic regressions in form of the Eq. (7) (see Appendix). Let \(\hat{\theta }_s\) be the solution of the Eq. (7) and \(h\{\hat{\theta }_s g(\textbf{X}_i)\}\) be the re-calibrated model. Based on the re-calibrated model \(h\{\hat{\theta }_s g(\textbf{X}_i)\}\), we update the sampling probability. Algorithm 1 provides a summary of our general framework.
Extension to predictive accuracy metrics
We now focus on evaluating classification models using a general form of predictive accuracy metrics,
$$\begin{aligned} \mathscr {D} \equiv \frac{\textbf{E}[d_{1}\{g(\textbf{X}),Y\}]}{\textbf{E}[d_{2}\{g(\textbf{X}),Y\}]}. \end{aligned}$$
(8)
The quantity \(\mathscr {D}\) is varied depending on the definition of \(d_1\) and \(d_2\) such as the true positive rate, false positive rate, positive predictive value, and \(\text {F}_1\) score (see Table 1).
Based on the labeled data \(\{\delta ^s_i, \delta ^s_iY_i, \textbf{X}_i, i = 1,…,N\}\) and the estimated weights from the Eq. (5) after labeling at the sth sampling, we can construct the AIIPW estimator of the metrics \({\mathscr {D}}\),
$$\begin{aligned} \widehat{\mathscr {D}}^{\text {AIIPW}}_s = \frac{\sum ^{N}_{i=1}\delta ^{s}_im_{1}\{g(\textbf{X}_{i}),Y_{i}\}/\hat{\textbf{E}}\{\delta ^s_i|g(\textbf{X})\}}{\sum ^{n}_{i=1}\delta ^{s}_im_{2}\{g(\textbf{X}_{i}),Y_{i}\}/\hat{\textbf{E}}\{\delta ^s_i|g(\textbf{X})\}}. \end{aligned}$$
(9)
For example, the AIIPW estimator for the true positive rate from the Eq. (9) can be presented as
$$\begin{aligned} \widehat{\text {TPR}}^{\text {AIIPW}}_s = \frac{\sum ^{N}_{i=1}\delta ^{s}_i\textbf{I}\{g(\textbf{X}) > c\}Y_{i}/\hat{\textbf{E}}\{\delta ^s_i|g(\textbf{X})\}}{\sum ^{N}_{i=1}\delta ^{s}_iY_{i}/\hat{\textbf{E}}\{\delta ^s_i|g(\textbf{X})\}}, \end{aligned}$$
where c is the risk cutoff.
