Labeling of standard proteomic datasets
The number of PSMs identified based on 1% FDR at the peptide level for DS-Schmidt, DS-Hultin, and DS-NCI-7 are 12,340 (from 10,479 peptides), 10,645 (from 3965 peptides), and 266,367 (from 165,152 peptides), respectively. Statistics of the labeling results on the three standard proteomic datasets are shown in Table 1. It is observed that QUPs, although constituting a relatively minor fraction of the total PSMs, exhibit substantially larger AREs ranging from 0.360 to 0.416 across the three datasets. In contrast, the AREs of QRPs fall within a range of 0.100–0.122. The situation justifies our presumption that the labeling of QUP and QRP reflects the quantitation accuracy of PSMs. The ratios between the number of QUPs and QRPs for DS-Schmidt, DS-Hultin, and DS-NCI-7 are 1:14.3, 1:3.4, and 1:6.8, respectively. This indicates the heterogeneous signal quality at the PSM level across the three standard proteomic datasets, which is reasonable given the variations in experimental protocols, sample preparation techniques, and mass spectrometer settings in the respective studies.
Prediction results of cross validation
The evaluation results of ML models across the three datasets (DS-Schmidt, DS-Hultin, and DS-NCI-7) are shown in Table 2. The top-performing models (GBC for DS-Schmidt and CatBoost for DS-Hultin, and DS-NCI-7) achieved MCCs of 0.661, 0.665, and 0.579 for the respective datasets, indicating accurate overall predictions. These models demonstrated high performance metrics, with AUCs ranging from 0.920 to 0.967 and accuracies between 0.886 and 0.960. The elevated AUC and accuracy can be attributed to the models’ capability in correctly identifying QRPs, i.e., a high true negative rate, which is a common situation for highly unbalanced datasets. Precision values for the best models are 0.757 (DS-Schmidt), 0.790 (DS-Hultin), and 0.783 (DS-NCI-7), while recall values are 0.615, 0.685, and 0.492 for the three datasets, respectively. The lower MCC for DS-NCI-7 (compared to the other datasets) appears to stem from reduced model sensitivity, as evidenced by its recall of 0.492, which is 12.3% and 19.3% lower than the recall values of DS-Schmidt and DS-Hultin, respectively. The significant difference in recall suggests that the models struggle more with identifying true positives (actual QUPs) in the DS-NCI-7 dataset. Since the size of the main dataset of DS-NCI-7 is more than 20-fold larger than the other two datasets, it is likely that the PSMs are more heterogeneous, hindering the generalization capability of ML models.
It is worth noting that boosting models, particularly GBC for DS-Schmidt and CatBoost for DS-Hultin and DS-NCI-7, consistently outperform statistical methods such as QDA and NB across all performance metrics. This superiority is attributed to boosting algorithms’ ability to sequentially combine multiple weak learners, effectively capturing non-linear relationships and complex feature interactions that traditional statistical methods assume to be linear or independent. Additionally, boosting models can automatically handle feature selection and importance weighting during training, whereas statistical methods like QDA and NB rely on strong distributional assumptions that may not hold for proteomics data. These findings underscore the superiority of advanced ML techniques in handling complex, large-scale datasets in this domain.
Prediction results of independent test
Evaluation results for independent tests are shown in Table 3. The top-performing models—Gradient Boosting Classifier (GBC) for DS-Schmidt, LightGBM for DS-Hultin, and CatBoost for DS-NCI-7—demonstrated robust predictive capabilities. These boosting models achieved accuracies ranging from 0.883 to 0.966, AUCs ranging from 0.924 to 0.963, and MCCs ranging from 0.596 to 0.691 on the three independent test sets. On the other hand, statistical approaches (NB and QDA) yielded accuracies ranging from 0.825 to 0.934, AUCs ranging from 0.783 to 0.907, and MCCs ranging from 0.424 to 0.522 on the three independent test sets. The roughly 17% improvement in MCC (from 0.424 to 0.522 to 0.596–0.691) of boosting models indicates significantly better discrimination between QUPs and QRPs. This translates to fewer false positives and false negatives propagating through downstream protein quantification and differential expression analyses. The performance gap demonstrates the clear advantage of ensemble methods over traditional statistical approaches for QUP identification tasks. Notably, the performance metrics obtained from independent tests are comparable to those from cross-validation, suggesting that the random partitioning of data into main and test datasets did not introduce significant bias. This consistency lends credibility to the generalizability of our findings.
We further analyzed the correlation between the prediction probability output for each model and the true positive rate (TPR), calculated by the number of actual QUPs divided by the total number of PSMs predicted within the range of prediction probability. A well-calibrated model should exhibit a strong positive correlation between prediction probability and TPR. The results are shown in Fig. 1. In general, ML models such as GBC, LightGBM, and CatBoost demonstrate positive correlations between TPR and prediction probability. The curves for NB and QDA show significant fluctuations in Fig. 1A compared to other models, which likely explains their inferior prediction performance on DS-Schmidt. In addition, NB and QDA also have lower TPR for the prediction probability range exceeding 0.7 in Figs. 1B and C, indicating a higher propensity for false positives.

The PSM number and true positive rate with respect to different ranges of the prediction probability of PSMs for (A) DS-Schmidt, (B) DS-Hultin, and (C) DS-NCI-7 based on independent tests.
ARE distributions of predicted QUPs and QRPs
It was hypothesized that PSMs predicted as QUPs and QRPs should exhibit significantly different AREs if the models were properly trained. To validate the hypothesis, we analyzed the ARE distributions of PSMs predicted as QUPs and QRPs from the independent tests across the three datasets. The results are illustrated in Fig. 2. It can be seen that QUPs and QRPs have quite separate distributions, and the majority of the predicted QUPs show relatively larger quantitation errors (ARE > 0.2). The presence of QUPs can introduce errors into the downstream quantitation process, such as peptide-level quantitation. By correctly identifying and removing QUPs, we can enhance the accuracy of the subsequent quantitative analysis. For DS-Hultin, the number of predicted QUPs with ARE > 0.25 is much more than the other two datasets, reflecting its inferior signal quality at the PSM level and the smaller QRP-to-QUP ratio (which is 3.4) as opposed to the higher ratios for DS-Schmidt (14.3) and DS-Hultin (6.8).

Distributions of AREs for PSMs predicted as QUP (Pred_QUP) and QRP (Pred_QRP) for (A) DS-Schmidt, (B) DS-Hultin, and (C) DS-NCI-7 based on independent tests. The prediction results of the best-performing models on the three test sets are used.
Feature importance
The t-SNE visualizations in Supplementary Fig. S2 illustrate the clear separation between QUPs and QRPs in all three datasets, suggesting that the encoded features are highly discriminative. This separation enables ML models to effectively capture the underlying differences between the two classes based on the extracted features. To investigate the contribution of respective features, we obtained the permutation importance from the top-performing models of the three datasets, i.e., GBC, LightGBM, and CatBoost for DS-Schmidt, DS-Hultin, and DS-NCI-7, respectively, as illustrated in Fig. 3. The permutation feature importance, obtained via the scikit-learn package, is defined to be the average decrease in the model score (R2 by default) when a single feature value is randomly shuffled 30 times (default value).

Permutation importance of features for GBC trained on DS-Schmidt, LightGBM trained on DS-Hultin, and CatBoost trained on DS-NCI-7. The three models yield the highest MCC on the test sets.
It can be seen from Fig. 3 that the feature importance of the three datasets shares a high level of similarity. For example, distance-based features such as devManD, avgManD, and devCosD exhibit a higher degree of permutation importance. The phenomenon is reasonable because larger values for distance-based features refer to the case that a PSM has quite distinct ratios to other PSMs, which often occurs when the PSM contains noisy reporter ion signals or is of lower signal quality. It is observed that charge state, PTM number, and PTM ratio, in general, show a much smaller degree of importance for the three datasets. Despite the similarities, a few features exhibit varying feature importance across datasets. For example, the F-value shows much higher feature importance for DS-Hultin than the other two datasets, and average reporter ion intensity (ARE Intensity) demonstrates higher feature importance for DS-Schmidt and DS-Hultin than DS-NCI-7. The commonalities of permutation importance indicate consistency in the underlying patterns across different datasets.
We implemented naïve algorithms utilizing a single feature for QUP removal to benchmark against the proposed ML-based models. Based on feature importance analysis, we selected four key features: devManD, avgManD, devCosD, and devPCC. For each selected feature, PSMs in the test sets were ranked in descending order according to their feature values, and the algorithm predicted predetermined percentages of QUPs (top 1%, 3%, and 5%) from the ranked list. The evaluation results for naïve algorithms across different features and percentages are presented in Supplementary Table S1. The best MCCs achieved among the naïve algorithms are 0.534, 0.343, and 0.456 for DS-Schmidt, DS-Hultin, and DS-NCI-7, respectively. These performance metrics are substantially lower than those achieved by most ML models (Table 3), suggesting that ML approaches leveraging multiple features can identify intricate patterns and interdependencies among variables that single-feature thresholds cannot capture. Furthermore, naïve algorithms require predefined percentages or numbers of PSMs for removal, which are difficult to optimize for individual datasets.
Construction of a generalized predictor
Previous prediction results were based on the main and test datasets derived from a single proteomic experiment. The three proteomic datasets considered in this study are based on different sample preparation protocols and MS instruments, and are from different laboratories. These discrepancies can result in distinct spectrum characteristics and signal patterns across datasets. Thus, it would be insightful to build up generalized ML models based on a combination of the three proteomic datasets, and investigate whether these models can learn the generalized patterns of QUPs within the feature space across datasets. Unfortunately, the sizes of the three main datasets vary drastically. For example, the number of QUPs in DS-Schmidt is 655 (the smallest among the three), while that in DS-NCI-7 is 27,471. To build up a blended dataset in which the three proteomic datasets have equal contribution, we included the minimum number of QUPs and QRPs among the three main datasets. To be specific, 665 QUPs and 6598 QRPs from each of the main datasets were included, resulting in a blended dataset of 21,759 spectra (in total 1965 QUPs and 19,794 QRPs). This also means that approximately 65.8% and 97.6% of the QUPs from the main datasets of DS-Hultin and DS-NCI-7, respectively, were not included in the blended training set. Similarly, 28.4% and 96.4% of the QRPs from the main datasets of DS-Schmidt and DS-NCI-7, respectively, were not included.
The 6 ML models were trained on the blended dataset and subsequently evaluated on the test sets of DS-Schmidt, DS-Hultin, and DS-NCI-7. Implementation details such as feature encoding, normalization, and hyperparameter optimization were identical to the previous experiments. The evaluation results are shown in Table 4. It can be seen that the models trained with the blended dataset yield decreased performance compared to the original models trained on individual datasets. Compared with the best-performing models from Table 3, the best-performing models trained with the blended datasets exhibit 4.3% decrease in AUC and 11.0% decrease in MCC on DS-Schmidt test set, 3.7% decrease in AUC and 5.9% decrease in MCC on DS-Hultin test set, and 8.4% decrease in AUC and 11.9% decrease in MCC on DS-NCI-7 test set. It appears that dataset-specific features can be incorporated in the future to improve model generalizability. That being said, the best models trained with blended datasets achieve accuracies ranging from 0.874 to 0.949, AUCs from 0.840 to 0.920, and MCCs from 0.477 to 0.581 for the three datasets, indicating decent generalizability and prediction accuracy. It is worth noting that the blended dataset consists of only 665 (or 2.4%) out of 27,471 QUPs and 6598 (or 3.55%) out of 185,622 QRPs from DS-NCI-7 main dataset. Despite the limited training data, CatBoost achieves an accuracy of 0.886, an AUC of 0.840, and an MCC of 0.477 on DS-NCI-7 test set. This outcome suggests that QUPs share a certain level of similarity in the feature space across proteomic datasets of different samples, mass spectrometers, and experimental protocols. Moreover, the analysis demonstrates that advanced ML models such as CatBoost and GBC can generate effective predictions with the blended dataset.
Prediction of QUPs from quantitation results of MaxQuant
To assess the impact of QUP prediction on the quantitation of existing proteomic software tools, we used MaxQuant37 to process the three datasets and evaluated the quantitation results of the three independent test sets. Due to different selections of database search engines (Andromeda for MaxQuant; Comet and X!Tandem for our analyses), MaxQuant yielded lower identification coverage compared to our results. To ensure comparability, spectra from the test datasets were filtered to include only those identified by MaxQuant that were not associated with reversed sequences (indicated by “ + ” in the “Identified” column and empty entries in the “Reverse” column from msmsScans.txt). This filtering process resulted in 1945, 1809, and 26,993 spectra for the test sets of DS-Schmidt, DS-Hultin, and DS-NCI-7, respectively. The predicted QUPs and QRPs were obtained with the top-performing models on the three datasets, i.e., GBC, LightGBM, and CatBoost for DS-Schmidt, DS-Hultin, and DS-NCI-7, respectively.
The evaluation results are shown in Supplementary Table S2. It can be seen that the predicted QRPs have AREs between 0.099 and 0.124, while the predicted QUPs show significantly higher AREs, between 0.339 and 0.389. Supplementary Fig. S3 also demonstrates that AREs for the predicted QUPs and QRPs exhibit great differences across the three datasets, consistent with the ARE distributions of QUPs and QRPs in Fig. 2. The above results suggest that the removal of QUPs from quantitation has great potential in improving the quantitation accuracy of MaxQuant. Notably, the average identification scores (reported in the “Score” column of msmsScans.txt) of QUPs and QRPs exhibit minor differences for DS-Hultin and DS-NCI-7, indicating that traditional metrics to assess identification confidence do not directly translate to quantification accuracy.
IQUP-assisted peptide quantitation
The traditional isobaric-labeling quantitation pipelines consider all the PSMs (satisfying the 1% FDR criterion) for peptide-level quantitation. With the development of IQUP, PSMs predicted to have high quantitation error (predicted QUPs) can be excluded from peptide-level quantitation. We employed MedianPsmRatio, a common peptide-level quantitation algorithm that assigns peptide ratios based on the median of PSM ratios for peptide-level quantitation. Only peptides with complete PSM assignments in the test set were considered. Moreover, a total of 1, 11, and 59 peptides were excluded because all their associated PSMs were predicted as QUPs. As a result, we analyzed 67, 86, and 2047 peptides from the test sets of DS-Schmidt, DS-Hultin, and DS-NCI-7, respectively.
The evaluations of peptide level quantitation are illustrated in Fig. 4. It can be seen that the quantitation using only QRPs results in a significant decrease in the peptides of larger AREs. Specifically, the numbers of peptides with ARE > 0.2 based on quantitation using all PSMs are 6, 13, and 144, for DS-Schmidt, DS-Hultin, and DS-NCI-7, respectively, whereas those based on quantitation using only QRPs are 1 (83.3% decrease), 5 (61.5% decrease), and 112 (15.3% decrease) for the three datasets. The quantitation using only QRPs also leads to an increase in the peptides of smaller AREs. To be specific, the numbers of peptides with ARE < 0.1 based on quantitation using all PSMs are 41, 47, and 944, for DS-Schmidt and DS-Hultin, and DS-NCI-7, respectively, whereas those based on quantitation using only QRPs are 45 (9.8% increase), 59 (25.5% increase), and 973 (3.1% increase) for the three datasets.

Peptide numbers for different ranges of peptide AREs for (A) DS-Schmidt, (B) DS-Hultin, and (C) DS-NCI-7 with quantitation using all PSMs and using only predicted QRPs.
A single peptide within the DS-Schmidt test set has two associated PSMs, both of which are classified as QUPs. The two PSMs have AREs of 0.64 and 0.22, respectively, indicating relatively large quantitation errors. There are in total 11 and 59 such peptides in DS-Hultin and DS-NCI-7 test sets, respectively, and the distributions of their peptide AREs are shown in Supplementary Fig. S4. It can be seen that 9 out of 11 peptides and 48 out of 59 peptides have their AREs greater than 0.2 in DS-Hultin (Fig. S4A) and DS-NCI-7 (Fig. S4C), respectively. This suggests that the peptides with all their PSMs predicted as QUPs are mostly peptides of larger quantitation errors. Such peptides from DS-Hultin and DS-NCI-7 test sets are associated with 22 and 122 PSMs, respectively. Similarly, 16 out of 22 PSMs and 111 out of 122 PSMs from DS-Hultin (Fig. S4B) and DS-NCI-7 (Fig. S4D) test sets, respectively, have their PSM AREs greater than 0.2. The results demonstrate that IQUP can be used not only to predict QUPs but also to identify peptides with larger quantitation errors.
It is worth noting that there are several alternative algorithms to calculate peptide ratios from PSMs11. The optimized usage of IQUP in consideration of various peptide ratio calculation algorithms requires thorough and systematic analyses, which are beyond the scope of this study. Nevertheless, the current experimental results highlight the great potential of machine learning models in enhancing peptide-level quantitation.
