Datasets
The statistics of the datasets used for training and validation for different tasks are shown in Supplementary Table 4. The following is a detailed dataset description.
MHC-II antigen presentation data
To train the antigen presentation model, we used the large-scale antigen presentation data collected in refs. 17,18,30,62, comprising three data types: BA, SA and MA EL datasets (Fig. 1b and Supplementary Fig. 9a,b). All data were filtered to remove possible contaminants and MHC class I-restricted peptides, retaining peptides of 12–21 amino acids (AAs) in length26. The EL datasets were then enriched by uniformly sampling five times of 12–21-AA random natural peptides as negative samples. The datasets were divided into five subsets for cross-validation using the common-motif method, ensuring that peptides sharing a subsequence of nine or more AAs were grouped into the same subset63. The final SA dataset contains 246,590 positive and 2,448,316 negative samples, whereas the MA dataset includes 432,255 positive and 4,467,755 negative samples, covering 142 MHC-II molecules. Additionally, the BA dataset comprises 129,110 data points across 80 class II molecules.
CD4+ epitope benchmark
The CD4+ epitope benchmark18, compiled by Nilsson et al. in 2023, was assembled following a specific protocol. Initially, positive CD4+ T cell epitopes ranging from 12 AAs to 21 AAs, without post-translational modifications and with complete four-digit MHC-II typing, were selected from the Immune Epitope Database (IEDB; https://www.iedb.org/)64. Only epitopes associated with well-documented source proteins were considered. Subsequently, the corresponding negative samples were generated based on the source protein sequences retrieved from the UniProt database (https://www.uniprot.org/)65. Each {epitope, allele, protein} triplet was then segregated into a distinct test subset. Within each subset, using a sliding window of the same length as the epitope, overlapping peptides were generated from the source protein sequence and designated as negative samples, excluding the epitope itself. Furthermore, it was ensured that none of the samples in the test set had previously appeared in the MHC-II antigen presentation training data. Ultimately, the test set comprised 842 {epitope, allele, protein} triplets, encompassing 40 HLA-DR, 13 HLA-DQ and 4 HLA-DP molecules.
Immunogenicity data
We curated immunogenicity assay data from IEDB64 and integrated it with the MHCBN66 dataset, following the methodology described in DeepNeo37,38. This dataset contains records up to 14 May 2024. Specifically, we selected the data of T cell reactivity based on IFN-γ secretion. Furthermore, we refined the dataset to include only entries with full MHC-II restriction and peptide lengths ranging from 12 AAs to 21 AAs. Given the variable nature of pMHC-II immunogenicity experiments, we followed the method in ref. 37 to classify pMHC-II with contradictory results as binding pairs. Moreover, we identified proteins with sequence similarity below 0.5 in the RCSB Protein Data Bank67 and generated ten times as many negative samples by randomly splitting peptides of the same length as the positive samples. The strategy aligns with the approaches used in IEPAPI68 and MHCflurry 2.0 (ref. 69). Subsequently, we randomly divided the data into training/validation and test sets at an 8:2 ratio. Consequently, the training/validation set comprised 71,584 data points, and the test set included 17,897 data points for our immunogenicity analysis.
Simulated MA dataset
Due to the absence of precise labels in the MA data, we constructed a simulated MA dataset using the SA dataset, which has been divided into a fivefold cross-validation set, to evaluate the capability of the MIL module in detecting positive pMHC-II samples within bags. The process was as follows: we selected four out of the five folds as the training set. These data were then randomly shuffled and organized into bags, each containing ten samples. Subsequently, we randomly sampled negative instances to achieve a 1:3 ratio of positive to negative bags.
Melanoma neoantigen data
The melanoma neoantigen data were obtained from ref. 51, who identified and functionally characterized 13 HLA class II-restricted neoantigens in two melanoma patients (Pt-C and Pt-D). The corresponding epitope information is available in the IEDB (reference IRI: http://www.iedb.org/reference/1042469). We included only neoantigens with reported TCR reactivity and quantitative avidity measurements. Each selected neoantigen was directly compared with its wild-type counterpart, enabling reliable immunogenicity assessment based on EC50 values.
SARS-CoV-2 immunogenic epitope data
The SARS-CoV-2 immunogenic epitope data were curated from the IEDB database64 (accessed 2 April 2025) and relevant primary literature (Supplementary Tables 4 and 5). Data from IEDB were retrieved using the following query parameters: disease set to COVID-19, full MHC class II restriction, source limited to peer-reviewed journal articles and T cell reactivity measured by IFN-γ secretion. Additional epitopes were manually extracted from selected primary publications. The resulting dataset comprises immunogenic epitopes derived from SARS-CoV-2 structural proteins (S, E, M and N) and the non-structural protein nsp12. All epitopes were clustered using MMseqs2 (ref. 70) with a sequence identity threshold of 0.5. The resulting clusters were then split into training and test sets at a 6:4 ratio, yielding 6,181 and 3,763 samples, respectively. To improve model generalizability, the training set was augmented with a general immunogenicity dataset containing non-SARS-CoV-2-derived epitopes, resulting in a total of 95,237 training samples.
ImmuScope architecture
MA and SA data representation
In this study, our model processed two predominant forms of mass spectrometry immunopeptidomics data: MA and SA data. Following the paradigm of MIL, we treated each MA sample as a ‘bag’ containing multiple instances, specifically pMHC-II pairs (Supplementary Fig. 9c). A positive bag suggests that the peptides are presented by at least one of the MHC molecules expressed in that sample. Conversely, a negative bag indicates that all pMHC-II pairs are negative instances. Similarly, for SA data, we defined each pMHC-II sample as either a positive bag with a single positive instance or a negative bag with a single negative instance. This consistent representation of MA and SA data enabled our framework to simultaneously learn from both data types and make predictions, thereby facilitating its application across diverse immunopeptidomics datasets.
Attention-based MIL aggregator
In branch a of Fig. 1a, we employed an attention-based MIL pooling mechanism71,72 to aggregate instance features within each bag. This mechanism not only enhances interpretability for predicting bag labels but also enables the identification and prioritization of the most critical instances crucial for the final prediction. Let Z = ℇ(X; θ) represent the embedding of pMHC-II instance obtained from the backbone of ImmuScope ℇ parameterized by θ. zk denotes the kth instance in the bag Z = {z1…zk}. We implemented the following gated attention aggregator:
$${{{\rm{att}}}}_{k}=\frac{\exp \{{{w}}^{{\mathsf{T}}}(\tanh ({V}{{z}}_{k}^{{\mathsf{T}}})\odot {{\rm{sigm}}}({U}{{z}}_{k}^{{\mathsf{T}}}))\}}{{\sum }_{j=1}^{K}\exp \{{{w}}^{{\mathsf{T}}}(\tanh ({V}{{z}}_{j}^{{\mathsf{T}}})\odot {{\rm{sigm}}}({U}{{z}}_{j}^{{\mathsf{T}}}))\}},$$
where w, V and U denote the model parameters and ⊙ represents an element-wise multiplication. The function tanh(·) refers to the hyperbolic tangent activation function and sigm(·) denotes the sigmoid nonlinearity.
High-confidence positive pseudo-labels selection module
We introduced high-confidence positive pseudo-labels to improve the accuracy of antigen presentation prediction. The number of positive samples in MA data is approximately twice that in SA data, and allele coverage is 2.2 times larger. This difference is particularly evident at the HLA-DP and HLA-DQ loci, where MA data substantially supplements coverage gaps in SA data. In particular, these weakly annotated positive MA samples contain multiple pMHC-II pairs, with at least one pair exhibiting positive signals. Such characteristics pose challenges for directly incorporating MA data into model training. To address this, we have developed a high-confidence positive pseudo-label selection module, which self-iteratively incorporates pseudo-labels from MA data to refine our predictive model (Fig. 1c).
High-confidence sample selection is performed using the trained backbone of ImmuScope, with the training process detailed in the ‘Antigen presentation prediction’ section. MA data are first input into the ImmuScope backbone, which incorporates Monte Carlo dropout73 to assess variability and enhance reliability. An attention-based MIL aggregation module is then used to estimate the uncertainty distribution of the MA samples, enabling the identification of high-confidence positive samples. Specifically, we iteratively select high-confidence samples by controlling the confidence ratio (Top R%) based on the attention scores. Samples already showing high confidence within the antigen presentation prediction branch are excluded. The selected samples are then integrated into the SA data for model fine tuning. Throughout this iterative process, we progressively adjust confidence thresholds to incorporate a broader range of positive MA samples, thereby improving model generalization. The optimal ratio of positive pseudo-labelled samples is determined based on validation performance.
Positive-anchor triplet loss
MHC-II molecules exhibit extensive diversity, exemplified by the human HLA-DR, HLA-DQ and HLA-DP loci, which collectively comprise 11,674 allelic variants according to the IPD-IMGT/HLA database74. Additionally, the peptides themselves show notable variability in sequence and length. The peptide-binding groove of MHC-II is highly specific for binding AAs in peptides75, determining which peptides can be bound and presented. Triplet loss76 enhances the model’s ability to perceive these subtle differences by minimizing the distance between similar samples (positive samples) and maximizing the distance between dissimilar samples (negative samples). This loss is particularly suitable for predicting pMHC-II BA and antigen presentation, as it improves learning on challenging-to-discriminate pMHC-II samples and facilitates the discovery of nuanced binding patterns between peptides and specific MHC-II molecules.
In the experimental setup, triplet loss was calculated using only positive samples as anchors. This strategy enabled the model to better distinguish crucial binding features within pMHC-II complexes. The positive-to-negative sample ratio in the antigen presentation dataset was set to 1:10. Using negative samples as anchors increased computational costs and might distract from the model’s primary goals by unnecessarily optimizing distances between negative samples. Such optimization failed to enhance discrimination and reduced the learning efficiency. To address these challenges and align with critical learning objectives, we have formulated the triplet loss for each mini-batch as follows:
$${{\mathcal{L}}}_{{{{\rm{triplet}\_{\rm{loss}}}}}}\left(a,{{\rm{pos}}},{{\rm{neg}}}\right)=\frac{1}{N}\mathop{\sum }\limits_{i}^{N}\max \{d\left({a}_{i},{{\rm{pos}}}_{i}\right)-d\left({a}_{i},{{{\rm{neg}}}}_{i}\right)+{{\rm{margin}}},0\},$$
where d(xi, yi) = ||xi – yi||p, we used Euclidean distance as the metric function, setting p = 2. In this context, i represents a mini-batch, N is the batch size and a exclusively denotes all the positive samples used as anchors; pos and neg indicate the positive and negative samples within the mini-batch, respectively; margin is a threshold defining the minimum distance that the negative sample must exceed beyond the positive sample from the anchor to avoid incurring a loss.
ImmuScope training process
ImmuScope backbone training process
The backbone of ImmuScope is a pretrained model for other downstream tasks. Initially, we loaded the SA and MA data, and then we computed the positive-anchor triplet loss for the embeddings of pMHC-II instances, denoted as \({{\mathcal{L}}}_{{{\rm{triplet\_loss}}}}\). In branch a, the bag labels for SA and MA data were optimized using the binary cross-entropy loss function, represented as \({{\mathcal{L}}}_{{{\rm{MIL\_SA}}}}\) and \({{\mathcal{L}}}_{{{\rm{MIL\_MA}}}}\), respectively. Concurrently, in branch b, the SA data were optimized using the binary cross-entropy loss \({{\mathcal{L}}}_{{{\rm{instance\_SA}}}}\). The composite loss function for the backbone is defined as
$${{\mathcal{L}}}_{{{\rm{ImmuScope}}\; {\rm{backbone}}}}=\tau \times {{\mathcal{L}}}_{{{\rm{triplet\_loss}}}}+{{\mathcal{L}}}_{{\rm{MIL\_MA}}}+{{\mathcal{L}}}_{{\rm{{MIL\_SA}}}}+{{\mathcal{L}}}_{{{\rm{instance\_SA}}}},$$
where τ represents the weighting factor for the triplet loss, setting τ = 0.1. Throughout the training process, the parameters of the ImmuScope backbone network were refined by synergistically combining individual instance learning, aggregated label optimization and metric learning strategies. This integrative approach ensured a robust optimization of model parameters, effectively capturing both micro- and macro-level data characteristics. The Adam optimizer with a learning rate of 1 × 10−3 was used to train the backbone of ImmuScope for up to 20 epochs, with the final model being selected based on the best performance on the validation set.
Antigen presentation prediction
On the basis of the backbone of ImmuScope, we gradually introduced high-quality positive pseudo-labels from MA data to construct an antigen presentation prediction model. In each epoch, we first obtained the predicted antigen presentation probability on branch b, the attention score in the MIL aggregator and the corresponding bag score through forward calculations. To ensure prediction stability and accurately gauge model uncertainty, we employed an architecture with Monte Carlo dropout to perform ten forward passes and analysed both mean and variance of these predictions. Initially, we selected the top 8% of samples with high attention weights and whose variances ranked in the top 80% (from lowest to highest). These thresholds (8% and 80%) were determined through preliminary experiments and an examination of the distribution of attention scores, ensuring that we focused on high-confidence, relatively low-variance samples. We also excluded samples with predicted antigen presentation probabilities exceeding 0.95 and those whose variances ranked in the top 40% (from lowest to highest), as they were already reliably identified by the model.
As the iterations progressed and the model’s internal representations became more stable, we gradually relaxed the threshold on attention weights from the top 8% to 12%. This step—commonly employed in self-training approaches—aims to broaden the scope of positive pseudo-labelled samples, thereby enriching the training dataset with more diverse pMHC-II binding candidates and further enhancing the model’s learning capacity. At the same time, we utilized this expanded dataset, SA-extend (EL), for incremental fine tuning of the backbone model. Finally, we fine-tuned the ImmuScope backbone with the final SA-extend (EL) dataset over ten additional epochs using the Adam optimizer (learning rate = 3 × 10−5), yielding the optimized ImmuScope-EL model for antigen presentation prediction.
CD4+ T cell epitope prediction
Antigen presentation is a prerequisite for the CD4+ T cell immune response. In line with the methodology of NetMHCIIpan-4.3, our CD4+ T cell epitope prediction model, ImmuScope, similarly incorporated both BA and EL data. Specifically, the antigen presentation model ImmuScope-EL was fine-tuned using BA data, employing a learning rate of 2 × 10−5, and leveraging the Adam optimizer to minimize the mean squared error loss over 20 epochs. To balance the influence of BA and EL data on CD4+ T cell epitope prediction, we set an 8:2 weighting ratio for BA and SA data in the validation set. This ratio was determined based on preliminary experiments and data correlation: although BA data provide precise BA information, SA data capture actual antigen presentation events in vivo. The final validation metrics were calculated as follows:
$${{\rm{{AUPR}}_{V{al}}}}=0.8\times {{\rm{{AUPR}}_{{BA}}}}+0.2\times {{\rm{{AUPR}}_{{SA}}}},$$
where AUPRBA and AUPRSA denote the AUPR values of the BA and SA subsets, respectively, within the validation set. Finally, we evaluated the performance of CD4+ T cell epitope prediction on the CD4+ epitope benchmark.
MHC-II epitope immunogenicity prediction
Immunogenicity is crucial as it determines the efficacy and safety of vaccines and therapies by triggering immune responses. We refined ImmuScope-EL further with immunogenicity data to develop the ImmuScope-IM model, tailored to immunogenicity prediction. The ImmuScope-IM model was optimized by an Adam optimizer with a learning rate of 1 × 10−3 and binary cross-entropy loss, for a maximum of 20 epochs. For the application of the ImmuScope-IM model in SARS-CoV-2 epitope discovery and dynamic escape mechanism studies, we excluded the epitope binding data pertaining to SARS-CoV-2 from our initial immunogenicity dataset to construct a dedicated SARS-CoV-2 immunogenicity benchmark dataset, ensuring unbiased benchmarking. This benchmark dataset was then used to train the ImmuScope-IM model for assessing the immunogenicity of SARS-CoV-2 epitopes.
All deep learning models were developed using PyTorch v. 1.12.1 and trained on an NVIDIA GeForce RTX 4090 GPU. Details of the algorithm and model hyperparameters are provided in Supplementary Tables 6 and 7, respectively. Computational efficiency and scalability are described in Supplementary Note 8.
Analysis of motif deconvolution
We employed the trained ImmuScope-EL model to perform motif deconvolution and obtain the binding peptide sequence set for different MHC-II allomorphs. Specifically, a subset of MA data was fed into ImmuScope-EL, and the attention weights from the attention-based MIL aggregator, along with the antigen presentation probabilities from branch a and branch b, were obtained, respectively. To ensure high-quality deconvolution, we selected the antigen presentation peptides with an antigen presentation probability greater than 0.8 and an attention weight exceeding the reciprocal of the number of MHC-II categories in the bag. We then employed Seq2Logo to visualize the motif logo of different MHC-II allomorphs based on the sequences of selected peptides.
Quantification of MHC-II binding specificity
We first calculated the antigen presentation score by inputting 100,000 random human peptide sequences and the alleles to be assessed into ImmuScope-EL. Then, the samples with the top 1% of the predicted scores were selected for cluster analysis using GibbsCluster77, and the optimal number of clusters, that is, binding specificity, was determined based on the lowest average KLD. Finally, we evaluated the MHC binding specificity quantified by ImmuScope-EL by comparing the KLD with the PSFM matrix based on the peptidomics data. The prediction results of NetMHCIIpan were obtained by predicting the top 1% of random human peptides using the NetMHCIIpan-4.3 software package, whereas MixMHC2pred was obtained by predicting using the MixMHC2pred-2.0 web server.
Measuring the similarity of MHC binding motifs
To evaluate the similarity between sequence motifs generated by various algorithms and those obtained from peptidomics data, we first represented each set of peptide-binding cores with PSFMs. Each PSFM was then converted into a single vector by concatenating the frequency values at its nine positions, with each position containing 20 values corresponding to the 20 standard AAs. Finally, we calculated the symmetric KLD18 for any two PSFMs, denoted as a and b, using the following formula:
$${{{\rm{KLD}}}}_{{a},{b}}=\mathop{\sum }\limits_{i}^{N}\left\{\left[{a}_{i} \circ\mathrm{ln}\left[\frac{{a}_{i}+\varepsilon }{{b}_{i}+\varepsilon }\right]\right]+\left[{b}_{i}\circ \mathrm{ln}\left[\frac{{b}_{i}+\varepsilon }{{a}_{i}+\varepsilon }\right]\right]\right\},$$
where ε is employed as an exceedingly small positive number, typically set at 1 × 10−10, to prevent division by zero.
Calculation of binding core alignment scores for epitopes
In our analysis of the melanoma neoepitope and SARS-CoV-2 epitope binding cores, we employed the ImmuScope-EL model to analyse the binding cores of various epitopes and to examine changes on mutations. Initially, we used ImmuScope-EL to predict 100,000 random human peptides and selected the top 1% based on the highest binding scores to create a position-specific scoring matrix for specific alleles (like HLA-DRB1*01:01 in SARS-CoV-2 epitope analysis). Subsequently, we calculated the matching degree for each 9-mer window of the candidate peptides against the position-specific scoring matrix. The alignment score for each window was then computed to assess how well it matched the binding pattern defined by the position-specific scoring matrix.
Statistical analyses
Error bars depicted in the bar plots indicate 95% CIs, unless specified otherwise. Performance benchmarks such as AUC and AUPR were computed using the scikit-learn Python package (v. 1.3.0). UMAP analysis was conducted with the umap-learn Python package (v. 0.5.3). The predicted binding peptide ligands were further clustered using the GibbsCluster tool (v. 2.0). Sequence motifs were generated and visualized using the Seq2Logo tool (v. 2.0). Additionally, the three-dimensional structures of pMHC-II complexes were visualized using PyMOL (v. 2.5.7).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
