In the experimental section, we conducted hyperparameter tuning for the proposed MTLPT model, performed comparative experiments with commonly used STL models and state-of-the-art approaches in vulnerability prediction, and carried out ablation studies in the end.
Experimental environment
The proposed framework was implemented using the Python 3.8 programming language and conducted related experiments on the Microsoft Windows 10 operating system. All experimental algorithms were computed using CPU with GPU acceleration. The specific parameters are displayed in Table 1.
Dataset
The publicly available Draper Dataset, compiled by Russell et al.19, considers the complexity and diversity of programs by capturing the granularity of subprograms’ overall flow at the function level through functional-level analysis of software packages. It is composed of a vast collection of useful function-level real code samples from millions of function-level C and C++ code examples compiled from SATE IV Juliet Test Suite, Debian Linux distributions, and public Git repositories on GitHub. The dataset contains 1.27 million entries, with a significant majority being non-vulnerability data, each CWE type accounting for less than 10% of the total. Therefore, addressing data imbalance is a priority for the real-world vulnerability dataset Draper. In this experiment, undersampling was employed to sample the original dataset. The distribution of data after sampling is shown in Table 2. We performed undersampling on data without CWE vulnerabilities from the original dataset, randomly selecting 10% of non-CWE vulnerability data to recombine with the original dataset containing CWE data, generating a subset after Draper undersampling. After sampling, the proportion of each vulnerability category in terms of total data volume increased: CWE-120 by approximately 20%, CWE-other by approximately 15%, CWE-119 by approximately 10%, CWE-476 by approximately 5%, and CWE-469 by approximately 1%.
As demonstrated in the scatter plot distribution of CWE category numbers in the original Draper dataset (a) and the subset after undersampling (b) shown in Fig. 3, it is evident that before sampling, the data of CWE categories were extremely imbalanced with very low proportions across all categories. After our undersampling treatment, the situation of data imbalance was greatly alleviated, but the issue still persists. For instance, the proportion of CWE-469 vulnerability category data remains very low, necessitating further measures to mitigate this imbalance. In this paper, we use the subset after undersampling divided into training dataset (80%), validation dataset (10%), and test dataset (10%) to train and evaluate the model.

Scatter plot distribution of CWE category numbers in draper dataset and sub-dataset.
Data preprocessing
In the data preprocessing phase, we adopted a series of refined data preprocessing steps to prepare and optimize the dataset, ensuring that the MTLPT model can effectively learn and predict source code vulnerabilities. This process is divided into three main stages: data cleaning, tokenization and serialization of text data, and preparation of target variables. The following will detail these steps and their application in data preprocessing.
Firstly, the dataset was cleaned by removing redundant data such as duplicate functions and garbled codes, and the non-numerical logical data in the training dataset, validation dataset, and test dataset were transformed into numerical data that the model can handle, thereby improving the consistency and processability of the data. Then, the source code data was tokenized: the text data was tokenized by word, that is, the text data was converted into a series of tokens for constructing numerical inputs that the model can understand, and a vocabulary table of maximum size \(L_{max}\) sorted by word frequency was created based on all source code texts in the training dataset. Afterwards, each dataset (training dataset, validation dataset, and test dataset) was transformed into an equal-length input matrix \((M, I_{max})\) according to the tokens of the text data, where M is the number of data, ensuring the consistency of model input. Finally, the feature columns in the dataset were converted into one-hot encoding format using one-hot encoding, and the number of categories was \(N_c\). Each category had a clear target vector representation \(V_1,…,V_c\) (c=5), further enhancing the expressive power of the data and laying the foundation for subsequent model training and evaluation.
Parameter settings
For our MTLPT method, we chose 10000 as the size of the vocabulary and fixed the length of the input samples at 500. We used the Adam optimizer with a learning rate of 0.001 for parameter updates. In terms of the MTLPT model architecture, we used a custom lightweight Transformer block with Dropout of 0.5, 4-head multi-head attention mechanism, an internal feed-forward network of dimension 512, an embedding layer and position encoding layer of dimension 128, and a one-dimensional convolutional layer with 256 filters and 3x3 size convolution kernels. For the hyperparameters of the baseline methods, we followed the best settings specified by the original authors.
To evaluate the effectiveness of the proposed MTLPT model, we compared it against both traditional machine learning methods and state-of-the-art approaches for vulnerability prediction. Random Forest (RF), a widely used ensemble learning method, was implemented with 100 estimators to ensure robust performance, serving as a benchmark for non-deep learning methods. For deep learning baselines, we employed CNN, RNN, and LSTM, which were implemented with consistent global hyperparameters, including a vocabulary size of 10,000, sequence length of 500, and a fixed random seed for reproducibility. The CNN utilized an embedding layer, followed by a convolutional layer with 512 filters, max pooling, and dense layers for output. The LSTM model featured a single LSTM layer with 128 units, followed by a dense layer for classification, while the RNN employed a SimpleRNN layer with the same structure. All models were trained using the Adam optimizer with a categorical cross-entropy loss function for 40 epochs. In addition, state-of-the-art methods, including AST + ML16, Code2Vec44 and Code2Vec MLP16, were included using performance data reported in the literature. These approaches provide a strong benchmark for comparing MTLPT’s ability to effectively predict vulnerabilities in real-world datasets.
Basic model comparison experiment

MTLPT model training Loss curve.
We trained the MTLPT model using our proposed loss function \(L_t\) based on dynamic weights and the undersampled Draper subset. As shown in Fig. 4, Task1 to Task5 are the training curves of the prediction tasks of the vulnerability categories CWE-119, CWE-120, CWE-469, CWE-476, CWE-other. In the first 15 iterations, the total loss of the proposed method dropped quickly and then became relatively stable, and the loss of each task also tended to stabilize within 15 times. This result demonstrates the adaptability of our learning model in the real-world imbalanced vulnerability dataset (RIV) environment.
-
(1)
For RQ1: How is the performance of our MTLPT method based on multi-task learning for predicting the five most common types of vulnerabilities in unbalanced real-world code?
In this paper, since RIV is an unbalanced dataset, we comprehensively evaluated the performance of our proposed MTLPT method for predicting real-world vulnerability data. To achieve this, we compared it against traditional, widely-used machine learning models (RF30) and deep learning models (LSTM24, RNN26, CNN25), as well as state-of-the-art approaches (AST + ML16, Code2Vec44, and Code2Vec + MLP16) in vulnerability prediction. The comparison was conducted on the vulnerability categories CWE-119, CWE-120, CWE-469, CWE-476, and CWE-other, using key indicators such as Recall, F1-score, Area Under the Curve (AUC), and Matthews Correlation Coefficient (MCC)45. Recall can measure the ability of the model to recognize all relevant instances, F1-Score provides a balance between precision and Recall, which is valuable in unbalanced datasets, AUC provides a general performance measurement standard for all classification thresholds, and MCC has a particularly large amount of information, because it will only give a high score when the prediction gets good results in all four confusion matrix categories (true positives, false negatives, true negatives, and false positives), which is beneficial for evaluating the performance of unbalanced datasets. Based on these four evaluation indicators, we can effectively evaluate the performance of our proposed method for predicting the five most common types of vulnerabilities in unbalanced real-world code. The specific evaluation results are listed and compared in Table 3.
As can be seen from Table 3, MTLPT outperforms other models on all indicators of different CWE category prediction tasks, including traditional, widely-used machine learning models (RF30), deep learning models (LSTM24, RNN26, CNN25), and state-of-the-art approaches (AST+ML16, Code2Vec44, Code2Vec+MLP16). For these five major categories of vulnerability prediction tasks, the recall rate of MTLPT is significantly higher than that of other methods. For example, in the CWE-119 vulnerability category prediction task, the recall rate of MTLPT reaches approximately 79%, while CNN achieves only 36.59%, a difference of more than 40%. Among state-of-the-art methods, Code2Vec+MLP achieves a recall rate of 87.3%, but MTLPT provides a more balanced performance across metrics, such as F1-score and MCC, showcasing its robustness and adaptability.
In addition to recall, MTLPT surpasses other models in terms of F1-score and AUC. For instance, in the CWE-120 vulnerability category prediction task, MTLPT achieves an F1-score approximately 43% higher than CNN and an AUC of 90%, highlighting its ability to maintain high classification performance under varying thresholds. In terms of the MCC indicator, MTLPT also demonstrates superior performance. For example, in the CWE-469 vulnerability category, which accounts for only 1.3% of the dataset, MTLPT achieves an MCC of about 30%, outperforming all other baseline methods, including state-of-the-art approaches.
RF represents traditional machine learning models, which are effective in handling moderately imbalanced datasets due to their ensemble learning capabilities. However, RF struggles with high-dimensional and unstructured data, such as source code representations, resulting in low recall, F1-score, and MCC values in all vulnerability categories.
Deep learning models, including LSTM, RNN, and CNN, improve upon traditional methods by leveraging their ability to learn sequential and spatial patterns. LSTM and RNN are capable of capturing temporal dependencies, while CNN is effective in extracting local spatial features through convolutional layers. Despite these advantages, these models cannot share information across tasks, limiting their effectiveness on highly imbalanced datasets. For instance, in the CWE-469 category, LSTM, RNN, and CNN fail to detect any vulnerabilities, indicating their inability to leverage correlations from other tasks or effectively learn features of rare classes.
State-of-the-art approaches, such as AST+ML and Code2Vec, focus on structural and semantic code analysis. AST+ML uses abstract syntax trees to extract structural features, while Code2Vec encodes syntactic and semantic information into vector representations for classification tasks. Code2Vec+MLP enhances Code2Vec by introducing additional layers to improve feature learning. However, these methods still struggle with extreme data imbalance, as shown in the CWE-469 category, where they fail to achieve significant recall or MCC. MTLPT, by contrast, leverages a multi-task learning framework, which enables it to capture correlations across vulnerability categories and perform well even on highly imbalanced datasets.
In the CWE-469 vulnerability category, despite its extreme data imbalance (only 1.3% of the dataset), MTLPT effectively learns its features and achieves high recall and MCC, whereas single-task deep learning models and state-of-the-art approaches fail to perform well. This is because MTLPT integrates a custom lightweight Transformer block and position encoding layer, allowing it to better capture contextual and structural information from the data. Additionally, by sharing information across tasks, MTLPT alleviates the class imbalance issue, improving its ability to predict rare vulnerabilities.
In summary, MTLPT demonstrates strong robustness and adaptability to varying data distributions and outperforms all baseline models. Traditional machine learning models, such as RF30, lack the capability to process high-dimensional and unstructured data effectively. Deep learning models (LSTM24, RNN26, CNN25) show improvements in learning sequential and spatial patterns but struggle with extreme class imbalance due to their single-task nature. State-of-the-art approaches, including AST+ML16 and Code2Vec+MLP16, advance vulnerability prediction by leveraging structural and semantic information, yet they remain limited in handling underrepresented classes in imbalanced datasets. MTLPT addresses these shortcomings by integrating a custom lightweight Transformer block and position encoding layer into a multi-task learning framework, enabling it to capture both local and global dependencies, share information across tasks, and improve prediction accuracy for rare vulnerabilities. These advantages make MTLPT a powerful and generalizable solution for real-world vulnerability prediction tasks.

Predicating tasks for binary classification confusion matrix in draper sub-dataset.
Figure 5 shows the confusion matrix of the five major vulnerability category prediction tasks in the Draper subset dataset. From Fig. 5, it can be seen that the MTLPT model achieves a balanced Precision and Recall rate while maintaining a high accuracy rate. MTLPT uses a multi-task learning framework and a loss function based on dynamic weight adjustment to effectively use the correlation between different tasks through the shared representation learning layer, enhancing the model’s prediction ability for different security vulnerability categories (CWE-119, CWE-120, CWE-469, CWE-476, CWE-other). Secondly, as shown in Fig. 5c, the confusion matrix of CWE-469, the MTLPT model has achieved significant results in reducing false positives (FP) and enhancing true negatives (TN), reducing the false alarm rate, and in Fig. 5d and e, the model shows a high TN value, indicating that MTLPT can effectively identify and exclude instances that do not contain specific vulnerabilities.
From Table 4, it can be seen that in the comparison of the total parameter amount of the basic models and our model that can effectively predict most CWE vulnerability category data, the parameter amount of our proposed MTLPT model is about 19% less than the CNN model and about 73% less than the RF model. This is because the MTLPT model uses a custom lightweight Transformer block, which reduces the parameter amount by using a lower dimension of the feed-forward network and the number of attention heads while maintaining performance, making the model more lightweight. It can be seen that the MTLPT model maintains lightness while achieving high performance, reducing the amount of parameters required during training, and improving the efficiency of the model.
In terms of inference time, the MTLPT model demonstrates nearly a 50% improvement compared to the RF model, showcasing its efficiency in handling complex data with reduced computational complexity. While the inference time is slightly higher than that of the CNN model, the MTLPT achieves a balanced trade-off between computational cost and performance through its custom lightweight Transformer blocks, enabling efficient processing of real-world vulnerability datasets.
The memory usage of the MTLPT model exhibits a remarkable advantage, consuming less than 1% of the memory required by the RF model and reducing memory usage by over 90% compared to the CNN model. This improvement is primarily attributed to the optimized design of the MTLPT model, which leverages reduced dimensions in multi-head attention mechanisms and efficient weight-sharing strategies, significantly lowering memory consumption without compromising predictive accuracy.
Result: In summary, MTLPT can capture complex patterns and dependencies between the five most common types of vulnerabilities on unbalanced real-world vulnerability datasets, fully learn vulnerability features, and achieve a performance that is 10% to 50% higher than traditional single-task learning and ensemble learning methods. State-of-the-art approaches, including AST+ML16 and Code2Vec+MLP16, advance vulnerability prediction by leveraging structural and semantic information, yet they remain limited in handling underrepresented classes in imbalanced datasets. MTLPT addresses these shortcomings by integrating a custom lightweight Transformer block and position encoding layer into a multi-task learning framework, enabling it to capture both local and global dependencies, share information across tasks, and improve prediction accuracy for rare vulnerabilities. Moreover, the MTLPT model not only maintains high predictive performance but also demonstrates substantial lightweight advantages, reducing the total parameters by approximately 19% compared to CNN and over 73% compared to RF. It achieves exceptional efficiency in inference time, reducing computational costs by nearly 50% compared to RF, and exhibits remarkable memory savings, using less than 1% of the memory required by RF and over 90% less than CNN. These features make the MTLPT model highly suitable for deployment in resource-constrained environments, such as embedded systems or large-scale vulnerability detection tasks. By achieving lightweight model deployment and improving efficiency while maintaining robust performance, the MTLPT model effectively addresses real-world vulnerability prediction challenges.
-
(2)
For RQ2: How effective is the dynamic weight-based loss function in the multi-task framework of the MTLPT method in alleviating the imbalance problem of real-world vulnerability data?
We trained and predicted the undersampled Draper subset after the same preprocessing with the MTLPT without the dynamic weight-based loss function \(L_t\) and our proposed MTLPT. As shown in Table 3, MTLPT for the CWE-469 vulnerability category prediction task is about 10% higher in the MCC indicator than MTLPT without the dynamic weight-based loss function \(L_t\), and MTLPT has a recall rate of about 58% and an F1-Score of 26%, which is far higher than MTLPT(without \(L_t\)) in comprehensive evaluation. The comparison results between MTLPT and MTLPT(without \(L_t\)) in Table 3 show that \(L_t\) plays an important role in enhancing model performance. The presence of \(L_t\) in MTLPT will always produce better scores in terms of F1-Score and MCC, which is crucial when dealing with highly unbalanced data as shown in Table 2. This is because our proposed dynamic weight-based loss function \(L_t\) can effectively guide the MTLPT model to pay more attention to data classes with insufficient representation, thereby alleviating the impact of data imbalance and maintaining the performance of other category prediction tasks.
Result: In general, the dynamic weight-based loss function \(L_t\) in MTLPT can effectively guide the MTLPT model to pay more attention to prediction tasks with less data and insufficient representation, effectively alleviate the imbalance problem of real-world vulnerability data while maintaining the performance of other prediction tasks.
Ablation experiment
-
(3)
For RQ3: What is the contribution of the components of our proposed MTLPT method to dealing with unbalanced data?
Ablation experiments are a research method that observes the impact on the final performance by removing some parts of the model, thereby explaining the importance and role of these components. In this section, we explore and evaluate the multi-task learning framework of the MTLPT model, the key component PT (custom lightweight Transformer block and position encoding layer), the dynamic weight-based loss function \(L_t\) in MTLPT, and the contribution to model performance and future research through ablation experiments.
We set the following four model configurations to compare on the same preprocessed dataset: (1) PT without MTL: a single-task learning model that only contains a custom lightweight Transformer block and position encoding layer. (2) MTLPT without PT: a basic MTL model that does not contain additional custom lightweight Transformer blocks and position encoding layers. (3) MTLPT without Dynamic weight: MTLPT model removes the dynamic weight strategy, that is, the dynamic weight-based loss function \(L_t\). (4) MTLPT model: our proposed MTL model based on a custom lightweight Transformer block and position encoding layer.

Comparative Stacked Graph of MTLPT Model Components for Predictive Performance on Imbalanced Vulnerability Category CWE-469.
In Fig. 6, we compare the performance of the MTLPT model with PT without MTL. We observe significant improvements in all metrics for the model with a multi-task learning framework, except for the AUC. This indicates that our proposed multi-task learning framework enables the sharing of information across various vulnerability prediction tasks, enhancing the predictive ability for minority categories. Consequently, it improves the model’s generalization capability and adaptability to imbalanced data.
The comparison between the MTLPT model and MTLPT without PT reveals substantial improvements across all metrics. Particularly, the MTLPT model outperforms MTLPT without PT by 30.1% in Recall and 6.7% in F1-score. This highlights the effectiveness of our custom lightweight Transformer blocks and position encoding layer in enhancing the model’s understanding of sequential data. It also improves the recognition of vulnerability categories with limited data, which is crucial for handling imbalanced datasets. Specifically, the introduction of custom lightweight Transformer blocks and position encoding layers enhances the model’s ability to capture complex patterns in vulnerability data. Furthermore, the sequence information introduced by the position encoding layer further improves the model’s contextual understanding, effectively enhancing vulnerability prediction accuracy.
Additionally, comparing the MTLPT model with MTLPT without Dynamic Weight demonstrates that the inclusion of \(L_t\) results in better performance in MCC and F1-score, with improvements of 30.0% and 26.4%, respectively. However, there is a slight decrease in AUC and Recall. This suggests that \(L_t\) contributes to better balancing of the model’s performance across different vulnerability category prediction tasks, especially in highly imbalanced data environments. This observation is consistent with the results from the previous comparative experiments.
Result: The ablation experiments in this section confirm the importance of the MTLPT model’s multi-task learning framework, key components PT (custom lightweight Transformer blocks and position encoding layer), and \(L_t\) for handling imbalanced data. The synergistic effect of each component leads to near-optimal performance across all evaluation metrics, making this improvement highly effective and relevant for real-world vulnerability category prediction in imbalanced scenarios.
