Dataset description
In this work, we choose various open-source projects from the PROMISE repository (https://openscience.us/repo/defect/), where the description of the selected projects is provided in Table 1. All projects in the PROMISE dataset used in this study include CSV data and Java source code files. The first column in the CSV files contains the path to the Java code. We utilize these paths to read and extract the AST from the Java files. We chose these seven projects to cover several kinds of data types, classes, functions, control flow, etc. PROMISE dataset has been utilized extensively in several studies to create practical SDP models12,38,43,54,55. We chose the PROMISE dataset because of its wide use, which indirectly compares the proposed approach with previous research.
First, we use each project’s version numbers since we need the archive of its source code to extract AST nodes from the program and then generate the integer vectors from these nodes to feed our CNN model. In this work, we extracted four categories of AST nodes: Method declarations and invocations, Class instance creations, control flow, and other nodes such as Formal Parameter, Basic Type, Member Reference, etc. Table 2 shows the details of these categories. After that, we use Word2vec embedding to generate the token sequences from the extracted AST nodes and then use these token sequences to create semantic features for training the CNN model as described in Section “CNN-MLP structure”.
Our second baseline of dataset description is to handle the traditional features; the traditional features we considered include 20 traditional metrics. Table 3 lists the details of these features. On the PROMISE data, we applied the imbalanced data handling approach outlined in Section “Data preprocessing”.
Data preprocessing
Word embedding (Word2vec)
CNN is designed to process inputs as numerical vectors, with a prerequisite that these input vectors maintain uniform lengths. An initial step in incorporating semantic features into CNN involves establishing a correlation between tokens (semantic units in the source code) and integers, effectively transforming token vectors into corresponding integer vectors. Each distinct token is assigned a unique integer identifier to facilitate this conversion. However, the challenge of varying lengths in these integer vectors persists, necessitating a solution to standardize input size. The Word2vec technique is employed to address this issue and ensure the compatibility of semantic features with CNN’s input requirements. Word2vec56,57 plays a pivotal role in creating a consistent mapping between tokens and integers. This mapping technique overcomes the obstacle of disparate vector lengths and generates fixed-size numerical inputs suitable for CNN processing. By leveraging Word2vec, it becomes feasible to convert the rich semantic information embedded within the source code into a standardized numerical format, thus enabling the effective application of CNN for tasks that require the analysis of semantic features.
Word2vec represents a method for word embedding within the realm of Natural Language Processing (NLP), facilitating the conversion of words into computationally manageable and systematically organized vectors. It offers two distinct approaches for constructing word embeddings: Continuous Bag-of-Words (CBOW) and Skip-Gram. This study uses the CBOW model to derive integer vectors from token vectors. The design of the CBOW model is centered on predicting the target word based on the context provided by surrounding words. This approach effectively encapsulates semantic information within numerical vectors, laying the groundwork for further processing and analysis.
Figure 5 shows the CBOW example; Fig. 5a presents the general mechanism of CBOW, and Fig. 5b presents the detailed steps of COW. As depicted in Fig. 5b, in analyzing token vectors, such as “ChunkedIntArray if appendSlot readEntry if specialFind slotsUsed discardLast writeEntry if writeSlot if readSlot if,” these can function either within a context window or as target words. As depicted in Fig. 5b, using a context window of size 4 enables the model to predict the target word based on the surrounding context words. This predictive mechanism is fundamental to how the word2vec model operates, leveraging the immediate linguistic environment to understand and predict word usage. In this particular study, we have configured the word2vec model to utilize a vector size of 100 and a context window size of 5. This configuration is chosen to optimally balance the granularity of semantic representation with computational efficiency, allowing for a nuanced capture of semantic relationships within a manageable computational framework. By adjusting these parameters, we aim to improve the model’s performance to accurately model linguistic patterns and relationships, thereby improving the overall effectiveness of the semantic feature extraction process.

Continuous bag-of-word example. (a) The CBOW model architecture and (b) the CBOW (context, target) example.
Handling imbalance
Datasets of SDP frequently exhibit class imbalance, where buggy instances constitute only a minor fraction of the total dataset. This imbalance ratio varies, directly correlating to the defect rate within the dataset. For instance, among the projects detailed in Table 1, the ’ant’ project displays the most significant imbalance, showcasing a buggy rate of 22.2%. Such imbalance poses challenges to model performance, particularly affecting its proficiency in accurately identifying non-defective instances, as highlighted by comprehensive studies in the field58,59.
Addressing the issues arising from imbalanced data is crucial for improving model accuracy and reliability. As detailed by60,61, two prevalent strategies for mitigating these challenges include Oversampling and Undersampling. Oversampling is the process of duplicating instances from the minority class (defective files) to achieve a balanced dataset representation while undersampling involves reducing the number of instances from the majority class (non-defective files).
In this study, we opt for the Undersampling approach. This preference is guided by the rationale that Undersampling maintains the integrity of the original dataset by using only genuine instances, thus avoiding the potential introduction of artificial bias that might occur with Oversampling. This approach ensures that our training sets accurately reflect real-world conditions, providing a more reliable basis for model training and evaluation. By prioritizing authenticity in our dataset composition, we aim to enhance the model’s predictive performance in a practical and effective manner for software defect prediction tasks.
Parameters setting
We divide our experiments into two main implementation steps: Semantic features extraction and defect classification. In the first phase, to generate the semantic features, we use Word2vec in deeplearning4j3 to construct a group of word embedding by changing the size of the context window and the dimensionality. We leverage the studies of57,62,63 to set values for the context window sizes, dimension size, batch size, negative sampling, minimum word frequency, and iterations. Table 4 shows the details of these parameters. In the second phase, the parameters for the classifier model are defined; we assign values for the group of parameters like the number of input layers, hidden layers, and nodes in each layer. We additionally consider batch size, epoch, the activation functions used in the input and hidden layers (CNN-MLP activation), fully connected activation, Merged activation function, optimizer, and learning rate. Table 4 shows the details of these parameters. We build our CNN-MLP model using python-3.9.6 with Tensorflow-2.5.0 and Keras-2.5.0. Other implementations are executed on Gensim for word2vec embedding, Pandas for processing dataset, and Javalang and NLTK for generating AST. The code was run on CPU Intel®Core™ i7 with NVIDIA GeForce MX250 CUDA 11.2.
According to our findings in previous studies38,64, validation techniques like k-fold cross-validation often introduce significant bias when evaluating SDP models, leading to inaccurate assessments. In this work, we combined two features (semantic and traditional). We performed several procedures on the dataset, including processing imbalanced data and integrating semantic features with standard features. We did not use it in this study to avoid the issues associated with k-fold cross-validation. Instead, we evaluated the performance of our CNN-MLP model by building a prediction model using data from different releases (see Table 1). We also employed various performance measures to assess the model’s performance in non-effort-aware and effort-aware scenarios.
In this study, we employed several baseline methods and compared their performance against our proposed model, as described in Section “Baseline methods”. The hyperparameters for each technique were carefully tuned based on the recommendations from their respective literature. Specifically, for the CNN model, we utilized 10 hidden layers, each comprising 100 nodes. Additionally, we set the number of filters to 10 and the filter length to 5. The DBN model consisted of 10 hidden layers, with 100 nodes in each layer. The LSTM model was configured with 16 LSTM units per layer, 250 attention widths, and a vector dimension of 16 for calculating the attention widths. The DP-HNN model had 5 hidden layers, each containing 100 nodes. The AdaMax optimizer was employed with a default learning rate of 0.002. We utilized two Bidirectional LSTM (BiLSTM) layers for the SDP-BB model, each comprising 128 units. Furthermore, seven hidden layers with hidden sizes of 8, 16, 32, 48, 64, 128, and 256 were incorporated. The Adam optimizer was used with a fixed learning rate of 0.001. The ACGDP model consisted of 5 layers, with a hidden size 249 and a dropout rate of 0.361.
To ensure a fair comparison, the number of epochs was set to 100 for all models, including our proposed method. This consistent epoch setting allowed for a comprehensive evaluation of the models’ performance under identical training conditions.
Baseline methods
This section introduces the baseline methodologies utilized in our study. To ascertain the efficacy of our newly proposed model, we have chosen seven distinct methods to serve as our comparative baselines. These baseline models incorporate various features-spanning traditional metrics, semantic attributes, simple integrating approaches, and changes in source code-paired with a diverse array of classifiers ranging from conventional machine learning algorithms to more advanced deep learning frameworks. This selection is designed to provide a comprehensive benchmark, allowing us to thoroughly evaluate the performance of our proposed model against established methods that utilize different combinations of features and classification techniques. By doing so, we aim to highlight our model’s unique strengths and potential advantages in accurately predicting outcomes based on the analyzed features.
Traditional (TR)65
TR is a method that uses 20 traditional handcrafted code metrics shown in Table 3 as input to train a classifier (Naïve Bayes and Random Forest).
CNN43
CNN serves as a predictive model for detecting software defects. It utilizes ASTs as input data to identify semantic elements within the source code. This approach amalgamates the semantic features extracted by CNN with traditional software metrics through simple concatenation, aiming to enhance the overall predictive performance.
DBN38
A defect prediction model that utilizes semantic features and features derived from source code changes, created using a Deep Belief Network.
LSTM39
An SDP framework leverages LSTM networks to extract syntactic features directly from program file ASTs. These extracted syntactic features are then utilized as inputs to predict the presence of software defects within the codebase.
SDP-BB48
An SDP approach using BiLSTM and BERT to predict defects in software code effectively. This model enhances defect prediction accuracy by leveraging two deep learning models to understand the semantic features of code.
DP-HNN50
A defect prediction framework based on the Hierarchical Neural Network. This model capitalizes on the hierarchical nature of ASTs, strategically segmenting extensive file-level ASTs into multiple subtrees centered around key AST nodes pivotal to the SDP task.
ACGDP66
An Augmented-Code Graph Defect Prediction model that extracts features from the code’s graph representation. Subsequently, graph neural networks are applied to these extracted features to capture intricate patterns and make predictions regarding defects within software modules.
CNN-MLP
It is the prediction model introduced in this study.
For the experiment’s integrity and to validate the outcomes, we implement undersampling as described in 4.2 to address imbalances in the dataset, employing it as our chosen technique for balanced learning. Each experiment is conducted 30 times to ensure reliability and consistency in the results.
Performance measures
This study evaluates the proposed approach performance under non-effort-aware and effort-aware scenarios.
Non-effort-aware evaluation measures
In this scenario, it is presumed that sufficient resources are available to facilitate testing based on the outcomes of the defect prediction model, meaning that every predicted defective instance can undergo verification. SDP models determine the outcome of a code modification through four possible predictions: (1) correctly identifying a defective code change as defective (True Positive, TP), (2) inaccurately identifying a defective code change as non-defective (False Negative, FN), (3) correctly identifying a non-defective code change as non-defective (True Negative, TN), and (4) inaccurately identifying a non-defective code change as defective (False Positive, FP).
Given these four outcomes, the predictive model computes key performance metrics within the test dataset, including recall, F1 scores, and precision. In this study, we have selected F1 and AUC as the performance indicators to demonstrate the efficacy of our approach under conditions that do not take effort into account.
The following are the detailed definitions:
Recall refers to the ratio of all correctly classified faults to all faults.
$$\begin{aligned} \text{ Recall } =\frac{T P}{T P+F N}. \end{aligned}$$
(3)
Precision is defined as the proportion of fault changes correctly identified relative to the total number of fault classifications made incorrectly, which is given as
$$\begin{aligned} \text{ Precision } =\frac{T P}{T P+F P}. \end{aligned}$$
(4)
F1 scores An integrated metric that merges both recall and precision rates, representing the harmonic mean of precision and recall, which is defined as
$$\begin{aligned} F1=\frac{2 \times \text{ precision } \times \text{ recall } }{ \text{ precision } + \text{ recall } }. \end{aligned}$$
(5)
AUC The Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) is a critical evaluation metric used in the field of real-time Source Code Defect Prediction (SDP) research. When assessing the performance of a model classifier, the ROC curve is constructed by setting various classification thresholds. The x-axis (abscissa) of the ROC curve denotes the false positive rate (FP rate), while the y-axis (ordinate) represents the true positive rate (TP rate). The ROC curve is composed of the coordinate points derived from each classification threshold’s pair of FP and TP rates. The AUC, the area beneath the ROC curve, varies from 0 to 1, with higher values indicating better model performance.
Effort-aware evaluation measures
Effort-aware conditions are implemented when testing resources are constrained or deadlines are imminent, representing the typical context in which defect prediction techniques are applied in real-world scenarios. Under such conditions, only a limited number of the predicted defect instances can be examined. Hence, in effort-aware scenarios, assessing the predictive performance using measures specifically tailored to these circumstances is essential. In this study, we utilize the PofB20 metric as the evaluation criterion for the effort-aware condition.
PofB2067 is a measure designed to quantify the proportion of defects a programmer can identify by examining 20% of the Lines of Code (LOC). This metric becomes applicable once the programmer has inspected 20% of the LOC within the test dataset. At this juncture, the PofB20 scores are expressed as the percentage of faults uncovered due to the inspection process. The possible range for PofB20 values lies between 0 and 1, where a higher value signifies a more efficient model performance. Essentially, this metric offers a focused lens on the model’s capability to prioritize and reveal the most significant defects early in the inspection process, thus serving as an essential indicator of the model’s practical utility in streamlining defect detection efforts under constrained conditions.
To calculate the PofB20 metric, we initiate the process by arranging the instances within the test files in descending order according to their confidence levels-the model’s assessed probabilities that each instance is likely defective. A higher confidence level suggests a greater likelihood of an instance being defective. Subsequently, we tally both the lines of code that have been scrutinized and the defects that have been uncovered in the process. The inspection halts once 20% of the Lines of Code (LOC) in the test dataset have been reviewed. At this point, the proportion of detected defects relative to this 20% examination is recorded as the PofB20 score. Essentially, a superior PofB20 score signifies the model’s enhanced efficiency in uncovering a larger number of bugs by examining a constrained segment of the LOC, highlighting the model’s effectiveness in prioritizing code segments that are most probable to contain defects.