To demonstrate the effectiveness of our GCN-based link prediction method, we have selected literature data from the Web of Science to conduct experiments, as described in Data acquisition. We conduct experiments to answer the following questions:
-
Does the GCN-based method perform better than traditional machine learning models, including Naive Bayes, Logistic Regression, Random Forest, XGBoost, and SVM?
-
Does the proposed heterogeneous GCN-based model more efficiently preserve structural information to achieve better performance than existing GCN models?
To answer the first question, we utilize two types of keyword features to construct the binary classification prediction model: TF-IDF features and LDA semantic features. The TF-IDF feature matrix, which can be mutually converted with the word-document graph in Fig. 1, is generated from the collection of literature data. Given that the performance of traditional machine learning methods heavily relies on the quality of feature engineering, we use matrix factorization to decompose the word-document matrix and generate a word-topic matrix as intermediate features. In a nutshell, we use the LDA model to extract topic correlations for each keyword as semantic features. These two types of keyword features are created to demonstrate that GCN-based graph embeddings better preserve structural information and achieve superior performance compared to traditional machine learning models.
To answer the second question, we take the informative text graph in Fig. 1 as a homogeneous network and use a trivial GCN-based model as an alternative to decomposing a text graph into relational subgraphs. It should be noted that the most recent work on co-word link prediction8, which achieves state-of-the-art performance, uses only TF-IDF features and LDA semantic features. By comparing these features used in the experiment, we can deeply analyze and understand the ability of the GCN model in capturing the semantic information of textual graphs. In Experiment settings, the detailed settings and experimental process will be elaborated. After that, performance metrics of all models, including the GCN-based model and traditional methods, will be illustrated in Results and analysis.
Experiment settings
The purpose of this experiment is to evaluate the performance of our heterogeneous GCN-based model in the task of co-word link prediction. In the comparative experiment, we will use TF-IDF and LDA features and compare them with the following methods: traditional machine learning methods such as Logistic Regression, Random Forest, and XGBoost; TabNet48, an deep learning model; AutoGluon47, an ensemble-based automated machine learning model to stacking above mentioned base models; and Sentence-Bert49, a transformer-based model. In addition, we also compare it with a trivial GCN model. The experiment was conducted on a server equipped with two NVIDIA RTX 3090 GPUs (24GB of GDDR6X memory each), a 12th Gen Intel® Core™ i9-12900K CPU (24 cores, 16 threads), 64GB of system memory, and CUDA version 12.3.
For data collection, we selected documents downloaded from the Web of Science website. Only the abstracts were exported for analysis as outlined in Data acquisition. Two datasets of documents were collected for evaluation: the GCN-related literature and LLM-related literature. We collected 5,196 GCN-related documents. Literature from the first five years (2018-2022, with a total of 3,818 records) were used for training and the literature from the subsequent year (2023, with 1,378 records) were used for testing. Based on our statistics, a total of 46,069 words were identified in the literature published from 2018 to 2023. For this study, we used Yake to extract keywords to construct the co-word network, incorporating informative textual information. Specifically, we selected a total of 500 keywords. Literature from 2018-2022 was used for constructing the textual graph. The constructed textual graph for co-word link prediction had 121,842 co-word links, which are employed as positive samples for training. Co-word links for the 500 selected keywords in 2023, totaling 94,894, were used as the test set. We performed negative sampling at a 1:1 ratio (positive:negative instances). Dataset of the LLM-related literature include 4,215 documents. The first five years of literature (2018-2022) were used for the training stage, with a total of 4,215 records, while the remaining 1,412 records from 2023 were used for the prediction stage. Yake was also employed to extract keywords. Specifically, we selected 500 keywords to construct the textual graph, where 137,291 co-word links from 2018-2022 were used as positive samples for training. 105,938 co-word links among these 500 selected keywords in 2023 were used as the test set. The negative sampling applied a 1:1 ratio.
We utilized Python 3.7 and the Scikit-learn package to build a TF-IDF feature matrix with dimensions of \(500\times 3818\). This TF-IDF feature matrix was used as both the word-document relations in Fig. 1 and the TF-IDF features for traditional models including Naive Bayes, Logistic Regression, Random Forest, XGBoost, and SVM. The TF-IDF feature matrix is further fed into TabNet48 and AutoGluon47 to enhance the performance. To evaluate recent transformer-based methods, we use S-Bert49 to generate word embeddings as features for testing. In addition, we utilized the Deep Graph Library (DGL) framework to process heterogeneous graph data and build GNNs. We compared trivial GCN15, and Metapath2vec39, where the former is proposed for homogeneous networks, and the latter is designed for heterogeneous networks. We employed two-layer GCNs as convolutional encoders for relational subgraphs, with a hidden layer size of 128, to generate embeddings.
The TF-IDF features have been directly employed as input for traditional machine learning methods. However, TF-IDF has its limitations as it only serves as a lexical-level feature and fails to capture semantic information. To address this problem, we also incorporate LDA semantic features46 as in8. LDA is a topic generation model that learns the probability distribution of words’ topics from texts. It is commonly used in subject modeling across various disciplines, including social networks and language science. We define the LDA semantic feature for traditional machine learning methods as the cosine similarity between the corresponding two topic distribution vectors in the topic-word matrix. Cosine similarity is the most frequently employed approach for measuring directional similarity between two vectors. The formula can be expressed as follows:
$$\begin{aligned} \mathrm{{LDA}}(v_i,v_j)=\frac{\sum _{k=1}^{K} {v_{ki}\times v_{kj}} }{\sqrt{\sum _{k=1}^{K}v_{ki}^2}\times \sqrt{\sum _{k=1}^{K}v_{kj}^2}}. \end{aligned}$$
(10)
where \(v_i\) and \(v_j\) denote two vectors and \(i\ne j\). We evaluated the parameter K by testing its values at 10, 20, 30, 40, and 50, observing AUC variations between 1.1 and 1.3 compared to no LDA feature used. We ultimately selected the best-performing parameter \(K=20\) as the number of topics in the LDA model and obtained the topic-word distribution matrix after training the algorithm until convergence. Since the original topic-word matrix is a \(500\times 20\) matrix with continuous values, we assign a value of 1 to any element greater than 0.01 and a value of 0 to all other elements. In the LDA top-word matrix of size \(500\times 20\), \(95\%\) of the values approach zero. Therefore, we can select a small threshold (e.g., 0.01) and consider any value larger than it as indicating that the keyword belongs to the corresponding topic. Discretizing the matrix values can enhance the model’s ability to accurately classify words into their corresponding topics.
Finally, we selectively combine these two features, represented by the TF-IDF, word-topic matrices, to construct the keyword feature matrix for the training of traditional learning methods. Additionally, in terms of node feature selection, the trivial GCN model uses the same node features (including TF-IDF and LDA features) as traditional learning methods. We use three types of relations, \({R_{w\sim w}}\), \({R_{w\sim d}}\), and \({R_{w\sim t}}\), to construct the co-word networks for the evaluation of Metapath2vec and our proposed method. Our proposed model uses a random initialization method to obtain node features. These node features are updated along with the parameters of the whole model to ensure that the whole training process is strictly end-to-end.
We also compare the performance of our approach under various negative sampling methods, including Power of Degree50 and RNS51, PinSAGE52, WARP53, and IRGAN54. Random negative sampling (uniformly selecting unconnected keyword pairs) may fail to capture challenging or informative negative samples. For example, randomly selected negative samples often include trivial cases (e.g., unrelated keywords in distant research domains), which provide limited learning signals. To thoroughly evaluate the impact of sampling strategies on model performance, we compare the performance of our method based on these sampling methods.
Results and analysis
We first conducted experiments to compare the performance of GCN-based methods with traditional machine learning methods in the task of co-word link prediction. All test results are the averages calculated after 1000 repetitions of training under the same experimental settings, ensuring the stability and reliability of the results. The experimental results are shown in Tables 1 and 2. The performance of different methods in terms of accuracy, precision, recall, F1 score, and AUC is demonstrated.
Experimental results on the dataset of GCN-related literature are depicted in Table 1. It can be seen that when traditional machine learning methods such as Naive Bayes, Logistic Regression, Random Forest, XGBoost, and SVM are used (only TF-IDF features are used), the best F1 and AUC values are \(80.16\%\) and \(88.87\%\) both achieved by Random Forest. In addition, when LDA semantic features are added, Random Forest still has the best F1 and AUC values among all the traditional methods, \(80.53\%\) and \(89.12\%\), respectively. Even though traditional machine learning methods add LDA semantic features to enrich the node information and improve the prediction, for the GCN-based methods, except for the trivial GCN model whose precision of \(82.59\%\) is slightly lower than that of Random Forest’s \(83.24\%\), the GCN-based methods outperform the traditional machine learning methods in almost all the performance metrics.
Then, we conducted experiments to compare the performance of our proposed heterogeneous GCN-based model with the trivial GCN model in link prediction. As shown in Table 1, when we fused only two relations (including \({R_{w\sim w}}\) and \({R_{w\sim d}}\)) in the heterogeneous GCN-based model, we found that our proposed model performed better than the trivial GCN model with an AUC value of \(92.14\%\), which is higher than the AUC value of \(89.62\%\) of the trivial GCN model. This shows that the heterogeneous GCN-based model can fuse the information of different relations more effectively; in other words, the heterogeneous GCN-based model can retain the structural information more effectively than the trivial GCN model and improve the performance of link prediction. In order to figure out whether fusing other relations (in this case \({R_{w\sim t}}\)) contributes to the heterogeneous GCN-based model, we further combined it with \({R_{w\sim w}}\) and \({R_{w\sim d}}\) and carried out the test. The experimental results show that fusing one more relation improves the performance of link prediction to some extent. For example, the AUC value increases from \(92.14\%\) to \(93.46\%\), and the F1 value increases from \(85.79\%\) to \(86.38\%\). Moreover, we compare the Precision-Recall curves of our method and several baseline methods, including trivial GCN, Metapath2vec, and TabNet. As illustrated in Fig. 2, our approach outperforms other methods.
Experimental results on the dataset of LLM-related literature are illustrated in Table 2. Similar conclusions can be drawn from the experiments on LLM-related literature. As depicted in Table 2, our method achieves the best performance that AUC score is \(92.78\%\) and F1 score is \(85.34\%\). Trivial GCN achieves \(88.06\%\) AUC score and \(81.99\%\) F1 score, respectively, outperforms only by our approach across all methods. In summary, the GCN-based graph embeddings retain structural information better and thus perform better in capturing complex relationships between nodes and global graph structure.
To evaluate the performance of different negative sampling algorithms on GNN models, we conducted experiments on three GNN-based methods, including Metapath2vec, trivial GCN, and our proposed method. The dataset of GCN-related literature is employed for the evaluation. Five negative sampling methods have been tested, including RNS (Random Negative Sampling), Power of Degree, PinSAGE, WARP, and IRGAN. Experimental results are illustrated in Fig. 3. IRGAN achieves the best performance in all these negative sampling methods.
We evaluate the training time of GCN-based methods across varying scales of graph structures. As depicted in Fig. 4, we tested trivial GCN and our proposed method. For each test, we run the model for 1000 epochs and record the training time. The number of keywords ranges from 1000 to 6000. Experimental results show that the trivial GCN achieves higher time efficiency than our method. This is because the trivial GCN treats the graph structure as a homogeneous network. Moreover, the GPU’s VRAM usage of trivial GCN and our method is illustrated in Fig. 5. Similar to the training time, our method requires more VRAM usage than trivial GCN.

Comparison of the Precision-Recall curves of our method, trivial GCN, Metapath2vec, and TabNet. The experiment is conducted on the dataset of GCN-related researches.

Comparison of the AUC scores of three GCN-based models: Metapath2vec, trivial GCN, and our method, under a variety of negative sampling algorithms. The experiments are conducted on the GCN-related documents.

Comparison of the training time of trivial GCN and our method, under varying scales of textual graphs. The experiments are conducted on the GCN-related documents.

Comparison of the GPU’s VRAM usage of trivial GCN and our method, under varying scales of textual graphs. The experiments are conducted on the GCN-related documents.
Case examples of predictions confirm the validity of our approach. As an example of the dataset of GCN-related literature, the GCN-based methods successfully predicted a link between “drug-drug” and “pharmacology”, which the traditional Random Forest model failed to recognize. For the dataset of LLM-related literature, we present illustrative examples to elucidate the link prediction results. Our method identified co-word links between the keyword “language models” and the other two keywords “psychosocial consequences” and “standard ontology”. With the advance of LLMs in domains such as healthcare and ontology, their impact on social relationships and individual psychology has become increasingly evident. Another example is our erroneous detection of a potential co-word link between “deep learning” and the keywords “semantic interaction” and “language applicability.” We hypothesize that this false signal arose from weak historical correlations in past data, where their co-occurrence probability was low. Consequently, similar associations did not emerge in the 2025 dataset. Thus, despite being slightly inferior to traditional methods in some specific metrics, overall, the GCN-based methods have greater potential and advantages in dealing with complex graph data.
