Research on GNNs with stable learning

Datasets

The data sets we employ are from the TUDataset⁴⁷ and OGB datasets⁵. Table 1 presents the data of TUDataset, and Table 2 presents the data of OGB datasets. TUDataset, a widely used dataset library for graph machine learning that contains datasets covering a variety of domains, including chemistry^5,48,49, biology^50,51,52, social networks^53,54, etc.

The data sets in the TUDataset are graph-based datasets, where each sample is a graphical object. Each graph consists of a set of nodes (vertices) and edges (edges connecting nodes), along with optional node features and graph labels⁴⁷. Node features are attributes or feature vectors associated with a node, while graph labels are labels or categories of the entire graph.

We tested the GNNs model on some selected datasets from the TUDataset, using a random 10-fold cross-validation (split ratio 70/10/20) on the datasets, and observe the performance of stationary learning on monocentric graph data. In addition, we use the Grad-CAM⁵⁵ technique to interpret MUTAG⁵⁶, a commonly used graph classification data set. It consists of 4337 molecular graphs. Each graph is classified into two categories according to its mutagenic effect.

We used seven molecular property prediction datasets in OGB datasets⁵ to test the performance of the stable graph model in OOD mode. All molecules were pre-processed using RDKit⁵⁷. Each graph represents a molecule where the nodes are atoms and the edges are chemical bonds. We used the 9-dimensional node features and 3D edge features provided by⁵, which have better generalization performance. The task of these data sets is to predict the molecular properties of the target, for example, whether a molecule inhibits HIV viral replication⁵. The input edge features are 3D and contain bond types, bond stereochemistry, and additional bond features indicating whether a bond is conjugate or not.

Depending on the nature of the molecules, these data sets can be divided into three tasks: binary classification, multilabel classification, and regression⁵. Unlike commonly used random splitting, these datasets employ a scaffold splitting procedure⁵⁸ to split molecules from different scaffolds into training or test sets. All molecules with the same scaffold can be viewed as an environment, and scaffold splitting attempts to separate molecules with different scaffolds into different subsets⁵⁹.

For example, two molecules, cyclopropanol (C₃H₆O) and 1, 4-cyclohexanediol (C₆H₁₂O₂₂), contain different scaffold patterns: the former scaffold is 3 ring and the latter is 6C ring. Although the sampling distribution is different, they are all easily soluble in water because their invariant hydroxyl (–OH) subgraph group is attached to different scaffolds⁶⁰. The environments of the training and test sets are different, resulting in different distributions of the graphs. Therefore, this type of data can be used to test the generalizability of the stable GCN model for graphs with different distributions.

Table 1 TUDataset statistics and properties.

Table 2 Summary of real-world datasets.

Weighted sampling module

The core concept of stable learning lies in the formation of a stable distribution through weighted sampling. The weighted sampling module used in this paper is made up of RFF extractor and a LSWD optimizer^26,29.

The RFF extractor uses the random Fourier transform to transform the feature sequence $O \in {R^{l \times n}}$ output by the last graph convolutional layer of the GCN model into the RFF sequence ${\left\{ {H_i^j = (h_{i1}^j \ldots h_{in}^j) \in {R^{1 \times n}}} \right\} _{1 \le j \le r}}$, and then inputs the ${\left\{ {H_i^j} \right\} _{1 \le j \le r}}$ into the LSWD optimizer to complete the feature decorrelation in the random Fourier transform space and find the optimal sample weights that match it.

Finally, the statistical independence between two different features ${O_i}$ and ${O_k}$ is maximized. Due to that, the independence in the random Fourier transform space is equivalent to the statistical independence in the original space²⁸.

Stochastic Fourier transforms and feature independence

A common approach to non-linearly map data into a higher dimensional space is the inner product kernel map. Usually, we cannot directly obtain the explicit representation of the mapping in high-dimensional space. For instance, the Gaussian kernel function corresponds to infinite-dimensional space. This space is obviously not available. The random Fourier transform, on the other hand, is a low-order method of approximating the kernel function.

The purpose is to approximate the kernel function by d different parameters of the random Fourier transform function $z\left( . \right)$: direct mapping of the raw data of D dimensional into a random feature space of d dimensional, where the inner product of two features is approximately equal to the value of the kernel function:

$$\begin{aligned} z(x)z(y) \approx k(x,y) = k(x – y) \end{aligned}$$

(1)

In²⁸, Eq. (1) indicates that the statistical independence between the features x and y in the original feature space can be estimated by computing the covariance matrices of the transformed representations z(x) and z(y) in the Reproducing kernel Hilbert space (RKHS)⁶¹. At this point, if the covariance matrices of z(x) and z(y) are zero matrices, then the features x and y are statistically independent of each other.

As mentioned above, we can use the random Fourier transform to approximate the expressions z(x) and z(y) in the RKHS space, and thus transform the covariance matrix in the RKHS space to approximate the computation in the low-dimensional random Fourier transform space.

Definition 1

The random Fourier transform function is given as follows:

$$\begin{aligned} \begin{aligned} {H_{RFF}} =&\{ h\;:x \rightarrow \sqrt{2} \cos \left( {\omega x + \phi } \right) |\omega \sim N(0,1), \\ &\phi \sim Uniform(0,2\pi ) \} \end{aligned} \end{aligned}$$

(2)

where, $\omega$ and $\phi$ are random frequencies and random phases sampled from the standard normal distribution $N\left( {0,1} \right)$ and uniform distribution $Uniform\left( {0,2\pi } \right)$, respectively.

Let there be two one-dimensional random distributions ${D_x}$,${D_y}$, and we obtain two n dimensional random sequences $X = \left\{ {{x_1},\ldots ,{x_n}} \right\}$ and $Y = \left\{ {{y_1},\ldots ,{y_n}} \right\}$ by randomly sampling n times on each of the two distributions.

By randomly sampling k frequencies and k phases, respectively, using a random Fourier transform function, we can further transform the sequences X and Y in the original space into $\left\{ {H_{{x_1}}^{\omega ,\phi },\ldots ,H_{{x_n}}^{\omega ,\phi }} \right\}$ and $\left\{ {H_{{y_1}}^{\omega ,\phi },\ldots ,H_{{y_n}}^{\omega ,\phi }} \right\}$. Where $H_{{x_n}}^{\omega ,\phi } = \left( {{H^{{\omega _1},{\phi _1}}}\left( {{x_n}} \right) ,\ldots ,{H^{{\omega _k},{\phi _k}}}\left( {{x_n}} \right) } \right)$, ${H^{{\omega _k},{\phi _k}}}\left( {{x_n}} \right) = \sqrt{2} \cos {\left( {{\omega _k}{x_n} + {\phi _k}} \right) }$.

At this point, the estimated covariance matrix ${\Sigma _{xy}}$ with respect to z(x) and z(y) in the random Fourier transform space can be written as follows:

$$\begin{aligned} \begin{aligned} {\Sigma _{xy}} = {\frac{1}{n – 1}} \sum \limits _{i = 1}^n&{\left( H_{{x_i}}^{\omega ,\phi } – {\frac{1}{n}} \sum \limits _{j = 1}^n H_{{x_j}}^{\omega ,\phi }\right) ^T}\times \left( {H_{{y_i}}^{\omega ,\phi } – {\frac{1}{n}} \sum \limits _{j = 1}^n H_{{y_j}}^{\omega ,\phi }} \right) \end{aligned} \end{aligned}$$

(3)

The statistical independence measure ${I_{xy}}$ of Y and X is defined as the Frobenius norm value⁶² of the matrix of covariance moment ${\mathrm{{\Sigma }}_{\mathrm{{xy}}}}$, that is, ${\mathrm{{I}}_{\mathrm{{xy}}}} = |\left| {{\mathrm{{\Sigma }}_{\mathrm{{xy}}}}} \right| |_\mathrm{{F}}^2$. When ${\mathrm{{I}}_{\mathrm{{xy}}}}$ decreases to 0, X and Y gradually tend to be independent of each other.

Feature decorrelation

In this section, feature correlations based on the spatial independence criterion of the RFF above ${I_{xy}} = 0$ using a sample weighting method are not taken into account. After sample weighting, the weighted covariance matrix of the random feature sequence X,Y in Eq. (3) can be re-expressed as follows:

$$\begin{aligned} \begin{aligned} {\Sigma _{xy}}\left( W \right) = {\frac{1}{n – 1}} \sum \limits _{i = 1}^n&{\left( {w_i}H_{{x_i}}^{\omega ,\phi } – {\frac{1}{n}} \sum \limits _{j = 1}^n {w_j}H_{{x_j}}^{\omega ,\phi }\right) ^T}\times \left( {{w_i}H_{{y_i}}^{\omega ,\phi } – {\frac{1}{n}} \sum \limits _{j = 1}^n {w_j}H_{{y_j}}^{\omega ,\phi }} \right) \end{aligned} \end{aligned}$$

(4)

Here, $W = \left( {{w_1},\ldots ,{w_n}} \right) \in R_ + ^n$ denotes the sample weight vector, which satisfies $\sum \limits _{j = 1}^n {w_i} = n$ and $w_i \ge 0$.

For m different features ${X^1},\ldots ,{X^m}$, we maximize the independence between pairwise features by minimizing the Frobenius norm value of the weighted population covariance. The specific optimization formula is as follows:

$$\begin{aligned} \begin{aligned}&J\left( W \right) = \mathrm{{min}} \sum \limits _{i,j = 1,i \ne j}^m ||{\Sigma _{{X^i}{X^j}}}\left( W \right) ||_F^2 \\&\textit{s.t.}\left\{ \begin{array}{cl} \sum \limits _{j = 1}^n {w_i} = n; \\ {w_i} \ge 0. \end{array}\right. \end{aligned} \end{aligned}$$

(5)

Gradient update algorithm for sampling weights

In the actual optimization process, we optimize the weighted classification loss and the loss of feature independence of the Eq. (5) in rotation. Suppose that the m features output by feature extractor ${f^{\left( t \right) }}$ and predictor ${g^{\left( t \right) }}$ obtained by us in the t round of iterative optimization are respectively ${X^1}\left( t \right) ,\ldots ,{X^m}\left( t \right)$, then the rotation optimization is as follows:

$$\begin{aligned} \begin{aligned} {W^{\left( t \right) }}&= \mathop {\min }\limits _{ \sum \limits _{j = 1}^n {w_i} = n,{w_i} \ge 0} \sum \limits _{i,j = 1,i \ne j}^m ||{\Sigma _{{X^i}\left( t \right) {X^j}\left( t \right) }}\left( {{W^{\left( {t – 1} \right) }}} \right) ||_F^2 \\ {f^{\left( {t + 1} \right) }},{g^{\left( {t + 1} \right) }}&= \mathop {\min }\limits _{f,g} \sum \limits _{i = 1}^n w_i^{\left( t \right) }L\left( {{g^{\left( {t + 1} \right) }}\left( {{X^i}\left( t \right) } \right) ,{y_i}} \right) \end{aligned} \end{aligned}$$

(6)

The optimization Eq. (6) needs to be satisfied in the constraint $\sum \limits _{j = 1}^n w_i^{\left( t \right) } = n,w_i^{\left( t \right) } \ge 0$. The usual gradient descent algorithm cannot satisfy the constraints. In this paper, a gradient update algorithm is specially designed to satisfy this constraint, as follows:

The above algorithm can be used to ensure that the loss of Eq. (5) decreases when w after each update satisfies the constraint ${\Delta _n}$, and the specific proof is as follows:

Step 1: Initial setup ${w_i} = 1,1 \le i \le n$, Learning rate: $\lambda > 0$.
Step 2: Find the loss gradient $\Delta w = {\frac{dl}{dw}}$ of Eq. (7) The convergence of the classical gradient descent algorithm is based on the first-order Taylor series approximation:

$$\begin{aligned} \begin{aligned} L\left( w \right) = L\left( {w\left( 0 \right) } \right) + {\left( {{\frac{dL}{dw\left( 0 \right) }}} \right) ^T}\left( {w – w\left( 0 \right) } \right) \end{aligned} \end{aligned}$$

(7)

Where $w\left( 0 \right)$ is the weight vector obtained in the previous round and w is the weight vector that must be determined by the gradient descent algorithm in this round. According to the above equation, the direction of the negative gradient is the direction of descent, that is, $w – w\left( 0 \right) = – {\frac{dL}{dw\left( 0 \right) }}$, and the use of $w = w\left( 0 \right) – \lambda {\frac{dL}{dw\left( 0 \right) }}$ update ensures that the loss decreases. The changes we make are in steps 3, 4 and 5.
Step 3: Use zero equalization $\Delta w = \Delta w – m\left( {\Delta w} \right) ,m\left( {\Delta w} \right) = {\frac{1}{n}} \sum \limits _{i = 1}^n {w_i}$ to update $w = w – \lambda \Delta w$ Updating w using zero equalize $\Delta w\left( 0 \right)$ can be written as follows:

$$\begin{aligned} \begin{aligned} w = w\left( 0 \right) – \lambda \left( {{\frac{dl}{dw\left( 0 \right) }} – m\left( {\Delta w\left( 0 \right) } \right) e} \right) \end{aligned} \end{aligned}$$

(8)

where e is a n dimensional all-ones vector. Furthermore, it can be computed from Eq. (8)

$$\begin{aligned} \begin{aligned} L&\left( w \right) – L\left( {w\left( 0 \right) } \right) \\&= – \lambda {\left( {{\frac{dl}{dw\left( 0 \right) }}} \right) ^T}\left( {{\frac{dl}{dw\left( 0 \right) }} – m\left( {\Delta w\left( 0 \right) } \right) e} \right) \\&= – \lambda \left( {{{\left( {{\frac{dl}{dw\left( 0 \right) }}} \right) }^T}{\frac{dl}{dw\left( 0 \right) }} – m\left( {\Delta w\left( 0 \right) } \right) {{\left( {{\frac{dl}{dw\left( 0 \right) }}} \right) }^T}e} \right) \\&= – \lambda \left( {{{\left( {{\frac{dl}{dw\left( 0 \right) }}} \right) }^T}{\frac{dl}{dw\left( 0 \right) }} – n{m^2}\left( {\Delta w\left( 0 \right) } \right) } \right) \end{aligned} \end{aligned}$$

(9)

Accordingly, let vector ${\frac{dl}{dw\left( 0 \right) }} = {({s_1},\ldots ,{s_n})^T}$, we have:

$$\begin{aligned} \begin{aligned} {m\left( {\Delta w\left( 0 \right) } \right) = {\frac{1}{n}} \sum \limits _{i = 1}^n {s_i}}\\ {{{\left( {{\frac{dl}{dw\left( 0 \right) }}} \right) }^T}{\frac{dl}{dw\left( 0 \right) }} = \sum \limits _{i = 1}^n s_i^2} \end{aligned} \end{aligned}$$

(10)

Subsequetly, from the basic inequality, the following is obtained:

$$\mathop \sum \nolimits_{i = 1}^n s_i^2 \ge \frac{1}{n}{\left( {\mathop \sum \nolimits_{i = 1}^n {s_n}} \right)^2} = n{m^2}\left( {\Delta w\left( 0 \right)} \right)$$

(11)

Here, Eq. (11) shows that when $n\ge 2$, there must be: ${\left( {{\frac{dl}{dw\left( 0 \right) }}} \right) ^T}{\frac{dl}{dw\left( 0 \right) }} \ge n{m^2}\left( {\Delta w\left( 0 \right) } \right)$. By further analyzing Eq. (9), $L\left( w \right) – L\left( {w\left( 0 \right) } \right) < 0$ can be obtained. Therefore, we conclude that updating w with zero equalization $\Delta w\left( 0 \right)$ also ensures a decrease in the loss function as in Eq. (5).
Step 4: Nonnegative $w \ge 0$. If ${w_i} < 0$, reset ${w_i} = 0$. Vector ${\frac{dl}{dw\left( 0 \right) }} = {({s_1},\ldots ,{s_n})^T}$, mean $m\left( {\Delta w\left( 0 \right) } \right) = b$, so $\lambda ({\frac{dl}{dw\left( 0 \right) }} – m\left( {\Delta w\left( 0 \right) } \right) e) = \lambda {({s_1} – b,\ldots ,{s_n} – b)^T}$. Now, let us prove $\sum \limits _{i = 1}^n {s_i}\left( {{s_i} – b} \right) \ge 0$.

Proof

Let us write ${w_i}$ as ${w_i} = {w_i}\left( 0 \right) – \lambda \left( {{s_i} – b} \right)$. If ${w_i} < 0$, there must be ${s_i} – b > 0$. According to the non-negation operation, we need to decrease ${s_i} – b$ to ${t_i}$ in the range of $\ge 0$.

At this time, if ${s_i} \ge 0$, due to ${t_i} \ge 0$, the modified ${s_1}\left( {{s_1} – b} \right) + \ldots + {s_i}{t_i} + \ldots + {s_n}\left( {{s_n} – b} \right)$ is still $\ge 0$, which can ensure the loss of Eq. (5) decreases.

If ${s_i} < 0$, then ${s_i} – b > {t_i} \ge 0$, ${s_i}\left( {{s_i} – b} \right)< {s_i}{t_i} < 0$ can be obtained. This means that ${s_1}\left( {{s_1} – b} \right) + \ldots + {s_i}{t_i} + \ldots + {s_n}\left( {{s_n} – b} \right)$ is still $\ge 0$, ensuring that Eq. (5) loss decreases. Thus, we conclude that a further non-negative w after a zero mean $\Delta w\left( 0 \right)$ still ensures a decrease in Eq. (5) loss. $\square$

Step 5: Normalized w. $\overline{w} = {\frac{n}{ \sum \nolimits _{i = 1}^n {w_i}}} \times w$ Let ${w^T}\left( 0 \right) e = n$, we first easily see that using the zero-averaged gradient to update $w\left( 0 \right)$, the updated w must also satisfy ${w^T}e = n$. Then non-negative w. In this case, we will increase the ${w_i}$ form less than 0 to 0, which means that w must have ${w^T}e > n$ after non-negative. Finally, we normalize w so that it satisfies $\alpha {w^T}e = n$ by scaling down, clearly scaling factor $\alpha < 1$. Substituting it into Eqs. (4), (5) and (6), it can be seen that when using $\alpha < 1$ to scale w, it also scales the matrix ${\Sigma _{{X^i}{X^j}}}\left( W \right)$, so it will make $\mathrm{{L}}\left( \mathrm{{w}} \right)$ further decrease. The complexity of computing the sample weights is $O(B*N)$, where B is the epoch and N is the feature dimension. Certificate completed.

Stable-GNN

Finding the correct causality between feature and result can improve the stability and generalization of network⁴³. GNN can only find the correlation between the sample features, but not the causality. In order to find a causal relationship between features and results, we need to take some measures to make the features as independent as possible between samples.

The overall architecture of Stable-GNN is shown in Fig. 1, Stable-GNN eliminates linear and nonlinear dependencies between features by utilizing the characteristics of RFF and sample weighting²⁶. The feature map is drawn through GNN and then entered the classifier to calculate the predicted loss when the dependence is removed by the weighting processing of the sample. The final loss is obtained by combining the weight of the processed sample with the predicted loss.

Learning sample weights globally

Global learning of sample weights in deep learning tasks is frequently plagued by issues of storage space and computing expense. When utilising the gradient optimisation approach, only a subset of the samples in each batch may be observed, preventing the global weights of all samples from being learned. In response to this difficulty, this study uses the weight saving and reloading technique introduced by Peng Cui²⁸ et al. to optimise sample weights by combining features and sample weights gained during the training phase as global knowledge. In particular, assuming batch size B , a join operation along the sample dimension is used to combine the batch features and weights into a globally optimised feature Z and weight W:

$${\text{W}}\;{\text{ = }}\;{\text{Concat}}\left( {\omega _{1} ,\;\omega _{2} ,\; \ldots ,\;\omega _{k} ,\;\omega _{l} } \right),\;Z = {\text{Concat}}\left( {z_{1} ,\;z_{2} ,\; \ldots ,\;z_{k} ,\;z_{l} } \right)$$

${z_1},\;{z_2},\; \ldots ,\;{z_k}$ and ${\omega _1},\;{\omega _2},\; \ldots ,\;{\omega _k}$ , respectively, stand for the global features and weights that are dynamically modified during each training batch. This technique lowers the compute and storage overhead from $O\left( N \right)$ to $O\left( {kB} \right)$ .

Source link