Decoding potential lncRNA and disease associations through graph representation learning and gradient boosting with histogram

Machine Learning


Problem formulation

Considering two sets composed of m lncRNAs and n diseases, let \(\varvec{Y} \in R^{m \times n}\) represents a set of all possible LDPs. For each LDP \((l_i,d_j )\), \(\varvec{Y}( l_i,d_j ) = 1\) denotes a verified linkage between lncRNA \(l_{i}\) and disease \(d_{j}\), \(\varvec{Y}( l_i,d_j ) = 0\), otherwise. We aim to train a model for predictions.

Pipeline of LDA-GMCB

As shown in Fig. 6, LDA-GMCB mainly includes four stages: (a) Nonlinear feature learning based on graph representation learning with graph embedding and MSA-CNN. (b) Linear feature learning based on low-rank SVD. (c) Feature fusing based on concatenation operation. (d) LDP classification based on HGBoost.

Fig. 6
figure 6

The illustration of LDA-GMCB.

Nonlinear feature extraction with graph representation learning

To learn LDP nonlinear representations, we combine their biological similarity and graph representation learning. First, disease similarity and lncRNA similarity are computed. A graph representation learning module is proposed to learn deep latent nonlinear representations of lncRNAs and diseases by leveraging graph embedding module and MSA-CNN, respectively. As shown in Fig. 6, each graph embedding module contains one GCN layer, one GAT layer, and one GCN layer. The MSA-CNN module learns node representations with different importance by integrating the outputs from different graph convolutional layers.

Disease semantic similarity

To build disease similarity network, we employ MeSH descriptors to evaluate semantic similarities between different diseases. A directed acyclic graph (DAG), where node and edge denote the MeSH descriptor of a disease and relationship between two diseases, respectively, is applied to depict relationships between various diseases. Consequently, the semantic similarity between \(d_i\) and \(d_j\) is measured by Eq. (1):

$$\begin{aligned} \textrm{DSSM}(d_i,d_j)=\frac{\sum _{x\in {N}_{d_i}\cap {N}_{d_j}}({S}_{d_i}({x})+{S}_{d_j}({x}))}{\sum _{{x}\in {N}_{d_i}}{S}_{d_i}({x})+\sum _{{x}\in {N}_{d_j}}{S}_{d_j}({x})} \end{aligned}$$

(1)

where \({N}_{d_i}\) contains \(d_i\) and its ancestral diseases in DAG(\(d_i\)). \({S}_{d_j}(x)\) is semantic contribution of x to \(d_i\) by Eq. (2):

$$\begin{aligned} {\left\{ \begin{array}{ll} S_{d_{i}}(x)=\max \left\{ (\Delta +\gamma _x)*S_{d_{i}}(x^{\prime })|x^{\prime }\in \text {children of } d_{i}\right\} & \text { if }~x\ne d_{i} \\ S_{d_{i}}(d_{i})=1 & \text { otherwise} \end{array}\right. } \end{aligned}$$

(2)

where \(\Delta\) represent semantic contribution factor corresponding to x and \(x^{\prime }\), and \(\gamma\) represents information content (IC) contribution factor involving to x and other diseases. \(\Delta\) was set to 0.5. For the disease x, its \(\gamma _x\) value change with the continuously updated version of MeSH.

lncRNA functional similarity

Since functionally similar lncRNAs tend to link with phenotypically similar diseases, functional similarity between \(l_i\) and \(l_j\) can be assessed via disease semantic similarity by Eq. (3):

$$\begin{aligned} \textrm{LFSM}(l_{i},l_{j})=\frac{\sum _{1\le {q}\le |D_{i}|}DS(d_{q},D_{j})+\sum _{1\le r\le |D_{j}|}DS(d_{r},D_{i})}{|D_{i}|+|D_{j}|}\quad \quad \end{aligned}$$

(3)

here

$$\begin{aligned} \textrm{DS}({d}_q,{D}_j)=\max _{1\le {t}\le |{D}_j|}(\textrm{DSSM}({d}_q,{d}_t)) \end{aligned}$$

(4)

where \(D_i\) denotes a set of diseases linking with \(l_i\), and \(\textrm{DS}(d_{r},D_{i})\) denotes the semantic similarity between \(d_r\) and \(D_i\).

Disease and lncRNA GAPK similarity

Since some diseases have no DAGs and thus have no MeSH descriptors, their semantic similarity can’t be measured. As a result, we utilize the topological structure of LDA network and use GAPK to measure their similarity. Given an association profile \(\textrm{AP}_{d_i}\) of \(d_i\), GAPK similarity between \(d_i\) and \(d_j\) is measured by Eq. (5):

$$\begin{aligned} \textrm{DGSM}(d_i,d_j)=\exp (-\mu ||\textrm{AP}(d_i)-\textrm{AP}(d_j)||^2) \end{aligned}$$

(5)

$$\begin{aligned} \mu =\frac{1}{\frac{1}{{n}}\sum _{{i=1}}^{{n}}||\textrm{AP}(c_{i})||^{2}} \end{aligned}$$

(6)

where \(\mu\) is used to control the kernel bandwidth. Similarly, GAPK similarity between \(l_i\) and \(l_j\) is measured by Eq. (7):

$$\begin{aligned} \textrm{LGSM}(l_i,l_j)=\exp (-\mu ||\textrm{AP}(l_i)-\textrm{AP}(l_j)||^2) \end{aligned}$$

(7)

$$\begin{aligned} \mu =\frac{1}{\frac{1}{{N_{l}}}\sum _{{i=1}}^{{N_{l}}}||\textrm{AP}(l_{i})||^{2}} \end{aligned}$$

(8)

where \(\textrm{AP}_{l_i}\) denotes the GAPK vector of \(l_i\) corresponding to the i-th row in \(\varvec{Y}\).

Similarity matrix fusion

To thoroughly measure similarity from biological characteristics and topological structures, we leverage functional similarity and GAPK similarity for lncRNAs, and semantic similarity and GAPK similarity for diseases by Eq.(9):

$$\begin{aligned} {\left\{ \begin{array}{ll} & L_{ij} = ({\textrm{LFSM}(l_i,l_j)+\textrm{LGSM}(l_i,l_j)})/{2} \\ & D_{ij}= ({\textrm{DSSM}(d_i,d_j)+\textrm{DGSM}(d_i,d_j)})/{2} \end{array}\right. } \end{aligned}$$

(9)

Graph embedding module

Graph embedding techniques effectively incorporate graph-based topological information and can precisely capture relationships between nodes based on neighborhood aggregation mechanisms. Graph embedding methods exhibit powerful robustness in learning discriminative node features, even these nodes have sparse or noise-contaminated features22. Here, we employ GCN to gain representations of lncRNAs and diseases, respectively. Given lncRNA similarity network \(G_l\) composed of \(N_l\) lncRNAs, and corresponding adjacency matrix \(\varvec{L} \in \mathbb {R}^{N_l \times N_l}\) (i.e., similarity network) and input lncRNA representations \(\varvec{H} \in \mathbb {R}^{N_l \times F_l}\) with \(F_l\)-dimensional feature, the output lncRNA representations \(\varvec{H}^{\textrm{new}}\) are denoted by a GCN layer by Eq. (10):

$$\begin{aligned} \varvec{H}^{\textrm{new}}=\textrm{GCN}(\varvec{L},\varvec{H}) \end{aligned}$$

(10)

$$\begin{aligned} \textrm{GCN}\left( \varvec{L},\varvec{H}\right) =\sigma \left( \varvec{A}^{-\frac{1}{2}}\widetilde{\varvec{L}}\varvec{A}^{-\frac{1}{2}}\varvec{HW}\right) \end{aligned}$$

(11)

where \(\widetilde{\varvec{L}}=\varvec{I}+\varvec{L}\), \(\varvec{A}=\sum _j\widetilde{\varvec{L}}_{i,j}\), \(\varvec{W} \in \mathbb {R}^{F_l \times F_l}\), and \(\sigma\) are the degree matrix, the trainable weight matrix, and the ReLU activation function, respectively.

GAT can set different weights for adjacent nodes based on their importance through the MSA mechanisms. Hence, we introduce a GAT layer between two GCN layers to help the following GCN layer to learn more informative features for lncRNAs and diseases. For lncRNAs, the output node representations \(\varvec{H}^{\textrm{new}}\) in the GAT layer are denoted by Eq. (12):

$$\begin{aligned} \varvec{H}^{\textrm{new}}=\textrm{GAT}(\varvec{L},\varvec{H}) \end{aligned}$$

(12)

$$\begin{aligned} \vec {\varvec{H}}_{i}^\textrm{new}=\sigma \left( \frac{1}{K}\sum _{k=1}^K\sum _{j\ne i}\phi _{ij}^k{\varvec{W}}_k\vec {\varvec{H}}_i\right) \end{aligned}$$

(13)

where \(\vec {\varvec{H}}_{i}^\textrm{new}\), K, \(\varvec{W}_k\), and \(\vec {\varvec{H}}_i\) denote the representations of \(l_i\) in \(\varvec{H}^{\textrm{new}}\), the number of attention mechanisms, the weight matrix corresponding to the k-th attention mechanism, the input representations of \(l_i\). \(\phi _{it}^k\) is the k-th attention coefficient between \(l_i\) and \(l_t\) and is computed by Eq. (14):

$$\begin{aligned} \phi _{{ij}}^{{k}}=\frac{\exp (LeakyReLU(a_{{k}}^{\top }[\varvec{W}_{{k}}\vec {\varvec{H}}_{i}||\varvec{W}_{k}\vec {\varvec{H}}_{j}||B_{k}\varvec{L}_{ij}]))}{\sum _{{t\ne i}}\exp (LeakyReLU(a_{{k}}^{\top }[\varvec{W}_{{k}}\vec {\varvec{H}}_{i}||\varvec{W}_{k}\vec {\varvec{H}}_{t}||B_{k}\varvec{L}_{it}]))} \end{aligned}$$

(14)

where \(a_k \in \mathbb {R}^{2F_l + 1}\) is a learnable parameter with initial value of random number. It denotes the weight vector corresponding to the k-th attention mechanism. || denotes the concatenation operation. \(B_k\) denotes the learnable weight of edge \(\varvec{L}_{ij}\). And LeakyReLU is an activation function with \(LeakyReLU(x)=max(0.01x,x)\). \([\varvec{W}_{{k}}\vec {\varvec{H}}_{i}||\varvec{W}_{k}\vec {\varvec{H}}_{j}||B_{k}\varvec{L}_{ij}]\) maps node pair features and edge features to the same space, enabling the attention mechanism to simultaneously capture semantic similarity (\(\varvec{W}_k\varvec{H}_i\) and \(\varvec{W}_k\varvec{H}_j\)) of nodes and association strength (\(\varvec{B}_k\varvec{L}_{ij}\)) between nodes.

Graph embedding modules for lncRNAs and diseases can learn their feature representations from corresponding similarity networks through GCN and GAT layers, respectively. Given lncRNA similarity matrix \(G_l\), its adjacency matrix \(\varvec{L}\), the input \(F_l\)-dimensional features \({\varvec{H}}_{l}^{(0)} \in \mathbb {R}^{N_l \times N_l}\) in \(G_l\), GCN and GAT are used alternately to learn the graph representations of lncRNAs in different node levels by Eq. (15):

$$\begin{aligned} {\left\{ \begin{array}{ll} & \varvec{H}_{l}^{(1)}= \textrm{GCN}(\varvec{L}, \varvec{H}_{l}^{(0)}) \\ & \varvec{H}_{l}^{(2)}= \textrm{GAT}(\varvec{L}, \varvec{H}_{l}^{(1)})\\ & \varvec{H}_{l}^{(3)}= \textrm{GCN}(\varvec{L}, \varvec{H}_{l}^{(2)}) \end{array}\right. } \end{aligned}$$

(15)

Similarly, given the adjacency matrix \(\varvec{D}\) and initial features \({\varvec{H}}_{d}^{(0)} \in \mathbb {R}^{N_d \times N_d}\) in disease similarity network \(G_d\), we employ GCN and GAT to capture multi-level node representations \(\varvec{H}_{d}^{(1)}\), \(\varvec{H}_{d}^{(2)}\) and \(\varvec{H}_{d}^{(3)}\) of diseases by Eq. (16):

$$\begin{aligned} {\left\{ \begin{array}{ll} & \varvec{H}_{d}^{(1)}= \textrm{GCN}(\varvec{D}, \varvec{H}_{d}^{(0)}) \\ & \varvec{H}_{d}^{(2)}= \textrm{GAT}(\varvec{D}, \varvec{H}_{d}^{(1)})\\ & \varvec{H}_{d}^{(3)}= \textrm{GCN}(\varvec{D}, \varvec{H}_{d}^{(2)}) \end{array}\right. } \end{aligned}$$

(16)

To boost their feature representations, we concatenate \(\varvec{H}^{(1)}\) and \(\varvec{H}^{(3)}\) of lncRNAs and diseases, respectively:

$$\begin{aligned} {\left\{ \begin{array}{ll} & \varvec{H}_{l}= \textrm{Concat}(\varvec{H}_{l}^{(1)},\varvec{H}_{l}^{(3)}) \\ & \varvec{H}_{d}= \textrm{Concat}(\varvec{H}_{d}^{(1)},\varvec{H}_{d}^{(3)}) \end{array}\right. } \end{aligned}$$

(17)

MSA mechanism

The MSA mechanism can model complex relational patterns from multiple perspectives across different subspace projections through parallelized computation. Its multi-perspective and multi-granular structure high-level balances model expressiveness, computational ability, and cross-task generalization performance48. Since node information from different layers exhibits different contributions to predictions, we employ the MSA mechanism to learn node representations with distinct importance through an MSA mechanism \(\text {MSA}(\cdot )\) and 1D CNN \(\text {CNN}(\cdot )\) by Eq. (18):

$$\begin{aligned} {\left\{ \begin{array}{ll} & \varvec{Z}_{l}= \textrm{CNN}(\textrm{MSA}(\varvec{H}_{l})) \\ & \varvec{Z}_{d}= \textrm{CNN}(\textrm{MSA}(\varvec{H}_{d})) \end{array}\right. } \end{aligned}$$

(18)

Training

Based on the representations of lncRNAs \(\varvec{Z}_{l}\) and disease \(\varvec{Z}_{d}\), association matrix \(\varvec{R}\) between lncRNAs and diseases is computed by Eq. (19):

$$\begin{aligned} \varvec{R}={\varvec{Z}_{l}}^{\top } \varvec{Z}_{d} \end{aligned}$$

(19)

The higher \(\varvec{R}_{ij}\) denotes greater association possibility between lncRNA \(l_i\) and disease \(d_j\). The binary cross-entropy is taken as the loss function to assess the difference between predictions \(\varvec{R}\) and original matrix \(\varvec{Y}\) when training the nonlinear representation learning model. Here, we can obtain the nonlinear representations \(\varvec{Z}_{l}\) and \(\varvec{Z}_{d}\) of lncRNAs and diseases based on the minimization of loss function. After MSA-CNN operation, the obtained \(\varvec{Z}_1\) and \(\varvec{Z}_d\) have stable data distribution. Therefore, \(\varvec{Z}_1\) and \(\varvec{Z}_d\) need not normalization operation. Moreover, dot-product is the most common and universal measurement method. Compared with other similarity methods, dot-product operation can directly reflect the strength of association between lncRNA and disease representation vectors. Meanwhile, dot product operation has low computational complexity and is suitable for scaling to large datasets. Thus, we use the dot-product operation for leveraging lncRNA and disease representations.

Linear feature extraction

Recommendation system76 has demonstrated the powerful linear feature learning ability in various supervised learning tasks. Low-rank SVD is an efficient approximation method. It maps a high-dimensional matrix to a lower-dimensional subspace through random projection and exact decomposition. Here, we use a low-rank SVD algorithm to extract linear representations of lncRNAs and diseases.

Given \(\varvec{Y}\), we first generate a randomized Gaussian matrix \(\Omega \in \mathbb {R}^{n \times (q + k)}\) based on the given rank (q) and oversampling parameter k. Next, we obtain a more stable projection matrix \(\varvec{P}\) through power iteration. Finally, we compute an orthogonal basis matrix \(\varvec{Q} \in \mathbb {R}^{m \times (q + k)}\) based on QR decomposition by Eq. (20):

$$\begin{aligned} \varvec{P}= \varvec{QR},\qquad \varvec{Q}^{\top }\varvec{Q}=\varvec{I} \end{aligned}$$

(20)

According to the orthogonal basis matrix \(\varvec{Q}\) and original LDA matrix \(\varvec{Y}\), we construct a reduced matrix \({\varvec{B}}=\varvec{Q}^\top \varvec{Y}\) and perform full SVD on \({\varvec{B}}\) by Eq. (21):

$$\begin{aligned} {\varvec{B}} = \tilde{ \varvec{U}} \Sigma \varvec{V}^{\top } \end{aligned}$$

(21)

Finally, the low-rank approximation of \(\varvec{Y}\) is represented by Eq. (22):

$$\begin{aligned} \hat{\varvec{Y}}= \varvec{U} \Sigma \varvec{V}^{\top }, \varvec{U} = \varvec{Q} \tilde{ \varvec{U}} \end{aligned}$$

(22)

where \(\varvec{U} \in \mathbb {R}^{m \times q}\) and \(\varvec{V} \in \mathbb {R}^{n \times q}\) denote the linear embeddings of lncRNAs and diseases, respectively, and \(\Sigma \in \mathbb {R}^{q \times q}\) is a diagonal matrix containing singular values.

LDA prediction

Through graph representation learning and low-rank SVD, we learn nonlinear and linear features of lncRNAs and diseases, and concatenate them to gain final hybrid feature matrices \(\varvec{X}_{l}\) and \(\varvec{X}_{d}\) for predictions. Consequently, the final descriptor of an LDP \((l_i,d_j)\) is represented as Eq. (23):

$$\begin{aligned} {z}_{ij}=[\varvec{X}_{l}({i}),\varvec{X}_{d}({j})] \end{aligned}$$

(23)

where \(\varvec{X}_{l}({i})\) denotes the i-th row in \(\varvec{X}_{l}\) and \(\varvec{X}_{d}({j})\) denotes the j-th row in \(\varvec{X}_{d}\).

HGBoost is a powerful scalable ensemble learning model by leveraging gradient boosting with histogram-based optimization algorithm. During each iteration, HGBoost conducts binning statistic analysis on feature values to build histograms, approximates the information gain for potential splits, and further selects optimal thresholds for node splitting. Through the approximation strategy, HGBoost alleviates the computational burden when sorting features and accelerates training speed by parallel searching splitting nodes across multiple features. For an LDP \({z}_{ij}\) and its true label \(y_t\), HGBoost defines its loss function to predict its label \(\hat{y}_t\) by Eq. (24):

$$\begin{aligned} \mathscr {L}(y, \hat{y}_t) = -\frac{1}{N_{ld}} \sum _{t=1}^{N_{ld}} \left[ y_t \ln (\hat{y}_t) + (1 – y_t) \ln (1 – \hat{y}_t) \right] \end{aligned}$$

(24)

where \(N_{ld}\) is the number of LDPs.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *