Integrating multimodal cancer data using deep latent variable path modelling

DLVPM

Here, we give a general introduction and a full technical treatment of the DLVPM method. DLVPM can be thought of as a generalization of PLS-PM²³. PLS-PM can be considered, in turn, to be a generalization of canonical correlation⁸¹. It is therefore natural to build an understanding of DLVPM with reference to these simpler methods. It is worth noting that we could have called our method Deep PLS-PM. However, we felt that DLVPM was more descriptive. We also wished to avoid confusion with the more popular PLS regression procedure.

The description of the DLVPM method we present here is broken into three basic parts:

1.

a description of shallow (that is, non-deep-learning-based) methods for establishing correlations between different data types;
2.

deep neural networks and notation;
3.

a description of DLVPM, and how deep learning can be used to identify complex, nonlinear associations between different data types.

Canonical correlation analysis

Canonical correlation analysis (CCA) is a statistical method used to identify linear relationships between two or more sets of variables⁸¹. This method can be thought of as a generalization of linear least squares regression. The objective of CCA is to identify a relationship between two (or more) sets of variables, where there is no distinction between which variables are considered dependent and which are considered independent. This method identifies weights for each variable, such that the weighted sum of variables in each set is maximally correlated with the weighted sum of variables from the opposite set, assuming a linear relationship⁸¹.

Consider two matrices X₁ and X₂, where each row denotes one of N observations, and each column denotes p₁ or p₂ features for X₁ and X₂, respectively. CCA is optimized to find weight vectors w₁ and w₂ that maximize the association

$$\max_{{w}_{1},{w}_{2}}{\rm{corr}}({X}_{1}{w}_{1},{X}_{2}{w}_{2}).$$

We assume that the columns of X₁ and X₂ have been standardized to have a mean of zero and a standard deviation of one. Using the equation used to find Pearson’s correlation coefficient, we get

$$\max_{{w}_{1},{w}_{2}}\frac{{w}_{1}^{{\rm{T}}}{{X}_{1}}^{{\rm{T}}}{X}_{2}{w}_{2}}{{\Vert {X}_{1}{w}_{1}\Vert }_{2}{\Vert {X}_{2}{w}_{2}\Vert }_{2}}.$$

Notice that the denominator is simply a normalization term. Therefore, the canonical correlation objective can also be written as

$$\max_{{w}_{1},{w}_{2}}{w}_{1}^{{\rm{T}}}{{X}_{1}}^{{\rm{T}}}{X}_{2}{w}_{2}$$

subject to the constraints

$${\Vert{X}_{1}{w}_{1}\Vert}_{2}^{2}=1\,{\rm{and}}\,{\Vert{X}_{2}{w}_{2}\Vert}_{2}^{2}=1.$$

Here the vectors X₁w₁ and X₂w₂ are referred to as canonical variates.

In the original formulation, the canonical weights that maximize the association between the two data views are normally found using eigenvalue decomposition. It is possible to find multiple modes of variation using this method. Here the correlation between subsequent canonical variates is maximized subject to their being uncorrelated with other canonical variates. A total of n_dims = min(p₁, p₂) canonical variates can be extracted in this way. This can be written as

$$\max_{{W}_{1},{W}_{2}}{\rm{tr}}({({X}_{1}{W}_{1})}^{T}{X}_{2}{W}_{2})$$

subject to the orthogonalization constraints

$${({X}_{1}{W}_{1})}^{T}{X}_{1}{W}_{1}=I$$

and

$${({X}_{2}{W}_{2})}^{T}{X}_{2}{W}_{2}=I$$

where W₁ and W₂ are p₁ × n_dims and p₂ × n_dims matrices, respectively; and I is an n_dims × n_dims identity matrix.

Generalizing CCA

Hotelling’s original formulation of CCA was designed to identify associations between two data views. Researchers have generalized CCA to more than two data views⁸². There are a number of different ways in which this can be done. One way to generalize the CCA approach is to optimize the sum of correlations between different data views. This involves maximizing the following criteria:

$$\max_{{W}_{1},{W}_{2},\ldots, {W}_{K}}\mathop{\sum }\limits_{i,j,i\ne j}^{K}{\rm{tr}}({({X}_{i}{W}_{i})}^{{\rm{T}}}{X}_{j}{W}_{j})$$

where K is the total number of data views under analysis, and i and j index different data views, subject to the orthogonalization constraints

$${({X}_{i}{W}_{i})}^{{\rm{T}}}{X}_{i}{W}_{i}=I\;\forall i.$$

PLS-PM

The canonical correlation procedures described in the text above can be used to identify latent variables that are highly correlated between multiple data types. In some cases, we may wish to identify associations between some—but not all—data types. For example, a particular disease phenotype may have both genetic and environmental causes. It does not make much sense to try to link these genetic and environmental causes as they should be independent. Any model that attempts to link these data types may end up highlighting spurious effects.

The mathematical framework above, described with relation to generalized CCA, can be used to formulate a kind of structural-equation-modelling procedure called PLS-PM^83,84. Using PLS-PM, it is possible to identify associations between prespecified data types. Utilizing this method, we specify which data types are connected with one another using a predefined adjacency matrix C. The adjacency matrix is a square matrix in which the elements c_ij represent connections between views i and j:

$$C=\left(\begin{array}{cc}\begin{array}{cc}{c}_{11} & {c}_{12}\\ {c}_{21} & {c}_{22}\end{array} & \begin{array}{cc}\cdots & {c}_{1K}\\ \cdot & {c}_{2K}\end{array}\\ \begin{array}{cc}\vdots & \cdot \\ {c}_{K1} & {c}_{K2}\end{array} & \begin{array}{cc}\cdot & \vdots \\ \cdots & {c}_{KK}\end{array}\end{array}\right)c_{ij}\in\{0,1\}\;\forall i,j$$

where K is again the total number of data types under analysis.

The optimization criteria can then be written as

$$\max_{{W}_{1},{W}_{2},\ldots, {W}_{K}}\mathop{\sum }\limits_{i,j,i\ne j}^{K}{c}_{ij}{\rm{tr}}\left({({X}_{i}{W}_{i})}^{{\rm{T}}}X_{j}W_{j}\right)$$

subject to the constraints

$${({X}_{i}{W}_{i})}^{{\rm{T}}}{X}_{i}{W}_{i}=I\;\forall i,$$

where c_ij represents the binary indexed elements of C.

Using PLS-PM, the full modelling process is normally referred to in two parts: the structural model and the measurement model. The structural model is the part of the model that defines which inter-relations are to be optimized between the data types; this information is stored in C. The measurement model is the part of each model (denoted by X_iW_i ∀ i) that links individual features to latent variables in the path model²³.

Deep neural networks and notation

Neural networks are computational models composed of layers of interconnected ‘neurons’ that perform calculations. During training, these networks adjust neuron connection weights via backpropagation, where they compute the gradient of a loss function (the difference between predicted and actual data) and iteratively update the network weights. The outputs of most neural networks can be written in the very general form:

where $\overline{Y}$ is the network output and F(X, U) is some function that takes an input X and passes it though sets of weights and biases U. This could be many kinds of neural network, for example, a feed-forward neural network, a convolutional network or a transformer. The network output can be written more simply still as $\overline{Y}(X,U\;)$.

Each of the methods described in the text below relies on the last layer of the neural network having a linear projection weight. Treating this weight differently in the notation is crucial to understand the mechanisms by which DLVPM functions. Therefore, neural networks processing individual data types in DLVPM are written as $\overline{Y}(X,U,W\;)$, where U represents all weights and biases in the network up to the penultimate layer and W represents the weights on the last layer of the network. We use this very simple notation to denote neural networks in the rest of the text.

Deep canonical correlation

Andrew et al. developed a two-view form of CCA, which they termed deep CCA⁸⁵. Deep CCA creates highly correlated representations of two data types by passing them through deep neural networks. The goal of the algorithm is to learn weights and biases of both data views such that we seek to maximize

$$\max_{{U}_{1},{U}_{2},{w}_{1}^{1},{w}_{1}^{2},\ldots, {w}_{1}^{{n}_{{\rm{dims}}}},{w}_{2}^{1},{w}_{2}^{2},\ldots, {w}_{2}^{{n}_{{\rm{dims}}}}}\mathop{\sum }\limits_{n=1}^{{n}_{{\rm{dims}}}}{\rm{corr}}\left(\;{\bar{y}}_{1}^{n}\left({X}_{1},{U}_{1},{w}_{1}^{n}\right),{\bar{y}}_{2}^{n}\left({X}_{2},{U}_{2},{w}_{2}^{n}\right)\right)$$

subject to the orthogonalization constraint

$${\left(\;{\overline{y}}_{1}^{\;n}\right)}^{\rm{T}}{\overline{y}}_{1}^{\;m}=1\,{\rm{when}}\,n=m\,{\rm{and}}\,{\left(\;{\overline{y}}_{1}^{\;n}\right)}^{\rm{T}}{\overline{y}}_{1}^{\;m}=0\,{\rm{when}}\,n\ne m,$$

where n_dims is the total number of canonical variates we wish to extract.

This optimization problem can be written in the matrix form as

$$ma{x}_{{U}_{1},{U}_{2},{W}_{1},{W}_{2}}tr({\overline{Y}}_{1}{({X}_{1},{U}_{1},{W}_{1})}^{\rm{T}}{{\overline{Y}}_{2}}({X}_{2},{U}_{2},{W}_{2}))$$

subject to the orthogonalization constraint

$${{\overline{Y}}_{1}}^{\rm{T}}{\overline{Y}}_{1}=I\,{\rm{and}}\,{{\overline{Y}}_{2}}^{\rm{T}}{\overline{Y}}_{2}=I$$

where ${\bar{Y}}_{i}$ is a column-wise concatenation of ${\bar{Y}}_{i}={\bar{y}}_{i}^{1}\leftrightarrow {\bar{y}}_{i}^{2}\leftrightarrow \ldots \leftrightarrow {\bar{y}}_{i}^{{n}_{\rm{dims}}}$ (↔ signifies the column-wise concatenation of CCA factors) and I is the identity matrix. Here W₁ and W₂ represent the set of all weights ${w}_{1}^{1},{w}_{1}^{2},\ldots ,{w}_{1}^{{n}_{\rm{dims}}}$ and ${w}_{2}^{1},{w}_{2}^{2},\ldots ,{w}_{2}^{{n}_{\rm{dims}}}$, respectively.

Andrew et al.’s formulation of this procedure operates by taking the derivative of the cross-covariance matrix between the data views. However, this approach is difficult to generalize to more than two data views. Wang et al.⁸⁶ formulated an iterative least squares approach to this method. This involves minimizing the loss

$$L({X}_{1},{X}_{2},{U}_{1},{U}_{2},{W}_{1},{W}_{2})=\frac{1}{2}{\left\Vert \frac{{\bar{Y}}_{1}({X}_{1},{U}_{1},{W}_{1})}{{\bar{Y}}_{1}{\Vert ({X}_{1},{U}_{1},{W}_{1})\Vert }}-\frac{{\bar{Y}}_{2}({X}_{2},{U}_{2},{W}_{2})}{{\Vert {\bar{Y}}_{2}({X}_{2},{U}_{2},{W}_{2})\Vert }}\right\Vert}_{F}^{2}$$

subject to the orthogonalization constraint given above. We use a similar iterative least squares regression approach in the present investigation.

DLVPM

The goal of the DLVPM algorithm is to identify orthogonal modes of association between data views connected by the user-defined adjacency matrix C. As before, the adjacency matrix is a square matrix in which the elements c_ij represent connections between views i and j. This method is essentially a deep analogue of PLS path modelling. This adjacency matrix is often referred to as the structural or path model.

We therefore seek to maximize

$$ma{x}_{{W}_{1},{W}_{2},\ldots, {W}_{K},{U}_{1},{U}_{2},\ldots, {U}_{K}}\mathop{\sum }\limits_{i,j,i\ne j}^{K}{c}_{ij}tr({\bar{Y}}_{i}{({X}_{i},{U}_{i},{W}_{i})}^{T}{\bar{Y}}_{j}(X_{j},U_{j},W_{j}))$$

subject to the orthogonalization constraint

$${{\bar{Y}}_{i}}^{\rm{T}}{\bar{Y}}_{i}=I\;\forall i.$$

Taking the iterative regression approach followed by Wang et al. and described with reference to classical canonical correlation and PLS-PM earlier in the text, we can maximize the association between network outputs by minimizing the loss

$$\begin{array}{l}L({X}_{1},{X}_{2},\ldots, {X}_{K},{W}_{1},{W}_{2},\ldots, {W}_{K},{U}_{1},{U}_{2},\ldots, {U}_{K})\\=\mathop{\sum }\limits_{i,j,i\ne j}^{K}{c}_{ij}\dfrac{1}{2}{\left\Vert\displaystyle\frac{{\bar{Y}}_{i}({X}_{i},{U}_{i},{W}_{i})}{{\Vert {\bar{Y}}_{i}({X}_{i},{U}_{i},{W}_{i})\Vert }}-\displaystyle\frac{{\bar{Y}}_{j}(X_{j},U_{j},W_{j})}{{\Vert {\bar{Y}}_{j}(X_{j},U_{j},W_{j})\Vert }}\right\Vert}_{F}^{2}.\end{array}$$

Orthogonalization

The DLVPM algorithm can be split into two fundamental parts: an optimization step aimed at finding factors that are strongly correlated between data views, and a constraint that ensures DLVPM factors are orthogonal to one another. It is possible to identify a single factor of shared variance between sets without the orthogonalization step. The loss associated with finding a single DLVPM factor can be written as

$$\begin{array}{l}L({X}_{1},{X}_{2},\ldots, {X}_{K},{w}_{1},{w}_{2},\ldots, {w}_{K},{U}_{1},{U}_{2},\ldots, {U}_{K})\\=\mathop{\sum }\limits_{i,j,i\ne j}^{K}{c}_{ij}\,\displaystyle\frac{1}{2}{\left\Vert \displaystyle\frac{{\bar{y}}_{i}({X}_{i},{U}_{i},{w}_{i})}{{\Vert {\;\bar{y}}_{i}({X}_{i},{U}_{i},{w}_{i})\Vert }}-\displaystyle\frac{{\bar{y}}_{j}(X_{j},U_{j},w_{j})}{{\Vert {\;\bar{y}}_{j}(X_{j},U_{j},w_{j})\Vert }}\right\Vert}_{2}^{2}.\end{array}$$

In cases where we wish to identify more than a single factor of shared variance between data views, an orthogonalization step is required to decorrelate factors. We used two different approaches to orthogonalization in the present investigation. We first introduce an orthogonalization procedure inspired by classical PLS, which is used in the main part of this investigation. We also compared this approach to a whitening procedure, similar to the approach used in Wang et al.⁸⁶.

Iterative orthogonalization

In the present investigation, we use a matrix deflation approach inspired by classical PLS-PM. This approach has the advantage that it maintains the proper ordering of DLVPM factors. During the forward pass through the network, data are orthogonalized with respect to previous DLVPM factors. Individual DLVPM factors are written as follows

$${\bar{y}}_{i}^{n}\left({X}_{i},{U}_{i},{w}_{i}^{n}\right).$$

The set of all DLVPM factors in a data view can be written as

$${\bar{Y}}_{i}({X}_{i},{U}_{i},{W}_{i}),$$

where ${\bar{Y}}_{i}$ is an N × n_dims matrix of DLVPM factors and W_i is a matrix of DLVPM weights. ${\bar{Y}}_{i}$ is a column-wise concatenation of ${\bar{Y}}_{i}={\bar{y}}_{i}^{1}\leftrightarrow {\bar{y}}_{i}^{2}\leftrightarrow {\ldots}\leftrightarrow {\bar{y}}_{i}^{{n}_{\rm{dims}}}$. Similarly, we define the matrix ${{\bar{Y}}_{i}}^{n}$ as the concatenation of all vectors from the first to the nth.

We denote the penultimate layer of the neural network with the notation F_i(X_i, U_i). It is a well-known property of regression that the residual features, denoted by ${F}_{i}({X}_{i},{U}_{i}|{{\overline{Y}}_{i}}^{n})$, found in the regression

$${F}_{i}({X}_{i},{U}_{i}|{\bar{Y}}_{i}^{n})={F}_{i}({X}_{i},{U}_{i})-{\bar{Y}}_{i}^{n}{\left({\bar{Y}}_{i}^{\;{n}^{\rm{T}}}{\bar{Y}}_{i}^{\;n}\right)}^{-1}{\bar{Y}}_{i}^{{\;n}^{\rm{T}}}{F}_{i}\left({X}_{i},{U}_{i}\right),$$

are orthogonal to ${\bar{Y}}_{i}^{\;n}$ (utilizing the Moore–Penrose pseudo-inverse). We use this mechanism to identify orthogonal modes of variation using DLVPM.

We can then write the loss for the nth extracted latent factor, for the ith data view as

$${L}_{i}^{n}\left({X}_{i},\,{U}_{i},{w}_{i}^{n}\right)=\mathop{\sum }\limits_{j,\;j\ne i}^{K}{c}_{ij}\frac{1}{2}{\left\Vert {\;\bar{y}}_{j}^{n}(X_{j},U_{j})-\left({F}_{i}\left({X}_{i},{U}_{i}\right)\left|{\bar{Y}}_{i}^{n-1}\right){w}_{i}^{n}\right.\right\Vert }_{2}^{2}$$

given that $\,{L}_{i}^{n}$ is a sum of regression problems. We can then write the total loss for the ith data view as follows

$${L}_{i}({X}_{i},{U}_{i},{W}_{i})=\mathop{\sum }\limits_{n=1}^{{n}_{\rm{dims}}}{L}_{i}^{n}=\mathop{\sum }\limits_{n=1}^{{n}_{\rm{dims}}}\mathop{\sum }\limits_{j,\;j\ne i}^{K}{c}_{ij}\frac{1}{2}{\left\Vert {\bar{\;y}}_{j}^{n}\left(X_{j},U_{j}\right)-\left({F}_{i}({X}_{i},{U}_{i})|{\bar{Y}}_{i}^{n-1}\right){w}_{i}^{n}\right\Vert }_{2}^{2}$$

L_i is, therefore, written as a sum of the mean squared error losses across latent factors.

Similarly, the total loss can be calculated as the sum of losses across all data views

$$\begin{array}{rcl}L({X}_{1},{X}_{2}\ldots {X}_{K},{W}_{1},{W}_{2}\ldots {W}_{K},{U}_{1},{U}_{2}\ldots {U}_{K})=&&\sum\limits_{i}^{K}{L}_{i} \\=&& \sum\limits_{i}^{K}\sum\limits_{n=1}^{n_{dims}}\sum\limits_{j,\,j\ne i}^{K} c_{ij}\frac{1}{2}\|{\bar{y}}_{j}^{n}(X_{j},U_{j})\\&&-(F_{i}(X_{i},U_{i})|{\bar{Y}}_{i}^{\,n-1})w_{i}^{\,n}{\|}_{2}^{2}\end{array}$$

Owing to the orthogonalization process introduced in the text above, this formulation meets the following constraints

$${\left(\;{\bar{y}}_{i}^{n}\right)}^{\rm{T}}{\bar{y}}_{i}^{m}=1\,{\rm{when}}\,{n}={m},{\left(\;{\bar{y}}_{i}^{n}\right)}^{\rm{T}}{\bar{y}}_{i}^{m}={0}\,{\rm{when}}\,{n}{\ne}{m},{\rm{for}}\,{\rm{i}}={1},{2},{\ldots},{K}.$$

It is worth noting at this point that due to these constraints,

$${\bar{Y}}_{i}^{{n}^{\rm{T}}}{\bar{Y}}_{i}^{n}=I,$$

where I is the identity matrix. This means that the orthogonalization procedure

$${F}_{i}({X}_{i},{U}_{i}|{\bar{Y}}_{i}^{\,n})={F}_{i}({X}_{i},{U}_{i})-{\bar{Y}}_{i}^{\,n}{\left({\bar{Y}}_{i}^{\,{n}^{\rm{T}}}{\bar{Y}}_{i}^{\,n}\right)}^{-1}{\bar{Y}}_{i}^{\,{n}^{\rm{T}}}{F}_{i}({X}_{i},{U}_{i})$$

simplifies to

$${F}_{i}\left({X}_{i},{U}_{i}|{\bar{Y}}_{i}^{\,n}\right)={F}_{i}({X}_{i},{U}_{i})-{\bar{Y}}_{i}^{\,n}{\bar{Y}}_{i}^{{\,n}^{\rm{T}}}{F}_{i}({X}_{i},{U}_{i}).$$

DLVPM minimizes this loss in an iterative fashion by calculating the gradients associated with each data view and updating the weights of these data views.

So far, the analysis of the DLVPM algorithm has preceded assuming that training is carried out on the entire dataset simultaneously. However, neural networks are usually trained on subsets or batches of data whereas orthogonality is a property of the full dataset. Orthogonalization requires estimating a covariance matrix.

$${\Sigma }_{{Y}_{i}{F}_{i}}=\,{{\bar{Y}}_{i}}^{\rm{T}}{F}_{i}\left({X}_{i},\,{U}_{i}\right)$$

We calculate the covariance matrices above during model training, by making an initial estimate of the covariance matrices using the first batch and then updating this estimate using parameter re-estimation with momentum for each batch. The batch-level covariance matrices for the first batch are written as follows

$${\Sigma }_{{Y}_{i}{F}_{i}{\rm{b}}_{0}}={{\bar{Y}}_{{i}_{{\rm{b}}_{0}}}}^{\rm{T}}{F}_{i}({X}_{{i}_{{\rm{b}}_{0}}},\,{U}_{i})$$

The global covariance matrices for the first batch are then initially estimated as

$${\Sigma }_{{Y}_{i}{F}_{i}}=\frac{N}{{\rm{b}}_{0}}{\Sigma }_{{Y}_{i}{F}_{i}{\rm{b}}_{0}},$$

where N is the total number of samples and b₀ is the size of the first batch. In subsequent batch updates, covariance matrices are calculated as

$${\Sigma }_{{Y}_{i}{F}_{i}}\leftarrow {{{\rho }}\times \Sigma }_{{Y}_{i}{F}_{i}}+\left(1-{{\rho }}\right)\frac{N}{{\rm{b}}_{{\rm{t}}}}\,{\Sigma }_{{Y}_{i}{F}_{i}{\rm{b}}_{{\rm{t}}}},$$

where ρ is the momentum of the update, N is the total number of samples under analysis and b_t is the size of the current batch and ρ is a hyperparameter that defines how quickly the covariance matrices are updated using the current batch. In the present investigation, we used a value of ρ = 0.95, which represents a trade-off between maintaining stable, smooth updates and allowing sufficient responsiveness to changes in newer data.

This algorithm allows us to learn global matrices for orthogonalization, which can then be used during inference. Nevertheless, we found that using these covariance matrices during training were ineffective at enforcing orthogonality. This is likely because using global covariance matrices does not enforce orthogonality at the batch-wise level. This means that gradient updates can also be non-orthogonal. However, if we use batch-wise orthogonalization, this condition is strongly enforced.

Consider the subloss for a particular data view, for a particular dimension of shared variance:

$${L}_{i}^{n}=\mathop{\sum }\limits_{j,\,j\ne i}^{K}{c}_{ij}\frac{1}{2}{\left\Vert\,{\bar{y}}_{j}^{\,n}\left(X_{j},U_{j}\right)-\left({F}_{i}\left({X}_{i},{U}_{i}\right)\left|{\bar{Y}}_{i}^{\,n-1}\right){w}_{i}^{n}\right.\right\Vert}_{2}^{2}.$$

Taking the gradient of ${L}_{i}^{n}$ with respect to the weight ${w}_{i}^{n}$ gives

$$\frac{\partial {L}_{i}^{n}}{\partial {w}_{i}^{n}}=-{\left({F}_{i}\left({X}_{i},{U}_{i}\right)\left|{\bar{Y}}_{i}^{\,n-1}\right.\right)}^{\rm{T}}\mathop{\sum }\limits_{j,\,j\ne i}^{K}{c}_{ij}\left(\,{\bar{y}}_{j}^{n}\left(X_{j},U_{j}\right)-\left({F}_{i}\left({X}_{i},{U}_{i}\right)\left|{\bar{Y}}_{i}^{\,n-1}\right){w}_{i}^{n}\right.\right).$$

Using batch-wise covariance matrices, we get

$$\bigl((F_{i}(X_{i},U_{i}) \mid \bar Y_{i}^{\,n-1})\,w_{i}^{\,n}\bigr)^{\!\top}\, (F_{i}(X_{i},U_{i}) \mid \bar Y_{i}^{\,m-1})\,w_{i}^{\,m} \;=\;0$$

and

$$\bar y_{j}^{\,n\!\top}\,\bar y_{j}^{\,m}=0.$$

Therefore, the gradients are orthogonal.

$${\left(\frac{{\partial L}_{i}^{n}}{\partial {w}_{i}^{n}}\right)}^{\rm{T}}\frac{{\partial L}_{i}^{m}}{\partial {w}_{i}^{m}}=0$$

For these reasons, the algorithm we used functions differently in training and testing. During model training, we implement orthogonalization using batch-wise covariance matrices. Global covariance matrices are used during testing. This different behaviour during training and testing, using batch-wise and global parameters, respectively, is similar in purpose and implementation to the batch normalization layer. The full algorithm specifying this method is shown in Fig. 1 and Extended Data Fig. 1. Pseudo-code illustrating how this algorithm works is shown below:

DLVPM with iterative orthogonalization

Input: data matrices ${{{X}}}_{{{i}}}\in {{\mathbb{R}}}^{{{{N}}\times {{p}}}_{{i}}}$ for i = 1, 2…K. Initialization of the weights W_i, U_i for each data view, momentum ρ and learning rate η. Randomly choose a mini-batch and extract data for each data view as ${{{{X}}}_{{{i}}}}_{{{{b}}}_{0}}$.

During training

For t = 0, 1, 2,…,T, do:

Forward propagate through the network:

For i = 1, 2,…,K, do:

$${\bar{Y}}_{i}={F}_{i}\left({X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i},{W}_{i}\right)$$

$$\bar{Y}_{i}\;\leftarrow\;\dfrac{\bar{Y}_{i}}{\;\|\bar{Y}_{i}\dfrac{N}{b_{t}}\|}$$

For i = 1, 2,…,K, do:

Compute the batch mean and variance:

$${\mu }_{{\rm{b}}_{\rm{t}}}=\frac{1}{{\rm{b}}_{\rm{t}}}\mathop{\sum }\limits_{n=1}^{{\rm{b}}_{\rm{t}}}{F}_{i}\left({X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i}\right)\quad{{\sigma }_{{\rm{b}}_{\rm{t}}}}^{2}=\frac{1}{{\rm{b}}_{\rm{t}}}\mathop{\sum }\limits_{n=1}^{{\rm{b}}_{\rm{t}}}{\left({F}_{i}\left({X}_{{i}_{{\rm{b}}_{\rm{t}}}},{U}_{i}\right)-{\mu }_{{\rm{b}}_{\rm{t}}}\right)}^{2}$$

$${F}_{i}\left({X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i}\right){\rm{\leftarrow }}\left({F}_{i}\left({X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i}\right)-{\mu }_{{\rm{b}}_{\rm{t}}}\right)/{\sigma }_{{\rm{b}}_{\rm{t}}}$$

For n = 1, 2…n_dims, do:

If n = 1:

$$\frac{\partial {L}_{i}}{\partial {w}_{i}^{1}}=\frac{\partial }{\partial {w}_{i}^{1}}\mathop{\sum }\limits_{n=1}^{{n}_{\rm{dims}}}\mathop{\sum }\limits_{j,\,j\ne i}^{K}{c}_{ij}\frac{1}{2}{\left\Vert \bar{y}_{j}^{1}\!\left(X_{j}, U_{j}\right)-{F}_{i}({X}_{i},\,{U}_{i}){w}_{i}^{1}\right\Vert }_{2}^{2}$$

$$\,{\bar{y}}_{i}^{1}={F}_{i}\left({X}_{{i}_{{\rm{b}}_{\rm{t}}}},{U}_{i}\right){w}_{i}^{1}$$

$$\,{\bar{y}}_{i}^{n}={\bar{y}}_{i}^{1}$$

Else

$${\Sigma \,}_{{Y}_{i}{F}_{i}{\rm{b}}_{\rm{t}}}^{n}=\,{\left({\bar{Y}}_{{{ib}}_{\rm{t}}}^{\,n-1}\right)}^{\rm{T}}{F}_{i}\left({X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i}\right)$$

$${F}_{i}({X}_{i},\,{U}_{i})|{{\bar{Y}}_{i}}^{n-1}={F}_{i}({X}_{i},\,{U}_{i})-\,{\bar{Y}}_{{{ib}}_{\rm{t}}}^{n-1}{\Sigma \,}_{{Y}_{i}{F}_{i}{\rm{b}}_{\rm{t}}}^{n}\,$$

$$\frac{\partial {L}_{i}}{\partial {w}_{i}^{n}}=\frac{\partial }{\partial {w}_{i}^{n}}\mathop{\sum }\limits_{n=1}^{{n}_{\rm{dims}}}\mathop{\sum }\limits_{j,\,j\ne i}^{K}{c}_{ij}\frac{1}{2}{\left\Vert\bar{y}_{j}^{n}\!\left(X_{j}, U_{j}\right)-\left({F}_{i}\left({X}_{i},{U}_{i}\right)\left|{{\bar{Y}}_{i}}^{n-1}\right){w}_{i}^{n}\right.\right\Vert }_{2}^{2}$$

$${{\bar{y}}_{i}^{n}}=\left({F}_{i}\left({X}_{i},\,{U}_{i}\right)\left|{{\bar{Y}}_{i}}^{n-1}\right){w}_{i}^{n}\right.$$

$${{\bar{Y}}_{i}}^{n}={{\bar{Y}}_{i}}^{n-1}\leftrightarrow {\bar{y}}_{i}^{n}$$

If t = 0

Define global variables, moving mean and moving variance:

$${\sigma }^{2}={{\sigma }_{{\rm{b}}_{\rm{t}}}}^{2},\mu ={\mu }_{{\rm{b}}_{\rm{t}}}$$

Covariance matrix (for orthogonalization):

$${\Sigma }_{{Y}_{i}{F}_{i}}=\frac{N}{{\rm{b}}_{0}}\times {\Sigma }_{{Y}_{i}{F}_{i}{\rm{b}}_{\rm{t}}}$$

Else

For subsequent batches, update the batch moving mean and moving variance:

$$\sigma ={{\rho }}\times \sigma +\left(1-{{\rho }}\right)\times {\sigma }_{{\rm{b}}_{\rm{t}}}$$

$$\mu ={{\rho }}\times \mu +\left(1-{{\rho }}\right)\times {\mu }_{{\rm{b}}_{\rm{t}}}$$

Update the moving covariance matrices:

$${\Sigma }_{{Y}_{i}{F}_{i}}\leftarrow {{{\rho }}\times \Sigma }_{{Y}_{i}{F}_{i}}+\left(1-{{\rho }}\right)\times \frac{N}{{\rm{b}}_{\rm{t}}}\,\times {\Sigma }_{{Y}_{i}{F}_{i}{\rm{b}}_{\rm{t}}}$$

Update the weights:

$${{{W}}}_{{{i}}}{\rm{\leftarrow }}{{{W}}}_{{{i}}}-{{\eta }}\frac{\partial {L}_{i}}{\partial {W}_{i}}$$

$${{{U}}}_{{{i}}}{\rm{\leftarrow }}{{{U}}}_{{{i}}}-{{\eta }}\frac{\partial {L}_{i}}{\partial {U}_{i}}$$

During inference

For i = 1, 2…K, do:

Forward propagate through the network:

$${F}_{i}\left({X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i}\right)$$

$${F}_{i}\left({X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i}\right){\rm{\leftarrow }}\left({F}_{i}\left({X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i}\right)-\mu \right)/\sigma$$

If n = 1:

$${{\bar{y}}_{i}^{1}=F}_{i}\left({X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i}\right){w}_{i}^{1}$$

$${{\bar{Y}}_{i}}^{n}={\bar{y}}_{i}^{1}$$

Else

$${{\bar{y}}_{i}^{n}}=\left({F}_{i}\left({X}_{i},\,{U}_{i}\right)\left|{{\bar{Y}}_{i}}^{n-1}\right)\right.{w}_{i}^{n}=\left({F}_{i}\left({X}_{i},\,{U}_{i}\right)-\,{\bar{Y}}_{{{ib}}_{\rm{t}}}^{\,n}{\Sigma \,}_{{Y}_{i}{F}_{i}}^{n}\right){w}_{i}^{n}$$

$${{\bar{Y}}_{i}}^{n}={{\bar{Y}}_{i}}^{n-1}\leftrightarrow {\bar{y}}_{i}^{n}$$

Whitening

Whitening offers a different way of orthogonalizing DLVs. This method of orthogonalization was used by Wang et al. in their formulation of deep CCA⁸⁶.

Using the definitions of Y_i and W_i outlined earlier in the text, we can write the objective as

$$\begin{array}{l}L({X}_{1},{X}_{2},\ldots {X}_{k},{W}_{1},{W}_{2},\ldots {W}_{k},{U}_{1},{U}_{2}\ldots {U}_{K})=\\\mathop{\sum }\limits_{i,j,i\ne j}^{K}{c}_{ij}\displaystyle\frac{1}{2}{\left\Vert \frac{{F}_{i}({X}_{i},\,{U}_{i}){W}_{i}}{{\Vert {F}_{i}({X}_{i},{U}_{i}){W}_{i}\Vert }}-\,\frac{{F}_{j}(X_{j},\,U_{j})W_{j}}{{\Vert {F}_{j}(X_{j},U_{j})W_{j}\Vert }}\right\Vert}_{F}^{2}\end{array}$$

subject to

$${{\bar{Y}}_{i}}^{\rm{T}}{\bar{Y}}_{i}=I\;\forall i.$$

We note that if we multiply Y_i by the matrix square root of its inverse, we get the following.

$${\bar{Y}}_{i}\leftarrow {\left({{\bar{Y}}_{i}}^{\rm{T}}{\bar{Y}}_{i}\right)}^{-1/2}{\bar{Y}}_{i}$$

Then, the columns of Y_i, representing different modes of variation, are uncorrelated. In other words, the orthogonality condition is met.

We introduce another algorithm, which again minimizes the global loss by iteratively minimizing the sum of the squared loss between each data view and connected data views. Using the whitening approach, we iteratively minimize the loss as follows

$$\begin{array}{l}L({X}_{1},{X}_{2},\ldots {X}_{k},{W}_{1},{W}_{2},\ldots {W}_{k},{U}_{1},{U}_{2}\ldots {U}_{K})\\=\mathop{\sum }\limits_{i,\;j,i\ne j}^{K}{c}_{ij}\frac{1}{2}{\left\Vert {\left({\bar{Y}_{j}}^{\rm{T}}\bar{Y}_{j}\right)}^{-1/2}\bar{Y}_{\!j}-{F}_{i}({X}_{i},{U}_{i}){W}_{i}\right\Vert }_{F}^{2}.\end{array}$$

As was the case when minimizing the loss using the iterative orthogonalization approach specified above, when trained at the batch-wise level, we must estimate a global covariance matrix. We do this in a similar manner to the way in which we estimated global covariance matrices using the iterative orthogonalization approach. As noted in the explanation of the iterative orthogonalization algorithm, training using deep learning is generally carried out at the batch-wise level.

For the first batch, in each data view, the covariance matrix is initially estimated as follows.

$${\Sigma }_{{Y}_{i}{Y}_{i}}=\bar{Y}_{{i}_{{\rm{b}}_{0}}}^{\rm{T}}\bar{Y}_{{i}_{{\rm{b}}_{0}}}$$

Subsequent batches are then estimated as follows.

$${\Sigma }_{{Y}_{i}{Y}_{i}}\leftarrow {{{\rho }}\times \Sigma }_{{Y}_{i}{Y}_{i}}+\,(1\,-{{\rho }})\frac{N}{{\rm{b}}_{\rm{t}}}\,{\Sigma }_{{Y}_{i}{Y}_{i}{\rm{b}}_{\rm{t}}}$$

The full pseudo-code for estimating a DLVPM model using the whitening orthogonalization approach is given below. Note that finding the matrix inverse can be computationally intensive for large embedding sizes. Using the Cholesky decomposition can be substantially quicker in finding ${({\bar{Y}_{j}}^{\rm{T}}\bar{Y}_{j})}^{-1/2}\bar{Y}_{j}$. Therefore, this is offered as an option in the DLVPM toolbox.

DLVPM with whitening

Input: data matrices ${{{X}}}_{{{i}}}\in {{\mathbb{R}}}^{{{{N}}\times {{p}}}_{{i}}}$ for i = 1, 2…K. Initialization of the weights W_i, U_i for each data view, momentum ρ and learning rate η. Randomly choose a mini-batch and extract data for each data view as ${{X}_{i}}_{{\rm{b}}_{0}}$.

During training

For t = 0, 1, 2…T, do:

Forward propagate through the network:

For i = 1, 2…K, do:

$$\bar{Y}_{i}={F}_{i}\left({X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i},{W}_{i}\right)$$

$$\,\bar{Y}_{i}\,\leftarrow {\,{\Sigma }_{{Y}_{i}{Y}_{i}}}^{-1/2}\bar{Y}_{i} $$

For i = 1, 2…K, do:

Compute the batch mean and variance:

$${F}_{i}({X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i}){\rm{\leftarrow }}({F}_{i}({X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i})-{\mu }_{{\rm{b}}_{\rm{t}}})/{\sigma }_{{\rm{b}}_{\rm{t}}}$$

$$\frac{\partial {L}_{i}}{\partial {W}_{i}}=\frac{\partial }{\partial {W}_{i}}\mathop{\sum }\limits_{j,\,j\ne i}^{K}{c}_{ij}\frac{1}{2}{\left\Vert \bar{Y}_{j}-{F}_{i}({X}_{{i}_{{\rm{b}}_{\rm{t}}}},{U}_{i}){W}_{i}\right\Vert}_{F}^{2}$$

If t = 0:

Define global variables, moving mean and moving variance:

$${\sigma }^{2}={{\sigma }_{{\rm{b}}_{\rm{t}}}}^{2},\mu ={\mu }_{{\rm{b}}_{\rm{t}}}$$

Covariance matrices (for orthogonalization):

$$\,{\Sigma }_{{Y}_{i}{Y}_{i}}=\frac{N}{{\rm{b}}_{0}}\times {\Sigma }_{{Y}_{i}{Y}_{i}{\rm{b}}_{\rm{t}}}$$

Else

For subsequent batches, update the batch moving mean and moving variance:

$$\sigma ={{\rho }}\times \sigma +\left(1-{{\rho }}\right)\times {\sigma }_{{\rm{b}}_{\rm{t}}}$$

$$\mu ={{\rho }}\times \mu +\left(1-{{\rho }}\right)\times {\mu }_{{\rm{b}}_{\rm{t}}}$$

Update the moving covariance matrices:

$${\Sigma }_{{Y}_{i}{Y}_{i}}\leftarrow {{{\rho }}\times \Sigma }_{{Y}_{i}{Y}_{i}}+\,(1\,-{{\rho }})\times \frac{N}{{\rm{b}}_{\rm{t}}}\,\times {\Sigma }_{{Y}_{i}{Y}_{i}{\rm{b}}_{\rm{t}}}$$

Update the weights:

$${{{W}}}_{{{i}}}{\rm{\leftarrow }}{{{W}}}_{{{i}}}-{{\eta }}\frac{\partial {L}_{i}}{\partial {W}_{i}}$$

$${{{U}}}_{{{i}}}{\rm{\leftarrow }}{{{U}}}_{{{i}}}-{{\eta }}\frac{\partial {L}_{i}}{\partial {U}_{i}}$$

During inference

For i = 1, 2…K, do:

Forward propagate through the network:

$${F}_{i}{(X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i})$$

$$\bar{Y}_{i}=(({F}_{i}({X}_{{i}_{{\rm{b}}_{\rm{t}}}},\,{U}_{i})-\mu )/\sigma){W}_{i}$$

DLVPM-Twins

Although DLVPM is primarily designed to identify associations between multiple data views, it can also be used to find useful representations of a single data view, which can then be used for downstream tasks. When used in this manner, DLVPM falls into the class of methods called Siamese or twin networks. This class of methods has become popular across a wide range of fields in recent years. Twin architectures are trained by feeding a neural network with distorted versions of the same input. By using some kind of correlative loss on the output features (as is the case with DLVPM), the network is encouraged to learn representations that are invariant to these distortions, which can be useful in downstream analyses.

When using DLVPM in this way, the loss can be written as

$${\max }_{U,W}{\rm{tr}}\left({{\bar{Y}}_{A}({X}_{A},U,W\,)}^{\rm{T}}{\bar{Y}}_{B}({X}_{B},U,W\,)\right)$$

subject to the orthogonalization constraint

$${{\bar{Y}}_{A}}^{\rm{T}}{\bar{Y}}_{A}=I\,{\rm{and}}\,{{\bar{Y}}_{B}}^{\rm{T}}{\bar{Y}}_{B}=I.$$

Here X_A and X_B are augmented/distorted versions of the same input X. Note here that the weights and biases associated with the networks are the same for the entities we are seeking to maximize associations between. The network is then optimized to learn model outputs that are invariant to user-specified changes in the input.

DLVPM-Twins can be used with both iterative orthogonalization and whitening approaches. The algorithms defining these approaches are very similar to the full path-modelling algorithms. Therefore, they are not given here to avoid repetition.

Removing confounds

Data are often subject to unwanted confounds that can affect the validity and generalizability of inferences made on these data. When assessing linear effects, these confounds can be removed by including them as covariates of no interest in a general linear model, or preregressing these unwanted effects from data before the analysis. We took a similar approach to removing confounding effects in neural networks. The last layer of a DLVPM model is linear. Therefore, removing confounding contributions before this layer will remove them entirely.

Here a set of confounds is denoted by an N × D_c matrix C, where N is the number of samples, and D_c is the number of confounds. F(X, U) has the same definition, given earlier in the text.

We implement the operation

$$(F(X,U\,)|C)=F(X,U\,)-C{({C}^{\rm{T}}C\,)}^{-1}{C}^{\rm{T}}F(X,U\,).$$

The matrix (C^TC)⁻¹C^T is known as the Moore–Penrose pseudo-inverse. It is a well-known result that columns of the resulting matrix (F(X, U)|C) are orthogonal to the columns of the matrix C.

When using DLVPM, we can use this approach to orthogonalize neuronal outputs with respect to a set of confounds in the penultimate layer. As projection layers are linear, the outputs of the measurement model will be orthogonal to these confounds. As with the DLV orthogonalization described earlier in the documentation, we must adapt this orthogonalization so that it is possible to train models using this approach at the batch-wise level.

The matrix

$$A={\left({C}^{\rm{T}}C\right)}^{-1}{C}^{\rm{T}}F\left(X,U\,\right)$$

can be split into two covariance matrices, namely,

$${{\Sigma }_{{{\rm{CC}}}}=C}^{\rm{T}}C$$

and

$${{\Sigma }_{{{\rm{CF}}}}=C}^{\rm{T}}F\left(X,U\,\right).$$

Batch-wise estimates of these matrices can be used to estimate full sample matrices. Batch-wise covariance matrices are written as

$${{\Sigma }_{\rm{CC}{\rm{b}}_{\rm{t}}}={C}_{{\rm{b}}_{\rm{t}}}}^{\rm{T}}{C}_{{\rm{b}}_{\rm{t}}}$$

and

$${{\Sigma }_{{CF}{\rm{b}}_{{\rm{t}}}}={C}_{{\rm{b}}_{{\rm{t}}}}}^{\rm{T}}F({X}_{{\rm{b}}_{{\rm{t}}}},\,U\,).$$

We can then carry out orthogonalization at the batch-wise level using

$$(F({X}_{{\rm{b}}_{\rm{t}}},U\,)|C)=F({X}_{{\rm{b}}_{\rm{t}}},U\,)-{C}_{{\rm{b}}_{\rm{t}}}{{{\sum }}_{CC{\rm{b}}_{\rm{t}}}}^{-1}{{\sum }}_{CF{\rm{b}}_{\rm{t}}}.$$

As in the case of carrying out orthogonalization between DLVs, we must also estimate full-sample covariance matrices so that we can carry out orthogonalization with respect to these parameters in unseen test data.

Global covariance matrices for the first batch are estimated as

$${\Sigma }_{\rm{CC}}=\frac{N}{{\rm{b}}_{0}}{{C}_{{\rm{b}}_{0}}}^{\rm{T}}{C}_{{\rm{b}}_{0}}$$

and

$${\Sigma }_{{CF}}=\frac{N}{{\rm{b}}_{0}}{{C}_{{\rm{b}}_{0}}}^{\rm{T}}F\left({X}_{{\rm{b}}_{0}},\,U\right).$$

In subsequent batches, these covariance matrices are updated with momentum:

$${\Sigma }_{\rm{CC}}\leftarrow {{{\rho }}\times \Sigma }_{\rm{CC}}+\left(1-{{\rho }}\right)\frac{N}{{\rm{b}}_{\rm{t}}}\,{\Sigma }_{\rm{CC}{\rm{b}}_{\rm{t}}}$$

and

$${\Sigma }_{{CF}}\leftarrow {{{\rho }}\times \Sigma }_{{CF}}+\left(1-{{\rho }}\right)\frac{N}{{\rm{b}}_{\rm{t}}}{\Sigma }_{{CF}{\rm{b}}_{\rm{t}}},$$

where ρ denotes the momentum.

At the model test time, these covariance matrices are then used to orthogonalize the signal that is forward propagated through the network, with respect to the confounding variables.

$$(F({X}_{{\rm{b}}_{\rm{t}}},U\,)|C)=F({X}_{{\rm{b}}_{\rm{t}}},U\,)-{C}_{{\rm{b}}_{\rm{t}}}{{{\sum }}_{CC}}^{-1}{{\sum }}_{CF}$$

Full pseudo-code illustrating this process is given below.

Confound removal

Input: Data matrices ${{{X}}}\in {{\mathbb{R}}}^{{{{N}}\times {{p}}}}$ and confound matrices $C\in {{\mathbb{R}}}^{{{N}}\times {{{D}}}_{{\rm{c}}}}$.

During training

For t = 0, 1, 2…T, do:

Compute the batch mean and variance:

${\mu }_{{\rm{b}}_{\rm{t}}}=\frac{1}{{\rm{b}}_{\rm{t}}}\mathop{\sum }\limits_{n=1}^{{\rm{b}}_{\rm{t}}}{F}({X}_{{{\rm{b}}_{\rm{t}}}},{U})\quad{{\sigma }_{{\rm{b}}_{\rm{t}}}}^{2}=\frac{1}{{\rm{b}}_{\rm{t}}}\mathop{\sum }\limits_{n=1}^{{\rm{b}}_{\rm{t}}}{({F}({X}_{{{\rm{b}}_{\rm{t}}}},{U})-{\mu }_{{\rm{b}}_{\rm{t}}})}^{2}$

$F({X}_{{\rm{b}}_{\rm{t}}},\,U\,){\rm{\leftarrow }}(F({X}_{{\rm{b}}_{\rm{t}}},\,U\,)\,-{\mu }_{{\rm{b}}_{\rm{t}}})/{\sigma }_{{\rm{b}}_{\rm{t}}}$

If t = 0:

Define global variables, moving mean and moving variance:

${\sigma }^{2}={{\sigma }_{{\rm{b}}_{\rm{t}}}}^{2},\mu ={\mu }_{{\rm{b}}_{\rm{t}}}$

Covariance matrices (for orthogonalization):

${{\sum }}_{CC}=\frac{N}{{\rm{b}}_{0}}{{C}_{{\rm{b}}_{0}}}^{\rm{T}}{C}_{{\rm{b}}_{0}}$ and

${{\sum }}_{CF}=\frac{N}{{\rm{b}}_{0}}{{C}_{{\rm{b}}_{0}}}^{\rm{T}}F({X}_{{\rm{b}}_{0}},U\,)$

Else

For subsequent batches, update the batch moving mean and moving variance:

$\sigma ={\rho}\times \sigma +(1-{\rho})\times {\sigma }_{{\rm{b}}_{\rm{t}}}$

$\mu ={\rho}\times \mu +(1-{\rho})\times {\mu }_{{\rm{b}}_{\rm{t}}}$

Update the moving covariance matrices:

$${{\sum }}_{CC}\leftarrow {\rho}\times {{\sum }}_{CC}+(1-{\rho})\times \frac{N}{{\rm{b}}_{\rm{t}}}\times {{\sum }}_{CC{\rm{b}}_{\rm{t}}}$$

and

$${{\sum }}_{CF}\leftarrow {\rho}\times {{\sum }}_{CF}+(1-{\rho})\times \frac{N}{{\rm{b}}_{\rm{t}}}\times {{\sum }}_{CF{\rm{b}}_{\rm{t}}}$$

Use these covariance matrices to remove the confounding effects:

$$\,(F({X}_{{\rm{b}}_{\rm{t}}},U\,)|C\,)=F({X}_{{\rm{b}}_{\rm{t}}},U\,)-{C}_{{\rm{b}}_{\rm{t}}}\times {{{\sum }}_{CC{\rm{b}}_{\rm{t}}}}^{-1}{{\sum }}_{CF{\rm{b}}_{\rm{t}}}$$

During inference

Forward propagate through the network:

$$F({X}_{{\rm{b}}_{\rm{t}}},U\,)$$

$$F({X}_{{\rm{b}}_{\rm{t}}},U\,)\leftarrow(F({X}_{{\rm{b}}_{\rm{t}}},U\,)-\mu )/\sigma$$

$$F({X}_{{\rm{b}}_{\rm{t}}},U\,)\leftarrow F({X}_{{\rm{b}}_{\rm{t}}},U\,)-{C}_{{\rm{b}}_{\rm{t}}}\times {{{\sum }}_{CC}}^{-1}{{\sum }}_{CF}$$

TCGA data

We initially applied DLVPM to data from the TCGA study (https://portal.gdc.cancer.gov/). Here we only used breast cancer patients with a full complement of data including histological images, RNA-seq, miRNA-seq, methylation and SNVs. To ensure we only used the highest-quality data, we subjected it to several selection steps before its use. We only used samples with a tumour purity above 60%. This threshold was chosen to minimize contamination from non-cancerous cells, thereby reducing background noise and increasing the precision of genetic and epigenetic profiling. By focusing on samples with higher tumour purity, we aimed to obtain clearer insights into tumour-specific molecular pathways and genetic alterations. The acquisition site can have a strong effect on both omics and imaging data³². We introduced a method to remove the effect of acquisition site (see the ‘Removing confounds’ section). However, when a small number of samples is associated with a covariate, it is not possible to disentangle biological effects and effects driven by this nuisance covariate. Therefore, we only used data from acquisition sites that contributed at least ten samples to the TCGA study. We only used female participants. 758 patient samples had a full set of SNV, methylation, miRNA-Seq, RNA-Seq and histological data available for the full path-modelling analysis.

Both DLVPM-Twins and full DLVPM path-modelling analysis require the sampling of subsections of each whole slide image (WSI) called image tiles. The first step in that process was to identify which parts of the overall image contain histological tissue. We did this by calculating Sobel’s image gradient across the whole slide. We then split the tissue into 224 × 224-pixel sections, subsequently referred to as image tiles. This tile size is the input size required by the EfficientNetB0 architecture³⁶ used throughout this study. This process was used at ×5, ×10 and ×20 magnifications. Tiles in which the average Sobel’s image gradient was over 15 for over 50% of the image were considered to contain enough tissue to be used in further analyses.

Omics data are very high dimensional. First, we reduced the data dimensionality for each omics modality. In the case of methylation and RNA-seq data, we did this by finding the genes with the top 10% highest variance. miRNA-seq data have much lower dimensionality; therefore, we used the top 50% here. Omics data such as RNA-seq are often heavily skewed. We subjected all omics data to a rank-based inverse Gaussian transform to remove this data skew. TCGA data are typically of very high quality. However, a very small fraction of the methylation data (0.054%) was missing and coded as NaNs. As all omics data are centred on zero after z normalization, we simply replaced these values with zeros, representing the mean, after normalization. This is common practice in multimodal data integration analyses⁸⁷.

DLVPM-Twins model specification and training

DLVPM-Twins model training comprises two steps: a step to train a convolutional-neural-network-based model to learn meaningful features from individual histological image tiles extracted from WSIs, and a step to predict tumoural properties at the patient/WSI level. The dataset was randomly divided into training (80%) and testing (20%) subsets. This training/testing subset was used for both DLVPM-Twins model training and full DLVPM path modelling.

Feature training

We trained the DLVPM-Twins model using the EfficientNetB0 convolutional architecture, pretrained on ImageNet, on tiles extracted at ×20 magnification. Networks were trained using a feed-forward head on this convolutional base. This consisted of two dense layers with an output size of 512 followed by two further dense layers with an output of 4,096, all with rectified linear unit activations. Dropout layers with a dropout rate of 0.2 are used between all dense layers to prevent overfitting. The two larger (4,096) dense layers represent the embedding layer and are only used during training; these are discarded for testing and the output of the after the first 512-output-size dense layer is used as the representation layer. This network is inspired by the network head chosen in the original paper detailing the Barlow twins method⁷⁵. The DLVPM-Iterative method becomes numerically unstable with larger output sizes. For this reason, we used an output size of 128 here, which was the largest size that was usable before the method became numerically unstable at a batch size of 256. For the DLVPM-Twins pretraining, we used a batch size of 256, and a learning rate of 1 × 10⁻⁴. These parameters were selected on the basis of previously published results⁷⁵. The Adam optimizer was used in all the cases. This hyperparameter selection process was also applied to VicReg and Barlow twins methods. This training to produce the DLVPM-Twins image model was carried out for 100 epochs.

Data augmentation

Here a single image tile was extracted at random from the WSIs of each patient included in a training batch. Each tile was then subjected to various data augmentations; specifically, tiles were rotated by up to ±20°; translated horizontally and vertically by up to 20%; sheared by 20% to introduce geometric distortions; zoomed by up to 20% to vary scale; horizontally flipped to diversify the dataset further; brightness adjustments were made within a 70%–130% range to mimic different lighting conditions; tiles were randomly converted to a greyscale-like effect with a 50% probability to prepare the model for variations in stain quality. The ‘reflect’ filling mode was used for newly created pixels. DLVPM-Twins was then trained to maximize the associations between tiles subjected to these different data augmentations. This process was then repeated for subsequent batches, with tiles again extracted at random for each new batch.

Prediction at WSI/patient level

Following model training on individual tiles, the trained convolutional model architecture was applied to 100 tiles randomly extracted from WSIs for each patient at ×20 magnification. For each patient WSI, global mean average pooling was carried out over DLVs extracted over all the tiles. This results in a single set of DLVs for each subject. A single-layer classification head was then trained on these DLVs to predict the molecular and histological statuses, and the presence/absence of the TP53 mutation. The same procedure was used to train the VicReg and Barlow twins models; for these methods, we used the optimal hyperparameter choices specified in the original publications for these methods^31,75. For the single-layer classification head, we used a batch size of 32 and a learning rate of 0.001, which are standard parameters.

For each of the methods compared here, model training took 2 h on a single A100 GPU. Feature extraction before classification took a further 1 h on the same hardware, for each method.

In the initial experiments, our DLVPM-Twins model, trained on the histology data, was used to predict the histological and molecular statuses of TCGA tumours.

Histological subtypes

Breast cancer is primarily categorized into several histological subtypes: invasive ductal carcinoma (which originates from breast ducts) and invasive lobular carcinoma (arising from the lobules). Less common types include inflammatory breast cancer, known for its aggressive nature and inflammatory symptoms, and triple-negative breast cancer, which lacks hormonal receptors and is particularly challenging to treat. Information on the percentages of patients with these different histological subtypes of cancer is given in Supplementary Table 1⁸⁸.

Molecular subtypes

The PAM50 molecular classification system categorizes breast cancer into five distinct subtypes based on gene expression: luminal A, luminal B, HER2 enriched, basal like and normal like. These subtypes inform prognosis and treatment decisions, with luminal A typically having the best outcome and basal like, the poorest owing to its aggressive nature and lack of hormone receptors. Information on the percentages of patients with these different molecular subtypes of cancer is given in Supplementary Table 2⁸⁹.

We only used histological/molecular subtypes for prediction when there were at least 30 instances of that classification in the training set. This threshold ensures reliable predictions by avoiding overfitting and capturing meaningful patterns. It aligns with standard practices requiring sufficient sample sizes for stable model training and generalizability.

Full DLVPM model specification and training

Full DLVPM requires that we specify a neural network for processing different data types included in an analysis. In path-modelling parlance, these different models are known as measurement models. The full DLVPM model encompassing all the measurement models is illustrated in Extended Data Fig. 3.

Histological measurement model

The histological imaging data were processed using a network that aggregates effects visible in the WSI data at different magnifications. To obtain effects from histology at different magnifications, we trained a DLVPM-Twins model at ×5, ×10 and ×20 magnifications. Here we used the whitening formulation of the method owing to its increased numeric stability. The DLVPM-Twins network layers after the convolutional base were assigned as trainable in the DLVPM path model. A trainable feed-forward neural network was then used to combine these multi-magnification effects. L₁ and L₂ weight regularization using standard regularization rates of L₁ = 0.01 and L₂ = 0.01 were used for all layers containing learnable weights, to prevent overfitting. A dropout layer⁹⁰ using a standard dropout rate of 0.5 was applied before the confound removal layer for the same purpose.

Omics measurement model

Each of the omics models uses the same general neural network structure. The model utilizes an embedding layer that reduces the dimensionality of the input to the square root of the initial gene count, a heuristic inspired by natural language processing to efficiently capture the essence of gene expression patterns. Subsequent reshaping introduces a pseudo-sequence dimension, enabling the application of a self-attention layer, which facilitates the model’s focus on critical gene interactions. The attention output, merged with the original input through a residual connection, ensures the preservation of initial gene expression information and incorporating learned interaction effects. As with the histological neural network, regularization was again applied to all layers at rates of L₁ = 0.01 and L₂ = 0.01. Again, a dropout layer using a standard dropout rate of 0.5 was applied before the confound removal layer.

Both histological and omics measurement models end with a custom neural network layer that partials out the effect of confounds using the Moore–Penrose pseudo-inverse. This approach is detailed later, and is illustrated in Extended Data Fig. 3.

Once the individual measurement models are specified, the DLVPM method is used to construct DLVs from each different data type that are maximally associated with DLVs from other data types connected by the user-specified path model. DLVPM path modelling used the same overall train–test split as DLVPM-Twins. For hyperparameter optimization, the training data were further partitioned into 80% training and 20% validation sets through random splitting. Hyperparameter tuning involved multiple runs using batch sizes of 32, 64, 128 and 256. We implemented an exponential decay strategy for the learning rate, starting from initial values of 1 × 10⁻², 1 × 10⁻³ and 1 × 10⁻⁴ and decaying to a value ten times lower. A grid search approach was utilized to determine the optimal batch size and initial learning rate. The hyperparameter combination yielding the highest evaluation metric (mean correlation between modalities connected by the path model) was then selected for further use. Following the selection of hyperparameters, the model was retrained on the entire initial training dataset (80%) using the selected hyperparameters. Each training run was carried out for 300 epochs. Here the histological DLVPM-Twins training step took 6 h on a single A100 GPU. Full DLVPM model training then took 35 min on the same hardware, including hyperparameter selection.

Multimodal methods

We benchmarked the performance of the shallow path-modelling method, PLS-PM, against DLVPM in the task these methods are optimized to carry out: identifying associations between latent variables connected by a path model. We then compared the performance of DLVPM with several other multimodal data integration methods in the task of identifying multiomic loci associated with the model as a whole, and in identifying multiomic loci associated with the histology data. As with the full DLVPM path-modelling analysis, in the case of histological data, we concatenated the multi-magnification features extracted using DLVPM-Twins (see the ‘DLVPM-Twins model specification and training’ section), and used these as inputs to the model.

Shallow PLS-PM

PLS-PM is closely related to DLVPM, and can be thought of as the classical equivalent of this method. PLS-PM is designed to construct sets of latent variables that are optimally correlated between data types connected by a path model. There are two major types of PLS-PM algorithm: mode A and mode B²³. Mode A involves optimizing the association between different data types. This approach requires the calculation of the matrix inverse of within-modality covariance matrices. This is not possible when the number of examples in the data modality is smaller than the number of features. Mode B PLS-PM solves this issue by replacing within-modality covariance matrices with identity matrices. As the data in the present application have many more features than samples, we used mode B PLS-PM for comparison with DLVPM. When training the shallow PLS model, we used the same processed data as for DLVPM.

MOFA+

MOFA+ generates factors derived from multiomics data by modelling each omics dataset as a linear combination of latent factors, with dataset-specific weight matrices capturing the contribution of each feature to the factors^39,40. It uses a probabilistic model with a Gaussian likelihood for continuous data and alternatives for other data types (for example, Bernoulli for binary data), along with sparsity-inducing priors to ensure interpretable factorization. Optimization is performed via variational inference, enabling the efficient estimation of the factors and associated weights and handling missing data. This approach allows MOFA+ to disentangle shared and data-specific sources of variation across modalities. We used MOFA+ with standard parameters.

Multimodal autoencoder

The deep multimodal autoencoder⁴¹ is designed for data integration across multiple modalities by learning a joint representation of multiple input data types. It extends the standard autoencoder structure to handle multimodal data, where the encoder maps inputs from multiple data types into a shared latent space, and the decoder reconstructs each modality from this latent representation. The key idea is to optimize the joint representation such that it captures the shared information across modalities as well as allows for modality-specific reconstructions. The model is trained using a combination of reconstruction loss for individual modalities and cross-modal reconstruction, ensuring that the learned latent space is meaningful even when some modalities are missing.

We used a multimodal autoencoder that integrates data from histology, RNA-seq, methylation, miRNA-seq and SNVs. In this work, each modality has a dedicated encoder with dense layers, rectified linear unit activations and batch normalization, producing a latent representation of size 128. Encoded representations are concatenated into a shared bottleneck layer of size 5 (the same number of DLVs extracted by DLVPM), capturing cross-modal relationships. Decoders, mirroring the encoders, reconstruct inputs from modality-specific representations. Training minimizes the mean squared error loss for each modality using the shared bottleneck as the target, ensuring compact, shared latent representations and retaining modality-specific features. As with the DLVPM, we ran this model for 300 epochs.

Mediation effects

Statistical mediation analysis examines how an independent variable influences a dependent variable through a mediator. It involves assessing three key pathways: the effect of the independent variable on the mediation (path A), the effect of mediator on dependent variable (path B) and the direct effect of independent variable on the dependent variable (path C′). The total effect of the independent variable (path C) is decomposed into the direct effect (path C′) and the indirect effect (path A × path B). To test for significant mediation, the significance of the indirect effect (path A × path B) is evaluated using methods like the Sobel test or bootstrapping. Mediation helps to understand the underlying mechanism of how an independent variable affects a dependent variable through a mediator.

The effect of DLVs constructed from methylation, miRNA-seq and SNV data should act indirectly on histology, with RNA-seq acting as a mediator. We tested for mediation effects using the ‘statsmodels’ package. Our mediation model designated the DLVs derived from methylation, miRNA-seq and SNV data as independent variables, RNA-seq data as the mediator and histological outcomes as the dependent variable. To assess the significance of the mediation effect, statsmodels uses a bootstrapping approach. By using bootstrapping, statsmodels does not rely on the assumption of normality for the indirect effect, making it a robust method for mediation analysis. The results of the bootstrapped mediation analysis provided an estimate of the size and significance of the indirect effects of methylation, miRNA-Seq and SNV data on histology through RNA-seq data.

Individually significant effects

DLVPM is a method for identifying global associations between different data types. We carried out additional analyses to localize effects to individual genetic loci. We ran these analyses to determine both overall significance of individual genetic loci within the path-modelling analysis and significance of their association with histological data. To assess the overall significance of individual genetic loci in the path-modelling analysis, we applied the following procedure. For each multiomic data type, we calculated the harmonic mean of Pearson’s correlation values between each omics locus and the DLVs connected to that data type via the DLVPM path model in the testing set. We chose the harmonic mean over the arithmetic mean because it is more sensitive to smaller values, which was crucial for identifying loci connected to all the modalities in the path model. Since the harmonic mean is always positive, we used the arithmetic mean to determine whether associations were positive or negative.

We then used permutation testing to ascribe significance to the mean of these associations for each multiomic data type. Using permutation testing, it is possible to correct for multiple comparisons by using the maximal statistic across all loci (here the largest mean correlation coefficient) as the statistic of interest in the permutation distribution⁹¹. This procedure controls the family-wise error in the strong sense.

We used the same procedure when determining the significance of associations between the histology data and individual genetic loci. The only difference in this analysis was that we only calculated associations between the individual genetic loci, and the histology DLVs, rather than taking the mean association as our statistic of interest.

GSEA

We carried out a GSEA on the ranking of correlations between the gene expression scores, quantified by RNA-seq, and DLVs from connected datasets (Methods). Results from these analyses are shown in Supplementary Fig. 1. GSEA was carried out using the fgsea package in R. This analysis used the gene set ‘Hallmarks’, downloaded from https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp. This analysis was conducted using the default exclusion criteria, in which pathways with fewer than 15, or more than 500, genes were omitted from the analysis. Significance was determined using an adjusted P-value threshold of less than 0.1. The normalized enrichment score was utilized to evaluate the effect sizes.

CPTAC replication

We replicated the primary DLVPM model, originally trained on the TCGA data, using data from the CPTAC project. CPTAC, an initiative by the National Cancer Institute, integrates proteomics, genomics and transcriptomics to advance our understanding of cancer biology, identify biomarkers and drive precision medicine. The project provides publicly available multiomics datasets, including those for breast cancer. For this study, we utilized data from prospectively collected, non-TCGA samples³⁷. These CPTAC samples included miRNA-seq, RNA-seq, SNV and histology data, though methylation data were not available. A total of 105 samples with all four data types were included in our analysis. Molecular data were obtained from https://kb.linkedomics.org/ (ref. ³⁸) and histology data were sourced from https://www.cancerimagingarchive.net/ (ref. ⁷⁸).

Survival analysis

After training the DLVPM model on TCGA, we predicted clinical outcomes using DLVs as predictors in a Cox proportional hazards regression model. In breast cancer, the progression-free interval is the recommended clinical endpoint⁹². The Cox model enables the aggregation of effects across multiple DLVs, providing a comprehensive risk assessment. TCGA has the benefit of extensive omics and imaging characterization. Nevertheless, TCGA has the limitation of short follow-up times and incomplete records, making it less reliable for analysing outcomes requiring extended follow-up or detailed survival trends.

To address this limitation, survival analysis was replicated and extended using the METABRIC dataset^93,94. METABRIC offers a large, well-characterized cohort with extensive genomic and transcriptomic data, complementing the TCGA dataset. Importantly, METABRIC features a longer follow-up time, crucial for capturing long-term survival outcomes and disease progression patterns. This extended follow-up enables a more robust estimation of hazard ratios and better differentiation between short- and long-term prognostic factors.

The TCGA model was trained using histology, RNA-seq, methylation, miRNA-seq and SNV data, enabling a rich, multimodal approach to outcome prediction. However, for METABRIC, only RNA-seq and SNV data were available. Despite this limitation, METABRIC’s extended follow-up and large cohort size provided a robust platform for validating and extending the survival analysis, demonstrating the flexibility of DLVPM in adapting to varying data modalities. We used n = 1,980 subjects from METABRIC, with all the subjects having clinical, RNA-seq and SNV data. We also compared the performance of DLVPM in predicting survival trends with that of several other multimodal data integration methods. METABRIC data were obtained from https://www.cbioportal.org/study/summary?id=brca_metabric. Analyses were carried out using the lifelines package.

Histological visualization

Our DLVPM model undergoes training using tissue tiles and, on completion, we deploy it to analyse each tile individually. This enhances our understanding of the tumour segments that exhibit the most pronounced effects for specific DLVs. This allows us to pinpoint and assess the tumour subsections that have the greatest influence on each DLV.

Single-cell analysis

Single-cell data were obtained from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE176078. We applied the RNA-seq component of the full DLVPM model, trained on the TCGA dataset to data from the single-cell breast cancer encyclopaedia⁵⁹. The single-cell breast cancer encyclopaedia is a collection of 100,064 single cells with transcriptomic data, taken from 26 primary tumours including 11 ER+, 5 HER2+ and 10 TNBC, representing the three major clinical subtypes. These data were preprocessed in the same manner as the TCGA RNA-seq data.

Cell-line data

Cell-line data were obtained from https://depmap.org/portal/. Of the available breast cancer cell-line data, 61 RNA-seq samples, 67 SNV samples, 50 miRNA-seq samples and 45 CRISPR–Cas9 samples were available. All these omics data were used in the analyses presented here. Pairwise associations between omics data types and CRISPR–Cas9 data utilized all the intersecting samples.

Omics data from the depmap project were preprocessed for use in the same manner as data from the TCGA dataset. Although methylation data were collected as part of this project, these data are of a different type to those collected as part of the TCGA project. These data were, therefore, not used as part of the current investigation. In some cases, data were not available for particular genes/loci. These genes/loci were replaced by columns of zeros. The DLVPM model was robust to these changes as it was trained with a dropout layer, simulating this effect.

As noted earlier, we used a confound layer to remove the effects of nuisance covariates when training the DLVPM model on the TCGA data. When we applied the model to the CCLE data, this layer was removed from the model as these covariates are not relevant for the CCLE data. We also used batch-level statistics to ensure that the DLVs were orthogonal in this new dataset.

Spatial transcriptomics

Spatial transcriptomic data were obtained from https://www.10xgenomics.com/. At the time of the analysis, four breast cancer samples were available using the Xenium platform. DLVPM was initially applied to the TCGA data to parse intertumoural heterogeneity. Because histology data are trained on sections of tissue called tiles, it is possible to deconvolve tile-wide effects back into the image space. This allows us to visualize histologic heterogeneity across individual tumours. Recently, a range of spatial transcriptomic methods have been developed with the aim of quantifying heterogeneity in gene expression across individual tumours.

The Xenium platform, from 10x Genomics, is an in situ hybridization-based spatial transcriptomic method⁷⁰. This platform provides subcellular transcript resolution for genes known to be important in breast cancer. We sought to identify relations between the DLVPM models, and genes found to be essential to the functioning of cells scoring highly on these models. The DLVPM histological model has a tile-wise resolution of 224 × 224 pixels. We extracted tile-wise histological DLVs, and calculated the association between these DLVs and the total number of transcripts of genes of interest in the matching tile, normalized by the total number of transcripts.

We assessed the significance of associations between DLV 1, and the genes CCND1, GATA3 and ESR1. As there is a high degree of spatial autocorrelation in these data, an uncritical application of Pearson’s coefficient will lead to inflated significance levels and type-1 errors. For this reason, we used a method to assess statistical significance that fully accounts for spatial autocorrelation⁹⁵ using the SpatialPack package in R.