VOLTA: an enVironment-aware cOntrastive ceLl represenTation leArning for histopathology

Machine Learning


Ethics

The Declaration of Helsinki and the International Ethical Guidelines for Biomedical Research Involving Human Subjects were strictly adhered throughout the course of this study. All study protocols have been approved by the University of British Columbia/BC Cancer Research Ethics Board.

Methodology

Fig. 1 provides an overview of the proposed self-supervised method for cell classification. This framework consists of two main blocks: 1) Cell Block; 2) Environment Block. The Cell Block learns the cell embeddings (i.e., representations) by contrasting individual cell-level images while the Environment Block incorporates environment-level information into the cell representations.

Cell block

The architectural design of the Cell Block is similar to our previously proposed model58, which has shown promising performance in cell representation learning tasks. In this block, cell embeddings are learned by pulling the embeddings of two augmentations of the same image together, while the embeddings of other images are pushed away. Let X = {xi1 ≤ i ≤ N} be the input batch of cell images and N to be the number of images in the batch. Each xi is a small crop of the H&E image around a cell in a way that it only includes that specific cell. Two different sets of augmentations are applied to X to generate Q = {qi1 ≤ i ≤ N} and K = {ki1 ≤ i ≤ N}. We call these sets query and key, respectively. qi and kj are the augmentations of the same image if and only if i = j. The query batch is encoded using a backbone model, a neural network of choice, while the keys are encoded using a momentum encoder, which has the same architecture as the backbone. This momentum encoder is updated using (1) in which \({{{{{{{{\boldsymbol{\theta }}}}}}}}}_{k}^{t}\) is the parameter of momentum encoder at time t,m is the momentum factor, and \({{{{{{{{\boldsymbol{\theta }}}}}}}}}_{q}^{t}\) is the parameter of the backbone at time t

$${{{{{{{{\boldsymbol{\theta }}}}}}}}}_{k}^{t}=m{{{{{{{{\boldsymbol{\theta }}}}}}}}}_{k}^{t-1}+(1-m){{{{{{{{\boldsymbol{\theta }}}}}}}}}_{q}^{t}.$$

(1)

Consequently, the obtained query and key representations are passed through separate Multi-Layer Perceptron (MLP) layers called projector heads. Although the query projector head is trainable, the key projector head is updated with momentum using the weight of the query projector head. We restrict these layers to be 2-layer MLPs with an input size of 512, a hidden size of 128, and an output size of 64. In addition to the projector head, we use an extra MLP on the query side of the framework, called the prediction head. This network is a 2-layer MLP with input, hidden, and output sizes of 64, 32, and 64, respectively. Similar to the last fully-connected layers of a conventional classification network, the projection and prediction heads provide more representation power to the model.

The networks of the Cell Block are trained using the InfoNCE39 loss which is shown in (2)

$${{{{{{{{\bf{L}}}}}}}}}_{{q}_{i}}^{cell}=-\log \frac{\exp \frac{\parallel \, {f}_{q}({{{{{{{{\bf{q}}}}}}}}}_{i}){\parallel }^{2} \, . \, \parallel \, {f}_{k}({{{{{{{{\bf{k}}}}}}}}}_{i}){\parallel }^{2}}{\tau }}{\mathop{\sum }\nolimits_{j=0}^{N+Q}\exp \frac{\parallel \, {f}_{q}({{{{{{{{\bf{q}}}}}}}}}_{i}){\parallel }^{2} \, . \, \parallel \, {f}_{k}({{{{{{{{\bf{k}}}}}}}}}_{j}){\parallel }^{2}}{\tau }}.$$

(2)

In this equation, τ is the temperature that controls the sharpness of the distribution, is the normalization operator, Q is the number of items stored in the queue from the key branch, fq is the equal function for the combination of the backbone, query projection head, and query prediction head, and fk shows the equal function for the momentum encoder and the key projection head.

The augmentation pipelines include cropping, color jitter (brightness of 0.4, contrast of 0.4, saturation of 0.4, and hue of 0.1), gray-scale conversion, Gaussian blur (with a random sigma between 0.1 and 2.0), horizontal and vertical flip, and rotation (randomly selected between 0 to 180 degrees). To ensure the model consistently observes the entire cell image on one side, we eliminate the cropping step from one of the processes. Consequently, the pipeline that includes cropping generates localized sections of the cell image, while the other augmentation pipeline produces global images encompassing the complete view of the entire cell. Due to the randomness of augmentations, either one can be passed through the backbone or momentum-encoder.

Cell embeddings are generated from the trained momentum encoder at the inference time and are clustered by applying the K-means algorithm. One can use either the encoder or momentum encoder for embedding generation; however, the momentum encoder provides more robust representations since it aggregates the learned weights of the encoder network from all of the training steps (an ensembling version of the encoder throughout training)33.

Environment block

Many studies have shown that the Tumor Micro Environment (TME) plays an important role in the tumor progression behavior32,57. Motivated by these findings, we ask: should the representation of a cell reflect its environment as well? Inspired by this question, we hypothesize that a deeper knowledge of the environment leads to a better general understanding of the cell. In a mathematical formulation, this hypothesis is equivalent to the assumption that there exists mutual information between cells and their environment. Therefore, to validate this hypothesis, we propose to increase the mutual information between the corresponding cell and environment representations during the training process. Previous studies59 have shown that the InfoNCE loss maximizes the lower bound of mutual information between different views of the image. Thus, we will use this loss function to achieve the aforementioned target by performing cross-modal contrastive learning as an auxiliary task.

Let E = {ei1 ≤ i ≤ N} be the corresponding environment patches of the cells represented by X. Here, we refer to the environment as a large region around a cell in a way that includes the surrounding tissue and cells. Therefore, for i 1, 2, . . . , N, xi and ei are centered on the same cell (however, for the cases where the cells are located on the edge of the patch, we limit the patch border to the border of the image). After applying an augmentation pipeline, the environment patches are passed through an encoder network, called an environment encoder. Simultaneously, we apply a new projection head, the environment projection head, to the cell representations obtained from the query backbone in the Cell Block. Finally, one can train the Environment Block using these two sets of representations (environment and cell) and (3)

$${L}_{{q}_{i}}^{env}=-\log \frac{\exp \frac{\parallel {g}_{cell}({{{{{{{{\bf{q}}}}}}}}}_{i}){\parallel }^{2} \, . \, \parallel {g}_{env}({{{{{{{{\bf{e}}}}}}}}}_{i}){\parallel }^{2}}{\tau }}{\mathop{\sum }\nolimits_{j=0}^{N}\exp \frac{\parallel {g}_{cell}({{{{{{{{\bf{q}}}}}}}}}_{i}){\parallel }^{2} \, . \, \parallel {g}_{env}({{{{{{{{\bf{e}}}}}}}}}_{j}){\parallel }^{2}}{\tau }}.$$

(3)

Therefore, the final loss of the whole framework can be written as (4), in which λ is a hyperparameter. Increasing the value of λ prioritizes the mutual information of the cell with its environment over the consistency of the representation for different augmentations of the same cell

$${L}_{{q}_{i}}={L}_{{q}_{i}}^{cell}+\lambda {L}_{{q}_{i}}^{env}.$$

(4)

The augmentation pipeline of the Environment Block uses the same operations as that of the Cell Block except for cropping.

To prevent the model from focusing on the overlapping regions between the corresponding cell and environment images (called shortcut60, meaning that the model uses undesired features to solve the problem), we mask the target cell in the environment patch. Furthermore, the rest of the cells in the environment patch are also masked to ensure that the model does not bias the representation of a cell towards the neighboring cell types. We will investigate the effectiveness of the masking operation in the ablation study.

Data preparation

The aforementioned datasets included patch-level images, while we required cell-level ones for the training of the model. To generate such data, we used the instance segmentation provided in each of the external datasets to find cells and crop a small box around them. However, for the Oracle and SarcCell datasets, the instance segmentation masks were generated by applying HoVer-Net21 segmentation pre-trained on the PanNuke dataset.

An adaptive window size was used to extract cell images from the H&E slides. More specifically, this window is selected based on the size of the cell, and this strategy is utilized to prevent overlapping with other cells. The adaptive window size was set to twice the size of the cell for the CoNSeP dataset while it was equal to the size of the cell for the rest of the datasets. Finally, cell images were resized to 32 × 32 pixels (to enable batch-wise processing operations) and were normalized to zero mean and unit standard deviation before being fed into our proposed framework. The environment patch used in the Environment Block was set to 200 pixels for all datasets.

Ground-truth label generation of the Oracle and SarcCell dataset cells was performed by finding the most expressed biomarker (by intensity and quantity) in the same position of the corresponding IHC image. To accommodate for the potential noise associated with image registration, two post-processing steps were performed: 1) the size of the window in the IHC image was set to 5 times of the window size in the H&E core (however, this scale was set to 1 for the SarcCell dataset due to more accurate co-registration performance); 2) the most expressed biomarker was considered as the label only if it contained at least 70% of the biomarker distribution in the IHC window.

Implementation details

The code was implemented in Pytorch (v1.9.0), and the model was run on one and two V100 GPUs for the w/ and w/o environment settings, respectively. The batch size was set to 1024 (unless specified otherwise), the queue size to 65,536, and pre-activated ResNet1861 was used for the backbone and momentum encoder in the Cell Block. The environment encoder architecture was set to LambdaNet model62 as it extracts more informative patch representations using self-attention while keeping the computation and memory usage tractable. The stack was trained using the Adam optimizer for 500 epochs (unless specified otherwise) with a starting learning rate of 0.001, a cosine learning rate scheduler, and a weight decay of 0.0001. We also adopted a 10-epoch warm-up step. The momentum factor in the momentum encoders was 0.999, and the temperature was set to 0.07.

In Table 1 experiments, the training epoch count and batch size of our models were set to 200 and 512 for the PanNuke Breast, Lizard, Oracle, and SarcCell datasets. Additionally, for the training of our model on the Oracle datasets, we used 15,000 randomly selected cells from the training set, to reduce the training time.

In the self-supervised to supervised transfer learning step (cell classification), we adopted SGD (Stochastic Gradient Descent) with a starting learning rate of 0.001 using a cosine learning rate scheduler for 300 epochs with a batch size of 1024. Also, the weight decay was set to 0.00001. In the case that we allowed the encoder to be fine-tuned, we set the encoder’s learning rate to 0.0001.

It is worth mentioning that for the cell classification of NuCLS, we followed the same super-class grouping of the original paper22. In this regard, we only used 3 super-classes out of 5 for cell type classification, including tumor, stromal, and sTILs.

Baselines

The performance was also compared against five baselines. The pre-trained ImageNet model used weights that were pre-trained on the ImageNet dataset to generate the cell embeddings. The Morphological Features approach63 adopted morphological features to produce a 30-dimensional feature vector, consisting of geometrical and shape attributes. Prior to clustering, the feature vectors were normalized to zero mean and unit standard deviation, and their size was reduced to 2 using t-SNE. The third baseline was Manual Feature27 which used a combination of Scale-Invariant Feature Transform (SIFT) and Local Binary Patterns (LBP) features to provide representations for the cells. Similar to the previous baseline, we exercised standardization on the computed feature vectors. Additionally, our baseline set included two state-of-the-art unsupervised deep learning models. More specifically, the Auto-Encoder baseline adopted a deep convolution auto-encoder alongside a clustering layer to learn cell embeddings by performing an image reconstruction task29. And finally, the last baseline was GAN27 which adopted the idea of InfoGAN28 and developed a Generative Adversarial Network (GAN) for cell clustering by increasing the mutual information between the cell representation and a categorical noise vector.

Statistics & reproducibility

The data selection and stratification were performed completely blind without any previous exposure to the patient or cell data. For public datasets, we used the train and test sets provided by the original publication; however, for the rest of the process, we took a completely blind approach.

The sample sizes used in this study are based on the sample provided sets from the original publication for the public datasets and the most available data for the private datasets. In both cases, we believe these sample sizes are sufficient for the study as at least 17,000 samples are available for each dataset.

Due to the stochastic nature of deep learning models, the exact reproduction of an experiment is not possible. However, we conducted each experiment multiple times and used the average of the results as the output.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *