VOLTA: an enVironment-aware cOntrastive ceLl represenTation leArning for histopathology

Ethics

The Declaration of Helsinki and the International Ethical Guidelines for Biomedical Research Involving Human Subjects were strictly adhered throughout the course of this study. All study protocols have been approved by the University of British Columbia/BC Cancer Research Ethics Board.

Methodology

Fig. 1 provides an overview of the proposed self-supervised method for cell classification. This framework consists of two main blocks: 1) Cell Block; 2) Environment Block. The Cell Block learns the cell embeddings (i.e., representations) by contrasting individual cell-level images while the Environment Block incorporates environment-level information into the cell representations.

Cell block

The architectural design of the Cell Block is similar to our previously proposed model⁵⁸, which has shown promising performance in cell representation learning tasks. In this block, cell embeddings are learned by pulling the embeddings of two augmentations of the same image together, while the embeddings of other images are pushed away. Let X = {x_i∣1 ≤ i ≤ N} be the input batch of cell images and N to be the number of images in the batch. Each x_i is a small crop of the H&E image around a cell in a way that it only includes that specific cell. Two different sets of augmentations are applied to X to generate Q = {q_i∣1 ≤ i ≤ N} and K = {k_i∣1 ≤ i ≤ N}. We call these sets query and key, respectively. q_i and k_j are the augmentations of the same image if and only if i = j. The query batch is encoded using a backbone model, a neural network of choice, while the keys are encoded using a momentum encoder, which has the same architecture as the backbone. This momentum encoder is updated using (1) in which ${{{{{{{{\boldsymbol{\theta }}}}}}}}}_{k}^{t}$ is the parameter of momentum encoder at time t,m is the momentum factor, and ${{{{{{{{\boldsymbol{\theta }}}}}}}}}_{q}^{t}$ is the parameter of the backbone at time t

$${{{{{{{{\boldsymbol{\theta }}}}}}}}}_{k}^{t}=m{{{{{{{{\boldsymbol{\theta }}}}}}}}}_{k}^{t-1}+(1-m){{{{{{{{\boldsymbol{\theta }}}}}}}}}_{q}^{t}.$$

(1)

Consequently, the obtained query and key representations are passed through separate Multi-Layer Perceptron (MLP) layers called projector heads. Although the query projector head is trainable, the key projector head is updated with momentum using the weight of the query projector head. We restrict these layers to be 2-layer MLPs with an input size of 512, a hidden size of 128, and an output size of 64. In addition to the projector head, we use an extra MLP on the query side of the framework, called the prediction head. This network is a 2-layer MLP with input, hidden, and output sizes of 64, 32, and 64, respectively. Similar to the last fully-connected layers of a conventional classification network, the projection and prediction heads provide more representation power to the model.

The networks of the Cell Block are trained using the InfoNCE³⁹ loss which is shown in (2)

$${{{{{{{{\bf{L}}}}}}}}}_{{q}_{i}}^{cell}=-\log \frac{\exp \frac{\parallel \, {f}_{q}({{{{{{{{\bf{q}}}}}}}}}_{i}){\parallel }^{2} \, . \, \parallel \, {f}_{k}({{{{{{{{\bf{k}}}}}}}}}_{i}){\parallel }^{2}}{\tau }}{\mathop{\sum }\nolimits_{j=0}^{N+Q}\exp \frac{\parallel \, {f}_{q}({{{{{{{{\bf{q}}}}}}}}}_{i}){\parallel }^{2} \, . \, \parallel \, {f}_{k}({{{{{{{{\bf{k}}}}}}}}}_{j}){\parallel }^{2}}{\tau }}.$$

(2)

In this equation, τ is the temperature that controls the sharpness of the distribution, ∥∥ is the normalization operator, Q is the number of items stored in the queue from the key branch, f_q is the equal function for the combination of the backbone, query projection head, and query prediction head, and f_k shows the equal function for the momentum encoder and the key projection head.

The augmentation pipelines include cropping, color jitter (brightness of 0.4, contrast of 0.4, saturation of 0.4, and hue of 0.1), gray-scale conversion, Gaussian blur (with a random sigma between 0.1 and 2.0), horizontal and vertical flip, and rotation (randomly selected between 0 to 180 degrees). To ensure the model consistently observes the entire cell image on one side, we eliminate the cropping step from one of the processes. Consequently, the pipeline that includes cropping generates localized sections of the cell image, while the other augmentation pipeline produces global images encompassing the complete view of the entire cell. Due to the randomness of augmentations, either one can be passed through the backbone or momentum-encoder.

Cell embeddings are generated from the trained momentum encoder at the inference time and are clustered by applying the K-means algorithm. One can use either the encoder or momentum encoder for embedding generation; however, the momentum encoder provides more robust representations since it aggregates the learned weights of the encoder network from all of the training steps (an ensembling version of the encoder throughout training)³³.

Environment block

Many studies have shown that the Tumor Micro Environment (TME) plays an important role in the tumor progression behavior^32,57. Motivated by these findings, we ask: should the representation of a cell reflect its environment as well? Inspired by this question, we hypothesize that a deeper knowledge of the environment leads to a better general understanding of the cell. In a mathematical formulation, this hypothesis is equivalent to the assumption that there exists mutual information between cells and their environment. Therefore, to validate this hypothesis, we propose to increase the mutual information between the corresponding cell and environment representations during the training process. Previous studies⁵⁹ have shown that the InfoNCE loss maximizes the lower bound of mutual information between different views of the image. Thus, we will use this loss function to achieve the aforementioned target by performing cross-modal contrastive learning as an auxiliary task.

Let E = {e_i∣1 ≤ i ≤ N} be the corresponding environment patches of the cells represented by X. Here, we refer to the environment as a large region around a cell in a way that includes the surrounding tissue and cells. Therefore, for ∀ i ∈ 1, 2, . . . , N, x_i and e_i are centered on the same cell (however, for the cases where the cells are located on the edge of the patch, we limit the patch border to the border of the image). After applying an augmentation pipeline, the environment patches are passed through an encoder network, called an environment encoder. Simultaneously, we apply a new projection head, the environment projection head, to the cell representations obtained from the query backbone in the Cell Block. Finally, one can train the Environment Block using these two sets of representations (environment and cell) and (3)

$${L}_{{q}_{i}}^{env}=-\log \frac{\exp \frac{\parallel {g}_{cell}({{{{{{{{\bf{q}}}}}}}}}_{i}){\parallel }^{2} \, . \, \parallel {g}_{env}({{{{{{{{\bf{e}}}}}}}}}_{i}){\parallel }^{2}}{\tau }}{\mathop{\sum }\nolimits_{j=0}^{N}\exp \frac{\parallel {g}_{cell}({{{{{{{{\bf{q}}}}}}}}}_{i}){\parallel }^{2} \, . \, \parallel {g}_{env}({{{{{{{{\bf{e}}}}}}}}}_{j}){\parallel }^{2}}{\tau }}.$$

(3)

Therefore, the final loss of the whole framework can be written as (4), in which λ is a hyperparameter. Increasing the value of λ prioritizes the mutual information of the cell with its environment over the consistency of the representation for different augmentations of the same cell

$${L}_{{q}_{i}}={L}_{{q}_{i}}^{cell}+\lambda {L}_{{q}_{i}}^{env}.$$

(4)

The augmentation pipeline of the Environment Block uses the same operations as that of the Cell Block except for cropping.

To prevent the model from focusing on the overlapping regions between the corresponding cell and environment images (called shortcut⁶⁰, meaning that the model uses undesired features to solve the problem), we mask the target cell in the environment patch. Furthermore, the rest of the cells in the environment patch are also masked to ensure that the model does not bias the representation of a cell towards the neighboring cell types. We will investigate the effectiveness of the masking operation in the ablation study.

Data preparation

The aforementioned datasets included patch-level images, while we required cell-level ones for the training of the model. To generate such data, we used the instance segmentation provided in each of the external datasets to find cells and crop a small box around them. However, for the Oracle and SarcCell datasets, the instance segmentation masks were generated by applying HoVer-Net²¹ segmentation pre-trained on the PanNuke dataset.

An adaptive window size was used to extract cell images from the H&E slides. More specifically, this window is selected based on the size of the cell, and this strategy is utilized to prevent overlapping with other cells. The adaptive window size was set to twice the size of the cell for the CoNSeP dataset while it was equal to the size of the cell for the rest of the datasets. Finally, cell images were resized to 32 × 32 pixels (to enable batch-wise processing operations) and were normalized to zero mean and unit standard deviation before being fed into our proposed framework. The environment patch used in the Environment Block was set to 200 pixels for all datasets.

Ground-truth label generation of the Oracle and SarcCell dataset cells was performed by finding the most expressed biomarker (by intensity and quantity) in the same position of the corresponding IHC image. To accommodate for the potential noise associated with image registration, two post-processing steps were performed: 1) the size of the window in the IHC image was set to 5 times of the window size in the H&E core (however, this scale was set to 1 for the SarcCell dataset due to more accurate co-registration performance); 2) the most expressed biomarker was considered as the label only if it contained at least 70% of the biomarker distribution in the IHC window.

Implementation details

The code was implemented in Pytorch (v1.9.0), and the model was run on one and two V100 GPUs for the w/ and w/o environment settings, respectively. The batch size was set to 1024 (unless specified otherwise), the queue size to 65,536, and pre-activated ResNet18⁶¹ was used for the backbone and momentum encoder in the Cell Block. The environment encoder architecture was set to LambdaNet model⁶² as it extracts more informative patch representations using self-attention while keeping the computation and memory usage tractable. The stack was trained using the Adam optimizer for 500 epochs (unless specified otherwise) with a starting learning rate of 0.001, a cosine learning rate scheduler, and a weight decay of 0.0001. We also adopted a 10-epoch warm-up step. The momentum factor in the momentum encoders was 0.999, and the temperature was set to 0.07.

In Table 1 experiments, the training epoch count and batch size of our models were set to 200 and 512 for the PanNuke Breast, Lizard, Oracle, and SarcCell datasets. Additionally, for the training of our model on the Oracle datasets, we used 15,000 randomly selected cells from the training set, to reduce the training time.

In the self-supervised to supervised transfer learning step (cell classification), we adopted SGD (Stochastic Gradient Descent) with a starting learning rate of 0.001 using a cosine learning rate scheduler for 300 epochs with a batch size of 1024. Also, the weight decay was set to 0.00001. In the case that we allowed the encoder to be fine-tuned, we set the encoder’s learning rate to 0.0001.

It is worth mentioning that for the cell classification of NuCLS, we followed the same super-class grouping of the original paper²². In this regard, we only used 3 super-classes out of 5 for cell type classification, including tumor, stromal, and sTILs.

Baselines

The performance was also compared against five baselines. The pre-trained ImageNet model used weights that were pre-trained on the ImageNet dataset to generate the cell embeddings. The Morphological Features approach⁶³ adopted morphological features to produce a 30-dimensional feature vector, consisting of geometrical and shape attributes. Prior to clustering, the feature vectors were normalized to zero mean and unit standard deviation, and their size was reduced to 2 using t-SNE. The third baseline was Manual Feature²⁷ which used a combination of Scale-Invariant Feature Transform (SIFT) and Local Binary Patterns (LBP) features to provide representations for the cells. Similar to the previous baseline, we exercised standardization on the computed feature vectors. Additionally, our baseline set included two state-of-the-art unsupervised deep learning models. More specifically, the Auto-Encoder baseline adopted a deep convolution auto-encoder alongside a clustering layer to learn cell embeddings by performing an image reconstruction task²⁹. And finally, the last baseline was GAN²⁷ which adopted the idea of InfoGAN²⁸ and developed a Generative Adversarial Network (GAN) for cell clustering by increasing the mutual information between the cell representation and a categorical noise vector.

Statistics & reproducibility

The data selection and stratification were performed completely blind without any previous exposure to the patient or cell data. For public datasets, we used the train and test sets provided by the original publication; however, for the rest of the process, we took a completely blind approach.

The sample sizes used in this study are based on the sample provided sets from the original publication for the public datasets and the most available data for the private datasets. In both cases, we believe these sample sizes are sufficient for the study as at least 17,000 samples are available for each dataset.

Due to the stochastic nature of deep learning models, the exact reproduction of an experiment is not possible. However, we conducted each experiment multiple times and used the average of the results as the output.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link

创建个人账户 commented on AI in CMO Strategy: Transforming Marketing Leadership: Can you be more specific about the content of your
binance account creation commented on The rise of Artificial Intelligence in Film & TV: Thank you for your sharing. I am worried that I la
最佳gate io推荐代码 commented on Building more cyber-resilient satellites begins with a strong network: Can you be more specific about the content of your
Mag-sign up upang makakuha ng 100 USDT commented on Cloud Trends and Cybersecurity Challenges: Navigating the Future | Data Center Knowledge: Your article helped me a lot, is there any more re
binance commented on Will generative AI really supercharge phishing attacks?: I don't think the title of your article matches th

VOLTA: an enVironment-aware cOntrastive ceLl represenTation leArning for histopathology

Ethics

Methodology

Cell block

Environment block

Data preparation

Implementation details

Baselines

Statistics & reproducibility

Reporting summary

Leave a Reply

RECENT POSTS

Tuya Smart partners with Zeroth to develop robot-powered smart living experiences, bringing physical AI to the home

China-U.S. Coopetition in AI’s Military Applications – Qi Haotian

Introducing Varya. India’s unique video AI model that generates videos at just 0.48 rupees per second.

Ethics

Methodology

Cell block

Environment block

Data preparation

Implementation details

Baselines

Statistics & reproducibility

Reporting summary

Related Posts

Leave a Reply