A triple pronged approach for ulcerative colitis severity classification using multimodal, meta, and transformer based learning

Problem formulation

For the classification of ulcerative colitis (UC) severity into two categories, mild and severe, we define the dataset as ${D} = \{(x_i, y_i)\}_{i=1}^N$, where $x_i$ represents the input images and $y_i \in \{0, 1\}$ denotes the corresponding class labels. Mild UC represents localized inflammation with minimal symptoms, whereas severe UC is characterized by widespread inflammation accompanied by frequent and severe symptoms such as bloody diarrhea and intense abdominal pain. The rationale for this severity classification is grounded in established clinical criteria outlined in the Mayo Clinic Score⁶, which is widely used to determine disease stage and inform treatment strategies. This study employs a dataset labeled according to these standardized criteria, ensuring clinical relevance and consistency. Our objective is to learn a mapping function $f: {X} \rightarrow {Y}$, parameterized by $\theta$, to accurately predict the severity class based on the input features. To address the challenges posed due to limited data availability and computational constraints, we propose three distinct approaches: (a) Multimodal approach $f_{\text {MM}}(x_i) = y_i$, which leverages pre-trained models and requires no additional training; (b) Few-shot meta-learning $f_{\text {FS}}(x_i) = y_i$, enabling rapid adaptation to new classes with minimal data and lower computational costs; and (c) Vision Transformers with ensemble learning $f_{\text {ViT}}(x_i) = \text {Ensemble}(\{\text {ViT}_k(x_i)\}_{k=1}^K) = y_i$, designed to enhance model performance through aggregation. The models are trained by minimizing a loss function $\mathcal {L}(\theta )$, which quantifies the discrepancy between predicted and actual labels across the dataset ${D}$.

Multimodal approach

To classify the severity of ulcerative colitis efficiently, we leveraged multimodal models that eliminate the need for computationally expensive training processes. Our approach involves three distinct methodologies for severity classification: (1) Direct classification using pre-trained multimodal models, (2) Multimodal model ensembles aggregated through soft voting, and (3) Classification using traditional machine learning algorithms applied to features extracted from multimodal models. These strategies are based on the strengths of pre-trained models while minimizing resource demands.

Pre-trained multimodal classification

We employed pre-trained multimodal models to classify ulcerative colitis severity (mild or severe) without any task-specific fine-tuning. This approach provides an efficient and computationally lightweight baseline for our experiments. By leveraging pre-trained models, we reduce the computational overhead substantially while benefiting from their generalized feature extraction capabilities. Specifically, we employed pre-trained versions of CLIP (B/16)¹⁸, CLIP (B/32)¹⁸, CLIP (L/14)¹⁸, BLIP²², and FLAVA²⁸. For classification, we encoded 90% of the image data as standard samples and evaluated the performance on the remaining 10% test set. Cosine similarity and Manhattan distance were used as the distance metrics to classify the test images based on their proximity to the encoded standard samples. Fig. 1a presents the classification process using a pre-trained multimodal model.

Multimodal ensemble classification

To enhance the performance of individual multimodal models such as CLIP and BLIP, we implemented a soft voting-based ensembling approach²⁹, as illustrated in Fig. 1b. In this method, each model independently outputs a probability vector $\hat{y_i}^{(k)}$, representing the probabilities for each of the two classes: $\hat{y_i}^{(k)} = [p_{mild}^{(k)}, p_{severe}^{(k)}]$. The probability scores from each model $\hat{y_i}^{(k)}$ are then averaged to compute the final prediction vector $\hat{y_i}$:

$$\begin{aligned} \hat{y_i} = \frac{1}{K} \sum _{k=1}^K \hat{y_i}^{(k)} \end{aligned}$$

(1)

Thus, $\hat{y_i}$ becomes the final probability vector: $\hat{y_i} = [p_{mild}, p_{severe}]$. The class with the highest probability in $\hat{y_i}$ is then selected as the ensemble’s final prediction:

$$\begin{aligned} \tilde{y}_i = \arg \max _{c \in \{0,1\}} \hat{y}_{i,c} \end{aligned}$$

(2)

We experimented with ensembles comprising three and five multimodal models. This ensemble strategy was chosen because it combines the strengths of different pre-trained models, each of which captures visual and textual features in slightly different ways. Additionally, soft voting is appropriate for this task because it uses the probability scores to improve consistency and better handle the subtle differences in UC severity. Notably, the three-model ensemble—consisting of CLIP (B/16), CLIP (B/32), and CLIP (L/14)—demonstrated the best performance. This ensembling approach improves classification accuracy compared to individual multimodal models while eliminating the need for computationally expensive model training.

ML ensemble-based multimodal feature classification

To further enhance classification performance, we explored traditional machine learning ensemble models by utilizing features extracted from multimodal models such as CLIP (B/32) and EVA-CLIP (B/16). The process involved two key steps: (a) extracting features from the multimodal models, and (b) passing these features as input to machine learning classifiers. We implemented two ensemble strategies to combine the predictions as illustrated in Fig. 1c. This approach was selected because classical machine learning ensemble classifiers are effective for handling high-dimensional feature representations. Additionally, these classifiers are lightweight and computationally efficient, making them easy to train even with limited resources. Moreover, combining multimodal model-based feature extraction with traditional ML classifiers significantly boosts performance.

1.

Soft voting. Predictions from three distinct base classifiers—K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Random Forest (RF)—were aggregated using a soft-voting strategy:

$$\begin{aligned} P_{\text {soft}}(y) = \frac{1}{n} \sum _{i=1}^n P_i (y) \end{aligned}$$

(3)

Here, $P_i (y)$ represents the probability of class y predicted by the i-th base classifier, and n denotes the total number of classifiers.

Additional classifiers, including Logistic Regression (LR), Gradient Boost (GB), and GaussianNB (GNB), were incorporated into the ensemble to further improve performance.
2.

Stacking. In the stacking ensemble, the outputs of the base classifiers (KNN, SVM, and RF) were combined as inputs to a meta-classifier, Logistic Regression (LR). The base classifiers’ predictions are represented as $H = [h_1, h_2, h_3]$, and the meta-classifier combines these predictions to produce the final output: $\hat{y} = f_\text {meta} (H)$, where $\hat{y}$ is the final prediction. This hierarchical approach integrates multimodal learning with traditional machine learning algorithms, creating a robust and generalizable framework for ulcerative colitis severity detection.

Hyperparameter tuning for ML classifiers

Hyperparameter tuning is essential for optimizing model performance by adjusting parameters such as learning rate, batch size, and regularization to prevent overfitting and enhance generalization. This study employs GridSearchCV with 5-fold cross-validation to thoroughly evaluate various hyperparameter combinations and identify the optimal configuration. Unlike RandomizedSearchCV, which selects hyperparameters randomly, GridSearchCV systematically explores all possible combinations within the defined search space. Table 1 presents the explored hyperparameter space and the best configurations chosen for our experiments.

Table 1 Overview of the hyperparameter search space and optimal parameters identified using GridSearchCV.

Few-shot meta-learning

In this study, we implemented a few-shot meta-learning framework³⁰ to classify the severity of ulcerative colitis into two categories: mild and severe. To address the challenge of limited labeled data, we adopted two meta-learning techniques: Matching Networks³¹ and Prototypical Networks³². Both techniques employed a 5-shot binary classification setup, with ResNet-18⁹ used as the backbone feature extractor as illustrated in Fig. 2. ResNet-18 is selected as the backbone model for its strong generalization capability without overfitting, particularly in scenarios with limited data availability, as demonstrated in several existing studies^33,34. With fewer layers than its deeper counterparts, ResNet-18 demands less computational power and memory, making it well-suited for efficient training and deployment in meta-learning scenarios. Overall, ResNet-18 strikes an optimal balance between computational efficiency and performance accuracy.

The dataset was partitioned into meta-learning tasks, each consisting of a support set of labeled examples and a query set for inference. The tasks were organized as follows: 12 training tasks, 5 validation tasks, and 5 testing tasks. Each task included a support set with five labeled images per class, forming a 5-shot classification scenario, and a query set containing unlabeled images for evaluation. This setup simulated a realistic few-shot learning environment, where models are expected to generalize effectively from a limited number of labeled examples.

Matching networks

In the matching networks meta-learning approach³¹, features are first extracted using the ResNet-18 backbone⁹. Let $\theta (\cdot )$ represent the pre-trained ResNet-18 encoder. For each support image $S_i$ and query image $Q_j$, we compute the support embeddings $\textbf{f}_i^S$ and query embeddings $\textbf{f}_j^Q$ as follows:

$$\begin{aligned} \textbf{f}_i^S = \theta (S_i) \quad \text {and} \quad \textbf{f}_j^Q = \theta (Q_j) \end{aligned}$$

(4)

Next, $\ell _2$ normalization is applied to both the support and query embeddings to obtain normalized embeddings $\tilde{\textbf{f}}_i^S$ and $\tilde{\textbf{f}}_j^Q$:

$$\begin{aligned} \tilde{\textbf{f}}_i^S = \frac{\textbf{f}_i^S}{\Vert \textbf{f}_i^S\Vert _2} \quad \text {and} \quad \tilde{\textbf{f}}_j^Q = \frac{\textbf{f}_j^Q}{\Vert \textbf{f}_j^Q\Vert _2} \end{aligned}$$

(5)

Given n support embeddings ${\tilde{\textbf{f}}_1^S, \dots , \tilde{\textbf{f}}_n^S}$ and m query embeddings ${\tilde{\textbf{f}}_1^Q, \dots , \tilde{\textbf{f}}_m^Q}$, we compute the similarity between each query embedding $\tilde{\textbf{f}}_j^Q$ and each support embedding $\tilde{\textbf{f}}_i^S$ using a dot product:

$$\begin{aligned} Similarity: S_{j,i} = \tilde{\textbf{f}}_j^Q \cdot \tilde{\textbf{f}}_i^S \end{aligned}$$

(6)

These similarity scores are then passed through a softmax function over the support embedding dimension for each query j:

$$\begin{aligned} \alpha _{j,i} = \frac{\exp \bigl (s_{j,i}\bigr )}{\sum _{k=1}^n \exp \bigl (s_{j,k}\bigr )} \end{aligned}$$

(7)

Here, $\alpha _{j,i}$ can be interpreted as an attention weight that quantifies how strongly query j is associated with support sample i. Finally, the predicted label for query j is obtained by computing a weighted sum of all support labels $\textbf{y}_i^S$:

$$\begin{aligned} \hat{\textbf{y}}_j = \sum _{i=1}^n \alpha _{j,i} \, \textbf{y}_i^S. \end{aligned}$$

(8)

Prototypical networks

Prototypical networks³² classify query images based on their distances to class prototypes, which are computed as the mean embeddings of support set images. The ResNet-18 backbone⁹, denoted by $\theta (\cdot )$, is used to extract embeddings for both support ($S_i$) and query ($Q_j$) images.

$$\begin{aligned} \textbf{f}_i^S = \theta (S_i) \quad \text {and} \quad \textbf{f}_j^Q = \theta (Q_j) \end{aligned}$$

(9)

For each class $k$, the prototype $c_k$ is calculated by averaging the embeddings of all support images ($\textbf{f}_i^S$) belonging to class $k$:

$$\begin{aligned} c_k = \frac{1}{|S_k|} \sum _{S_i \in S_k} \textbf{f}_{i}^S \end{aligned}$$

(10)

where $S_k$ represents the set of support images for class $k$. The Euclidean distance between the embeddings of a query image $\textbf{f}_j^Q$ and each class prototype $c_k$ is computed as:

$$\begin{aligned} d(\textbf{f}_j^Q, c_k) = \Vert \textbf{f}_j^Q – c_k\Vert _2^2. \end{aligned}$$

(11)

The computed distance $d(\textbf{f}_j^Q, c_k)$ are then converted into probabilities using a softmax activation function:

$$\begin{aligned} P(y = k | j) = \frac{\exp (-d(\textbf{f}_j^Q, c_k))}{\sum _{i} \exp (-d(\textbf{f}_j^Q, c_i))}. \end{aligned}$$

(12)

Finally, the query image is assigned to the class with the highest probability:

$$\begin{aligned} \hat{y}_j = \arg \max _k P(y = k | j). \end{aligned}$$

(13)

Model training and optimization

Both matching networks and prototypical networks were trained on meta-learning tasks using carefully partitioned support and query sets. The training process utilized the categorical cross-entropy loss function, which ensures that the models learn to distinguish between classes effectively by minimizing prediction errors.

This meta-learning framework is designed to generalize well to new tasks by leveraging the few-shot learning paradigm³⁵. The 12:5:5 task split was designed to balance meta-training diversity and evaluation rigor, consistent with existing few-shot medical imaging studies by Singh et al.³⁶. Training tasks encapsulated heterogeneous disease presentations, while validation and testing tasks simulated unseen data scenarios. To mitigate sampling bias, we employed class-stratified sampling in both support and query sets as presented in Laenen et al.³⁷ and Finn et al.³⁸. This ensured balanced class distributions within each task, reducing the risk of bias due to class imbalance and supporting fair generalization assessment. It enables reliable classification of ulcerative colitis severity, even with a limited amount of labeled data. By simulating tasks during training, the models learn to adapt quickly to new data, achieving strong performance on challenging medical image classification problems.

Vision transformers

We experimented with (a) pre-trained ViT-based classification and (b) ViT ensembles aggregated through soft voting. All models in the ensembles were trained independently and combined only during inference, ensuring resource efficiency by avoiding the computational overhead associated with joint training. These ViT-based methods significantly boost the performance matrices to the state-of-the-art level.

Pre-trained ViT for UC classification

Vision Transformers have shown great performance in complex image classification tasks. In our study, we employed several pre-trained vision transformers (ViT) to classify the severity of ulcerative colitis, such as ViT³⁹, DeiT⁴⁰, and Swin⁴¹.

1.

ViT model: The Vision Transformer (ViT) is a robust architecture that divides input images into non-overlapping patches of size $n \times n$ pixels, treating each patch as a “token,” analogous to words in natural language processing (NLP) models. This architecture has demonstrated state-of-the-art performance across various computer vision tasks. The ViT architecture is composed of four key components: (a) Image Patching and Embedding, (b) Positional Encoding, (c) Transformer Encoder, and (d) Classification Head (MLP Head)³⁹. In our study, we utilized two variants of ViT: ViT-Base and ViT-Large. The base version employs a patch size of 16 and contains 85.8 million parameters, while the large version uses a patch size of 32 and includes 305.5 million parameters.
2.

DeiT model: The Data-efficient Image Transformer (DeiT) shares the same architecture as Vision Transformer (ViT) models but is specifically optimized for smaller datasets. Similar to ViT-Base, DeiT uses a patch embedding with 16 patches. Additionally, it incorporates knowledge distillation⁴² into its architecture. The input sequence of the DeiT model includes a distillation token to enhance performance⁴⁰. In our study, we utilized two variants of DeiT: DeiT-Small and DeiT-Base, which have 21.6 million and 85.8 million parameters, respectively.
3.

Swin transformer: The Swin Transformer is a variant of the Vision Transformer (ViT) designed for various computer vision tasks, including image classification and object detection. It processes images hierarchically using a shifted window attention mechanism, effectively capturing both local and global features⁴¹. The Swin models employed in our work divide the images into 4 x 4 patches. Specifically, we used two variants: Swin-Tiny, with 27.5 million parameters, and Swin-Base, with 86.7 million parameters.

Ensembling ViT for UC classification

Pre-trained Vision Transformers (ViT) have demonstrated promising performance in the classification of ulcerative colitis. To further enhance this performance, we employed ensembling techniques as illustrated in Fig. 3 to aggregate the outputs of multiple ViT models. Two distinct ensembling approaches were utilized: weighted voting and soft voting. Both methods improved accuracy, with the soft voting technique achieving the highest performance.

The soft voting ensemble technique works by averaging the probability scores generated by each ViT model, resulting in a final probability vector. The class corresponding to the highest final probability is then predicted by the ensemble. In contrast, the weighted voting ensemble technique assigns weights to individual models based on their performance and calculates a weighted average of the probabilities. The class with the highest combined probability is chosen as the final prediction. Among all the classification techniques employed, the ViT ensembling approach delivered the best results, achieving the highest performance scores.

Source link