How to fine-tune SLM for emotion recognition

introduction

A fine-tuned model for sentiment classification (SLM) infers sentiment as a single score, capturing the overall emotional tone of the text. In many use cases, positive/negative categorization alone doesn’t tell the entire story a company needs. Emotion recognition models are further evolved to decompose emotions into emotion classes (“Anger” “Approval” “Disappointment”etc.), assigns probabilities to a set of emotions in a text. This allows businesses to model the emotional content in the datasets they receive (customer tickets, emails, brand-related discussions) and quickly respond to changing conditions.

One of our recent projects, modeling emotions in online media, required an emotion recognition model with open weights and flexible licensing, maintaining high transparency standards and of course benefiting from the cost savings associated with open models. Although we subjectively prefer the European model, Hugging Face did not offer a Mistral replacement with a developed model card. One possible reason is that the GoEmotions dataset of 28 emotions, which is the most detailed training set for emotion recognition, is highly class-imbalanced. Deeper focus is required to fine-tune SLMs based on high-class imbalanced datasets that perform well in tests.

We combined the following three techniques to handle the class imbalance problem. (1) undersampling Most typical emotion categories (2) Expand comprehensively of ethnic minority using Nature’s 2025 ISMOTE algorithm, and (3) weighting of loss function. This combination of techniques allows MistralSmall-3.1.GoEmotionsThe tool, currently released at Hugging Face, infers most target emotions related to a project with F1 > 0.7.

This article details how to fine-tune your open weight SLM. You will also understand:

How to preprocess class imbalance data for LLM fine-tuning Ismothe in 2025 algorithm.
How to decompose emotions into emotion categories by fine-tuning small language models for emotion recognition in text data.

2. Data

GoEmotions is a human-annotated dataset of 58,000 Reddit comments extracted from English subreddits, divided into 27 emotion categories and “neutral” label. it is Multi-label classification dataset Each comment may have multiple TRUE labels representing sentiment, e.g. “She hit me. Even though she wasn’t actually trying to hit her, it just added another interesting dynamic.” is true “amusement”and “bothersome”).

This dataset is released at TensorFlow Datasets under the Apache 2.0 license and contains 54,263 labeled texts. This is shown below.

_{Image 1. GoEmotions dataset. Image by author.}

A quick check reveals a high degree of imbalance in the data. neutral Category takes precedence:

Image 2. Class imbalance in the GoEmotions dataset. Image by author.

3. Preprocessing the training set

Our goal is to develop a classifier that identifies 15 emotions in common language texts. Preprocessing is essential because training on class-imbalanced data can introduce bias as the fine-tuned model tends to favor the majority class and perform poorly on the minority class.

I used a combination of several methods. training set; To address class imbalance and maximize performance on target emotions, the validation and test sets were unchanged (fear, sadness, disgust, disapproval, annoyance, anger, disappointment, optimism, amusement, surprise, admiration, excitement, confusion, joy, love):

We thinned out the data by randomly filtering. “neutral” line.
We generated a synthetic sample of the least represented emotion categories using: ISMOTE (Improved Synthetic Minority Oversampling Technique).

of Izumote This algorithm extends the popular SMOTE technique by (1) extending the sample generation space and (2) improving the sampling distribution. Synthetically generated samples have a more realistic data distribution than those generated using original methods.

Image 3. Flowchart of ISMOTE algorithm. Source: Scientific Reports.

By reducing the majority class and expanding the minority categories to a total of 4000 samples, we built a relatively balanced set for fine-tuning. The code for ISMOTE oversampling can be found here.

Image 4. Label the relative frequencies, training (augmentation), validation, and test sets. Image by author.

4. Fine tuning SLM

Among the Mistral models, I chose: small The class (Small-3.1-24B-Instruct-2503) is GPU compatible and provides the multilingual functionality required by the classifier. The Unsloth framework makes the fine-tuning step easier and faster than Transformers.

1. Data load — Load the preprocessed training set, validation set, and test set. Use a 60:20:20 split.

2. Loading of base model—Load Small-3.1–24B-Instruct-2503 locally.

3. apply LoRA – Lower hardware requirements.

4. Multi-label wrapper with focal loss function — Update the multi-label classification trainer. We also add a focal loss to weight the loss function for a selected set of emotions to favor its performance.

5. Evaluation indicators and Argument training— Specify evaluation metrics and hyperparameters for model training.

6. model training— Trainer formulation and initiation.

7. Evaluation — Evaluate the best model performance on the test set.

4.1.Coding

The code implementation is as follows:

4.1.1.Data load

# Loading augmented train, validation and test sets
BASE = r"augmented"

def load_split(path: str) -> Dataset:
    with open(path, encoding="utf-8") as f:
        d = json.load(f)
    return Dataset.from_dict({"input_embeds": d["X"], "labels": d["y"]})

train_dataset = load_split(f"{BASE}/train.json")
val_dataset   = load_split(f"{BASE}/val.json")
test_dataset  = load_split(f"{BASE}/test.json")

# Formulate embedding dimension
EMBED_DIM = len(train_dataset[0]["input_embeds"])

# Return Pytorch tensors
train_dataset.set_format("torch")
val_dataset.set_format("torch")
test_dataset.set_format("torch")

4.1.2. Loading the basic model

# Load base model with Unsloth FastLanguageModel
MODEL_NAME = "unsloth/Mistral-Small-3.1-24B-Instruct-2503"

base_model, _ = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=2,
    load_in_4bit=True,
    dtype=torch.bfloat16,
)

4.1.3. Apply LoRA

# Aply Low-rank adaptation (LoRA) 
base_model = FastLanguageModel.get_peft_model(
    base_model,
    r=16,
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    use_gradient_checkpointing="unsloth",
    random_state = 3407,
    use_rslora = False, 
)

4.1.4. Multi-label wrapper with focal loss functionality

# Focal loss weights for preffered labels  
FOCAL_ALPHA_DEFAULT   = 0.25
FOCAL_ALPHA_PREFERRED = 0.75

PREFERRED_LABELS = {
    "fear", "sadness", "disgust", "disapproval", "annoyance",
    "anger", "disappointment", "optimism", "amusement", "surprise",
    "admiration", "excitement", "confusion","joy","love"
}

FOCAL_ALPHA_PER_LABEL: list[float] = [
    FOCAL_ALPHA_PREFERRED if lbl in PREFERRED_LABELS else FOCAL_ALPHA_DEFAULT
    for lbl in EMOTION_LABELS
]

"Per-label weighted focal binary cross-entropy for multi-label problems"
class FocalLossWithAlpha(nn.Module):
        def __init__(self, alpha: list[float], gamma: float = 2.0):
        super().__init__()
        self.register_buffer("alpha", torch.tensor(alpha, dtype=torch.float32))
        self.gamma = gamma
    def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        probs   = torch.sigmoid(logits)
        p_t     = probs * targets + (1.0 - probs) * (1.0 - targets)
        alpha_t = self.alpha * targets + (1.0 - self.alpha) * (1.0 - targets)
        focal_w = alpha_t * (1.0 - p_t) ** self.gamma
        bce     = nn.functional.binary_cross_entropy_with_logits(
            logits, targets, reduction="none"
        )
        return (focal_w * bce).mean()

# Multilabel classification wrapper with focal loss class weighting
class MistralForMultiLabel(nn.Module):
    is_loaded_in_4bit = True

    def __init__(self, backbone: nn.Module, num_labels: int,
                 hidden_size: int, embed_dim: int):
        super().__init__()
        self.backbone = backbone
        _device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.projection = nn.Sequential(
            nn.Linear(embed_dim, hidden_size // 2),
            nn.GELU(),
            nn.Linear(hidden_size // 2, hidden_size),
        ).to(_device)
        self.dropout    = nn.Dropout(0.1).to(_device)
        self.classifier = nn.Linear(hidden_size, num_labels).to(_device)
        self.focal_loss = FocalLossWithAlpha(FOCAL_ALPHA_PER_LABEL).to(_device)

    def gradient_checkpointing_enable(self, gradient_checkpointing_kwargs=None):
        self.backbone.gradient_checkpointing_enable(gradient_checkpointing_kwargs)

    def gradient_checkpointing_disable(self):
        self.backbone.gradient_checkpointing_disable()

    def forward(
        self,
        input_embeds: torch.Tensor,
        labels: torch.Tensor | None = None,
        **kwargs,
    ):
        B = input_embeds.size(0)
        projected = self.projection(input_embeds).unsqueeze(1)
        attn_mask = torch.ones(B, 1, device=input_embeds.device)

        outputs = self.backbone.base_model.model.model(
            inputs_embeds=projected,
            attention_mask=attn_mask,
            output_hidden_states=True,
        )
        pooled = outputs.hidden_states[-1][:, 0, :]
        logits = self.classifier(self.dropout(pooled))

        loss = self.focal_loss(logits, labels.float()) if labels is not None else None
        return {"loss": loss, "logits": logits}

4.1.5. Evaluation metrics and training arguments

# Specifiy the evaluation function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = torch.sigmoid(torch.tensor(logits)).numpy()
    preds = (probs >= 0.5).astype(int)
    labels = labels.astype(int)

    from sklearn.metrics import accuracy_score

    exact_accuracy  = accuracy_score(labels, preds)
    macro_f1        = f1_score(labels, preds, average="macro", zero_division=0)
    micro_f1        = f1_score(labels, preds, average="micro", zero_division=0)
    macro_precision = precision_score(labels, preds, average="macro", zero_division=0)
    macro_recall    = recall_score(labels, preds, average="macro", zero_division=0)

    per_class_f1        = f1_score(labels, preds, average=None, zero_division=0)
    per_class_recall    = recall_score(labels, preds, average=None, zero_division=0)
    per_class_precision = precision_score(labels, preds, average=None, zero_division=0)
    per_class_accuracy  = (preds == labels).mean(axis=0)

    per_class_metrics = {}
    for i, emotion in enumerate(EMOTION_LABELS):
        per_class_metrics[f"f1_{emotion}"]        = float(per_class_f1[i])
        per_class_metrics[f"recall_{emotion}"]    = float(per_class_recall[i])
        per_class_metrics[f"precision_{emotion}"] = float(per_class_precision[i])
        per_class_metrics[f"accuracy_{emotion}"]  = float(per_class_accuracy[i])

    return {
        "exact_accuracy":   exact_accuracy,
        "macro_f1":         macro_f1,
        "micro_f1":         micro_f1,
        "macro_precision":  macro_precision,
        "macro_recall":     macro_recall,
        **per_class_metrics,
    }

# Specify hyperparameters
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,            # where checkpoints and logs are written
    eval_strategy="epoch",            # run evaluation once per epoch
    save_strategy="epoch",            # save checkpoint once per epoch
    per_device_train_batch_size=8,    # samples per GPU per step
    per_device_eval_batch_size=16,    # larger batch is fine — no gradients
    gradient_accumulation_steps=4,    # effective batch = 8 × 4 = 32
    num_train_epochs=15,              # total passes over the training data
    learning_rate=1e-4,               # peak LR after warmup
    bf16=True,                        # bfloat16 mixed precision
    optim="adamw_8bit",               # 8-bit AdamW
    warmup_ratio=0.05,                # first 5 % of steps ramp LR from 0 to peak
    lr_scheduler_type="cosine",       # cosine decay from peak LR to ~0
    logging_steps=25,                 # print loss/LR to console every 25 steps
    logging_first_step=True,          # also log step 1 to catch early instability
    load_best_model_at_end=True,      # restore best checkpoint after training ends
    metric_for_best_model="macro_f1", # criterion used to select the best checkpoint
    greater_is_better=True,           # higher macro_f1 is better in evaluation
    gradient_checkpointing=False,    
    remove_unused_columns=False,      # keep input_embeds column
    save_total_limit=15,              # keep all checkpoints on disk to load the best model
    weight_decay=0.01,                # L2 regularisation on all trainable parameters
)

4.1.6.Training the model

# Set-up the trainer for multilabel finetuning
class MultiLabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs, labels=labels)
        loss = outputs["loss"]
        return (loss, outputs) if return_outputs else loss

    def _save_checkpoint(self, model, trial, metrics=None):
        super()._save_checkpoint(model, trial)
        ckpt_dir = self._get_output_dir(trial)
        # Save head
        torch.save({
            "projection": model.projection.state_dict(),
            "classifier":  model.classifier.state_dict(),
        }, os.path.join(ckpt_dir, "head_weights.pt"))
        # Save LoRA adapter explicitly (bypasses bitsandbytes serialization issues)
        model.backbone.save_pretrained(os.path.join(ckpt_dir, "lora_adapter"))

    def _load_best_model(self):
        best_ckpt = self.state.best_model_checkpoint
        if not best_ckpt:
            return
        # Restore head
        head_path = os.path.join(best_ckpt, "head_weights.pt")
        if os.path.exists(head_path):
            head = torch.load(head_path, map_location="cpu")
            self.model.projection.load_state_dict(head["projection"])
            self.model.classifier.load_state_dict(head["classifier"])
            print(f"Head restored from: {best_ckpt}")
        else:
            print(f"WARNING: head_weights.pt not found in {best_ckpt}")
        # Restore LoRA adapter
        lora_path = os.path.join(best_ckpt, "lora_adapter")
        if os.path.exists(lora_path):
            from peft import PeftModel
            self.model.backbone.load_adapter(lora_path, adapter_name="default")
            print(f"LoRA restored from: {best_ckpt}")
        else:
            print(f"WARNING: lora_adapter/ not found in {best_ckpt}")

# Launch the trainer
trainer = MultiLabelTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

# Launch training
trainer.train()

Fine-tuning for 15 epochs took 9 hours and 30 minutes on a machine with an NVIDIA RTX 6000 GPU and 192 GB of VRAM, and finally the optimal model was loaded.

4.1.7. Model evaluation

Let’s demonstrate the performance on the test dataset. The standard statistics for model evaluation by class are: F1, accuracyand recollection. The F1 score shows relatively good performance regarding the target emotion. It is above 0.7 in most categories. Full performance is on the model card.

emotions	accuracy	recollection	F1	N
praise	0.7415	0.6354	0.6844	993
amusement	0.7810	0.7422	0.7611	543
anger	0.7423	0.7367	0.7395	395
bothersome	0.7049	0.5452	0.6148	609
confusion	0.7576	0.8251	0.7899	303
disappointment	0.8487	0.8459	0.8473	305
disapproval	0.7208	0.5841	0.6453	517
disgust	0.8396	0.9368	0.8856	190
excitement	0.8240	0.9366	0.8767	205
fear	0.9112	0.9686	0.9390	159
joy	0.7577	0.8024	0.7794	339
love	0.7424	0.7903	0.7656	496
optimism	0.8145	0.7636	0.7882	368
sorrow	0.8534	0.8899	0.8713	327
surprise	0.8456	0.8555	0.8505	256
macro precision	0.8295
macro recall	0.8184
Micro F1	0.7527
Macro F1	0.8215

Table 1: Performance of Mistral Small 3.1-GoEmotions on test set

5. Summary

Let’s summarize the main points of the article. Requirements and complete code can be found in this repository.

emotion recognition Modeling extends sentiment analysis by decomposing the overall sentiment score into emotional components.
MistralSmall-3.1.GoEmotions It’s on hug face Based on the Apache 2.0 license. This repository also contains inference guidelines.
Deployment use case Brand and social monitoring, and email classification.

petr kolab Founder of Text Mining Stories. Development and consulting company based in Prague. To learn more about cutting-edge NLP, check out our blog.

AI statement. Part of the code was reviewed by Sonnet 4.6 (Cursor). No text is generated using AI.

Acknowledgment. The Slovak National Bank Foundation supported this development. We would like to thank Martin Feldkircher, Václav Jež and Michala Moravcová for comments and suggestions.

References

[1] Ying Li, Yali Yang, Peihua Song, Lian Duan, Rui Ren. 2025. Improved SMOTE algorithm to enhance imbalanced data classification by expanding sample generation space. scientific report, 15 (23521).

[2] Liu YinghanJiatao GuNaman Goyal, Xian Li, Sergei Edunov
Marjan Gazbininejad, Mike Lewis, Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8, pp. 726-742.

Source link