introduction
A fine-tuned model for sentiment classification (SLM) infers sentiment as a single score, capturing the overall emotional tone of the text. In many use cases, positive/negative categorization alone doesn’t tell the entire story a company needs. Emotion recognition models are further evolved to decompose emotions into emotion classes (“Anger” “Approval” “Disappointment”etc.), assigns probabilities to a set of emotions in a text. This allows businesses to model the emotional content in the datasets they receive (customer tickets, emails, brand-related discussions) and quickly respond to changing conditions.
One of our recent projects, modeling emotions in online media, required an emotion recognition model with open weights and flexible licensing, maintaining high transparency standards and of course benefiting from the cost savings associated with open models. Although we subjectively prefer the European model, Hugging Face did not offer a Mistral replacement with a developed model card. One possible reason is that the GoEmotions dataset of 28 emotions, which is the most detailed training set for emotion recognition, is highly class-imbalanced. Deeper focus is required to fine-tune SLMs based on high-class imbalanced datasets that perform well in tests.
We combined the following three techniques to handle the class imbalance problem. (1) undersampling Most typical emotion categories (2) Expand comprehensively of ethnic minority using Nature’s 2025 ISMOTE algorithm, and (3) weighting of loss function. This combination of techniques allows MistralSmall-3.1.GoEmotionsThe tool, currently released at Hugging Face, infers most target emotions related to a project with F1 > 0.7.
This article details how to fine-tune your open weight SLM. You will also understand:
- How to preprocess class imbalance data for LLM fine-tuning Ismothe in 2025 algorithm.
- How to decompose emotions into emotion categories by fine-tuning small language models for emotion recognition in text data.
2. Data
GoEmotions is a human-annotated dataset of 58,000 Reddit comments extracted from English subreddits, divided into 27 emotion categories and “neutral” label. it is Multi-label classification dataset Each comment may have multiple TRUE labels representing sentiment, e.g. “She hit me. Even though she wasn’t actually trying to hit her, it just added another interesting dynamic.” is true “amusement”and “bothersome”).
This dataset is released at TensorFlow Datasets under the Apache 2.0 license and contains 54,263 labeled texts. This is shown below.

A quick check reveals a high degree of imbalance in the data. neutral Category takes precedence:

3. Preprocessing the training set
Our goal is to develop a classifier that identifies 15 emotions in common language texts. Preprocessing is essential because training on class-imbalanced data can introduce bias as the fine-tuned model tends to favor the majority class and perform poorly on the minority class.
I used a combination of several methods. training set; To address class imbalance and maximize performance on target emotions, the validation and test sets were unchanged (fear, sadness, disgust, disapproval, annoyance, anger, disappointment, optimism, amusement, surprise, admiration, excitement, confusion, joy, love):
- We thinned out the data by randomly filtering. “neutral” line.
- We generated a synthetic sample of the least represented emotion categories using: ISMOTE (Improved Synthetic Minority Oversampling Technique).
of Izumote This algorithm extends the popular SMOTE technique by (1) extending the sample generation space and (2) improving the sampling distribution. Synthetically generated samples have a more realistic data distribution than those generated using original methods.

By reducing the majority class and expanding the minority categories to a total of 4000 samples, we built a relatively balanced set for fine-tuning. The code for ISMOTE oversampling can be found here.

4. Fine tuning SLM
Among the Mistral models, I chose: small The class (Small-3.1-24B-Instruct-2503) is GPU compatible and provides the multilingual functionality required by the classifier. The Unsloth framework makes the fine-tuning step easier and faster than Transformers.
1. Data load — Load the preprocessed training set, validation set, and test set. Use a 60:20:20 split.
2. Loading of base model—Load Small-3.1–24B-Instruct-2503 locally.
3. apply LoRA – Lower hardware requirements.
4. Multi-label wrapper with focal loss function — Update the multi-label classification trainer. We also add a focal loss to weight the loss function for a selected set of emotions to favor its performance.
5. Evaluation indicators and Argument training— Specify evaluation metrics and hyperparameters for model training.
6. model training— Trainer formulation and initiation.
7. Evaluation — Evaluate the best model performance on the test set.
4.1.Coding
The code implementation is as follows:
4.1.1.Data load
# Loading augmented train, validation and test sets
BASE = r"augmented"
def load_split(path: str) -> Dataset:
with open(path, encoding="utf-8") as f:
d = json.load(f)
return Dataset.from_dict({"input_embeds": d["X"], "labels": d["y"]})
train_dataset = load_split(f"{BASE}/train.json")
val_dataset = load_split(f"{BASE}/val.json")
test_dataset = load_split(f"{BASE}/test.json")
# Formulate embedding dimension
EMBED_DIM = len(train_dataset[0]["input_embeds"])
# Return Pytorch tensors
train_dataset.set_format("torch")
val_dataset.set_format("torch")
test_dataset.set_format("torch")
4.1.2. Loading the basic model
# Load base model with Unsloth FastLanguageModel
MODEL_NAME = "unsloth/Mistral-Small-3.1-24B-Instruct-2503"
base_model, _ = FastLanguageModel.from_pretrained(
model_name=MODEL_NAME,
max_seq_length=2,
load_in_4bit=True,
dtype=torch.bfloat16,
)
4.1.3. Apply LoRA
# Aply Low-rank adaptation (LoRA)
base_model = FastLanguageModel.get_peft_model(
base_model,
r=16,
lora_alpha=32,
lora_dropout=0,
bias="none",
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
use_gradient_checkpointing="unsloth",
random_state = 3407,
use_rslora = False,
)
4.1.4. Multi-label wrapper with focal loss functionality
# Focal loss weights for preffered labels
FOCAL_ALPHA_DEFAULT = 0.25
FOCAL_ALPHA_PREFERRED = 0.75
PREFERRED_LABELS = {
"fear", "sadness", "disgust", "disapproval", "annoyance",
"anger", "disappointment", "optimism", "amusement", "surprise",
"admiration", "excitement", "confusion","joy","love"
}
FOCAL_ALPHA_PER_LABEL: list[float] = [
FOCAL_ALPHA_PREFERRED if lbl in PREFERRED_LABELS else FOCAL_ALPHA_DEFAULT
for lbl in EMOTION_LABELS
]
"Per-label weighted focal binary cross-entropy for multi-label problems"
class FocalLossWithAlpha(nn.Module):
def __init__(self, alpha: list[float], gamma: float = 2.0):
super().__init__()
self.register_buffer("alpha", torch.tensor(alpha, dtype=torch.float32))
self.gamma = gamma
def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
probs = torch.sigmoid(logits)
p_t = probs * targets + (1.0 - probs) * (1.0 - targets)
alpha_t = self.alpha * targets + (1.0 - self.alpha) * (1.0 - targets)
focal_w = alpha_t * (1.0 - p_t) ** self.gamma
bce = nn.functional.binary_cross_entropy_with_logits(
logits, targets, reduction="none"
)
return (focal_w * bce).mean()
# Multilabel classification wrapper with focal loss class weighting
class MistralForMultiLabel(nn.Module):
is_loaded_in_4bit = True
def __init__(self, backbone: nn.Module, num_labels: int,
hidden_size: int, embed_dim: int):
super().__init__()
self.backbone = backbone
_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.projection = nn.Sequential(
nn.Linear(embed_dim, hidden_size // 2),
nn.GELU(),
nn.Linear(hidden_size // 2, hidden_size),
).to(_device)
self.dropout = nn.Dropout(0.1).to(_device)
self.classifier = nn.Linear(hidden_size, num_labels).to(_device)
self.focal_loss = FocalLossWithAlpha(FOCAL_ALPHA_PER_LABEL).to(_device)
def gradient_checkpointing_enable(self, gradient_checkpointing_kwargs=None):
self.backbone.gradient_checkpointing_enable(gradient_checkpointing_kwargs)
def gradient_checkpointing_disable(self):
self.backbone.gradient_checkpointing_disable()
def forward(
self,
input_embeds: torch.Tensor,
labels: torch.Tensor | None = None,
**kwargs,
):
B = input_embeds.size(0)
projected = self.projection(input_embeds).unsqueeze(1)
attn_mask = torch.ones(B, 1, device=input_embeds.device)
outputs = self.backbone.base_model.model.model(
inputs_embeds=projected,
attention_mask=attn_mask,
output_hidden_states=True,
)
pooled = outputs.hidden_states[-1][:, 0, :]
logits = self.classifier(self.dropout(pooled))
loss = self.focal_loss(logits, labels.float()) if labels is not None else None
return {"loss": loss, "logits": logits}
4.1.5. Evaluation metrics and training arguments
# Specifiy the evaluation function
def compute_metrics(eval_pred):
logits, labels = eval_pred
probs = torch.sigmoid(torch.tensor(logits)).numpy()
preds = (probs >= 0.5).astype(int)
labels = labels.astype(int)
from sklearn.metrics import accuracy_score
exact_accuracy = accuracy_score(labels, preds)
macro_f1 = f1_score(labels, preds, average="macro", zero_division=0)
micro_f1 = f1_score(labels, preds, average="micro", zero_division=0)
macro_precision = precision_score(labels, preds, average="macro", zero_division=0)
macro_recall = recall_score(labels, preds, average="macro", zero_division=0)
per_class_f1 = f1_score(labels, preds, average=None, zero_division=0)
per_class_recall = recall_score(labels, preds, average=None, zero_division=0)
per_class_precision = precision_score(labels, preds, average=None, zero_division=0)
per_class_accuracy = (preds == labels).mean(axis=0)
per_class_metrics = {}
for i, emotion in enumerate(EMOTION_LABELS):
per_class_metrics[f"f1_{emotion}"] = float(per_class_f1[i])
per_class_metrics[f"recall_{emotion}"] = float(per_class_recall[i])
per_class_metrics[f"precision_{emotion}"] = float(per_class_precision[i])
per_class_metrics[f"accuracy_{emotion}"] = float(per_class_accuracy[i])
return {
"exact_accuracy": exact_accuracy,
"macro_f1": macro_f1,
"micro_f1": micro_f1,
"macro_precision": macro_precision,
"macro_recall": macro_recall,
**per_class_metrics,
}
# Specify hyperparameters
training_args = TrainingArguments(
output_dir=OUTPUT_DIR, # where checkpoints and logs are written
eval_strategy="epoch", # run evaluation once per epoch
save_strategy="epoch", # save checkpoint once per epoch
per_device_train_batch_size=8, # samples per GPU per step
per_device_eval_batch_size=16, # larger batch is fine — no gradients
gradient_accumulation_steps=4, # effective batch = 8 × 4 = 32
num_train_epochs=15, # total passes over the training data
learning_rate=1e-4, # peak LR after warmup
bf16=True, # bfloat16 mixed precision
optim="adamw_8bit", # 8-bit AdamW
warmup_ratio=0.05, # first 5 % of steps ramp LR from 0 to peak
lr_scheduler_type="cosine", # cosine decay from peak LR to ~0
logging_steps=25, # print loss/LR to console every 25 steps
logging_first_step=True, # also log step 1 to catch early instability
load_best_model_at_end=True, # restore best checkpoint after training ends
metric_for_best_model="macro_f1", # criterion used to select the best checkpoint
greater_is_better=True, # higher macro_f1 is better in evaluation
gradient_checkpointing=False,
remove_unused_columns=False, # keep input_embeds column
save_total_limit=15, # keep all checkpoints on disk to load the best model
weight_decay=0.01, # L2 regularisation on all trainable parameters
)
4.1.6.Training the model
# Set-up the trainer for multilabel finetuning
class MultiLabelTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
labels = inputs.pop("labels")
outputs = model(**inputs, labels=labels)
loss = outputs["loss"]
return (loss, outputs) if return_outputs else loss
def _save_checkpoint(self, model, trial, metrics=None):
super()._save_checkpoint(model, trial)
ckpt_dir = self._get_output_dir(trial)
# Save head
torch.save({
"projection": model.projection.state_dict(),
"classifier": model.classifier.state_dict(),
}, os.path.join(ckpt_dir, "head_weights.pt"))
# Save LoRA adapter explicitly (bypasses bitsandbytes serialization issues)
model.backbone.save_pretrained(os.path.join(ckpt_dir, "lora_adapter"))
def _load_best_model(self):
best_ckpt = self.state.best_model_checkpoint
if not best_ckpt:
return
# Restore head
head_path = os.path.join(best_ckpt, "head_weights.pt")
if os.path.exists(head_path):
head = torch.load(head_path, map_location="cpu")
self.model.projection.load_state_dict(head["projection"])
self.model.classifier.load_state_dict(head["classifier"])
print(f"Head restored from: {best_ckpt}")
else:
print(f"WARNING: head_weights.pt not found in {best_ckpt}")
# Restore LoRA adapter
lora_path = os.path.join(best_ckpt, "lora_adapter")
if os.path.exists(lora_path):
from peft import PeftModel
self.model.backbone.load_adapter(lora_path, adapter_name="default")
print(f"LoRA restored from: {best_ckpt}")
else:
print(f"WARNING: lora_adapter/ not found in {best_ckpt}")
# Launch the trainer
trainer = MultiLabelTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
)
# Launch training
trainer.train()
Fine-tuning for 15 epochs took 9 hours and 30 minutes on a machine with an NVIDIA RTX 6000 GPU and 192 GB of VRAM, and finally the optimal model was loaded.
4.1.7. Model evaluation
Let’s demonstrate the performance on the test dataset. The standard statistics for model evaluation by class are: F1, accuracyand recollection. The F1 score shows relatively good performance regarding the target emotion. It is above 0.7 in most categories. Full performance is on the model card.
| emotions | accuracy | recollection | F1 | N |
| praise | 0.7415 | 0.6354 | 0.6844 | 993 |
| amusement | 0.7810 | 0.7422 | 0.7611 | 543 |
| anger | 0.7423 | 0.7367 | 0.7395 | 395 |
| bothersome | 0.7049 | 0.5452 | 0.6148 | 609 |
| confusion | 0.7576 | 0.8251 | 0.7899 | 303 |
| disappointment | 0.8487 | 0.8459 | 0.8473 | 305 |
| disapproval | 0.7208 | 0.5841 | 0.6453 | 517 |
| disgust | 0.8396 | 0.9368 | 0.8856 | 190 |
| excitement | 0.8240 | 0.9366 | 0.8767 | 205 |
| fear | 0.9112 | 0.9686 | 0.9390 | 159 |
| joy | 0.7577 | 0.8024 | 0.7794 | 339 |
| love | 0.7424 | 0.7903 | 0.7656 | 496 |
| optimism | 0.8145 | 0.7636 | 0.7882 | 368 |
| sorrow | 0.8534 | 0.8899 | 0.8713 | 327 |
| surprise | 0.8456 | 0.8555 | 0.8505 | 256 |
| macro precision | 0.8295 | |||
| macro recall | 0.8184 | |||
| Micro F1 | 0.7527 | |||
| Macro F1 | 0.8215 |
5. Summary
Let’s summarize the main points of the article. Requirements and complete code can be found in this repository.
- emotion recognition Modeling extends sentiment analysis by decomposing the overall sentiment score into emotional components.
- MistralSmall-3.1.GoEmotions It’s on hug face Based on the Apache 2.0 license. This repository also contains inference guidelines.
- Deployment use case Brand and social monitoring, and email classification.
petr kolab Founder of Text Mining Stories. Development and consulting company based in Prague. To learn more about cutting-edge NLP, check out our blog.
AI statement. Part of the code was reviewed by Sonnet 4.6 (Cursor). No text is generated using AI.
Acknowledgment. The Slovak National Bank Foundation supported this development. We would like to thank Martin Feldkircher, Václav Jež and Michala Moravcová for comments and suggestions.
References
[1] Ying Li, Yali Yang, Peihua Song, Lian Duan, Rui Ren. 2025. Improved SMOTE algorithm to enhance imbalanced data classification by expanding sample generation space. scientific report, 15 (23521).
[2] Liu YinghanJiatao GuNaman Goyal, Xian Li, Sergei Edunov
Marjan Gazbininejad, Mike Lewis, Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8, pp. 726-742.
