Reinforcement learning enables time-constrained LLM transformation with Sandgrass benchmark

Machine Learning


While large-scale language models are excellent for multilingual translation, they often produce overly verbose output, creating challenges for time-sensitive applications such as subtitles and dubbing. Ziang Cui, Mengran Yu, and Tianjiao Li, along with colleagues at Bilibili Inc., are tackling this critical problem using a new framework designed to balance translation accuracy with tight time constraints. Their work introduces Sand-Glass, a benchmark for evaluating translations under syllable-level duration constraints, and HOMURA, a reinforcement learning system that actively manages the trade-off between semantic meaning and temporal feasibility. This innovative approach employs a dynamic reward system to control output length and clearly outperforms existing large-scale language model baselines in achieving accurate and linguistically appropriate timing without sacrificing translation quality.

Syllable timing and semantic fidelity in translation

Many neural machine translation systems have an inherent redundancy bias, making them unsuitable for strictly time-sensitive tasks such as subtitling and dubbing. Existing prompt engineering approaches often struggle to resolve the conflict between maintaining semantic fidelity and adhering to strict temporal feasibility requirements. To address this challenge, researchers introduced Sand-Glass, a new benchmark designed to specifically assess translation performance under syllable-level duration constraints. This benchmark allows for a more nuanced evaluation of systems operating within strict temporal boundaries.

Furthermore, this study proposes HOMURA, a reinforcement learning framework that explicitly optimizes the trade-off between semantic preservation and temporal compliance. HOMURA employs a KL regularization objective function combined with a novel dynamic syllable ratio reward mechanism to effectively control the output length and encourage the model to produce semantically accurate and temporally feasible translations. Experimental results demonstrate the effectiveness of HOMURA in managing output length while preserving semantic meaning, providing a significant advance in machine translation of time-sensitive applications.

Constrained translation using reinforcement learning with HOMURA

Recent work details HOMURA, a reinforcement learning approach for constrained translation that focuses on achieving high compression while preserving linguistic quality. In this paper, we present a method to perform translations with strict syllable constraints without sacrificing the quality of the translated text. It accomplishes this by removing the typical KL divergent regularization used in reinforcement learning from human feedback, and argues that in this constrained scenario the KL penalty impedes the model’s ability to make the necessary structural changes to meet the syllable goal. HOMURA leverages group-relative policy optimization, a reinforcement learning framework that stabilizes training without relying on KL regularization, and uses group-relative regularization and clipping to control policy updates.

A specific loss function within GRPO, Centered Clipped Objective, ensures stable learning by centering updates and preventing large-scale policy changes. Syllable ratio rewards serve as the primary reward signal, penalizing translations that exceed the syllable limit, and serve as a form of structural regularization. Results show that removing KL improves performance, achieves higher CERR and BLEU-ρ scores, and allows the model to reconstruct sentences to meet syllable constraints rather than simply removing words.

LLM exhibits cross-linguistic redundancy bias

Scientists have addressed a significant limitation of large-scale language models in time-sensitive translation tasks by addressing systematic cross-linguistic redundancy bias. The research team identified that LLM consistently produced translations that were significantly longer than the source material, precluding its use in applications with tight time budgets. To quantify this issue, they developed Sand-Glass, a new benchmark designed to evaluate translation performance under syllable-level duration constraints that incorporates information density obtained from real-world subtitles. Experiments revealed a widespread tendency for LLM to expand translated text, and diagnostic analysis established a round-trip expansion rate to isolate model-induced redundancies.

The data shows that across the languages ​​tested, including German, English, and Spanish, more than 63% of LLM translations showed an expansion of a ratio greater than 1.0, indicating systematic inflation despite consistent semantic content. The team then introduced HOMURA, a reinforcement learning framework specifically designed to optimize the tradeoff between meaning preservation and temporal compliance, effectively adjusting the length of translations. HOMURA employs a KL regularization objective combined with a new dynamic syllable ratio reward, allowing precise length control while respecting language density hierarchies. Results show that HOMURA significantly outperforms strong LLM baselines, achieves significant reductions in output length without compromising semantic adequacy, and operates close to the rate-distortion limit for fidelity compression.

Short translations with reinforcement learning and sandglass

This research addresses a key problem in machine translation: that large-scale language models tend to produce overly redundant output, hindering their application in time-sensitive situations such as subtitling and dubbing. To quantify this problem, the authors introduced Sand-Glass, a new benchmark that evaluates translation quality under strict syllable-level duration constraints, based on the principle of information efficiency. They then developed HOMURA, a reinforcement learning framework designed to balance semantic accuracy and temporal feasibility, effectively controlling output length through a novel reward system. Experimental results show that HOMURA significantly outperforms existing language models in achieving accurate length control while maintaining semantic consistency. This study also identified a clear limit to compression, namely a Chinese to English translation ratio of approximately 0.49, suggesting the minimum information density required to convey the core meaning. While acknowledging that syllable count serves as a textual approximation of actual acoustic duration and that the current validation focuses on a limited set of language pairs, the authors suggest future work that explores the generalizability of the observed compression bounds and extends the framework to encompass end-to-end speech translation by incorporating both semantic and duration optimization.

👉 More information
🗞 HOMURA: Taming the Hourglass of Time-Constrained LLM Translation with Reinforcement Learning
🧠ArXiv: https://arxiv.org/abs/2601.10187



Source link