Speaking of Voxtral | Mistral AI

Today, we are releasing Voxtral TTS, our first text-to-speech model with cutting-edge performance in multilingual speech generation. The model is lightweight with 4B parameters, making Voxtral-powered agents natural, reliable, and cost-effective even at scale.

Highlights.

Realistic, emotionally expressive speech in 9 common languages with support for diverse dialects.
Very short delay to reach the first audio.
Easily adapt to new voices.
You can test it in Mistral Studio.
Power your critical voice agent workflows with enterprise-grade speech synthesis capabilities.

The production of natural speech depends on the model’s ability to not only recite text but also accurately interpret it. Understanding the context – neutral, happy, sarcastic – determines whether listeners consider the generation accurate or robotic. Our model excels at both context understanding and speaker modeling, capturing how a given person naturally speaks. Our voice adaptation goes beyond traditional read-aloud voices by capturing the speaker’s personality, including natural pauses, rhythm, intonation, and emotional dexterity. With its compact size, low cost and latency, and easy adaptability, Voxtral TTS offers complete control and customization to businesses looking to own a voice AI stack.

Audio is the new UX. Create new interactions for collaboration and understanding that can only be found in voice. Get started with Mistral Voices in American, British, and French dialects with AI Studio.

Listen and decide. Can you see the difference?

Our team speaks dozens of languages with multiple dialects, understands the importance of cultural nuances, and built a model that reflects us. Speech production builds trust through the use of natural rhythm, emotion, and even humor. Therefore, our voice emulation focused on authenticity and emotional expression.

voice emulation

original voice

Margaret

model behavior architect

English (United States)

prompt

Hey there! I’m really looking forward to summer. It’s about to get really warm here and I can’t wait to swim and make cherry pie on Lido.

Cutting-edge performance.

Automated metrics such as word error rates and voice quality scores in multilingual text-to-speech systems cannot measure the naturalness of speech. What makes speech natural is very nuanced and requires a deep understanding of cultural differences and typical conversational patterns. Therefore, human comparative evaluation by native speakers is very important.

For voice agents, latency and quality are constant tensions. Human evaluation shows that Voxtral TTS achieves better naturalness compared to Celebrities Flash v2.5 while maintaining similar Time-to-First-Audio (TTFA). Voxtral delivers performance comparable to the quality of Eleven Lab v3 and successfully supports emotional manipulation for more realistic interactions.

We conducted a comparative human evaluation of Voxtral TTS and Celebrities v2.5 Flash in a zero-shot custom audio context. For each of the nine supported languages, three annotators performed pairwise side-by-side preference tests for naturalness, accent adherence, and acoustic similarity to the original reference using two recognizable voices in their native dialect. Voxtral TTS widens the quality gap with v2.5 Flash with this zero-shot multilingual custom audio setup, highlighting the instant customizability of Voxtral TTS for any audio.

Spoken by native speakers.

Trained on large audio datasets, Voxtral TTS is built for global applications. Supports cutting-edge performance in nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

The model is trained to adapt to custom voices with just 3 seconds of reference, capturing not only the voice but also nuances such as subtle accents, intonations, intonations, and even inconsistencies similar to those expressed in the reference. The API provides several preset voice options, but it’s easy to extend your in-house voice library and customize it for your use case, localize it for language or accent, keep it neutral or more emotional, casual or formal, more natural and conversational or robotic.

This model also exhibits zero-shot cross-language speech adaptation, even though it was not explicitly trained. For example, a model can generate English speech using a French voice prompt and English text. The resulting audio sounds natural while adopting the accent of the provided audio prompt (in this example, the generated audio is English with a natural French accent). This makes this model useful for building cascading speech-to-speech translation systems.

Cascading speech recognition translation

Click on the speaker or connect it to a prompt block to enable cascading speech-to-text translation.

Before we begin, we need to confirm a few details. Can I confirm your full name and date of birth?

prompt

Before we begin, we need to confirm a few details. Can I confirm your full name and date of birth?

Press Enter or Space to select a node. You can then move the nodes using the arrow keys. Press Delete to delete or Escape to cancel.

Press Enter or Space to select edges. You can then press delete to remove it or escape to cancel.

Built for low-latency streaming.

Latency is important for voice agent applications. Voxtral TTS achieves a model delay of 70 ms for a typical input audio sample of 10 seconds and 500 characters, with a real-time factor (RTF) of approximately 9.7x. This model natively generates up to 2 minutes of audio, and the API handles generations of arbitrary length with smart interleaving.

Voxtral TTS architecture.

This model is a transformer-based autoregressive flow matching model built on Ministeral 3B. It consists of the following components:

3.4B Parameter Transformer Decoder Backbone
390M Flow Matching Acoustic Transformer
300M Neural Audio Codec (Symmetric Encoder/Decoder)

This model displays voice prompts (5 to 25 seconds) and text prompts in nine supported languages. For each audio frame, a transformer backbone predicts semantic tokens, and then a flow matching transformer performs 16 function evaluations (NFEs) to generate acoustic latencies.

We have developed an in-house codec that uses semantic VQ (8192 vocabulary) and acoustic FSQ (36 dims, 21 levels) latent to causally process audio and produce it at a frame rate of 12.5 Hz.

Power your enterprise voice workflow.

Voxtral TTS closes the loop on audio intelligence and provides an output layer for enterprise voice pipelines that passes human testing. Work with Voxtral Transcribe for full text-to-speech, or integrate with your existing text-to-speech or LLM stack for cross-language support.

workflow

customer support

Voice agents route and resolve queries across channels in a natural, on-brand voice. Deploy Voxtral TTS into your existing contact support call system for automated voice response and integrate the output into your existing workflows.