Voxtral transcribes at the speed of sound.

AI News


Today we are releasing Voxtral Transcribe 2. These are two next-generation speech-to-text models with cutting-edge transcription quality, diarization, and ultra-low latency. This family includes Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live applications. Voxtral Realtime is openweight under the Apache 2.0 license.

We’re also launching Mistral Studio’s Audio Playground, where you can instantly test transcriptions with diarization and timestamps using Voxtral Transcribe 2.

Highlights.

  • Voxtral Mini Transcribe V2: State-of-the-art transcription with speaker diarization, context bias, and word-level timestamping for 13 languages.

  • Voxtral Realtime: Designed specifically for live transcription, with configurable latency down to less than 200ms, enabling voice agents and real-time applications.

  • Best-in-class efficiency: Voxtral Mini Transcribe V2 delivers the lowest word error rate at the lowest price and industry-leading accuracy at a fraction of the cost.

  • Openweight: Voxtral Realtime ships with Apache 2.0 and can be deployed at the edge for privacy-first applications.

Voxtral real time.

Voxtral Realtime is purpose-built for latency-critical applications. Unlike approaches that adapt offline models by processing audio in chunks, Realtime uses a new streaming architecture that transcribes audio as it arrives. This model offers transcription with configurable latencies down to less than 200ms, unlocking a new class of voice-first applications.

Fleur Boxtral 2

Cross-language word error rates in the FLEURS transcription benchmark (lower is better).

With a latency of 2.4 seconds, perfect for subtitling, real-time matches the latest batch model, Voxtral Mini Transcribe V2. A delay of 480 milliseconds keeps the word error rate within 1-2%, enabling a voice agent with near-offline accuracy.

This model is natively multilingual, delivering strong transcription performance in 13 languages ​​including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. The 4B parameter footprint allows it to run efficiently on edge devices, ensuring privacy and security for sensitive deployments.

We are releasing model weights in Apache 2.0 on Hugging Face Hub.

Voxtral Mini Transcribe V2.

Average diarization error rate for Voxtral 2.0, per minimum price

Average diarization error rate (lower is better) for five English benchmarks (Switchboard, CallHome, AMI-IHM, AMI-SDM, SBCSAE) and TalkBank multilingual benchmarks (German, Spanish, English, Chinese, Japanese).

Voxtral 2.0 Transcription Performance Fleurs Priceper Min

Average word error rate across the top 10 languages ​​in the FLEURS transcription benchmark (lower is better).

Voxtral Mini Transcribe V2 significantly improves transcription and diarization quality across languages ​​and domains. With a word error rate of approximately 4% and $0.003 per minute on FLEURS, Voxtral offers the best value for money of any transcription API. It outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova in terms of accuracy, and processes audio nearly three times faster than Eleven Labs’ Scribe v2, while matching quality at one-fifth the cost.

Model characteristics.

Voxtral Mini Transcribe 2 introduces key features.

icon language

Speaker diarization.

Generate transcriptions with speaker labels and exact start/end times. Ideal for meeting transcription, interview analysis, and multiparty call handling. Note: When audio overlaps, the model typically transcribes one speaker.

icon filter

Contextual bias.

Specify up to 100 words or phrases to guide the model in correctly spelling names, technical terms, or domain-specific vocabulary. It’s especially useful for proper nouns and industry terms that are often missed in the standard model. Context bias is optimized for English. Support for other languages ​​is experimental.

Word-level timestamp.

Word-level timestamp.

Generates accurate start and end timestamps for each word, enabling applications such as subtitle generation, voice search, and content placement.

icon earth black

Expanded language support.

Like Realtime, this model currently supports 13 languages: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. Non-English performance significantly outperforms competitors.

Noise resistance.

Noise resistance.

Maintain transcription accuracy even in difficult acoustic environments such as factory floors, crowded call centers, and field recordings.

Support for longer audio.

Support for longer audio.

Process up to 3 hours of recording in a single request.

fleur

Cross-language word error rates in the FLEURS transcription benchmark (lower is better).

Audio playground.

Test Voxtral Transcribe 2 directly in Mistral Studio. Upload up to 10 audio files, toggle diarization, choose timestamp granularity, and add context-biased terms to your domain-specific vocabulary. Supports .mp3, .wav, .m4a, .flac, and .ogg up to 1GB each.

Transforming voice applications.

Voxtral powers voice workflows across a variety of applications and industries.

  • An encounter with intelligence.

    Transcribe multilingual recordings with speaker diarization that clearly shows who said what and when. Voxtral’s price point allows you to annotate large amounts of meeting content with industry-leading cost efficiency.

  • Voice agents and virtual assistants.

    Build conversational AI with sub-200ms transcription latency. Connect Voxtral Realtime to your LLM and TTS pipelines for a naturally responsive voice interface.

  • Contact center automation.

    Transcribe calls in real-time, allowing AI systems to analyze sentiment, suggest responses, and populate CRM fields while the conversation is taking place. Speaker diarization ensures clear attribution between agent and customer.

  • media and broadcasting.

    Generate live multilingual subtitles with minimal delay. Context bias deals with proper nouns and jargon that interfere with common transcription services.

  • Compliance and documentation.

    Monitor and transcribe interactions for regulatory compliance, and diarization provides clear speaker attribution and timestamps to enable accurate audit trails.

Both models support GDPR- and HIPAA-compliant deployments through secure on-premises or private cloud setups.

Let’s get started.

Voxtral Mini Transcribe V2 is currently available via API for $0.003 per minute. Try it now with the new Mistral Studio Audio Playground or Le Chat.

Voxtral Realtime is available as an open weight on Hugging Face for $0.006 per minute via API.

See Mistral’s audio and transcription features documentation.

We are currently hiring.

If you’re excited about building world-class voice AI and putting cutting-edge models into the hands of developers around the world, we want to hear from you. Apply to join our team.



Source link