Voxtral TTS: Open-Weights, Low-Latency Text-to-Speech from Mistral AI | EveryCorner

In "Speaking of Voxtral," Mistral AI released Voxtral TTS, the company's first text-to-speech model, positioned for natural voice output usable in voice agents and enterprise voice workflows. The article is dated March 23, 2026, which is inconsistent with the release time of June 8, 2026 provided in the source field; for this summary, the content of the official page takes precedence. Voxtral TTS is a model of roughly 4B parameters, with the company emphasizing its multilingual generation, low latency, fast adaptation to new voices, and more emotionally and contextually aware speech delivery. Supported languages include English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model can be customized with reference speech as short as 3 seconds, attempting to capture details of the speaker's rhythm, pauses, intonation, accent, and disfluencies; the company also mentions that it demonstrates zero-shot cross-lingual voice adaptation, for example using a French voice prompt to generate English with a French accent. Technically, Voxtral TTS is a transformer-based autoregressive flow-matching architecture based on Ministral 3B, comprising a 3.4B transformer decoder backbone, a 390M acoustic transformer, and a 300M neural audio codec. The company claims that with a 10-second speech sample and 500-character input, the model's latency is about 70ms and the RTF is about 9.7x, and it can natively generate audio up to two minutes long. Mistral states that in human evaluations, Voxtral TTS surpasses ElevenLabs Flash v2.5 in naturalness for zero-shot multilingual custom-voice scenarios, with quality approaching ElevenLabs v3. The model is available via API at a price of $0.016 per 1,000 characters, and can also be tried in Mistral Studio and Le Chat; a version with multiple reference voices is open-weighted on Hugging Face under the CC BY NC 4.0 license.