Voxtral TTS: Open-Weights, Low-Latency Text-to-Speech from Mistral AI
Original: Research Speaking of Voxtral Voxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents. March 23, 2026 Mistral AI
Mistral AI released Voxtral TTS, a 4B multilingual, low-latency voice model for voice agents.
Mistral AI introduced Voxtral TTS, its first text-to-speech model, focused on realistic multilingual voice generation. The 4B-parameter model supports nine languages, quick voice adaptation from short references, and low-latency streaming for voice agents. Mistral says human evaluations show stronger naturalness than ElevenLabs Flash v2.5, with API access, Studio testing, Le Chat access, and open weights on Hugging Face.
In "Speaking of Voxtral," Mistral AI released Voxtral TTS, the company's first text-to-speech model, positioned for natural voice output usable in voice agents and enterprise voice workflows. The article is dated March 23, 2026, which is inconsistent with the release time of June 8, 2026 provided in the source field; for this summary, the content of the official page takes precedence. Voxtral TTS is a model of roughly 4B parameters, with the company emphasizing its multilingual generation, low latency, fast adaptation to new voices, and more emotionally and contextually aware speech delivery. Supported languages include English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model can be customized with reference speech as short as 3 seconds, attempting to capture details of the speaker's rhythm, pauses, intonation, accent, and disfluencies; the company also mentions that it demonstrates zero-shot cross-lingual voice adaptation, for example using a French voice prompt to generate English with a French accent. Technically, Voxtral TTS is a transformer-based autoregressive flow-matching architecture based on Ministral 3B, comprising a 3.4B transformer decoder backbone, a 390M acoustic transformer, and a 300M neural audio codec. The company claims that with a 10-second speech sample and 500-character input, the model's latency is about 70ms and the RTF is about 9.7x, and it can natively generate audio up to two minutes long. Mistral states that in human evaluations, Voxtral TTS surpasses ElevenLabs Flash v2.5 in naturalness for zero-shot multilingual custom-voice scenarios, with quality approaching ElevenLabs v3. The model is available via API at a price of $0.016 per 1,000 characters, and can also be tried in Mistral Studio and Le Chat; a version with multiple reference voices is open-weighted on Hugging Face under the CC BY NC 4.0 license.
Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.
See Pro plans →Want the original English / full article?
Read on Mistral AI News →Related
Summaries are AI-generated; the original article is authoritative.