A r/LocalLLaMA post introduces an offline voice loop for talking to local models through Ollama, LM Studio, or vLLM. The stack uses Silero VAD, Parakeet TDT 0.6B v3 STT, and Supertonic TTS 3, all running on CPU so GPU memory stays available for the LLM. The author reports measured CPU-only benchmarks, agent integrations, cross-platform installers, and an MIT-licensed GitHub release.
Omi Health’s founder says he fine-tuned NVIDIA Parakeet TDT 0.6B v2 for clinical speech and released Omi Med STT v1 under CC-BY-4.0. The runtime supports Mac, Windows, and Linux, auto-selecting MLX, NeMo, or GGUF/parakeet.cpp backends. In the author’s held-out medical benchmark, it reports 2.37% medical-WER and 145× realtime on local A10 compute.
The title says Mistral AI’s Voxtral can transcribe “at the speed of sound,” suggesting a focus on fast speech-to-text. No article body is available, so details such as benchmarks, languages, pricing, API access, or release status cannot be confirmed. The item is most relevant to developers and researchers tracking Mistral’s work in speech and transcription models.
ElevenLabs introduced Scribe v2 Realtime, a low-latency speech-to-text model built for live transcription, voice agents, meeting assistants, and real-time captions. The company says it transcribes in under 150 ms across several major languages and supports 90 languages. Key features include automatic language detection, VAD, manual commit, text conditioning, multiple audio formats, API access, ElevenLabs Agents integration, and enterprise compliance options.
ElevenLabs published a blog post titled “Introducing Scribe v2.” With no source text provided, the only confirmed information is that it introduces Scribe v2. It likely concerns an updated transcription or speech-to-text product, but features, accuracy claims, pricing, API access, language support, and rollout details cannot be verified from the title alone.
ElevenAPI is a developer category on the ElevenLabs blog rather than a single detailed article. It collects updates and tutorials around speech, music, conversational agents, API keys, web components, and integrations. Listed posts mention Lovable, ElevenLabs UI, Music API, Claude 3.7 Sonnet, Gemini 2.0 Flash, DeepSeek R1, Voice Isolator API, timestamped TTS endpoints, and Speech-to-Speech API.
Abridge is an AI-native startup focused on the healthcare sector. Its core product uses "Ambient Clinical Intelligence" technology to record clinical…
Prominent AI scholar and commentator Nathan Lambert, in his latest edition of Latest Open Artifacts (#20), has compiled the major recent developments in the…
With the proliferation of GPT-4o, Gemini Live, and various end-to-end voice models, Voice Agents have become an important frontier in AI applications. However…
Vercel has released an update announcing that its AI Gateway service now officially supports the Nova 2 Lite model. Vercel AI Gateway is an AI middleware layer…
Hugging Face recently made a major upgrade to its flagship "Open ASR Leaderboard," officially launching two brand-new evaluation tracks: "Multilingual" and…
Hugging Face recently announced a brand-new, ultra-fast optimized deployment solution for OpenAI's open-source speech recognition model Whisper on its hosted…
Replicate has published its technical newsletter, Replicate Intelligence #4, summarizing recent major developments in the AI field as well as the latest…
This technical blog post from Hugging Face provides a detailed walkthrough of how to use the `transformers` library to fine-tune Meta's open-source W2V2-BERT…
The Hugging Face official blog introduces how to use "Speculative Decoding" to more than double the inference speed of OpenAI's Whisper speech-to-text model…
This official Hugging Face blog post details how to quickly implement AI speech recognition (Automatic Speech Recognition, ASR) functionality in the Unity game…
OpenAI's Whisper is a powerful automatic speech recognition (ASR) model. While its zero-shot capabilities are impressive, there remains significant room for…
In the field of automatic speech recognition (ASR), Wav2Vec2 is a revolutionary model, but it faces a significant challenge when processing long audio files…
This is a landmark technical tutorial published by the Hugging Face team in 2021, detailing how to fine-tune Meta AI's Wav2Vec2 model using the Hugging Face…