A r/LocalLLaMA post introduces an offline voice loop for talking to local models through Ollama, LM Studio, or vLLM. The stack uses Silero VAD, Parakeet TDT 0.6B v3 STT, and Supertonic TTS 3, all running on CPU so GPU memory stays available for the LLM. The author reports measured CPU-only benchmarks, agent integrations, cross-platform installers, and an MIT-licensed GitHub release.
Community developer maximecb has published bebelm, a Rust-native, GPU-free inference implementation of Liquid AI's LFM2.5-8B-A1B model, available on crates.io. Decode speed reaches ~37 tokens/s on a Ryzen 7950x with ~7GB memory footprint; prefill is unoptimized and currently similar in speed to decode. The library supports tool-use callbacks, weight sharing across multiple Agent instances with independent KV caches, and Agent cloning to skip repeated prefill on shared prompts.
Google Cloud has announced a deep collaboration with Intel and Hugging Face on its new C4 instances to comprehensively optimize open-source GPT (Generative…
This technical blog post from Hugging Face provides a detailed benchmark of running large language models (LLMs) on Google Cloud Platform's (GCP) new C4…
AMD has officially launched its 5th-generation EPYC processor, codenamed "Turin," and Hugging Face has promptly published a blog post detailing the deep…
SetFit (Sentence Transformer Fine-Tuning) is a few-shot text classification framework co-developed by Hugging Face, Intel Labs, and other organizations. Rather…
This Hugging Face blog post explores in detail how to use the `Optimum Intel` library to accelerate inference for the StarCoder code-generation model on Intel…
In the current boom of generative AI, image generation models like Stable Diffusion have become widely popular thanks to their remarkable capabilities…
This article introduces the latest outcome of a collaboration between Hugging Face and Intel: "Q8-Chat," a project designed to demonstrate how to efficiently…
This technical blog post from Hugging Face provides a detailed guide on optimizing and accelerating Stable Diffusion model inference on Intel CPUs…
This case study focuses on the performance of "Hugging Face Infinity" — Hugging Face's high-performance inference container solution — on modern CPUs…
In many real-world enterprise production environments, although GPUs offer extremely high throughput for deep learning inference, CPUs remain indispensable due…