A Reddit user with an RTX 3060 12GB and 32GB DDR3 RAM is evaluating new QAT-based Gemma 31B GGUF quantizations. They currently run an older Unsloth Gemma 31B IQ3_XXS build at long context, with some tensor and mmproj offloading to CPU. The post asks which Q2-Q3 quant to choose, whether QAT changes quality expectations, and whether MTP would help or hurt under tight VRAM limits.
Microsoft announced at Computex 2026 that Windows 11 has surpassed one billion users, framing the milestone as a base for its next PC strategy. This fall, AI laptops powered by NVIDIA RTX Spark are expected to arrive, emphasizing local inference. Microsoft also plans broader mainstream hardware upgrades to prepare Windows PCs for future AI agent workflows.
Hugging Face published a tutorial for running Reachy Mini conversations without cloud audio processing or API keys. The setup uses its speech-to-speech library as a cascaded VAD, STT, LLM, and TTS pipeline exposed through a Realtime API-compatible WebSocket. Recommended defaults include llama.cpp with Gemma 4, Silero VAD, Parakeet-TDT, and Qwen3-TTS, while allowing swaps to vLLM, MLX, Transformers, or hosted Responses API providers.