A r/LocalLLaMA post presents an unofficial PyTorch implementation of NanoQuant, a 2026 post-training quantization method for dense transformers. The method factorizes weights into scaling vectors and binary matrices, then quantizes and fine-tunes blocks sequentially to reduce hardware requirements. Early Qwen3-0.6B and Qwen3-4B experiments are promising for base models, but instruct quality remains weak and highly dependent on calibration data.
Cohere has signed strategic Memorandums of Understanding (MOUs) with Spanish multinational tech giant Indra Group and quantum software leader Multiverse Computing. The collaborations aim to accelerate enterprise AI adoption in Europe, combining Cohere's LLMs with Indra's digital transformation expertise and Multiverse's quantum-inspired model optimization capabilities.
Google released new Gemma 4 checkpoints optimized with Quantization-Aware Training to preserve quality after compression. The release includes Q4_0 checkpoints and a mobile-focused quantization format that can reduce Gemma 4 E2B memory use to about 1GB, or below 1GB for a text-only configuration. The models are available through Hugging Face and supported across llama.cpp, Ollama, LM Studio, LiteRT-LM, Transformers.js, SGLang, vLLM, MLX, and Unsloth.
Hugging Face has officially introduced Quanto, a brand-new quantization library designed for PyTorch, which has been integrated as a backend into the Hugging…