Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

Original: Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

Google released Gemma 4 QAT checkpoints to reduce memory use for local mobile, laptop, and GPU inference.

Google released new Gemma 4 checkpoints optimized with Quantization-Aware Training to preserve quality after compression. The release includes Q4_0 checkpoints and a mobile-focused quantization format that can reduce Gemma 4 E2B memory use to about 1GB, or below 1GB for a text-only configuration. The models are available through Hugging Face and supported across llama.cpp, Ollama, LM Studio, LiteRT-LM, Transformers.js, SGLang, vLLM, MLX, and Unsloth.

Google has released a new batch of QAT checkpoints for Gemma 4, with the focus on shrinking the models further while preserving their original capabilities as much as possible, so that developers can more easily run local AI on everyday edge devices, laptops, and consumer-grade GPUs. QAT stands for Quantization-Aware Training, and unlike PTQ — which only quantizes after training — it simulates the constraints introduced by quantization during the training process, so the model typically suffers less quality loss after being compressed. Google says these QAT results have better overall quality than the standard PTQ baseline.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Summaries are AI-generated; the original article is authoritative.