An analysis of Gemma 4 QAT GGUF files reveals that Google's official 'Q4_0' releases actually employ a mixed-precision strategy. For smaller models like E2B and E4B, Google keeps critical token embeddings in Q6_K and certain projection weights in F16. This makes Google's Q4_0 files larger and more precise than Unsloth's 'Q4_K_XL' versions, which default to standard Q4_0 for almost all tensors.
A Reddit user shared their experience with the Gemma 4 31B QAT (Quantization-Aware Training) model. Compared to traditional GGUF quants like Q6_K_L, the QAT version delivers noticeable quality improvements in roleplay and long-context tasks. Additionally, combining the QAT model with Multi-Token Prediction (MTP) yielded massive speedups, boosting generation speeds from ~20 t/s to up to 50 t/s.