llama.cpp Gemma4 MTP Support Merged | EveryCorner

The llama.cpp community merged PR #23398 "llama: add Gemma4 MTP," adding MTP support for the Gemma4 model to the main branch. MTP here is used in a speculative-decoding type of acceleration flow, with the goal of using draft tokens to make early predictions, thereby boosting generation speed while keeping accuracy at an acceptable level. The PR author stated that on his own system, no significant acceleration was observed for Gemma4's MoE model, but the dense model achieved an average speedup of more than 2x; in terms of accuracy, the author also said he was able to reproduce the roughly 87% AIME-26 result claimed by the Gemma team. The scope of support is currently not available for all Gemma4 variants—the PR explicitly mentions it works for the 31B and 26B-4B, but the E4B and E2B versions are not supported for now. The accompanying mtp-bench test compared "no MTP" versus `--spec-draft-n-max 4` on a DGX Spark: without MTP, several tasks landed around 5.9 to 6.2 tok/s with a total runtime of about 290 seconds; with MTP enabled, the various tasks landed around 11.4 to 19.3 tok/s with a total runtime of about 120 seconds, and an aggregate draft acceptance rate of about 0.588. The PR also provides `llama-server -hf am17an/Gemma4-31B-it-GGUF --spec-type draft-mtp --spec-draft-n-max 4` as a usage example for high-VRAM scenarios. For local LLM users, this is an important incremental improvement to llama.cpp's Gemma4 inference performance, but the actual gains still depend on the model variant, quantization, hardware, and multi-GPU configuration; the PR specifically notes that multi-GPU works, but when paired with `-sm layer` it may be necessary to specify `--spec-draft-device`.