r/LocalLLaMA top dayJun 7, 2026, 12:53 PM/u/pinkyellowneon

llama.cpp Gemma4 MTP Support Merged

Original: llama.cpp Gemma4 MTP support merged!

llama.cpp merged Gemma4 MTP support for speculative decoding acceleration.

llama.cpp PR #23398 was merged on June 7, 2026, adding MTP support for Gemma4 models. The author reports over 2x average speedup on dense models, no observed speedup on MoE, and replicated AIME-26 results around 87%. Support currently covers 31B and 26B-4B variants, while E4B and E2B are not supported yet; multi-GPU may need extra draft-device configuration.

The llama.cpp community merged PR #23398 "llama: add Gemma4 MTP," adding MTP support for the Gemma4 model to the main branch. MTP here is used in a speculative-decoding type of acceleration flow, with the goal of using draft tokens to make early predictions, thereby boosting generation speed while keeping accuracy at an acceptable level. The PR author stated that on his own system, no significant acceleration was observed for Gemma4's MoE model, but the dense model achieved an average speedup of more than 2x; in terms of accuracy, the author also said he was able to reproduce the roughly 87% AIME-26 result claimed by the Gemma team. The scope of support is currently not available for all Gemma4 variants—the PR explicitly mentions it works for the 31B and 26B-4B, but the E4B and E2B versions are not supported for now. The accompanying mtp-bench test compared "no MTP" versus `--spec-draft-n-max 4` on a DGX Spark: without MTP, several tasks landed around 5.9 to 6.2 tok/s with a total runtime of about 290 seconds; with MTP enabled, the various tasks landed around 11.4 to 19.3 tok/s with a total runtime of about 120 seconds, and an aggregate draft acceptance rate of about 0.588. The PR also provides `llama-server -hf am17an/Gemma4-31B-it-GGUF --spec-type draft-mtp --spec-draft-n-max 4` as a usage example for high-VRAM scenarios. For local LLM users, this is an important incremental improvement to llama.cpp's Gemma4 inference performance, but the actual gains still depend on the model variant, quantization, hardware, and multi-GPU configuration; the PR specifically notes that multi-GPU works, but when paired with `-sm layer` it may be necessary to specify `--spec-draft-device`.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on r/LocalLLaMA top day →

Summaries are AI-generated; the original article is authoritative.