Cohere analyzes why speculative decoding behaves differently on Mixture-of-Experts models than on dense LLMs. Its benchmarks show MoE speedups can peak at moderate batch sizes because sparse expert routing keeps verification bandwidth-bound. The post also finds that temporal expert overlap and fixed overhead amortization make multi-token verification cheaper than simple worst-case models predict.
A Reddit user is running Qwen3.6-MTP-27B-MTP in Q4_K_M GGUF format with llama.cpp server on a 32GB Tesla V100. They report one peak of 55 tokens per second, but typical throughput is closer to 44-48 TPS. The post asks whether flags such as parallelism, speculative MTP draft settings, KV cache quantization, flash attention, and a 262K context window are limiting performance without improving output quality.
A LocalLLaMA user shared an early packed-twin-inference experiment for local LLM acceleration. The idea resembles speculative decoding, but uses the same quantized model side-by-side instead of a smaller draft model. On a single AMD MI50, the author reports Qwen3.6-27B improving from 19.4 to 38.1 tk/s, with Q8-or-lower quantization as the main target.
Xiaomi announced MiMo-V2.5-Pro-UltraSpeed with TileRT, claiming over 1,000 tokens/s decode speed on a 1-trillion-parameter MoE model. The company says it runs on a single standard 8-GPU commodity node, not wafer-scale or SRAM-heavy specialized hardware. The claimed stack combines FP4 MoE expert quantization, DFlash speculative decoding, and TileRT low-latency inference kernels, but independent validation is still needed.
llama.cpp PR #23398 was merged on June 7, 2026, adding MTP support for Gemma4 models. The author reports over 2x average speedup on dense models, no observed speedup on MoE, and replicated AIME-26 results around 87%. Support currently covers 31B and 26B-4B variants, while E4B and E2B are not supported yet; multi-GPU may need extra draft-device configuration.
As AI Agent applications become increasingly widespread, running large language models (LLMs) efficiently on personal computers (such as AI PCs powered by…
### Background and the LLM Inference Bottleneck When running large language models (LLMs), autoregressive generation is inherently "memory-bandwidth-bound"…
The slow autoregressive generation speed of large language models (LLMs) has long been a major bottleneck in real-world deployment. While "speculative…
In the deployment and inference of large language models (LLMs), reducing generation latency has always been a critical challenge. The traditional approach of…
Hugging Face has published a technical blog post on "Dynamic Speculation," aimed at optimizing the inference speed of large language models (LLMs)…
Hugging Face, in collaboration with Intel, has announced official support for "Assisted Generation" (also commonly known as Speculative Decoding) on Intel…
This technical blog post from Hugging Face introduces how to build a powerful and efficient speech processing system using Hugging Face Inference Endpoints — a…
This Hugging Face blog post explores in detail how to use the `Optimum Intel` library to accelerate inference for the StarCoder code-generation model on Intel…
The Hugging Face official blog introduces how to use "Speculative Decoding" to more than double the inference speed of OpenAI's Whisper speech-to-text model…
This technical guide from Hugging Face systematically introduces the core strategies for deploying and optimizing large language models (LLMs) in production…
Large language models (LLMs) typically generate text using an "autoregressive" mechanism, meaning the model must generate one token at a time. Each generation…