Pipeline parallelism in llama.cpp may be wasting your VRAM
A Reddit test claims llama.cpp pipeline parallelism may use extra VRAM without improving single-request inference speed.
The author compared three llama.cpp Vulkan builds: default 4 sched copies, 1 sched copy, and no pipeline parallelism. In their Qwen GGUF test, input and output throughput were nearly identical across all configurations. However, the default setting used about 1.5GB more VRAM for compute buffers and reduced usable context from roughly 113K tokens to around 88K, though parallel-request benefits were not tested.
This Reddit post shares the author's hands-on results on llama.cpp pipeline parallelism. The authors point out that llama.cpp enables pipeline parallelism by default, which is presumed to be intended to accelerate inference; But in his test environment, this mechanism did not deliver any visible speed gains; instead, it significantly increased VRAM usage. The authors used Vulkan backend to compare three builds: the default GGML_SCHED_MAX_COPIES=4, changed to GGML_SCHED_MAX_COPIES=1, and the version that indirectly disabled pipeline parallelism through GGML_BLAS=ON and GGML_BLAS_VENDOR=OpenBLAS. The test model is a GGUF quantized version of Qwen3.6-27B-MTP, running with llama-server, full-layer offload to GPU, flash attention, K cache f16, V cache q8_0, and other settings.
Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.
See Pro plans →Want the original English / full article?
Read on r/LocalLLaMA top day →Summaries are AI-generated; the original article is authoritative.