llama-server Router Mode: Pinned Model Grabs CUDA Context on All GPUs, Causing OOM
Original: llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is?
In llama-server router mode, child processes initialize CUDA contexts on all GPUs, causing OOM when other cards are fully loaded.
A Reddit user highlighted a limitation in llama-server's router mode (`--models-preset`): child processes spawn and initialize CUDA contexts on all available GPUs, even when pinned to a single card. When other GPUs are fully utilized by a large model, launching a smaller model fails with a CUDA OOM error because it cannot allocate the context stub on the maxed-out cards. Currently, child processes inherit the base environment, preventing per-model `CUDA_VISIBLE_DEVICES` configuration.
This popular discussion from Reddit's r/LocalLLaMA points out a serious multi-GPU memory management flaw that exists when using `llama-server`'s routing mode (via the `--models-preset` parameter) for dynamic management of multiple models.
Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.
See Pro plans →Want the original English / full article?
Read on r/LocalLLaMA top day →Summaries are AI-generated; the original article is authoritative.